IT Manager, Platform Research Systems
Robert Half
About the role
About
Reporting to the Director of Enterprise Infrastructure within the Office of Information Technology (OIT), the IT Manager, Platform Research Systems leads the team responsible for clients’ High-Performance Computing (HPC) platforms and associated research infrastructure.
This position oversees the day-to-day reliability, operational excellence, and strategic evolution of clustered compute environments, GPU resources, high-performance storage systems, and the scheduling and tooling platforms that enable research at scale.
The manager fosters a culture of collaboration, innovation, and service excellence by partnering closely with researchers, faculty, campus IT teams, and external vendors to deliver secure, high-performing, scalable, and cost-effective HPC services. This role is also responsible for capacity planning, technology roadmaps, vendor coordination, and ensuring alignment with university priorities, compliance requirements, and long-term research computing goals.
Key Duties and Responsibilities
30% – HPC Platform Reliability and Lifecycle Management
- Lead the reliability, availability, and lifecycle management of multi-tenant HPC clusters, including:
- Operating system provisioning and standardized image management
- Patch management and system updates
- Scheduler health and workload management services
- Control-plane services administration
- Incident response, problem management, and root-cause analysis
- Change control and planned maintenance activities
- Establish and maintain service level indicators (SLIs), service level objectives (SLOs), and operational runbooks.
- Ensure HPC services meet institutional security, compliance, and availability standards, including support for regulated and sensitive research workloads.
20% – Operational Automation, Observability, and Continuous Improvement
- Drive operational efficiency across compute and storage platforms through automation and standardization, including:
- Automated provisioning and deployment workflows
- Configuration management and patch orchestration
- Standardized system builds and image templates
- Governance controls and security guardrails
- Strengthen observability and reporting capabilities by implementing monitoring solutions for:
- System health
- Performance metrics
- Resource utilization
- Capacity trends
- Operational bottlenecks
- Define and track key operational metrics to improve system reliability, performance efficiency, capacity utilization, and service delivery outcomes.
- Use insights from these metrics to guide continuous improvement initiatives and investment decisions.
20% – Research Storage and Data Lifecycle Management
- Manage high-performance research storage systems that support HPC workloads, including:
- Capacity planning and growth forecasting
- Quota management
- Backup and disaster recovery (DR) strategies
- Performance tuning and storage tiering
- Develop and enforce data lifecycle policies covering:
- Data ingest
- Active research use
- Archiving
- Retention
- Decommissioning
- Ensure storage practices align with research best practices, institutional policies, and compliance requirements.
15% – Team Leadership and Staff Development
- Lead the day-to-day management and professional development of the Platform Research Systems team, including:
- Hiring and onboarding
- Coaching and mentoring
- Performance feedback and evaluations
- Workload planning and prioritization
- Promote a collaborative, inclusive, and service-oriented culture that encourages:
- Innovation and creative problem-solving
- Adoption of advanced technologies
- Knowledge sharing
- Documentation best practices
- Blameless postmortems and continuous learning
- Establish operational and engineering standards that enhance team effectiveness and service quality.
10% – Service Enablement, Documentation, and Strategic Roadmap
- Improve service delivery through the development and maintenance of:
- Onboarding materials
- Service documentation
- Training resources
- Platform service catalog
- Define clear service interfaces, usage policies, quotas, and support expectations.
- Partner with research and education facilitation teams by providing guidance, tooling support, utilization reporting, and forward-looking platform roadmaps.
- Evaluate, pilot, and implement advanced capabilities aligned with enterprise standards and research needs.
- Coordinate vendor relationships, enterprise integrations, and cross-functional collaboration with infrastructure, security, compliance, and change-management teams.
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free