HPC Infrastructure Solutions Architect – GPU Platforms
Doghouse Recruitment
About the role
About
Join an AI infrastructure team building the GPU, networking, and storage platforms underneath large-scale AI training workloads. This role focuses on the quality, scalability, reliability, and efficiency of the infrastructure before workloads arrive.
While ML Specialists focus on models and workloads, this role owns the underlying platform that enables them to run at scale.
Team & Responsibilities
Work alongside senior infrastructure and AI engineers in a hands‑on, client‑facing role.
You will:
- Design and operate production‑grade GPU and HPC platforms for AI training and simulation
- Build and scale GPU clusters, with a strong focus on Slurm‑based scheduling
- Design and optimize high-performance networking using RDMA, InfiniBand, NVLink, and NVSwitch
- Design and tune storage and I/O paths for large-scale datasets
- Build cloud infrastructure using open-source tooling such as Kubernetes, Terraform, and Helm
Required Skills
- Hands‑on experience building and operating GPU or HPC clusters
- Strong Linux, Kubernetes, networking, and storage background
- Deep understanding of HPC networking and RDMA stacks
- Experience with GPU schedulers, preferably Slurm
- Strong cloud experience, ideally multi‑cloud
- Experience with specific storage technologies is a plus, but strong storage and I/O expertise is required.
This role is not a fit if your experience is limited to model development or high-level cloud architecture without deep GPU and networking exposure. We're looking for serious Seniors!
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free