HPC Infrastructure Solutions Architect – GPU Platforms

Doghouse Recruitment

Remote · US Full-time Senior 6d ago

About the role

About

Join an AI infrastructure team building the GPU, networking, and storage platforms underneath large-scale AI training workloads. This role focuses on the quality, scalability, reliability, and efficiency of the infrastructure before workloads arrive.

While ML Specialists focus on models and workloads, this role owns the underlying platform that enables them to run at scale.

Team & Responsibilities

Work alongside senior infrastructure and AI engineers in a hands‑on, client‑facing role.

You will:

Design and operate production‑grade GPU and HPC platforms for AI training and simulation
Build and scale GPU clusters, with a strong focus on Slurm‑based scheduling
Design and optimize high-performance networking using RDMA, InfiniBand, NVLink, and NVSwitch
Design and tune storage and I/O paths for large-scale datasets
Build cloud infrastructure using open-source tooling such as Kubernetes, Terraform, and Helm

Required Skills

Hands‑on experience building and operating GPU or HPC clusters
Strong Linux, Kubernetes, networking, and storage background
Deep understanding of HPC networking and RDMA stacks
Experience with GPU schedulers, preferably Slurm
Strong cloud experience, ideally multi‑cloud
Experience with specific storage technologies is a plus, but strong storage and I/O expertise is required.

This role is not a fit if your experience is limited to model development or high-level cloud architecture without deep GPU and networking exposure. We're looking for serious Seniors!

Skills

HelmInfiniBandKubernetesLinuxNVLinkNVSwitchRDMASlurmTerraform

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

HPC Infrastructure Solutions Architect – GPU Platforms

About the role

About

Team & Responsibilities

Required Skills

Skills

Similar roles

Senior Platform Engineer — AI Agent Infrastructure

Software Engineer

Software Developer - Automation

Don't send a generic resume