Skip to content
mimi

Principal Infrastructure Engineer — GPU Clusters / HPC

Acceler8 Talent

Mountain View · On-site Full-time Lead 1mo ago

About the role

About

Working with an early-stage AI compute company building next-generation infrastructure for large-scale AI workloads. They are looking for a hands-on Principal Infrastructure Engineer to design, build and scale GPU compute environments across bare metal, on-prem data center and cloud infrastructure.

The focus is on building scalable AI/HPC infrastructure from the ground up, owning large hardware clusters and working closely with senior technical leadership across hardware, systems and software.

The ideal candidate will have strong experience across Linux, GPU clusters, HPC schedulers, Kubernetes, Slurm/LSF, networking, automation, observability and hybrid cloud environments. They should be comfortable operating in a fast-moving startup environment and have practical experience scaling infrastructure rather than only maintaining mature systems.

Key Responsibilities

  • Build, scale and operate large Linux-based infrastructure for AI/HPC workloads.
  • Manage GPU and compute clusters across bare metal, on-prem and cloud.
  • Work with Slurm, LSF, Kubernetes, or similar scheduling/orchestration tools.
  • Support hybrid cloud environments across AWS, Azure, GCP, or GPU cloud providers.
  • Automate infrastructure using Terraform, Ansible, Python, Bash, or similar.
  • Troubleshoot issues across Linux, GPUs, networking, storage, schedulers, containers and distributed systems.
  • Build monitoring and observability using Prometheus, Grafana, ELK, Datadog, Splunk, or similar.
  • Partner with hardware and software teams on cluster expansion, reliability, and performance.

Ideal Background

  • 7+ years in Infrastructure, DevOps, SRE, HPC, Systems Engineering, or Technical Operations.
  • Experience building or administering large-scale GPU, AI/ML, HPC, or hardware infrastructure clusters.
  • Strong Linux, networking, automation, and observability experience.
  • Exposure to NVIDIA/AMD GPUs, InfiniBand, NVLink, CUDA, NCCL, high-memory bandwidth systems, or similar is highly desirable.
  • Experience in a startup, AI infrastructure, hyperscale, neocloud, semiconductor, research compute, or advanced data center environment would be a strong fit.

Location

Bay Area Based

Onsite 5 days a week

Skills

AI/MLAnsibleAWSAzureBashCUDADatadogELKGCPGrafanaHPCInfiniBandKubernetesLinuxLSFNCCLNVIDIANVLinkObservabilityPrometheusPythonSlurmSplunkTerraform

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free