Skip to content
mimi

Staff ML Infrastructure Engineer

Metric Geo

Fremont · On-site Full-time Lead $220k – $320k/yr Yesterday

About the role

About the role

We build general-purpose robots powered by a proprietary embodied AI foundation model that generalises and self-improves across varied environments with commercial-grade performance. Our robots are already deployed across multiple industries, and our frontier model leads the industry in generalisation and performance.

We're looking for a Staff ML Infrastructure Engineer to serve as the architect of our training engine, the person who bridges raw hardware and cutting-edge research to ensure our ML team can iterate at speed without friction. Your goal: maximise intelligence-per-watt by optimising every millisecond of the training and inference pipeline.

What you'll do

  • Architect and own the infrastructure for large-scale GPU clusters, implementing sharding, activation checkpointing, and memory optimisation (ZeRO, FSDP) to enable training of massive multimodal models
  • Build a research codebase and job scheduling system (Kubernetes/SLURM) that prioritises fast iteration, automated retries, and seamless failure recovery
  • Design high-throughput data pipelines to ingest and transform terabytes of multimodal robot data, video, proprioception, and 3D signals ensuring dataloaders never starve the GPUs
  • Build low-latency inference pipelines for real-time robot control, applying quantisation, distillation, and model compilation (TensorRT, Triton) to move models from lab to physical deployment
  • Profile GPU utilisation, I/O bottlenecks, and memory fragmentation to squeeze maximum performance from an expanding compute fleet

What we're looking for

  • 7+ years of engineering experience with a track record of leading technical projects in high-performance computing or ML infrastructure
  • Deep experience with PyTorch and distributed training frameworks such as DeepSpeed and Accelerate, including mixed precision and gradient accumulation
  • Hands-on experience managing cloud GPU environments (GCP or AWS) and container orchestration with Kubernetes
  • Strong understanding of distributed systems fundamentals, including race conditions, memory management, and NCCL/inter-node communication
  • An ownership mindset — you design, build, and operate systems end-to-end rather than simply deploying code

Nice to have

  • Experience with robotics data formats such as MCAP or Protobuf, or with multimodal models (VLAs)
  • Deep ML systems work including custom kernels (Triton), compilers, or runtime optimisation
  • Experience as a founding or early-stage infrastructure hire

What we offer

  • Competitive base salary of $220,000 – $320,000 + Equity
  • The opportunity to work at the frontier of embodied AI and physical robotics
  • A fast-moving, research-driven environment where your infrastructure work directly shapes what the robots can do
  • Backing from top investors including CRV and First Round, with over $100M raised

Skills

AWSAccelerateDeepSpeedGCPKubernetesML infrastructureNCCLNvidia TritonPyTorchSLURMTensorRTZeRO

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free