Staff Machine Learning Infrastructure Engineer
Metric Geo
About the role
About the role
We build general-purpose robots powered by a proprietary embodied AI foundation model that generalises and self‑improves across varied environments with commercial‑grade performance. Our robots are already deployed across multiple industries, and our frontier model leads the industry in generalisation and performance.
We're looking for a Staff ML Infrastructure Engineer to serve as the architect of our training engine, the person who bridges raw hardware and cutting‑edge research to ensure our ML team can iterate at speed without friction. Your goal: maximise intelligence‑per‑watt by optimising every millisecond of the training and inference pipeline.
What you'll do
- Architect and own the infrastructure for large‑scale GPU clusters, implementing sharding, activation checkpointing, and memory optimisation (ZeRO, FSDP) to enable training of massive multimodal models
- Build a research codebase and job scheduling system (Kubernetes/SLURM) that prioritises fast iteration, automated retries, and seamless failure recovery
- Design high‑throughput data pipelines to ingest and transform terabytes of multimodal robot data, video, proprioception, and 3D signals ensuring dataloaders never starve the GPUs
- Build low‑latency inference pipelines for real‑time robot control, applying quantisation, distillation, and model compilation (TensorRT, Triton) to move models from lab to physical deployment
- Profile GPU utilisation, I/O bottlenecks, and memory fragmentation to squeeze maximum performance from an expanding compute fleet
What we're looking for
- 7+ years of engineering experience with a track record of leading technical projects in high‑performance computing or ML infrastructure
- Deep experience with PyTorch and distributed training frameworks such as DeepSpeed and Accelerate, including mixed precision and gradient accumulation
- Hands‑on experience managing cloud GPU environments (GCP or AWS) and container orchestration with Kubernetes
- Strong understanding of distributed systems fundamentals, including race conditions, memory management, and NCCL/inter‑node communication
- An ownership mindset — you design, build, and operate systems end‑to‑end rather than simply deploying code
Nice to have
- Experience with robotics data formats such as MCAP or Protobuf, or with multimodal models (VLAs)
- Deep ML systems work including custom kernels (Triton), compilers, or runtime optimisation
- Experience as a founding or early‑stage infrastructure hire
What we offer
- Competitive base salary of $220,000 – $320,000 + Equity
- The opportunity to work at the frontier of embodied AI and physical robotics
- A fast‑moving, research‑driven environment where your infrastructure work directly shapes what the robots can do
- Backing from top investors including CRV and First Round, with over $100M raised
Requirements
- 7+ years of engineering experience with a track record of leading technical projects in high-performance computing or ML infrastructure
- Deep experience with PyTorch and distributed training frameworks such as DeepSpeed and Accelerate, including mixed precision and gradient accumulation
- Hands-on experience managing cloud GPU environments (GCP or AWS) and container orchestration with Kubernetes
- Strong understanding of distributed systems fundamentals, including race conditions, memory management, and NCCL/inter-node communication
- An ownership mindset — you design, build, and operate systems end-to-end rather than simply deploying code
Responsibilities
- Architect and own the infrastructure for large-scale GPU clusters, implementing sharding, activation checkpointing, and memory optimisation (ZeRO, FSDP) to enable training of massive multimodal models
- Build a research codebase and job scheduling system (Kubernetes/SLURM) that prioritises fast iteration, automated retries, and seamless failure recovery
- Design high-throughput data pipelines to ingest and transform terabytes of multimodal robot data, video, proprioception, and 3D signals ensuring dataloaders never starve the GPUs
- Build low-latency inference pipelines for real-time robot control, applying quantisation, distillation, and model compilation (TensorRT, Triton) to move models from lab to physical deployment
- Profile GPU utilisation, I/O bottlenecks, and memory fragmentation to squeeze maximum performance from an expanding compute fleet
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free