Staff ML Infrastructure Engineer
Metric Geo
About the role
About the role
We build general-purpose robots powered by a proprietary embodied AI foundation model that generalises and self-improves across varied environments with commercial-grade performance. Our robots are already deployed across multiple industries, and our frontier model leads the industry in generalisation and performance.
We're looking for a Staff ML Infrastructure Engineer to serve as the architect of our training engine, the person who bridges raw hardware and cutting-edge research to ensure our ML team can iterate at speed without friction. Your goal: maximise intelligence-per-watt by optimising every millisecond of the training and inference pipeline.
What you'll do
- Architect and own the infrastructure for large-scale GPU clusters, implementing sharding, activation checkpointing, and memory optimisation (ZeRO, FSDP) to enable training of massive multimodal models
- Build a research codebase and job scheduling system (Kubernetes/SLURM) that prioritises fast iteration, automated retries, and seamless failure recovery
- Design high-throughput data pipelines to ingest and transform terabytes of multimodal robot data, video, proprioception, and 3D signals ensuring dataloaders never starve the GPUs
- Build low-latency inference pipelines for real-time robot control, applying quantisation, distillation, and model compilation (TensorRT, Triton) to move models from lab to physical deployment
- Profile GPU utilisation, I/O bottlenecks, and memory fragmentation to squeeze maximum performance from an expanding compute fleet
What we're looking for
- 7+ years of engineering experience with a track record of leading technical projects in high-performance computing or ML infrastructure
- Deep experience with PyTorch and distributed training frameworks such as DeepSpeed and Accelerate, including mixed precision and gradient accumulation
- Hands-on experience managing cloud GPU environments (GCP or AWS) and container orchestration with Kubernetes
- Strong understanding of distributed systems fundamentals, including race conditions, memory management, and NCCL/inter-node communication
- An ownership mindset — you design, build, and operate systems end-to-end rather than simply deploying code
Nice to have
- Experience with robotics data formats such as MCAP or Protobuf, or with multimodal models (VLAs)
- Deep ML systems work including custom kernels (Triton), compilers, or runtime optimisation
- Experience as a founding or early-stage infrastructure hire
What we offer
- Competitive base salary of $220,000 – $320,000 + Equity
- The opportunity to work at the frontier of embodied AI and physical robotics
- A fast-moving, research-driven environment where your infrastructure work directly shapes what the robots can do
- Backing from top investors including CRV and First Round, with over $100M raised
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free