Staff Machine Learning Infrastructure Engineer

Metric Geo

Santa Rosa · On-site Full-time Lead $220k – $320k/yr 1mo ago

About the role

We build general-purpose robots powered by a proprietary embodied AI foundation model that generalises and self‑improves across varied environments with commercial‑grade performance. Our robots are already deployed across multiple industries, and our frontier model leads the industry in generalisation and performance.

We're looking for a Staff ML Infrastructure Engineer to serve as the architect of our training engine, the person who bridges raw hardware and cutting‑edge research to ensure our ML team can iterate at speed without friction. Your goal: maximise intelligence‑per‑watt by optimising every millisecond of the training and inference pipeline.

What you'll do

Architect and own the infrastructure for large‑scale GPU clusters, implementing sharding, activation checkpointing, and memory optimisation (ZeRO, FSDP) to enable training of massive multimodal models
Build a research codebase and job scheduling system (Kubernetes/SLURM) that prioritises fast iteration, automated retries, and seamless failure recovery
Design high‑throughput data pipelines to ingest and transform terabytes of multimodal robot data, video, proprioception, and 3D signals ensuring dataloaders never starve the GPUs
Build low‑latency inference pipelines for real‑time robot control, applying quantisation, distillation, and model compilation (TensorRT, Triton) to move models from lab to physical deployment
Profile GPU utilisation, I/O bottlenecks, and memory fragmentation to squeeze maximum performance from an expanding compute fleet

What we're looking for

7+ years of engineering experience with a track record of leading technical projects in high‑performance computing or ML infrastructure
Deep experience with PyTorch and distributed training frameworks such as DeepSpeed and Accelerate, including mixed precision and gradient accumulation
Hands‑on experience managing cloud GPU environments (GCP or AWS) and container orchestration with Kubernetes
Strong understanding of distributed systems fundamentals, including race conditions, memory management, and NCCL/inter‑node communication
An ownership mindset — you design, build, and operate systems end‑to‑end rather than simply deploying code

Nice to have

Experience with robotics data formats such as MCAP or Protobuf, or with multimodal models (VLAs)
Deep ML systems work including custom kernels (Triton), compilers, or runtime optimisation
Experience as a founding or early‑stage infrastructure hire

What we offer

Competitive base salary of $220,000 – $320,000 + Equity
The opportunity to work at the frontier of embodied AI and physical robotics
A fast‑moving, research‑driven environment where your infrastructure work directly shapes what the robots can do
Backing from top investors including CRV and First Round, with over $100M raised

Skills

AccelerateAWSDeepSpeedGCPKubernetesNCCLPyTorchSLURMTensorRTTritonZeRO

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

Staff Machine Learning Infrastructure Engineer

About the role

About the role

What you'll do

What we're looking for

Nice to have

What we offer

Skills

Similar roles

Senior Fullstack/Frontend Developer Content Hub (m/w/d)

Backend-Entwickler*in (w/m/d)

Software Architect (m/w/d)

Don't send a generic resume