Generalist Infrastructure and Systems Engineer
Thinking Machines
About the role
About the Role
Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence. We're building a future where everyone has access to the knowledge and tools to make AI work for their unique needs and goals.
We are scientists, engineers, and builders who’ve created some of the most widely used AI products, including ChatGPT and Character.ai, open‑weights models like Mistral, as well as popular open source projects like PyTorch, OpenAI Gym, Fairseq, and Segment Anything.
We’re looking for generalist infrastructure and systems engineers to help build the systems that power our foundation models and enable internal research and product development teams to create and ship products powered by our models. You’ll join a small, high‑impact team responsible for architecting and scaling the core infrastructure behind everything we do, working across the full technical stack to solve complex distributed systems problems and build robust, scalable platforms.
Infrastructure is critical to us: it’s the bedrock that enables every breakthrough. You’ll work directly with researchers to accelerate experiments, improve infrastructure efficiency, and enable key insights across our models, products, and data assets.
Note: This is an “evergreen role” that remains open on an ongoing basis. Applications are continuously reviewed, and candidates may be contacted as new opportunities arise.
What You’ll Do
Example areas you may contribute to, depending on your expertise and interest:
- Core Infrastructure – Support teams that train, research, and serve AI models; build and operate large Kubernetes clusters with GPU workloads; develop infrastructure to support projects such as Tinker.
- Data Infrastructure – Design and optimize data pipelines using tools like Spark and modern data‑infrastructure technologies; build scalable, reliable data systems while embedding governance best practices.
- Developer Productivity – Build tooling, frameworks, and systems to ensure well‑configured, optimized developer environments and maintain high engineering productivity.
Skills and Qualifications
Minimum qualifications
- Bachelor’s degree or equivalent experience in computer science, engineering, or a related field.
- Proficiency in at least one backend language (Python or Rust).
- Experience operating large‑scale clusters and container orchestration systems (e.g., Kubernetes or Slurm).
- Comfort operating across the stack and owning projects end‑to‑end.
- Ability to thrive in a highly collaborative environment with many cross‑functional partners and subject‑matter experts.
- Bias for action and initiative to work across different stacks and teams where opportunities arise.
Preferred qualifications
- Strong debugging skills across application, OS, and network layers.
- Proficiency with Python or Rust (or similar), containers, and modern CI pipelines.
- Experience with Kubernetes controllers/operators or performance profiling.
- Familiarity with GPU/ML workflows or large‑scale data/evaluation pipelines.
Logistics
- Location: San Francisco, California
- Compensation: Expected annual salary range $350,000 – $475,000 USD (depending on background, skills, and experience)
- Visa sponsorship: Available; we will work with the right candidate through the visa process.
- Benefits: Generous health, dental, and vision coverage; unlimited PTO; paid parental leave; relocation support as needed.
Thinking Machines is an equal‑opportunity employer and does not discriminate on the basis of any protected group status under applicable law.
Requirements
- Bachelor’s degree or equivalent experience in computer science, engineering, or similar
- Proficiency in at least one backend language (we use Python or Rust)
- Experience operating large‑scale clusters and container orchestration systems (e.g. Kubernetes or Slurm)
- Comfort operating across the stack and owning projects end-to-end
- Thrive in a highly collaborative environment involving many, different cross-functional partners and subject matter experts
- A bias for action with a mindset to take initiative to work across different stacks and different teams where you spot the opportunity to make sure something ships
Responsibilities
- building systems and running large Kubernetes clusters with GPU workloads
- building infrastructure to support Tinker
- design and optimize data pipelines using tools like Spark and other modern data infrastructure technologies
- build scalable, reliable, data infrastructure while embedding governance best practices
- build tooling, systems, frameworks, and systems to make sure everyone gets well configured, optimized developer environments
Benefits
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free