Skip to content
mimi

Senior Production Engineer

VirtualVocations

Rockville · On-site Full-time Senior 3w ago

About the role

About

To support the scaling of AI Infrastructure, the full-time Senior Production Engineer will manage production systems for large GPU clusters, focusing on custom software development, monitoring capabilities, and cross-team collaboration to ensure reliability and performance.

Responsibilities

  • Develop and maintain production systems for scalable GPU clusters used in AI workloads
  • Implement monitoring and health management to enhance reliability and scalability of GPU assets
  • Collaborate with cross-functional teams to evaluate system failures and improve incident management processes

Qualifications

  • 8+ years of experience in Production Engineering, DevOps, or SRE roles with a proven impact
  • Bachelor's degree in Computer Science, Engineering, Physics, Mathematics, or equivalent experience
  • Proficiency in systems programming languages such as Go or Python
  • Experience with large-scale production systems and related engineering principles
  • Strong technical knowledge of cluster management systems like Kubernetes or Slurm

Skills

GoKubernetesPythonSlurm

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free