Skip to content
mimi

Platform Support Engineer

VirtualVocations

Remote (Global) Full-time 1mo ago

About the role

About

Platform Support Engineer, a remote full-time position supporting ML engineers with large-scale training and inference workloads across cloud infrastructure, Kubernetes, and GPU platforms, while diagnosing failures and improving reliability.

Key Responsibilities

  • Partner with customer engineering teams to resolve complex distributed systems and ML infrastructure issues
  • Investigate failures involving distributed training, Kubernetes orchestration, and GPU allocation
  • Identify patterns in customer issues to drive long-term reliability improvements and contribute to operational enhancements

Required Qualifications

  • Strong software engineering and systems troubleshooting background
  • Experience with Kubernetes and containerized environments
  • Hands-on experience operating machine learning workloads in production or research environments
  • Familiarity with GPU infrastructure and orchestration
  • Experience with observability and debugging tools such as Prometheus or Grafana

Skills

DockerGrafanaGPUKubernetesPrometheus

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free