V
Platform Support Engineer
VirtualVocations
Remote (Global) Full-time 1mo ago
About the role
About
Platform Support Engineer, a remote full-time position supporting ML engineers with large-scale training and inference workloads across cloud infrastructure, Kubernetes, and GPU platforms, while diagnosing failures and improving reliability.
Key Responsibilities
- Partner with customer engineering teams to resolve complex distributed systems and ML infrastructure issues
- Investigate failures involving distributed training, Kubernetes orchestration, and GPU allocation
- Identify patterns in customer issues to drive long-term reliability improvements and contribute to operational enhancements
Required Qualifications
- Strong software engineering and systems troubleshooting background
- Experience with Kubernetes and containerized environments
- Hands-on experience operating machine learning workloads in production or research environments
- Familiarity with GPU infrastructure and orchestration
- Experience with observability and debugging tools such as Prometheus or Grafana
Skills
DockerGrafanaGPUKubernetesPrometheus
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free