HA
AI Platform SRE
HMG AMERICA LLC
US · Hybrid Contract Senior Yesterday
About the role
About
AI Platform SRE
Responsibilities
- Own end-to-end delivery of major platform initiatives—from architecture and design through deployment and post-launch optimization
- Lead deep technical ownership of Kubernetes environments, including cluster management, networking, operators, container lifecycle, and multi-tenant orchestration
- Design and build scalable, reliable distributed systems and cloud-native infrastructure on AWS and/or GCP
- Drive engineering excellence through code quality, design reviews, automation, and CI/CD best practices
- Collaborate cross-functionally with Product, AI, and Security teams to align technical solutions with business objectives
- Mentor engineers and guide architectural decisions, trade-offs, and delivery approaches
- Partner with leadership to shape engineering strategy, roadmap planning, and platform evolution
Required Qualifications
- 9+ years of experience in software engineering, with a strong focus on backend systems and infrastructure
- Proficiency in Python and/or Go, with a track record of delivering production-grade systems
- Deep, hands-on experience with Kubernetes, including building and operating clusters in production environments
- Proven expertise in designing and managing distributed systems at scale
- Strong experience with cloud platforms (AWS and/or GCP), including compute, networking, storage, and IAM
- Experience with Infrastructure as Code (Terraform or similar) and CI/CD pipelines
- Familiarity with applied AI tools and ecosystems, such as agent frameworks, AI gateways, or models like Claude and LiteLLM
- Strong system design skills and architectural decision-making capability
- Excellent communication and collaboration skills across engineering, product, and security teams
Preferred Qualifications
- Experience with observability tools such as Prometheus, Grafana, Datadog, and OpenTelemetry
- Exposure to multi-cloud or hybrid infrastructure environments
- Knowledge of API gateways, AI gateways, and policy frameworks (e.g., ABAC, OPA)
- Experience in service mesh architectures or platform-as-a-service design
- Demonstrated ability to improve engineering productivity and operational efficiency at scale
Skills
AWSCI/CDCloud NativeCloud PlatformsContainer OrchestrationDockerGCPGoInfrastructure as CodeKubernetesPythonTerraform
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free