Lead Site Reliability Engineer
AIQ
About the role
About The RoleAIQ is looking for a Lead Site Reliability Engineer to drive reliability, performance, and scalability across our infrastructure. This role will lead SRE initiatives, mentor team members, and collaborate with engineering and product teams to build robust systems that can scale globally.ResponsibilitiesArchitect and lead reliability strategies across services and environments.Define and enforce SLOs, SLIs, and error budgets with engineering leadership.Lead incident response and root cause analysis.Implement automation to reduce toil and improve system resilience.Manage capacity planning, traffic forecasting, and cost optimization.Mentor junior and senior SREs in technical and process excellence.Collaborate with MLOPS, DevSecOps and CloudOps teams to enforce best practices.Champion observability, metrics-driven decisions, and platform maturity.Qualifications12 years of experience in previous relevants roles.At least 1 year experience in leading a team. Expertise in Kubernetes, CI/CD (e.g., GitLab, Argo), and infrastructure-as-code (Terraform/Helm).Strong experience in cloud (Azure, AWS, or GCP).Proven background in SRE principles, SLIs/SLOs, and reliability-focused engineering.Programming proficiency in Python, or Shell (Nice to have)Deep understanding of distributed systems, networking, and incident management.
Requirements
- 12 years of experience in previous relevant roles
- At least 1 year experience in leading a team
- Expertise in Kubernetes, CI/CD (e.g., GitLab, Argo), and infrastructure-as-code (Terraform/Helm)
- Strong experience in cloud (Azure, AWS, or GCP)
- Proven background in SRE principles, SLIs/SLOs, and reliability-focused engineering
- Programming proficiency in Python, or Shell (Nice to have)
- Deep understanding of distributed systems, networking, and incident management
Responsibilities
- Architect and lead reliability strategies across services and environments
- Define and enforce SLOs, SLIs, and error budgets with engineering leadership
- Lead incident response and root cause analysis
- Implement automation to reduce toil and improve system resilience
- Manage capacity planning, traffic forecasting, and cost optimization
- Mentor junior and senior SREs in technical and process excellence
- Collaborate with MLOPS, DevSecOps and CloudOps teams to enforce best practices
- Champion observability, metrics-driven decisions, and platform maturity
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free