Lead Site Reliability Engineer

AIQ

UAE · On-site Lead Today

About the role

About The RoleAIQ is looking for a Lead Site Reliability Engineer to drive reliability, performance, and scalability across our infrastructure. This role will lead SRE initiatives, mentor team members, and collaborate with engineering and product teams to build robust systems that can scale globally.ResponsibilitiesArchitect and lead reliability strategies across services and environments.Define and enforce SLOs, SLIs, and error budgets with engineering leadership.Lead incident response and root cause analysis.Implement automation to reduce toil and improve system resilience.Manage capacity planning, traffic forecasting, and cost optimization.Mentor junior and senior SREs in technical and process excellence.Collaborate with MLOPS, DevSecOps and CloudOps teams to enforce best practices.Champion observability, metrics-driven decisions, and platform maturity.Qualifications12 years of experience in previous relevants roles.At least 1 year experience in leading a team. Expertise in Kubernetes, CI/CD (e.g., GitLab, Argo), and infrastructure-as-code (Terraform/Helm).Strong experience in cloud (Azure, AWS, or GCP).Proven background in SRE principles, SLIs/SLOs, and reliability-focused engineering.Programming proficiency in Python, or Shell (Nice to have)Deep understanding of distributed systems, networking, and incident management.

Requirements

12 years of experience in previous relevant roles
At least 1 year experience in leading a team
Expertise in Kubernetes, CI/CD (e.g., GitLab, Argo), and infrastructure-as-code (Terraform/Helm)
Strong experience in cloud (Azure, AWS, or GCP)
Proven background in SRE principles, SLIs/SLOs, and reliability-focused engineering
Programming proficiency in Python, or Shell (Nice to have)
Deep understanding of distributed systems, networking, and incident management

Responsibilities

Architect and lead reliability strategies across services and environments
Define and enforce SLOs, SLIs, and error budgets with engineering leadership
Lead incident response and root cause analysis
Implement automation to reduce toil and improve system resilience
Manage capacity planning, traffic forecasting, and cost optimization
Mentor junior and senior SREs in technical and process excellence
Collaborate with MLOPS, DevSecOps and CloudOps teams to enforce best practices
Champion observability, metrics-driven decisions, and platform maturity

Skills

KubernetesCI/CDInfrastructure-as-codeTerraformHelmAzureAWSGCPSRE principlesSLIs/SLOsPythonShell

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

Lead Site Reliability Engineer

About the role

Requirements

Responsibilities

Skills

Similar roles

Site Reliability Engineer

Assistant Director of Engineering

Senior Machine Learning Engineer, Digital Products

Don't send a generic resume