Skip to content
mimi

Lead Site Reliability Engineer

AIQ

UAE ยท On-site Lead Today

About the role

About The RoleAIQ is looking for a Lead Site Reliability Engineer to drive reliability, performance, and scalability across our infrastructure. This role will lead SRE initiatives, mentor team members, and collaborate with engineering and product teams to build robust systems that can scale globally.ResponsibilitiesArchitect and lead reliability strategies across services and environments.Define and enforce SLOs, SLIs, and error budgets with engineering leadership.Lead incident response and root cause analysis.Implement automation to reduce toil and improve system resilience.Manage capacity planning, traffic forecasting, and cost optimization.Mentor junior and senior SREs in technical and process excellence.Collaborate with MLOPS, DevSecOps and CloudOps teams to enforce best practices.Champion observability, metrics-driven decisions, and platform maturity.Qualifications12 years of experience in previous relevants roles.At least 1 year experience in leading a team. Expertise in Kubernetes, CI/CD (e.g., GitLab, Argo), and infrastructure-as-code (Terraform/Helm).Strong experience in cloud (Azure, AWS, or GCP).Proven background in SRE principles, SLIs/SLOs, and reliability-focused engineering.Programming proficiency in Python, or Shell (Nice to have)Deep understanding of distributed systems, networking, and incident management.

Requirements

  • 12 years of experience in previous relevant roles
  • At least 1 year experience in leading a team
  • Expertise in Kubernetes, CI/CD (e.g., GitLab, Argo), and infrastructure-as-code (Terraform/Helm)
  • Strong experience in cloud (Azure, AWS, or GCP)
  • Proven background in SRE principles, SLIs/SLOs, and reliability-focused engineering
  • Programming proficiency in Python, or Shell (Nice to have)
  • Deep understanding of distributed systems, networking, and incident management

Responsibilities

  • Architect and lead reliability strategies across services and environments
  • Define and enforce SLOs, SLIs, and error budgets with engineering leadership
  • Lead incident response and root cause analysis
  • Implement automation to reduce toil and improve system resilience
  • Manage capacity planning, traffic forecasting, and cost optimization
  • Mentor junior and senior SREs in technical and process excellence
  • Collaborate with MLOPS, DevSecOps and CloudOps teams to enforce best practices
  • Champion observability, metrics-driven decisions, and platform maturity

Skills

KubernetesCI/CDInfrastructure-as-codeTerraformHelmAzureAWSGCPSRE principlesSLIs/SLOsPythonShell

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free