Skip to content
mimi

Site Reliability Architect/Lead

Coforge

Hyderabad · On-site Full-time Lead Today

About the role

Title

Site Reliability Architect/Lead

Location

Hyderabad

Experience

12-16 Years

Responsibilities

  • Site Reliability Architect/Lead will be responsible for implementing and operationalizing SRE practices across production systems, including defining and enforcing SLIs, SLOs, and error budgets.
  • The role involves active participation in system and architecture-level design decisions to ensure high availability, scalability, resilience, and performance.
  • The individual will own observability standards, including hands-on dashboard creation, alert design, and continuous tuning to reduce false alerting.
  • They will lead infrastructure and application deployments, ensure reliable CI/CD pipelines, drive automation to eliminate operational toil, manage incident responses and RCAs, act as an escalation point during critical outages, and mentor SREs while promoting a reliability-first engineering culture.

Skill Stack

  • Strong hands-on experience in observability and monitoring tools such as Prometheus, Grafana, Datadog, Dynatrace, New Relic, or ELK
  • Infrastructure and application deployment using Kubernetes and cloud platforms (AWS, Azure, or GCP)
  • CI/CD and GitOps tools such as Helm, Argo CD, Flux, Jenkins, GitHub Actions, or GitLab CI
  • Infrastructure as Code using Terraform, CloudFormation, or ARM
  • SRE automation using scripting languages such as Python, Go, or Bash/Shell
  • Proven experience working with distributed systems, microservices, and large-scale production environments is required.

Requirements

  • Proven experience working with distributed systems, microservices, and large-scale production environments.

Responsibilities

  • Implement and operationalize SRE practices across production systems, including defining and enforcing SLIs, SLOs, and error budgets.
  • Participate in system and architecture-level design decisions to ensure high availability, scalability, resilience, and performance.
  • Own observability standards, including hands-on dashboard creation, alert design, and continuous tuning to reduce false alerting.
  • Lead infrastructure and application deployments, ensure reliable CI/CD pipelines, and drive automation to eliminate operational toil.
  • Manage incident responses and RCAs, act as an escalation point during critical outages, and mentor SREs while promoting a reliability-first engineering culture.

Skills

AWSArgo CDAzureBash/ShellCloudFormationDatadogDockerDynatraceELKGCPGitOpsGitLab CIGoGrafanaHelmInfrastructure as CodeJenkinsKubernetesNew RelicPrometheusPythonTerraform

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free