C
Site Reliability Architect/Lead
Coforge
Hyderabad · On-site Full-time Lead Today
About the role
Title
Site Reliability Architect/Lead
Location
Hyderabad
Experience
12-16 Years
Responsibilities
- Site Reliability Architect/Lead will be responsible for implementing and operationalizing SRE practices across production systems, including defining and enforcing SLIs, SLOs, and error budgets.
- The role involves active participation in system and architecture-level design decisions to ensure high availability, scalability, resilience, and performance.
- The individual will own observability standards, including hands-on dashboard creation, alert design, and continuous tuning to reduce false alerting.
- They will lead infrastructure and application deployments, ensure reliable CI/CD pipelines, drive automation to eliminate operational toil, manage incident responses and RCAs, act as an escalation point during critical outages, and mentor SREs while promoting a reliability-first engineering culture.
Skill Stack
- Strong hands-on experience in observability and monitoring tools such as Prometheus, Grafana, Datadog, Dynatrace, New Relic, or ELK
- Infrastructure and application deployment using Kubernetes and cloud platforms (AWS, Azure, or GCP)
- CI/CD and GitOps tools such as Helm, Argo CD, Flux, Jenkins, GitHub Actions, or GitLab CI
- Infrastructure as Code using Terraform, CloudFormation, or ARM
- SRE automation using scripting languages such as Python, Go, or Bash/Shell
- Proven experience working with distributed systems, microservices, and large-scale production environments is required.
Requirements
- Proven experience working with distributed systems, microservices, and large-scale production environments.
Responsibilities
- Implement and operationalize SRE practices across production systems, including defining and enforcing SLIs, SLOs, and error budgets.
- Participate in system and architecture-level design decisions to ensure high availability, scalability, resilience, and performance.
- Own observability standards, including hands-on dashboard creation, alert design, and continuous tuning to reduce false alerting.
- Lead infrastructure and application deployments, ensure reliable CI/CD pipelines, and drive automation to eliminate operational toil.
- Manage incident responses and RCAs, act as an escalation point during critical outages, and mentor SREs while promoting a reliability-first engineering culture.
Skills
AWSArgo CDAzureBash/ShellCloudFormationDatadogDockerDynatraceELKGCPGitOpsGitLab CIGoGrafanaHelmInfrastructure as CodeJenkinsKubernetesNew RelicPrometheusPythonTerraform
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free