SRE Lead

Birlasoft Limited

Pimpri-Chinchwad · On-site Full-time Lead 3d ago

About the role

Summary

Job description - SRE Lead (SRE & Platform Infra)

The SRE Lead is responsible for driving reliability, resiliency, and performance across Birlasoft’s Platform Engineering ecosystem—including microservices, cloud workloads, Cogito agentic operations, and enterprise applications. The role ensures high availability and predictable performance through SLO-driven engineering, observability, automation‑first operations, and incident excellence. Partners closely with DevOps, Cloud, Architecture, Security, and Delivery to embed reliability into design, build, and run phases.

Roles & Responsibilities

Reliability & Performance

Define & maintain SLOs, SLIs, and error budgets for platform services.
Lead capacity planning, performance tuning, autoscaling strategies, and resilience testing.
Drive reliability patterns such as graceful degradation, retry logic, and distributed failover.

Observability & Monitoring

Own monitoring stack across Azure Monitor, App Insights, Log Analytics, OpenTelemetry, and AKS.
Design alerting standards noise reduction, correlation, routing, escalation.
Build health, reliability, and risk dashboards for leadership.

Incident & Problem Management

Lead incident response, on‑call processes, and blameless postmortems.
Drive MTTR reduction through automation, playbooks, and predictive analytics.
Establish proactive issue detection mechanisms using patterns, telemetry, and AIOps.

Automation & AIOps

Implement automation‑first operations for remediation, self‑healing, and repetitive tasks.
Integrate AI‑driven agent workflows with Cogito for troubleshooting, optimization, and cost‑ops.
Increase operational maturity through runbooks, autopilot actions, and integrated CI/CD reliability checks.

Collaboration & Governance

Partner with Platform Engineering pods (Infra, Core, Integration, DevEx, Security) to embed reliability by design.
Influence architecture for scalability, observability, and fault tolerance.
Mentor SRE engineers and lead the maturity of SRE practices across accounts.

Technical Skills

Mandatory

Azure Monitor, App Insights, Log Analytics, KQL
Kubernetes (AKS), autoscaling, HPA/KEDA
Distributed tracing (OpenTelemetry)
CI/CD pipelines & release engineering (Azure DevOps/Jenkins/GitOps)
Incident management, root‑cause analysis, and on‑call frameworks
Performance testing, load testing, and capacity planning
Infrastructure as Code (Terraform/ARM/Bicep)
Strong understanding of microservices & cloud‑native architecture
Python/PowerShell/Go scripting for automation

Qualifications

Bachelor’s degree in Computer Science, Engineering, or related field
8–14 years of experience in SRE, DevOps, Cloud, or Platform Engineering roles
Certifications preferred:
- Azure Administrator / Azure DevOps Engineer
- Kubernetes (CKA/CKAD)
- SRE Foundation / SRE Practitioner
Demonstrated leadership experience managing SRE/DevOps teams, reliability initiatives, or mission‑critical platforms

Requirements

Demonstrated leadership experience managing SRE/DevOps teams, reliability initiatives, or mission‑critical platforms

Responsibilities

Define & maintain SLOs, SLIs, and error budgets for platform services.
Lead capacity planning, performance tuning, autoscaling strategies, and resilience testing.
Drive reliability patterns such as graceful degradation, retry logic, and distributed failover.
Own monitoring stack across Azure Monitor, App Insights, Log Analytics, OpenTelemetry, and AKS.
Design alerting standards noise reduction, correlation, routing, escalation.
Build health, reliability, and risk dashboards for leadership.
Lead incident response, on-call processes, and blameless postmortems.
Drive MTTR reduction through automation, playbooks, and predictive analytics.
Establish proactive issue detection mechanisms using patterns, telemetry, and AIOps.
Implement automation-first operations for remediation, self-healing, and repetitive tasks.
Integrate AI-driven agent workflows with Cogito for troubleshooting, optimization, and cost‑ops.
Increase operational maturity through runbooks, autopilot actions, and integrated CI/CD reliability checks.
Partner with Platform Engineering pods (Infra, Core, Integration, DevEx, Security) to embed reliability by design.
Influence architecture for scalability, observability, and fault tolerance.
Mentor SRE engineers and lead the maturity of SRE practices across accounts.

Skills

AKSApp InsightsARMAzure DevOpsAzure MonitorBicepCI/CDCloud-native architectureCogitoDockerGitOpsGoHPA/KEDAInfrastructure as CodeJenkinsKQLKubernetesLog AnalyticsMicroservicesOpenTelemetryPowerShellPythonTerraform

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

SRE Lead

About the role

Summary

Roles & Responsibilities

Technical Skills

Qualifications

Requirements

Responsibilities

Skills

Similar roles

Lead Software Architect

Azure Cloud Engineer

Cloud Engineer III

Don't send a generic resume