Site Reliability Engineer
Ascendion
About the role
Job Title
Site Reliability Engineer
Location
Bengaluru (Hybrid, 2-3 days onsite in a week)
Minimum relevant years of experience
10+ Years
Role Overview
We are recruiting multiple SRE Engineers to embed directly within engineering teams. These engineers will become expert practitioners of the systems they support — understanding how they are built, how they behave in production, and how to troubleshoot them forensically when issues arise. This is a hands‑on, technical role that demands both engineering rigour and operational instinct.
Key Responsibilities
- Embed within engineering squads to build deep system knowledge — understanding architecture, data flows, failure modes, and dependencies.
- Instrument systems with comprehensive observability — metrics, logs, traces, and alerting — to provide a full forensic picture of production behaviour.
- Participate in on‑call rotas and lead technical incident response, using structured troubleshooting and tooling to diagnose and resolve production issues rapidly.
- Proactively identify reliability risks and work with engineering teams to address them before they impact production.
- Build and maintain runbooks, playbooks, and diagnostic tooling to support efficient incident management.
- Monitor system performance continuously, validating both infrastructure health and functional correctness of data pipelines and application behaviour.
- Support the SRE Lead in establishing team‑wide standards for monitoring, alerting, and incident response.
Essential Skills & Experience
- Solid software engineering or platform engineering background with production operations experience.
- Hands‑on experience with observability and monitoring tooling (e.g. Datadog, Grafana, ELK stack, Prometheus, or equivalent).
- Experience troubleshooting complex distributed systems — strong diagnostic skills and methodical approach to incident investigation.
- Comfortable reading and understanding application code as well as infrastructure configuration.
- Experience working in Agile engineering teams with shared ownership of reliability outcomes.
Desirable / Nice-to-Have
- Experience in financial services, particularly with data‑intensive or calculation‑heavy systems (e.g. index calculation, pricing engines, market data pipelines).
- Familiarity with AI‑assisted diagnostic tools or agentic troubleshooting workflows.
- Knowledge of data ingestion patterns and how to validate the accuracy and completeness of processed data.
- Experience writing tooling or automation to improve operational workflows and reduce toil.
About Ascendion
Ascendion is transforming the future of technology with AI‑driven software engineering. Our global team accelerates innovation and delivers future‑ready solutions for some of the world’s most important industry leaders. Our applied AI, software engineering, cloud, data, experience design, and talent transformation capabilities accelerate innovation for Global 2000 clients. Join us to build transformative experiences, pioneer cutting‑edge solutions, and thrive in a vibrant, inclusive culture - powered by AI and driven by bold ideas.
Requirements
- Solid software engineering or platform engineering background with production operations experience.
- Hands-on experience with observability and monitoring tooling (e.g. Datadog, Grafana, ELK stack, Prometheus, or equivalent).
- Experience troubleshooting complex distributed systems — strong diagnostic skills and methodical approach to incident investigation.
- Comfortable reading and understanding application code as well as infrastructure configuration.
- Experience working in Agile engineering teams with shared ownership of reliability outcomes.
Responsibilities
- Embed within engineering squads to build deep system knowledge — understanding architecture, data flows, failure modes, and dependencies.
- Instrument systems with comprehensive observability — metrics, logs, traces, and alerting — to provide a full forensic picture of production behaviour.
- Participate in on-call rotas and lead technical incident response, using structured troubleshooting and tooling to diagnose and resolve production issues rapidly.
- Proactively identify reliability risks and work with engineering teams to address them before they impact production.
- Build and maintain runbooks, playbooks, and diagnostic tooling to support efficient incident management.
- Monitor system performance continuously, validating both infrastructure health and functional correctness of data pipelines and application behaviour.
- Support the SRE Lead in establishing team-wide standards for monitoring, alerting, and incident response.
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free