Senior Site Reliability Engineer
Taazaa Inc
About the role
About the Role
We’re looking for a Senior DevOps Engineer to join our Site Reliability Engineer (SRE) Team in Noida.
As a Site Reliability Engineer (SRE) you will play a pivotal role in the design, implementation, and maintenance of the infrastructure that supports our software development lifecycle. You will work closely with software engineers, QA, and IT teams to ensure the availability, reliability, and performance of our systems. Your primary focus will be on streamlining our deployment processes, improving system scalability, and ensuring a robust, secure, and cost‑efficient infrastructure.
About Taazaa
Working at Taazaa involves engaging with cutting‑edge technology and innovative software solutions in a collaborative environment. We emphasize continuous professional growth, offering workshops and training. Our employees often interact with clients to tailor solutions to business needs, working on diverse projects across industries.
We promote work‑life balance with flexible hours and remote options, fostering a supportive and inclusive culture. Competitive salaries, health benefits, and various perks further enhance the work experience.
Looking ahead, we aim to expand our technological capabilities and market reach, investing in advanced technologies and expanding our service offerings. We plan to deepen our expertise in AI and machine learning, enhance our cloud services, and continue fostering a culture of innovation and excellence. Taazaa is committed to staying at the forefront of technology trends, ensuring it delivers impactful and transformative solutions for its clients.
Responsibilities (What you’ll do)
- Partner with product engineering squads to design, build, and operate highly reliable services
- Own and improve production reliability end‑to‑end:
- Define and measure SLOs/SLIs, error budgets, and reliability goals
- Lead incident response, postmortems, and follow‑up action items
- Participate in on‑call rotation and drive rapid, effective resolution of production issues
- Build and maintain world‑class observability:
- Create comprehensive dashboards, alerts, metrics, structured logging, and distributed tracing
- Enable squads to understand system behavior and debug effectively
- Develop automation, tooling, and infrastructure as code to reduce toil and increase developer velocity
- Collaborate closely with Staff Engineers / Team Leads to:
- Embed reliability best practices into the development lifecycle
- Review architectural decisions with a production lens
- Mentor engineers on operational excellence, observability, and on‑call mindset
- Champion modern engineering and DevOps practices:
- CI/CD pipelines
- Progressive delivery (feature flags, canaries, blue‑green)
- Infrastructure as code (Terraform, Pulumi, CDK)
- Effective use of AI‑assisted tools to accelerate scripting, debugging, and documentation
- Proactively identify and eliminate classes of failure through chaos engineering, capacity planning, and performance tuning
- Help evolve our technical strategy for reliability, scalability, and cost‑efficiency
Qualifications – Technical
- 5+ years of professional experience in SRE, DevOps, or software engineering with a strong focus on production systems
- Deep hands‑on experience operating distributed cloud systems (AWS / GCP / Azure — at least one in depth, preferably AWS)
- Proficiency in at least one modern programming language used for tooling & automation (Go, Python, TypeScript/JavaScript, Rust)
- Strong observability expertise:
- Building dashboards and alerts (Grafana, Groundcover, Datadog, New Relic, Prometheus, etc.)
- Distributed tracing (OpenTelemetry, Jaeger, Zipkin)
- Structured logging and metrics at scale
- Proven track record of incident management, postmortems, and driving reliability improvements
- Experience defining and working with SLOs, SLIs, and error budgets
- Comfort with infrastructure as code and modern DevOps practices (CI/CD, GitOps, containers/Kubernetes)
- Excellent collaboration skills — you enjoy partnering with product engineers and teaching reliability concepts
- Bias toward automation and reducing manual toil
Nice‑to‑Haves
- Previous on‑call leadership or incident commander experience
- Background in performance engineering or capacity planning at scale
- Familiarity with service meshes, API gateways, or zero‑trust networking
- Contributions to open‑source reliability/observability tools
- Experience mentoring or embedding within product squads within product squads
Behavioural
Here are four essential behavioral skills Assistant Mana
Requirements
- 5+ years of professional experience in SRE, DevOps, or software engineering with a strong focus on production systems
- Deep hands-on experience operating distributed cloud systems (AWS / GCP / Azure — at least one in depth, preferably AWS)
- Proficiency in at least one modern programming language used for tooling & automation (Go, Python, TypeScript/JavaScript, Rust)
- Strong observability expertise
- Building dashboards and alerts (Grafana, Groundcover, Datadog, New Relic, Prometheus, etc.)
- Distributed tracing (OpenTelemetry, Jaeger, Zipkin)
- Structured logging and metrics at scale
- Proven track record of incident management, postmortems, and driving reliability improvements
- Experience defining and working with SLOs, SLIs, and error budgets
- Comfort with infrastructure as code and modern DevOps practices (CI/CD, GitOps, containers/Kubernetes)
- Excellent collaboration skills — you enjoy partnering with product engineers and teaching reliability concepts
- Bias toward automation and reducing manual toil
Responsibilities
- Partner with product engineering squads to design, build, and operate highly reliable services
- Own and improve production reliability end-to-end
- Define and measure SLOs/SLIs, error budgets, and reliability goals
- Lead incident response, postmortems, and follow-up action items
- Participate in on-call rotation and drive rapid, effective resolution of production issues
- Build and maintain world-class observability
- Create comprehensive dashboards, alerts, metrics, structured logging, and distributed tracing
- Enable squads to understand system behavior and debug effectively
- Develop automation, tooling, and infrastructure as code to reduce toil and increase developer velocity
- Collaborate closely with Staff Engineers / Team Leads to embed reliability best practices into the development lifecycle
- Review architectural decisions with a production lens
- Mentor engineers on operational excellence, observability, and on-call mindset
- Champion modern engineering and DevOps practices
- CI/CD pipelines
- Progressive delivery (feature flags, canaries, blue-green)
- Infrastructure as code (Terraform, Pulumi, CDK)
- Effective use of AI-assisted tools to accelerate scripting, debugging, and documentation
- Proactively identify and eliminate classes of failure through chaos engineering, capacity planning, and performance tuning
- Help evolve our technical strategy for reliability, scalability, and cost-efficiency
Benefits
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free