All jobs

Senior Site Reliability Engineer

Taazaa Inc

India · Hybrid Full-time Senior Today

Apply with a tailored resume Save job

About the role

About the Role

We’re looking for a Senior DevOps Engineer to join our Site Reliability Engineer (SRE) Team in Noida.

As a Site Reliability Engineer (SRE) you will play a pivotal role in the design, implementation, and maintenance of the infrastructure that supports our software development lifecycle. You will work closely with software engineers, QA, and IT teams to ensure the availability, reliability, and performance of our systems. Your primary focus will be on streamlining our deployment processes, improving system scalability, and ensuring a robust, secure, and cost‑efficient infrastructure.

About Taazaa

Working at Taazaa involves engaging with cutting‑edge technology and innovative software solutions in a collaborative environment. We emphasize continuous professional growth, offering workshops and training. Our employees often interact with clients to tailor solutions to business needs, working on diverse projects across industries.

We promote work‑life balance with flexible hours and remote options, fostering a supportive and inclusive culture. Competitive salaries, health benefits, and various perks further enhance the work experience.

Looking ahead, we aim to expand our technological capabilities and market reach, investing in advanced technologies and expanding our service offerings. We plan to deepen our expertise in AI and machine learning, enhance our cloud services, and continue fostering a culture of innovation and excellence. Taazaa is committed to staying at the forefront of technology trends, ensuring it delivers impactful and transformative solutions for its clients.

Responsibilities (What you’ll do)

Partner with product engineering squads to design, build, and operate highly reliable services
Own and improve production reliability end‑to‑end:
- Define and measure SLOs/SLIs, error budgets, and reliability goals
- Lead incident response, postmortems, and follow‑up action items
- Participate in on‑call rotation and drive rapid, effective resolution of production issues
Build and maintain world‑class observability:
- Create comprehensive dashboards, alerts, metrics, structured logging, and distributed tracing
- Enable squads to understand system behavior and debug effectively
Develop automation, tooling, and infrastructure as code to reduce toil and increase developer velocity
Collaborate closely with Staff Engineers / Team Leads to:
- Embed reliability best practices into the development lifecycle
- Review architectural decisions with a production lens
- Mentor engineers on operational excellence, observability, and on‑call mindset
Champion modern engineering and DevOps practices:
- CI/CD pipelines
- Progressive delivery (feature flags, canaries, blue‑green)
- Infrastructure as code (Terraform, Pulumi, CDK)
- Effective use of AI‑assisted tools to accelerate scripting, debugging, and documentation
Proactively identify and eliminate classes of failure through chaos engineering, capacity planning, and performance tuning
Help evolve our technical strategy for reliability, scalability, and cost‑efficiency

Qualifications – Technical

5+ years of professional experience in SRE, DevOps, or software engineering with a strong focus on production systems
Deep hands‑on experience operating distributed cloud systems (AWS / GCP / Azure — at least one in depth, preferably AWS)
Proficiency in at least one modern programming language used for tooling & automation (Go, Python, TypeScript/JavaScript, Rust)
Strong observability expertise:
- Building dashboards and alerts (Grafana, Groundcover, Datadog, New Relic, Prometheus, etc.)
- Distributed tracing (OpenTelemetry, Jaeger, Zipkin)
- Structured logging and metrics at scale
Proven track record of incident management, postmortems, and driving reliability improvements
Experience defining and working with SLOs, SLIs, and error budgets
Comfort with infrastructure as code and modern DevOps practices (CI/CD, GitOps, containers/Kubernetes)
Excellent collaboration skills — you enjoy partnering with product engineers and teaching reliability concepts
Bias toward automation and reducing manual toil

Nice‑to‑Haves

Previous on‑call leadership or incident commander experience
Background in performance engineering or capacity planning at scale
Familiarity with service meshes, API gateways, or zero‑trust networking
Contributions to open‑source reliability/observability tools
Experience mentoring or embedding within product squads within product squads

Behavioural

Here are four essential behavioral skills Assistant Mana

Requirements

5+ years of professional experience in SRE, DevOps, or software engineering with a strong focus on production systems
Deep hands-on experience operating distributed cloud systems (AWS / GCP / Azure — at least one in depth, preferably AWS)
Proficiency in at least one modern programming language used for tooling & automation (Go, Python, TypeScript/JavaScript, Rust)
Strong observability expertise
Building dashboards and alerts (Grafana, Groundcover, Datadog, New Relic, Prometheus, etc.)
Distributed tracing (OpenTelemetry, Jaeger, Zipkin)
Structured logging and metrics at scale
Proven track record of incident management, postmortems, and driving reliability improvements
Experience defining and working with SLOs, SLIs, and error budgets
Comfort with infrastructure as code and modern DevOps practices (CI/CD, GitOps, containers/Kubernetes)
Excellent collaboration skills — you enjoy partnering with product engineers and teaching reliability concepts
Bias toward automation and reducing manual toil

Responsibilities

Partner with product engineering squads to design, build, and operate highly reliable services
Own and improve production reliability end-to-end
Define and measure SLOs/SLIs, error budgets, and reliability goals
Lead incident response, postmortems, and follow-up action items
Participate in on-call rotation and drive rapid, effective resolution of production issues
Build and maintain world-class observability
Create comprehensive dashboards, alerts, metrics, structured logging, and distributed tracing
Enable squads to understand system behavior and debug effectively
Develop automation, tooling, and infrastructure as code to reduce toil and increase developer velocity
Collaborate closely with Staff Engineers / Team Leads to embed reliability best practices into the development lifecycle
Review architectural decisions with a production lens
Mentor engineers on operational excellence, observability, and on-call mindset
Champion modern engineering and DevOps practices
CI/CD pipelines
Progressive delivery (feature flags, canaries, blue-green)
Infrastructure as code (Terraform, Pulumi, CDK)
Effective use of AI-assisted tools to accelerate scripting, debugging, and documentation
Proactively identify and eliminate classes of failure through chaos engineering, capacity planning, and performance tuning
Help evolve our technical strategy for reliability, scalability, and cost-efficiency

Benefits

health benefits

Skills

AWSCDKCI/CDDatadogDockerGCPGitOpsGoGrafanaGroundcoverJaegerJavaScriptKubernetesNew RelicOpenTelemetryPrometheusPulumiPythonRustTerraformTypeScriptZipkin

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free