Senior Site Reliability Engineer - Fully remote (Pune)

Nameless

Pune · On-site Full-time Senior Today

About the role

Overview

We are looking for a Senior Site Reliability Engineer to join our Engineering Infrastructure team. In this role, you will own the reliability, performance, and operational excellence of Generac’s cloud‑native software platforms. You will bridge the gap between development and operations—embedding SRE practices across engineering squads, driving automation, and ensuring our systems meet the highest availability and performance standards. Manager, Site Reliability Engineering.

Responsibilities

Incident Response & On-Call Management

Maintain and evolve on-call runbooks, escalation paths, and post-mortem processes to build a culture of blameless learning.
Conduct thorough root cause analysis (RCA) and implement preventive measures to reduce mean time to recovery (MTTR) and mean time between failures (MTBF).
Define, track, and report on SLOs, SLIs, and error budgets, using Grafana dashboards to surface real-time reliability signals to engineering leadership.

Infrastructure Automation & IaC

Design, build, and maintain infrastructure-as-code (IaC) using Terraform and Ansible to provision and manage cloud resources across AWS (primary), GCP, and Azure.
Automate repeatable operational tasks—reducing toil and enabling engineering teams to move faster with confidence.
Lead Kubernetes cluster management and lifecycle operations, including upgrades, scaling, networking, and security hardening across environments.
Manage and optimize GitHub Actions CI/CD pipelines, ensuring reliable, rapid, and secure software delivery from code commit to production.

Performance & Capacity Planning

Lead capacity planning initiatives for multi-cloud infrastructure (AWS primary, GCP, Azure legacy), ensuring systems scale efficiently to meet business demand.
Develop load testing frameworks and performance benchmarking strategies to identify bottlenecks before they impact customers.
Analyze trends in system resource utilization and provide data‑driven recommendations for cost optimization and right‑sizing.
Collaborate with engineering leadership on architecture reviews to ensure systems are designed with scalability and reliability as first‑class concerns.
Build and maintain Grafana dashboards and alerting rules that provide end‑to‑end visibility into system performance and capacity headroom.

Developer Tooling & Platform Engineering

Build and maintain internal developer platforms that improve engineering velocity, standardize observability, and reduce operational complexity.
Partner with software engineering teams to embed reliability practices early in the SDLC—shift‑left on reliability, security, and performance.
Provide SRE consultation to product squads on service architecture, deployment patterns, and observability instrumentation.
Evangelize and implement best practices around feature flags, canary deployments, blue/green strategies, and rollback mechanisms.

Requirements

5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.
Deep expertise in AWS (EC2, EKS, RDS, Lambda, S3, CloudWatch, IAM, VPC, Route 53); Hands‑on experience with Kubernetes administration, including cluster upgrades, RBAC, networking (CNI plugins), and storage.
experience with Ansible or similar configuration management tools.
Experience designing and managing GitHub Actions CI/CD pipelines at scale.
Strong observability skills—experience with Grafana, Prometheus, or equivalent monitoring and alerting stacks.
Solid programming or scripting skills in Python, Go, Bash, or similar languages for automation and tooling.
Experience defining and managing SLOs, SLIs, and error budgets in production environments.
Excellent communication skills—able to translate complex technical concepts for both engineering and business stakeholders.
Experience working in a multi‑cloud environment, particularly managing legacy workloads in GCP or Azure alongside AWS.
Familiarity with service mesh technologies (Istio, Linkerd) and advanced Kubernetes networking.
Experience with chaos engineering tools (Chaos Monkey, Gremlin) and fault‑injection testing.
Background in platform engineering or internal developer portal (IDP) development.
Knowledge of FinOps practices for cloud cost optimization and rightsizing.
AWS certifications (Solutions Architect, DevOps Engineer) or CKA/CKAD are advantageous

Requirements

5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles
Deep expertise in AWS (EC2, EKS, RDS, Lambda, S3, CloudWatch, IAM, VPC, Route 53)
Hands‑on experience with Kubernetes administration (cluster upgrades, RBAC, networking, storage)
Experience with Ansible or similar configuration‑management tools
Experience designing and managing GitHub Actions CI/CD pipelines at scale
Strong observability skills (Grafana, Prometheus, or equivalent monitoring/alerting stacks)
Solid programming or scripting skills in Python, Go, Bash, or similar languages
Experience defining and managing SLOs, SLIs, and error budgets in production
Excellent communication skills for translating technical concepts to engineering and business stakeholders
Experience working in a multi‑cloud environment (AWS primary, GCP and Azure legacy workloads)
Familiarity with service‑mesh technologies (Istio, Linkerd) and advanced Kubernetes networking
Experience with chaos‑engineering tools (Chaos Monkey, Gremlin) and fault‑injection testing
Background in platform engineering or internal developer portal (IDP) development
Knowledge of FinOps practices for cloud cost optimization and rightsizing

Responsibilities

Own reliability, performance, and operational excellence of Generac’s cloud‑native software platforms
Bridge development and operations by embedding SRE practices across engineering squads
Drive automation and reduce toil through infrastructure‑as‑code (Terraform, Ansible) and tooling
Maintain and evolve on‑call runbooks, escalation paths, and post‑mortem processes
Conduct root‑cause analysis (RCA) and implement preventive measures to reduce MTTR and MTBF
Define, track, and report on SLOs, SLIs, and error budgets; build Grafana dashboards for real‑time reliability signals
Design, build, and maintain IaC for cloud resources across AWS, GCP, and Azure
Lead Kubernetes cluster management and lifecycle operations (upgrades, scaling, networking, security hardening)
Manage and optimize GitHub Actions CI/CD pipelines for reliable, rapid, and secure software delivery
Lead capacity‑planning initiatives for multi‑cloud infrastructure and provide cost‑optimization recommendations
Develop load‑testing frameworks and performance‑benchmarking strategies
Collaborate with engineering leadership on architecture reviews to ensure scalability and reliability
Build and maintain internal developer platforms to improve engineering velocity and standardize observability
Partner with software engineering teams to embed reliability, security, and performance practices early in the SDLC
Provide SRE consultation to product squads on service architecture, deployment patterns, and observability instrumentation
Evangelize best practices around feature flags, canary deployments, blue/green strategies, and rollback mechanisms

Skills

AWSKubernetesTerraformAnsibleGitHub ActionsGrafanaPrometheusPythonGoBashSLO/SLI managementMulti‑cloud (AWS, GCP, Azure)Service mesh (Istio, Linkerd)Chaos engineering (Chaos Monkey, Gremlin)Platform engineeringFinOps

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

Senior Site Reliability Engineer - Fully remote (Pune)

About the role

Overview

Responsibilities

Incident Response & On-Call Management

Infrastructure Automation & IaC

Performance & Capacity Planning

Developer Tooling & Platform Engineering

Requirements

Requirements

Responsibilities

Skills

Similar roles

Software Developer/Engineer (Freelancer)

Machine Learning Engineer (ML Ops & Pipelines)

Site Reliability Engineer

Don't send a generic resume