N
Senior Site Reliability Engineer - Fully remote (Pune)
Nameless
Pune · On-site Full-time Senior Today
About the role
Overview
We are looking for a Senior Site Reliability Engineer to join our Engineering Infrastructure team. In this role, you will own the reliability, performance, and operational excellence of Generac’s cloud‑native software platforms. You will bridge the gap between development and operations—embedding SRE practices across engineering squads, driving automation, and ensuring our systems meet the highest availability and performance standards. Manager, Site Reliability Engineering.
Responsibilities
Incident Response & On-Call Management
- Maintain and evolve on-call runbooks, escalation paths, and post-mortem processes to build a culture of blameless learning.
- Conduct thorough root cause analysis (RCA) and implement preventive measures to reduce mean time to recovery (MTTR) and mean time between failures (MTBF).
- Define, track, and report on SLOs, SLIs, and error budgets, using Grafana dashboards to surface real-time reliability signals to engineering leadership.
Infrastructure Automation & IaC
- Design, build, and maintain infrastructure-as-code (IaC) using Terraform and Ansible to provision and manage cloud resources across AWS (primary), GCP, and Azure.
- Automate repeatable operational tasks—reducing toil and enabling engineering teams to move faster with confidence.
- Lead Kubernetes cluster management and lifecycle operations, including upgrades, scaling, networking, and security hardening across environments.
- Manage and optimize GitHub Actions CI/CD pipelines, ensuring reliable, rapid, and secure software delivery from code commit to production.
Performance & Capacity Planning
- Lead capacity planning initiatives for multi-cloud infrastructure (AWS primary, GCP, Azure legacy), ensuring systems scale efficiently to meet business demand.
- Develop load testing frameworks and performance benchmarking strategies to identify bottlenecks before they impact customers.
- Analyze trends in system resource utilization and provide data‑driven recommendations for cost optimization and right‑sizing.
- Collaborate with engineering leadership on architecture reviews to ensure systems are designed with scalability and reliability as first‑class concerns.
- Build and maintain Grafana dashboards and alerting rules that provide end‑to‑end visibility into system performance and capacity headroom.
Developer Tooling & Platform Engineering
- Build and maintain internal developer platforms that improve engineering velocity, standardize observability, and reduce operational complexity.
- Partner with software engineering teams to embed reliability practices early in the SDLC—shift‑left on reliability, security, and performance.
- Provide SRE consultation to product squads on service architecture, deployment patterns, and observability instrumentation.
- Evangelize and implement best practices around feature flags, canary deployments, blue/green strategies, and rollback mechanisms.
Requirements
- 5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.
- Deep expertise in AWS (EC2, EKS, RDS, Lambda, S3, CloudWatch, IAM, VPC, Route 53); Hands‑on experience with Kubernetes administration, including cluster upgrades, RBAC, networking (CNI plugins), and storage.
- experience with Ansible or similar configuration management tools.
- Experience designing and managing GitHub Actions CI/CD pipelines at scale.
- Strong observability skills—experience with Grafana, Prometheus, or equivalent monitoring and alerting stacks.
- Solid programming or scripting skills in Python, Go, Bash, or similar languages for automation and tooling.
- Experience defining and managing SLOs, SLIs, and error budgets in production environments.
- Excellent communication skills—able to translate complex technical concepts for both engineering and business stakeholders.
- Experience working in a multi‑cloud environment, particularly managing legacy workloads in GCP or Azure alongside AWS.
- Familiarity with service mesh technologies (Istio, Linkerd) and advanced Kubernetes networking.
- Experience with chaos engineering tools (Chaos Monkey, Gremlin) and fault‑injection testing.
- Background in platform engineering or internal developer portal (IDP) development.
- Knowledge of FinOps practices for cloud cost optimization and rightsizing.
- AWS certifications (Solutions Architect, DevOps Engineer) or CKA/CKAD are advantageous
Requirements
- 5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles
- Deep expertise in AWS (EC2, EKS, RDS, Lambda, S3, CloudWatch, IAM, VPC, Route 53)
- Hands‑on experience with Kubernetes administration (cluster upgrades, RBAC, networking, storage)
- Experience with Ansible or similar configuration‑management tools
- Experience designing and managing GitHub Actions CI/CD pipelines at scale
- Strong observability skills (Grafana, Prometheus, or equivalent monitoring/alerting stacks)
- Solid programming or scripting skills in Python, Go, Bash, or similar languages
- Experience defining and managing SLOs, SLIs, and error budgets in production
- Excellent communication skills for translating technical concepts to engineering and business stakeholders
- Experience working in a multi‑cloud environment (AWS primary, GCP and Azure legacy workloads)
- Familiarity with service‑mesh technologies (Istio, Linkerd) and advanced Kubernetes networking
- Experience with chaos‑engineering tools (Chaos Monkey, Gremlin) and fault‑injection testing
- Background in platform engineering or internal developer portal (IDP) development
- Knowledge of FinOps practices for cloud cost optimization and rightsizing
Responsibilities
- Own reliability, performance, and operational excellence of Generac’s cloud‑native software platforms
- Bridge development and operations by embedding SRE practices across engineering squads
- Drive automation and reduce toil through infrastructure‑as‑code (Terraform, Ansible) and tooling
- Maintain and evolve on‑call runbooks, escalation paths, and post‑mortem processes
- Conduct root‑cause analysis (RCA) and implement preventive measures to reduce MTTR and MTBF
- Define, track, and report on SLOs, SLIs, and error budgets; build Grafana dashboards for real‑time reliability signals
- Design, build, and maintain IaC for cloud resources across AWS, GCP, and Azure
- Lead Kubernetes cluster management and lifecycle operations (upgrades, scaling, networking, security hardening)
- Manage and optimize GitHub Actions CI/CD pipelines for reliable, rapid, and secure software delivery
- Lead capacity‑planning initiatives for multi‑cloud infrastructure and provide cost‑optimization recommendations
- Develop load‑testing frameworks and performance‑benchmarking strategies
- Collaborate with engineering leadership on architecture reviews to ensure scalability and reliability
- Build and maintain internal developer platforms to improve engineering velocity and standardize observability
- Partner with software engineering teams to embed reliability, security, and performance practices early in the SDLC
- Provide SRE consultation to product squads on service architecture, deployment patterns, and observability instrumentation
- Evangelize best practices around feature flags, canary deployments, blue/green strategies, and rollback mechanisms
Skills
AWSKubernetesTerraformAnsibleGitHub ActionsGrafanaPrometheusPythonGoBashSLO/SLI managementMulti‑cloud (AWS, GCP, Azure)Service mesh (Istio, Linkerd)Chaos engineering (Chaos Monkey, Gremlin)Platform engineeringFinOps
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free