Skip to content
mimi

Sr. Staff Site Reliability Engineer

SolarWinds

Bengaluru · On-site Full-time Senior 2d ago

About the role

About SolarWinds

At SolarWinds, we’re a people-first company. Our purpose is to enrich the lives of the people we serve—including our employees, customers, shareholders, Partners, and communities. Join us in our mission to help customers accelerate business transformation with simple, powerful, and secure solutions.

The ideal candidate thrives in an innovative, fast-paced environment and is collaborative, accountable, ready, and empathetic. We’re looking for individuals who believe they can accomplish more as a team and create lasting growth for themselves and others. We hire based on attitude, competency, and commitment. Solarians are ready to advance our world‑class solutions in a fast‑paced environment and accept the challenge to lead with purpose. If you’re looking to build your career with an exceptional team, you’ve come to the right place. Join SolarWinds and grow with us.

About the Role

As a Senior Staff Site Reliability Engineer, you will play a pivotal role in driving reliability and performance improvements across the SolarWinds Observability Platform. You will work closely with cross‑functional engineering teams to manage and reduce SaaS backlogs, ensuring that our platform scales effectively while maintaining the highest standards of reliability and performance. Your ability to drive initiatives, provide technical leadership, and optimize complex systems will be key to our success.

This role demands deep technical expertise, a collaborative mindset, and the ability to mentor a high‑performing team of engineers. You will be responsible for driving technical initiatives, overseeing incident response, and improving our platform’s infrastructure while focusing on the integration of emerging technologies such as ClickHouse, Kafka, Karpenter, and Buf.

Key Responsibilities

  • Lead and Drive Initiatives: Own and lead strategic initiatives to improve the reliability, scalability, and performance of the SolarWinds Observability Platform, with a strong focus on reducing SaaS backlogs.
  • SaaS Backlog Management: Collaborate with cross‑functional teams to identify, prioritize, and address outstanding backlog items, including incidents, infrastructure improvements, performance optimization, and automation.
  • Automation & Observability: Lead the development of automation strategies and observability tools to improve platform monitoring, reduce incidents, and enhance performance insights across the infrastructure.
  • Incident Response & Postmortems: Lead response efforts for production incidents, conducting thorough postmortems, driving continuous improvement initiatives, and ensuring the team learns from each incident.
  • Platform Engineering Leadership: Drive initiatives related to platform engineering and scale infrastructure systems, ensuring they meet the reliability and performance standards necessary for the SolarWinds Observability Platform.
  • Mentorship & Team Leadership: Mentor and provide technical guidance to the Site Reliability Engineering (SRE) team, helping them grow their skills and driving a culture of continuous learning and collaboration.
  • Collaboration & Cross‑Functional Engagement: Collaborate closely with engineering, security, and product teams to ensure the seamless integration of new technologies and systems, improving platform reliability and scalability.

Ideal Candidate Attributes

  • Strong Leadership Skills: Proven ability to drive initiatives, manage SaaS backlogs, and lead cross‑functional teams to successful outcomes.
  • Collaborative Mindset: Comfortable working with diverse teams across different functions to solve complex problems and build scalable, high‑performance systems.
  • Customer‑Focused: A strong customer orientation, with the ability to translate technical challenges into business solutions.
  • Excellent Communication: Strong interpersonal and communication skills to effectively engage with both technical and non‑technical stakeholders.
  • Problem‑Solving & Ownership: A collaborative problem solver with a strong bias for ownership and decisive action.

Qualifications

  • 13+ years of experience in Site Reliability Engineering, Platform Engineering, or related roles, with extensive experience managing SaaS environments.
  • 8+ years of experience designing, building, and maintaining AWS/Azure infrastructure, using Terraform and automation tools.
  • 5+ years of experience building, running, and scaling Kubernetes clusters in production environments.
  • Experience with Observability tools (e.g., monitoring, logging, tracing, metrics) and practices for high‑performance systems.
  • Strong expertise with Kafka for real‑time data processing, ClickHouse for OLAP workloads, and GitOps CI/CD processes.
  • Familiarity with Karpenter for Kubernetes autoscaling, and Buf for managing

Requirements

  • 13+ years of experience in Site Reliability Engineering, Platform Engineering, or related roles, with extensive experience managing SaaS environments.
  • 8+ years of experience designing, building, and maintaining AWS/Azure infrastructure, using Terraform and automation tools.
  • 5+ years of experience building, running, and scaling Kubernetes clusters in production environments.
  • Experience with Observability tools (e.g., monitoring, logging, tracing, metrics) and practices for high-performance systems.
  • Strong expertise with Kafka for real-time data processing, ClickHouse for OLAP workloads, and GitOps CI/CD processes.
  • Familiarity with Karpenter for Kubernetes autoscaling, and Buf for managing

Responsibilities

  • Lead and Drive Initiatives: Own and lead strategic initiatives to improve the reliability, scalability, and performance of the SolarWinds Observability Platform, with a strong focus on reducing SaaS backlogs.
  • SaaS Backlog Management: Collaborate with cross-functional teams to identify, prioritize, and address outstanding backlog items, including incidents, infrastructure improvements, performance optimization, and automation.
  • Automation & Observability: Lead the development of automation strategies and observability tools to improve platform monitoring, reduce incidents, and enhance performance insights across the infrastructure.
  • Incident Response & Postmortems: Lead response efforts for production incidents, conducting thorough postmortems, driving continuous improvement initiatives, and ensuring the team learns from each incident.
  • Platform Engineering Leadership: Drive initiatives related to platform engineering and scale infrastructure systems, ensuring they meet the reliability and performance standards necessary for the SolarWinds Observability Platform.
  • Mentorship & Team Leadership: Mentor and provide technical guidance to the Site Reliability Engineering (SRE) team, helping them grow their skills and driving a culture of continuous learning and collaboration.
  • Collaboration & Cross-Functional Engagement: Collaborate closely with engineering, security, and product teams to ensure the seamless integration of new technologies and systems, improving platform reliability and scalability.

Skills

AWSAzureBufClickHouseGitOpsKafkaKarpenterKubernetesObservabilityTerraform

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free