Skip to content
mimi

Site Reliability Engineer

Clearwater Analytics

India · On-site Full-time Senior Today

About the role

Responsibilities

  • Own the observability platform end-to-end—Prometheus, Grafana, distributed tracing (Open Telemetry)—and establish SLO/SLI frameworks and track SLAs across all systems and applications.
  • Lead major incident response as an incident commander; drive root‑cause analysis and systemic remediation programs.
  • Lead the evolution of CWAN's cloud infrastructure on AWS, establishing scalability, resilience, and security standards across all services.
  • Serve as the primary owner of the Kubernetes (EKS) platform: design cluster topology, multi‑tenancy models, autoscaling strategies, and upgrade lifecycle.
  • Define and enforce the organization's Infrastructure‑as‑Code standards using Terraform and Ansible; drive adoption of GitOps workflows.
  • Build and maintain CI/CD and automated deployment pipelines for all services and applications across environments.
  • Evaluate and introduce emerging technologies (eBPF, WASM, service meshes) to improve platform capabilities.
  • Partner with engineering leadership to embed reliability requirements into the SDLC; champion chaos engineering and resilience testing programs.
  • Mentor and grow mid‑level and junior SREs across global teams through code reviews, pairing, and structured knowledge sharing.
  • Drive capacity planning, cost optimization, and FinOps practices across the AWS environment.
  • Contribute to the engineering roadmap and help define the long‑term reliability strategy for the CWAN platform.

Qualifications (Required)

  • 7+ years of experience in Site Reliability Engineering, Platform Engineering, or related roles.
  • Proven track record leading major incident response and driving post‑incident systemic improvements.
  • Strong experience building and operating observability stacks at scale.
  • Hands‑on experience with monitoring, logging, and tracing tools such as Grafana, Prometheus, Mimir, OpenSearch, Dynatrace/Datadog, Victoria Metrics, etc.
  • Demonstrated ability to mentor engineers and influence technical direction across time zones.
  • Deep expertise with Kubernetes in large‑scale, multi‑cluster production environments.
  • Advanced proficiency with AWS (EKS, RDS/Aurora, ElastiCache, Direct Connect, IAM/SCP, Cost Explorer).
  • Expert‑level Infrastructure‑as‑Code skills with Terraform (modules, remote state, Atlantis or similar).
  • Hands‑on experience with CI/CD platforms: Jenkins, GitHub Actions, and GitLab CI.
  • Experience with GitOps workflows (ArgoCD, Rancher).
  • Proficiency in at least one general‑purpose programming language (Python, Go, Java) for building tooling and automation.
  • Experience with security best practices in cloud environments (IAM least privilege, secrets management, etc.).

Preferred Qualifications

  • Experience in financial services, FinTech, or other mission‑critical, regulated environments.
  • Hands‑on experience with service mesh (Istio) and eBPF‑based observability tools.
  • Prior staff or principal engineer experience with cross‑team technical influence in a global organization.
  • AWS and Kubernetes certifications at the Professional level.
  • Experience with multi‑region active‑active architectures and global load balancing.

Requirements

  • Proven track record leading major incident response and driving post-incident systemic improvements.
  • Strong experience building and operating Observability stacks at scale.
  • Hands on experience with monitoring, logging and tracing tools like Grafana, Prometheus, Mimir, OpenSearch Dynatrace/Datadog, Victoria Metrics etc.
  • Demonstrated ability to mentor engineers and influence technical direction across time zones.
  • Deep expertise with Kubernetes in large-scale, multi-cluster production environments.
  • Advanced proficiency with AWS (EKS, RDS/Aurora, ElastiCache, Direct Connect, IAM/SCP, Cost Explorer).
  • Expert-level Infrastructure-as-Code skills with Terraform (modules, remote state, Atlantis or similar).
  • Hands-on experience with CI/CD platforms: Jenkins, GitHub Actions and GitLab CI.
  • Experience with GitOps workflows (ArgoCD, Rancher).
  • Proficiency in at least one general-purpose programming language (Python, Go, Java) for building tooling and automation.
  • Experience with security best practices in cloud environments (IAM least privilege, secrets management etc.).

Responsibilities

  • Own the observability platform end-to-end—Prometheus, Grafana, distributed tracing (Open Telemetry)—and establish SLO/SLI frameworks and track SLAs across all system and applications.
  • Lead major incident response as an incident commander; drive root-cause analysis and systemic remediation programs.
  • Lead the evolution of CWAN's cloud infrastructure on AWS, establishing scalability, resilience, and security standards across all services.
  • Serve as the primary owner of the Kubernetes (EKS) platform: design cluster topology, multi-tenancy models, autoscaling strategies, and upgrade lifecycle.
  • Define and enforce the organization's Infrastructure-as-Code standards using Terraform and Ansible; drive adoption of GitOps workflows.
  • Build and maintain CI/CD and automated deployment pipelines for all services and applications across environments.
  • Evaluate and introduce emerging technologies (eBPF, WASM, service meshes) to improve platform capabilities.
  • Partner with engineering leadership to embed reliability requirements into the SDLC; champion chaos engineering and resilience testing programs.
  • Mentor and grow mid-level and junior SREs across global teams through code reviews, pairing, and structured knowledge sharing.
  • Drive capacity planning, cost optimization, and FinOps practices across the AWS environment.
  • Contribute to the engineering roadmap and help define the long-term reliability strategy for the CWAN platform.

Skills

AnsibleArgoCDAWSAWS Cost ExplorerAWS Direct ConnectAWS EKSAWS ElastiCacheAWS IAMAWS RDS/AuroraDatadogDockereBPFGitGitOpsGitHub ActionsGitLab CIGoGrafanaIstioJavaJenkinsKubernetesMimirOpenTelemetryOpenSearchPrometheusPythonRancherService MeshTerraformVictoria MetricsWASM

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free