CA
Site Reliability Engineer
Clearwater Analytics
India · On-site Full-time Senior Today
About the role
Responsibilities
- Own the observability platform end-to-end—Prometheus, Grafana, distributed tracing (Open Telemetry)—and establish SLO/SLI frameworks and track SLAs across all systems and applications.
- Lead major incident response as an incident commander; drive root‑cause analysis and systemic remediation programs.
- Lead the evolution of CWAN's cloud infrastructure on AWS, establishing scalability, resilience, and security standards across all services.
- Serve as the primary owner of the Kubernetes (EKS) platform: design cluster topology, multi‑tenancy models, autoscaling strategies, and upgrade lifecycle.
- Define and enforce the organization's Infrastructure‑as‑Code standards using Terraform and Ansible; drive adoption of GitOps workflows.
- Build and maintain CI/CD and automated deployment pipelines for all services and applications across environments.
- Evaluate and introduce emerging technologies (eBPF, WASM, service meshes) to improve platform capabilities.
- Partner with engineering leadership to embed reliability requirements into the SDLC; champion chaos engineering and resilience testing programs.
- Mentor and grow mid‑level and junior SREs across global teams through code reviews, pairing, and structured knowledge sharing.
- Drive capacity planning, cost optimization, and FinOps practices across the AWS environment.
- Contribute to the engineering roadmap and help define the long‑term reliability strategy for the CWAN platform.
Qualifications (Required)
- 7+ years of experience in Site Reliability Engineering, Platform Engineering, or related roles.
- Proven track record leading major incident response and driving post‑incident systemic improvements.
- Strong experience building and operating observability stacks at scale.
- Hands‑on experience with monitoring, logging, and tracing tools such as Grafana, Prometheus, Mimir, OpenSearch, Dynatrace/Datadog, Victoria Metrics, etc.
- Demonstrated ability to mentor engineers and influence technical direction across time zones.
- Deep expertise with Kubernetes in large‑scale, multi‑cluster production environments.
- Advanced proficiency with AWS (EKS, RDS/Aurora, ElastiCache, Direct Connect, IAM/SCP, Cost Explorer).
- Expert‑level Infrastructure‑as‑Code skills with Terraform (modules, remote state, Atlantis or similar).
- Hands‑on experience with CI/CD platforms: Jenkins, GitHub Actions, and GitLab CI.
- Experience with GitOps workflows (ArgoCD, Rancher).
- Proficiency in at least one general‑purpose programming language (Python, Go, Java) for building tooling and automation.
- Experience with security best practices in cloud environments (IAM least privilege, secrets management, etc.).
Preferred Qualifications
- Experience in financial services, FinTech, or other mission‑critical, regulated environments.
- Hands‑on experience with service mesh (Istio) and eBPF‑based observability tools.
- Prior staff or principal engineer experience with cross‑team technical influence in a global organization.
- AWS and Kubernetes certifications at the Professional level.
- Experience with multi‑region active‑active architectures and global load balancing.
Requirements
- Proven track record leading major incident response and driving post-incident systemic improvements.
- Strong experience building and operating Observability stacks at scale.
- Hands on experience with monitoring, logging and tracing tools like Grafana, Prometheus, Mimir, OpenSearch Dynatrace/Datadog, Victoria Metrics etc.
- Demonstrated ability to mentor engineers and influence technical direction across time zones.
- Deep expertise with Kubernetes in large-scale, multi-cluster production environments.
- Advanced proficiency with AWS (EKS, RDS/Aurora, ElastiCache, Direct Connect, IAM/SCP, Cost Explorer).
- Expert-level Infrastructure-as-Code skills with Terraform (modules, remote state, Atlantis or similar).
- Hands-on experience with CI/CD platforms: Jenkins, GitHub Actions and GitLab CI.
- Experience with GitOps workflows (ArgoCD, Rancher).
- Proficiency in at least one general-purpose programming language (Python, Go, Java) for building tooling and automation.
- Experience with security best practices in cloud environments (IAM least privilege, secrets management etc.).
Responsibilities
- Own the observability platform end-to-end—Prometheus, Grafana, distributed tracing (Open Telemetry)—and establish SLO/SLI frameworks and track SLAs across all system and applications.
- Lead major incident response as an incident commander; drive root-cause analysis and systemic remediation programs.
- Lead the evolution of CWAN's cloud infrastructure on AWS, establishing scalability, resilience, and security standards across all services.
- Serve as the primary owner of the Kubernetes (EKS) platform: design cluster topology, multi-tenancy models, autoscaling strategies, and upgrade lifecycle.
- Define and enforce the organization's Infrastructure-as-Code standards using Terraform and Ansible; drive adoption of GitOps workflows.
- Build and maintain CI/CD and automated deployment pipelines for all services and applications across environments.
- Evaluate and introduce emerging technologies (eBPF, WASM, service meshes) to improve platform capabilities.
- Partner with engineering leadership to embed reliability requirements into the SDLC; champion chaos engineering and resilience testing programs.
- Mentor and grow mid-level and junior SREs across global teams through code reviews, pairing, and structured knowledge sharing.
- Drive capacity planning, cost optimization, and FinOps practices across the AWS environment.
- Contribute to the engineering roadmap and help define the long-term reliability strategy for the CWAN platform.
Skills
AnsibleArgoCDAWSAWS Cost ExplorerAWS Direct ConnectAWS EKSAWS ElastiCacheAWS IAMAWS RDS/AuroraDatadogDockereBPFGitGitOpsGitHub ActionsGitLab CIGoGrafanaIstioJavaJenkinsKubernetesMimirOpenTelemetryOpenSearchPrometheusPythonRancherService MeshTerraformVictoria MetricsWASM
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free