Lead Observability Platform Engineer
Hispanic Alliance for Career Enhancement
About the role
Position Summary
We're building a world of health around every individual - shaping a more connected, convenient and compassionate health experience. At CVS Health®, you'll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do. Join us and be part of something bigger - helping to simplify health care one person, one family and one community at a time.
Join CVS Health Enterprise Technology and help evolve observability at Fortune‑6 scale. The Enterprise Observability Platform (EOP) delivers standardized, frictionless instrumentation and telemetry pipelines for engineering teams across all CVS Health application environments‑spanning on‑prem, hybrid, and multiple public clouds.
As a Lead Observability Platform Engineer, you will design, build, and operate large‑scale observability services that process billions of logs, metrics, and traces daily. You will develop high‑performance backend services using Go, Java, and Node.js, and lead the adoption of Open Telemetry‑based instrumentation and standards across the enterprise.
In this role, you will partner closely with SRE, Cloud Engineering, CI/CD, Infrastructure, Security, and application teams to shape platform strategy, enhance developer experience, and ensure reliable, secure, and cost‑efficient observability will provide senior technical leadership, influence architectural direction, and help deliver a world‑class, self‑service observability ecosystem that accelerates engineering productivity and operational excellence.
Key Responsibilities
- Design, build, and operate core observability platform services using Go, Java (Spring Boot), and Node.js.
- Lead enterprise‑wide adoption of Open Telemetry, including client libraries, semantic conventions, instrumentation patterns, and Collector/agent strategy.
- Architect and scale high‑throughput, fault‑tolerant telemetry pipelines (logs, metrics, traces) with a focus on performance, reliability, and cost efficiency.
- Develop self‑service observability capabilities that simplify onboarding, troubleshooting, and adoption for application teams.
- Implement end‑to‑end monitoring of the observability platform itself, defining SLOs, health checks, and alerting.
- Collaborate with SRE, Platform, and Cloud teams to establish reliability standards, error budgets, and incident response practices.
- Participate in on‑call rotations and lead incident mitigation, root‑cause analysis, and post‑incident reviews.
- Automate operational workflows and eliminate manual toil through tooling, CI/CD enhancements, and platform automation.
- Ensure secure telemetry pipelines through mTLS, secrets management, and zero‑trust design patterns.
- Produce and maintain high‑quality technical documentation, standards, and best practices.
- Engage with internal engineering teams to gather requirements, influence roadmap prioritization, and deliver platform improvements.
- Provide technical leadership through mentorship, design reviews, architectural guidance, and cross‑team collaboration with principal engineers and engineering leadership.
Required Qualifications
- 7+ years of experience in Software Engineering, Platform Engineering, or SRE.
- 5+ years of experience with observability practices, including SLIs/SLOs/SLAs, alerting, and incident management.
- 5+ years building production‑grade backend services in Go and/or Java.
- 5+ years implementing and operating Open Telemetry, including OTLP, semantic conventions, and instrumentation patterns.
- 5+ years with cloud‑native and containerized platforms (Docker, Kubernetes, Argo CD).
- 5+ years working with public cloud platforms (AWS, GCP, or Azure).
- 3+ years designing and scaling distributed, high‑volume data pipelines.
- 3+ years working with Grafana OSS or comparable observability backends (e.g., Grafana, Loki, Tempo, Mimir).
- 3+ years with relational databases (PostgreSQL, MySQL).
Preferred Qualifications
- Experience with service meshes and networking technologies such as Envoy and Istio
- Experience integrating or operating commercial observability platforms (Datadog, New Relic, App Dynamics, etc.)
- Experience with streaming and data…
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free