Staff Site Reliability Engineer
Publicis Groupe Holdings B.V
About the role
Overview
About Business Unit:
At the core of all that Epsilon does is a team that sets the foundation of our IT infrastructure. The team drives innovation and efficiency through pioneering technology across Epsilon's platforms and business verticals. From being the first point of contact for infrastructure needs to final deployment, the team provides end-to-end solutions for our client-facing platforms. ETS supports all aspects of revenue-generating platforms for Epsilon and sets the architectural direction for our enterprise deployments. By adopting the newest technologies, such as Cloud, Automation, and Artificial Intelligence, the team is at the front of redefining our digital business and capturing new opportunities.
Role Overview:
We are looking for a highly experienced and forward-thinking Staff Site Reliability Engineer (SRE) to lead and evolve our infrastructure platforms—spanning over 15,000+ on-premise servers and a growing multi-cloud environment.
The ideal candidate should embody an automation-first mindset, a strong grasp of cloud engineering, and a passion for building AI-driven, agentic systems that can self-heal, self-optimize, and provide deep observability.
Click here to view how Epsilon transforms marketing with 1 View, 1 Vision and 1 Voice.
Responsibilities • Lead SRE initiatives across a hybrid infrastructure (on-prem + AWS, Azure, GCP) • Manage and optimize 15,000+ servers across Linux and Windows platforms • Work on automation by creating n8n workflows and create integrations across our tech stack • Build self-service platform using Backstage and write integrations across different products • Architect and support scalable, resilient AWS infrastructure (EKS, EC2, S3, RDS, Lambda, etc.) • Administer Kubernetes clusters at scale; ensure health, upgrades, and secure deployments • Drive infrastructure automation using Python, Shell, and Infrastructure as Code (Terraform, Ansible) • Design and implement AI agents for observability, RCA, and incident triage using modern MLOps/DevOps paradigms • Collaborate with development, IT Ops, Command Center, cloud, and platform teams to strengthen CI/CD, security posture, and SLA adherence • Build robust monitoring/alerting pipelines using Grafana, Prometheus, ELK, PagerDuty, or similar tools • Participate in and improve on-call rotations, while building out self-healing systems • Lead root cause analysis (RCA) exercises and post-incident reviews • Evangelize best practices in reliability, scalability, and cost optimization
Qualifications • 12+ years of experience in Platform/Cloud Engineering, SRE, DevOps • Strong hands-on coding experience in Python, Shell • Strong expertise in Cloud, Kubernetes, Linux Administration • Hands-on experience with AWS services and Kubernetes • Proficiency in IAC tools like Terraform, Ansible • Extensive experience in delivering efficient developer experience • Extensive knowledge in building CI/CD pipelines • Familiarity with monitoring tools (Zabbix, PagerDuty, Grafana, ELK).
Additional Information
Epsilon is a global data, technology and services company that powers the marketing and advertising ecosystem. For decades, we’ve provided marketers from the world’s leading brands the data, technology and services they need to engage consumers with 1 View, 1 Vision and 1 Voice. 1 View of their universe of potential buyers. 1 Vision for engaging each individual. And 1 Voice to harmonize engagement across paid, owned and earned channels.
Epsilon’s comprehensive portfolio of capabilities across our suite of digital media, messaging and loyalty solutions bridge the divide between marketing and advertising technology. We process 400+ billion consumer actions each day using advanced AI and hold many patents of proprietary technology, including real-time modeling languages and consumer privacy advancements. Thanks to the work of every employee, Epsilon has been consistently recognized as industry-leading by Forrester, Adweek and the MRC. Epsilon is a global company with more than 9,000 employees around the world.
Our pillars aren't just words. They're how we show up every day. • People centricity: We focus on employee well-being in an environment where colleagues truly care about each other. • Collaboration: We work together, support one another, and collectively achieve goals. • Growth: There are endless opportunities for growth through learning, development and career advancement. • Innovation: We drive progress through cutting-edge solutions and forward-thinking approaches. • Flexibility: We’ve created a balance between work and personal life, and we encourage adaptability to solve problems creatively.
Our values guide us to create value for our clients, our people and consumers. • Act with integrity • Work together to win together • Innovate with purpose • Respect all voices • Empower with accountability
These pillars and values are our foundation—shaping our culture, guiding our decisions, and uniting us in common purpose.
Epsilon is an Equal Opportunity Employer. Epsilon is committed to promoting diversity, inclusion, and equal employment opportunities by using reasonable efforts to attract, recruit, engage and retain qualified individuals of all ethnicities and backgrounds, including, but not limited to, women, people of color, LGBTQ individuals, people with disabilities and any other underrepresented groups, traits or characteristics.
Requirements
- 12+ years of experience in Platform/Cloud Engineering, SRE, DevOps
- Strong hands-on coding experience in Python, Shell
- Strong expertise in Cloud, Kubernetes, Linux Administration
- Hands-on experience with AWS services and Kubernetes
- Proficiency in IAC tools like Terraform, Ansible
- Extensive experience in delivering efficient developer experience
- Extensive knowledge in building CI/CD pipelines
- Familiarity with monitoring tools (Zabbix, PagerDuty, Grafana, ELK)
Responsibilities
- Lead SRE initiatives across a hybrid infrastructure (on-prem + AWS, Azure, GCP)
- Manage and optimize 15,000+ servers across Linux and Windows platforms
- Work on automation by creating n8n workflows and create integrations across our tech stack
- Build self-service platform using Backstage and write integrations across different products
- Architect and support scalable, resilient AWS infrastructure (EKS, EC2, S3, RDS, Lambda, etc.)
- Administer Kubernetes clusters at scale; ensure health, upgrades, and secure deployments
- Drive infrastructure automation using Python, Shell, and Infrastructure as Code (Terraform, Ansible)
- Design and implement AI agents for observability, RCA, and incident triage using modern MLOps/DevOps paradigms
- Collaborate with development, IT Ops, Command Center, cloud, and platform teams to strengthen CI/CD, security posture, and SLA adherence
- Build robust monitoring/alerting pipelines using Grafana, Prometheus, ELK, PagerDuty, or similar tools
- Participate in and improve on-call rotations, while building out self-healing systems
- Lead root cause analysis (RCA) exercises and post-incident reviews
- Evangelize best practices in reliability, scalability, and cost optimization
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free