Senior Principal Reliability Engineer

The Hartford

Bismarck · Hybrid Full-time Senior $153k – $229k/yr 3mo ago

About the role

Join our innovative team as a Senior Principal Reliability Engineer responsible for shaping the reliability strategies across our Enterprise Data Services (EDS) organization. At our company, we are committed to making a significant impact beyond traditional insurance coverages and policies, and we want to empower you to achieve your professional aspirations while helping others along the way.

In this pivotal role, you will be the senior technical authority overseeing the reliability, resilience, availability, and performance of our entire data platforms, cloud infrastructure, data products, and data pipelines. You'll establish and enhance Reliability Engineering (RE) practices, tools, automation, observability frameworks, and AI-driven operations, setting a strategic vision that influences organizational success.

This position offers a Hybrid or Remote work schedule. Candidates located near our office are expected to work in the office three days a week (Tuesday through Thursday). Those who are remote will have the flexibility to work from home, with occasional office visits as required by business needs. Must be eligible to work in the US without company sponsorship.

Key Responsibilities

Enterprise Reliability Strategy & Leadership

Collaborate with the AVP of RE & Production Support to define the reliability engineering strategy for data platforms and cloud environments.
Create long-term RE roadmaps and architectural patterns that facilitate organizational growth.
Act as the top technical escalation point for systemic reliability challenges, engaging with executive stakeholders and engineering leaders.

Platform & Cloud Reliability

Design and enhance reliable, efficient cloud-based platforms across AWS and GCP for EDS services.
Work with Platform Solution Architecture to enable new products through hyper-automation.
Manage reliability controls for critical data systems and oversee the establishment of SLO/SLI frameworks throughout the data lifecycle.

AI-Enabled Operations & Automation

Implement AI-driven anomaly detection, alert correlation, and predictive capacity management solutions.
Utilize LLMs and cloud-native AI tools to create intelligent operational resources.
Promote machine learning-based observability and reliability analytics throughout the organization.

End-to-End Observability & Operational Excellence

Establish a comprehensive data observability framework across all platforms.
Develop incident response protocols and continuous improvement processes to enhance operational excellence.
Focus on reducing repetitive tasks, fostering self-healing systems, and proactive detection.

Data Pipeline & Product Reliability

Define best practices for modern data products and automated data pipeline reliability.
Ensure high-quality, timely data delivery through automated checks and alerts.
Work with Data Engineering to incorporate resilience patterns into pipeline designs.

Engineering Standards & Cross-Organizational Influence

Set standards for Infrastructure-as-Code, CI/CD, and operational readiness.
Mentor and lead teams in establishing a strong engineering culture across the organization.
Represent the RE function in architectural reviews and executive discussions.

Technical Experience

Over 10 years in data, cloud, platform engineering, or large-scale distributed systems, preferably in leadership roles.
Proficient in cloud platforms and resilient architectural patterns.
Experience with technologies such as Snowflake, EMR, Hadoop/Spark and cloud-native data ecosystems.
Strong programming skills, especially in Python, for automation and reliability frameworks.
Familiarity with Infrastructure-as-Code and enterprise CI/CD practices.

Preferred Qualifications

Experience in industries such as financial services, insurance, or healthcare.
Previous roles as a Senior Staff Engineer or Engineering/Architecture leader with hands-on experience.
Knowledge of data governance and data quality engineering.
Relevant certifications in AWS, GCP, Kubernetes, or SRE/DevOps frameworks.

AI & AIOps Expertise

Experience applying machine learning techniques for operational improvements.
Familiarity with AI-enabled tools for reliability enhancements.

Observability & Platform Operations

Expertise in enterprise observability tools such as Prometheus, Grafana, and Datadog.
Ability to design advanced SLI/SLO frameworks in complex data environments.

Leadership & Cross-Functional Influence

Proven ability to lead technical strategy, influence engineering leaders, and establish company-wide standards.
Strong mentoring capabilities and architectural guidance to enhance engineering excellence.
Outstanding communication skills for effective collaboration with executives and technical teams.

Compensation

The annualized base pay range for this role is $152,800 - $229,200. This range reflects external market analysis, and actual pay may vary based on performance and competencies. Base pay is just one component of our total compensation package, which includes bonuses, long-term incentives, and other rewards.

We are proud to be an Equal Opportunity Employer, welcoming individuals regardless of sex, race, color, veterans, disability, sexual orientation, gender identity, expression, religion, or age.

Skills

AWSCI/CDCloud NativeDatadogEMRGCPGrafanaHadoopInfrastructure-as-CodeLLMsMachine LearningPythonPrometheusSparkSnowflake

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free