Site Reliability Engineer

Jefferson Lab

Newport News · Hybrid Full-time Lead $116k – $205k/yr 3mo ago

About the role

About Jefferson Lab

Join a community with a common purpose of solving the most challenging scientific and engineering problems of our time. The Jefferson Lab campus is located in southeastern Virginia amidst a vibrant and growing technology community.

A career at Jefferson Lab is more than a job. You will be part of “big science” and work alongside top scientists and engineers from around the world unlocking the secrets of our visible universe. Managed by Jefferson Science Associates, LLC, Thomas Jefferson National Accelerator Facility is entering an exciting period of mission growth and is seeking new team members ready to apply their skills and passion to have an impact. You could call it work, or you could call it a mission. We call it a challenge. We do things that will change the world.

What your job will be like:

You embed within the HPDF architecture team to make reliability, resilience, and observability first-class features of the facility's scientific data lifecycle systems — not afterthoughts. You define the initial Service Level Objectives (SLOs) and Service Level Indicators (SLIs), establish monitoring and alerting foundations, influence technology selections across compute, storage, and networking, and build the automation tooling that eliminates manual operations risk. When the facility transitions to operations, you lead the HPDF SRE team, owning availability metrics, incident response, and the continuous improvement processes that keep the facility performing to its design parameters.

In this job you will:

Work closely with the rest of the architecture team to review and influence technology choices to establish reliability, and resilience parameters (e.g., meeting expected availability, failure domain isolation, disaster recovery)
Ensure the selected software and hardware systems meet those parameters, while also meeting performance expectations and security requirements.
Evaluate vendor and open-source solutions against established reliability and resilience parameters, develop comparative assessments, and provide technically grounded recommendations to inform architecture decisions and support acquisitions.
Metrics & Observability: Establish the foundation for system observability, defining initial SLOs/SLIs, architecting, prototyping and then implementing comprehensive monitoring, logging, and alerting solutions.
Lead the design, prototyping and implementation of these solutions including custom automation to eliminate manual operations and further improve facility resilience.
Performance Engineering: Participate in testing and performance analysis to validate reliability and resilience design decisions, to identify bottlenecks and alternative approaches.
SRE Team Framework: Define the operational framework, on-call structures, incident response, other operational processes, and staffing plans for the future SRE team, bridging the design-to-operations transition.

Experience

Required: 10 or more years SRE (Site Reliability Engineering), DevOps, or Systems Engineering roles

Education

Required: Bachelor's Degree Computer Science or related field
Preferred: Master's Degree Computer Science or related field

Education above the minimum may be substituted for experience.

Knowledge, Skills, and Abilities

High: Deep experience and understanding of distributed systems principles, failure modes, consensus protocols and self-healing architectures.
High: Expertise in defining and implementing SLOs and SLIs and comprehensive monitoring stacks and experience architecting observability frameworks in greenfield environments (e.g. Prometheus, ELK, OpenTelemetry)
High: Strong scripting and automation skills (Go, Python, Shell).
Medium: Deep experience with public cloud environments (AWS, Azure, GCP) and container orchestration (Kubernetes).
Medium: Experience with configuration management and IaC tools (e.g., Terraform, Puppet, Ansible).
Medium: Experience with IPv4 and IPv6 networking, high-speed interconnects and data transfer protocols, familiarity with network reliability patterns and software-defined networking (pref)
Low: Experience with HPC infrastructure and environments (pref)
Low: Experience leading or mentoring small teams (pref)

Total Rewards at Jefferson Lab

At Jefferson Lab, we believe that a comprehensive employee benefits program is an important and meaningful part of the compensation employees receive. Our benefits program includes, but is not limited to:

Medical, Dental, and Vision Care Plans
Flexible Spending Accounts
Paid Time-off and Leave Programs (Paid Parental, vacation, holidays, and sick leave)
401(k) Plan – 9% Lab Contribution; 100% vested
Flexible Work Arrangements (Remote & Alternate Work Schedules available)
Tuition Assistance, Training and Professional Development Programs
Live near the waterways of the Chesapeake Bay region with access to nearby beaches, mountains, and all major metropolitan centers on the East Coast

Skills

AnsibleAWSAzureELKGCPGoHPCIaCKubernetesOpenTelemetryPuppetPythonShellTerraformPrometheus

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

Site Reliability Engineer

About the role

About Jefferson Lab

What your job will be like:

In this job you will:

Experience

Education

Knowledge, Skills, and Abilities

Total Rewards at Jefferson Lab

Skills

Similar roles

Java Backend Engineer (all gender)

Backend Engineer (Bangalore)

Senior Sales Engineer

Don't send a generic resume