Site Reliability Engineer
Jefferson Lab
About the role
About Jefferson Lab
Join a community with a common purpose of solving the most challenging scientific and engineering problems of our time. The Jefferson Lab campus is located in southeastern Virginia amidst a vibrant and growing technology community.
A career at Jefferson Lab is more than a job. You will be part of “big science” and work alongside top scientists and engineers from around the world unlocking the secrets of our visible universe. Managed by Jefferson Science Associates, LLC, Thomas Jefferson National Accelerator Facility is entering an exciting period of mission growth and is seeking new team members ready to apply their skills and passion to have an impact. You could call it work, or you could call it a mission. We call it a challenge. We do things that will change the world.
What your job will be like:
You embed within the HPDF architecture team to make reliability, resilience, and observability first-class features of the facility's scientific data lifecycle systems — not afterthoughts. You define the initial Service Level Objectives (SLOs) and Service Level Indicators (SLIs), establish monitoring and alerting foundations, influence technology selections across compute, storage, and networking, and build the automation tooling that eliminates manual operations risk. When the facility transitions to operations, you lead the HPDF SRE team, owning availability metrics, incident response, and the continuous improvement processes that keep the facility performing to its design parameters.
In this job you will:
- Work closely with the rest of the architecture team to review and influence technology choices to establish reliability, and resilience parameters (e.g., meeting expected availability, failure domain isolation, disaster recovery)
- Ensure the selected software and hardware systems meet those parameters, while also meeting performance expectations and security requirements.
- Evaluate vendor and open-source solutions against established reliability and resilience parameters, develop comparative assessments, and provide technically grounded recommendations to inform architecture decisions and support acquisitions.
- Metrics & Observability: Establish the foundation for system observability, defining initial SLOs/SLIs, architecting, prototyping and then implementing comprehensive monitoring, logging, and alerting solutions.
- Lead the design, prototyping and implementation of these solutions including custom automation to eliminate manual operations and further improve facility resilience.
- Performance Engineering: Participate in testing and performance analysis to validate reliability and resilience design decisions, to identify bottlenecks and alternative approaches.
- SRE Team Framework: Define the operational framework, on-call structures, incident response, other operational processes, and staffing plans for the future SRE team, bridging the design-to-operations transition.
Experience
- Required: 10 or more years SRE (Site Reliability Engineering), DevOps, or Systems Engineering roles
Education
- Required: Bachelor's Degree Computer Science or related field
- Preferred: Master's Degree Computer Science or related field
Education above the minimum may be substituted for experience.
Knowledge, Skills, and Abilities
- High: Deep experience and understanding of distributed systems principles, failure modes, consensus protocols and self-healing architectures.
- High: Expertise in defining and implementing SLOs and SLIs and comprehensive monitoring stacks and experience architecting observability frameworks in greenfield environments (e.g. Prometheus, ELK, OpenTelemetry)
- High: Strong scripting and automation skills (Go, Python, Shell).
- Medium: Deep experience with public cloud environments (AWS, Azure, GCP) and container orchestration (Kubernetes).
- Medium: Experience with configuration management and IaC tools (e.g., Terraform, Puppet, Ansible).
- Medium: Experience with IPv4 and IPv6 networking, high-speed interconnects and data transfer protocols, familiarity with network reliability patterns and software-defined networking (pref)
- Low: Experience with HPC infrastructure and environments (pref)
- Low: Experience leading or mentoring small teams (pref)
Total Rewards at Jefferson Lab
At Jefferson Lab, we believe that a comprehensive employee benefits program is an important and meaningful part of the compensation employees receive. Our benefits program includes, but is not limited to:
- Medical, Dental, and Vision Care Plans
- Flexible Spending Accounts
- Paid Time-off and Leave Programs (Paid Parental, vacation, holidays, and sick leave)
- 401(k) Plan – 9% Lab Contribution; 100% vested
- Flexible Work Arrangements (Remote & Alternate Work Schedules available)
- Tuition Assistance, Training and Professional Development Programs
- Live near the waterways of the Chesapeake Bay region with access to nearby beaches, mountains, and all major metropolitan centers on the East Coast
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free