Skip to content
mimi

DevOps Engineer

Cadre5

Remote · US Full-time Mid Level 6d ago

About the role

Founded in 1999 in the beautiful Smoky Mountains of East Tennessee, Cadre5 provides innovative technical solutions to our customers locally and nationally. Our Cadre5 Lab Partners division has partnered with the National Center for Computational Sciences (NCCS) at Oak Ridge National Laboratory (ORNL) to recruit a qualified DevOps Engineer for the American Science Cloud (AmSC) initiative.

AmSC is a first-of-its-kind, federally funded cloud infrastructure and API platform designed to accelerate AI model development, data sharing, and large-scale computational science across the U.S. Department of Energy (DOE). ORNL is a premier research institution delivering breakthroughs in energy, national security, and advanced computing.

ORNL delivers scientific discoveries and technical breakthroughs needed to realize solutions in energy and national security and provides economic benefit to the nation. This premier research institution located near Knoxville in Oak Ridge, TN, addresses national needs through impactful research and world-leading research centers.

This is a full-time, permanent position that can telecommute. Occasional travel to the Oak Ridge facility may be required.

Why Cadre5?

  • Working with highly talented team members
  • 3 weeks’ vacation
  • Excellent medical insurance, including employer-paid benefits

Project Overview (American Science Cloud- A Platform for Transformative Science):

AmSC is a secure, federated, and science-optimized cloud environment that integrates the DOE’s world-leading computing and experimental facilities, data resources, and high-performance networks

The AmSC platform enables DOE scientists to create, access, and integrate world-class AI-ready datasets, run scalable model training on leadership-class systems, perform distributed simulations, control instruments, and move data efficiently across sites.

The project is a multi-Lab and Public-Private Partnership endeavor, working in tandem with the Models Consortium (ModCon) who will deploy transformative AI models and services to the platform. Key DOE capabilities, such as the Frontier (ORNL), Aurora (ANL), Perlmutter (NERSC, at LBL), Energy Services Network (ESnet, at LBL), and the High Performance Data Facility (HPDF, at JLab) will be directly integrated, allowing multi-site workflows.

The Team:

As a DevOps Engineer, you will work within the L2 Infrastructure Services group of AmSC to support all activities on our multi-cloud central hub infrastructure, for dev, staging, pre-production, and production environments. There are other L2 science service teams that are deploying services on top of the infrastructure that our L2 manages – e.g. data catalogs and repositories, at-scale HPC compute services, user interface and API developers, and intelligent operations (AI/MLOps). Most of the resources working on AmSC only have a part-time allocation to the project. You will be one of the first full-time hires that is dedicated exclusively to AmSC. Your primary job responsibilities will be to support the science teams by building foundational infrastructure and developing CI/CD pipelines to deploy services on that infrastructure.

Job Responsibilities:

  • The service stack is primarily Kubernetes-based. Perform cluster administration and application deployment assistance to users.
  • Build and maintain pipelines for deploying cloud infrastructure and science services
  • Manage and use image registries such as Harbor
  • Writing and updating automation for resource provisioning and CI/CD pipelines – e.g. Terraform, GitOps, Python
  • Implement security controls as defined by Cybersecurity team (DevSecOps)
  • Configure basic instrumentation for infrastructure and core services, to feed into monitoring and alerting systems
  • Provide primary operational support and engineering for production applications.
  • Define and implement define KPIs, processes and drive continuous improvement.
  • Diagnose platform operational problems quickly and effectively.
  • Participate in on-call rotation providing 24-hour, 7-day support and off-hours maintenance windows.
  • Deploy, manage, and operate managed Kubernetes clusters (Amazon EKS, Azure AKS, Google GKE, or equivalent), including node group lifecycle management, cluster upgrades, cloud-native networking integrations (load balancer controllers, CNI plugins), and multi-environment promotion across dev, staging, and production.
  • Coordinate with vendors to resolve hardware and software problems.
  • Deliver AmSC’s mission by aligning behaviors, priorities, and interactions with our core values of Impact, Integrity, Teamwork, Safety, and Service. Promote diversity, equity, inclusion, and accessibility by fostering a respectful workplace – in how we treat one another, work together, and measure success.

Basic Qualifications:

  • Bachelor’s Degree in computer science or closely related field and a minimum of 2 years of experience as a DevOps engineer and/or Cloud Engineer. An equivalent combination of education and experience may be considered.
  • The ability to obtain and maintain a Department of Energy "Q" clearance is required. This requires US Citizenship.

Preferred Qualifications:

  • Previous experience as a team lead – able to perform task management for DevOps or Cloud Engineering teams
  • Excellent interpersonal/communication skills, and the ability to work as part of a team.
  • Working knowledge of cloud application architecture patterns and a thorough grasp of common products and managed services for at least one Cloud Service Provider (e.g. AWS)
  • Working knowledge of Kubernetes cluster administration and concepts (CR/CRDs) and application deployment strategies (GitOps, Helm)
  • Working knowledge of Unix system fundamentals and common network protocols.
  • Solid understanding of cloud computing networking concepts.
  • Ability to proactively identify performance issues, problems, and areas for improvement.
  • Ability to identify requirements and to define, plan, and implement requisite solutions.
  • Ability to plan, organize, prioritize tasks, and complete assigned projects with minimal supervision.
  • Experience with continuous integration and continuous deployment software methodologies and strategies
  • An understanding of code review and familiarity with tools like GitHub and GitLab
  • Experience using tools such as Nagios, Grafana and Prometheus to monitor systems, metrics, and create dashboards.
  • Experience with OpenTofu or Terraform in multi-account AWS environments, including AWS Organizations, SCPs, and IRSA (IAM Roles for Service Accounts)
  • Hands-on experience with ArgoCD including App of Apps patterns and ApplicationSets for multi-environment GitOps deployments
  • Familiarity with Tanka, Jsonnet, or equivalent configuration-as-code templating approaches beyond Helm
  • Experience with Kong Gateway or similar API gateway platforms in Kubernetes environments
  • Familiarity with secrets management patterns including AWS Secrets Manager, External Secrets Operator, or comparable Kubernetes-native solutions
  • Exposure to high-speed research networks suchs as ESnet or Internet2 is a plus

Benefits

Cadre5 offers excellent pay and benefits, to include full medical, dental, and vision coverage coupled with 401K match, 15 days PTO, and 10 holidays.

Cadre5 is an equal opportunity employer. All qualified applicants, including individuals with disabilities and protected veterans, are encouraged to apply. Cadre5 is an E-Verify Employer.

Skills

Amazon EKSArgoCDAWSAWS OrganizationsAWS Secrets ManagerAzure AKSCNIDevSecOpsDockerESnetExternal Secrets OperatorGitGitHubGitLabGitOpsGoogle GKEGrafanaHarborHelmInternet2JsonnetKong GatewayKubernetesNagiosOpenTofuPrometheusPythonTerraformUnix

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free