Skip to content
mimi

Site Reliability Engineer (SRE) / Platform Engineer

Tier4 Group

Reston · Hybrid Full-time Mid Level $170k – $200k/yr Today

About the role

About the Organization

Join a mission-driven, national financial services organization at the heart of the U.S. housing finance ecosystem. This is a mid-sized, highly regulated enterprise operating at market scale—supporting platforms and analytics that enable trillions of dollars in annual economic activity. You’ll work in a modern tech environment with strong engineering partners, clear business impact, and a mandate for reliability, security, and continuous improvement.

The Role

Our client is hiring a hands-on SRE / Platform Engineer to operate, tune, and scale our OpenShift/Kubernetes platforms while bridging on-prem to Azure to power our analytics ecosystem. You’ll own reliability, automation, and observability across a hybrid estate—partnering closely with developers, data engineers, infrastructure operations, and security to deliver secure, performant platform services using modern DevSecOps practices.

Why This Role Stands Out

  • Hybrid impact: Operate critical OpenShift clusters and manage Azure services used by data and analytics teams.
  • Hybrid architecture: Help design and support the bridge from on-prem to cloud—migration, integration, and steady-state operations.
  • Real-world scale: Reliability work that directly supports high-volume financial market operations and enterprise analytics.
  • Automation-first: Lean into Terraform, Ansible, and GitOps to make reliability repeatable.

What You’ll Do In The First 180 days....

  • Operate, tune, and optimize OpenShift/Kubernetes clusters (scheduling, ingress, upgrades, quotas, policies).
  • Stand up and/or refine observability (Datadog, Prometheus, Grafana)—dashboards, alerts, SLOs, runbooks.
  • Map current hybrid topology and critical delivery pipelines; identify toil and prioritize automation (Terraform/Ansible).
  • Begin supporting Azure environments (compute, networking, storage, data services) used by analytics teams.
  • Drive GitOps-first workflows; harden CI/CD with ArgoCD/Jenkins/GitHub Actions and policy-as-code guardrails.
  • Implement or enhance platform services (Vault, Kafka/AMQ, ingress, service mesh) for dev and data teams.
  • Lead incident response and postmortems; institutionalize RCA, blameless learning, and continuous improvement.
  • Advance the hybrid service model—migrations, integrations, reliability/latency tuning, cost and performance optimization.

Day-to-Day Responsibilities

  • Operate and optimize OpenShift/Kubernetes clusters, ingress (e.g., Nginx), and container networking/service mesh.
  • Manage Azure services (compute, VNet, storage, data services) supporting analytics workloads.
  • Build and maintain automated infrastructure with Terraform, Ansible, and GitOps workflows.
  • Implement and evolve observability (Datadog, Prometheus, Grafana): metrics, traces, logs, alerting, SLOs, runbooks.
  • Design, harden, and support delivery pipelines with ArgoCD/Jenkins/GitHub Actions.
  • Provide platform tooling and enablement for application developers, data engineers, and operations teams.
  • Ensure security and access management (HashiCorp Vault, secrets management, least privilege).
  • Lead incident response, coordinate cross-functional resolution, and drive corrective actions and platform improvements.
  • Script or develop tools in Bash, Python, or Go to eliminate toil and improve developer experience.

Tech You’ll Work With

  • Kubernetes / OpenShift
  • Azure (compute, networking, storage, and data services)
  • Automation & IaC: Terraform, Ansible, GitOps
  • Observability: Datadog, Prometheus, Grafana
  • Networking & Ingress: Nginx, service meshes, container networking
  • Messaging: Kafka, AMQ
  • Secrets & Access: HashiCorp Vault
  • CI/CD: ArgoCD, Jenkins, GitHub Actions
  • Scripting/Coding: Bash, Python, Go

Must-Have Qualifications

  • 5+ years hands-on operating and managing Kubernetes and OpenShift clusters.
  • Strong experience with Microsoft Azure (compute, networking, storage, and data services).
  • Proven skills in automation and Infrastructure-as-Code (Terraform, Ansible, GitOps).
  • Proficiency with observability tooling (Datadog, Prometheus, Grafana).
  • Scripting/coding ability in Bash, Python, or Go.

Preferred / Stand-Out Skills

  • Experience bridging on-prem and cloud in a hybrid service model (migration, integration, optimization).
  • Expertise with Kafka/AMQ, HashiCorp Vault, and ArgoCD/Jenkins/GitHub Actions.
  • Background leading incident response and postmortems with strong RCA and continuous improvement practices.

Work Model & Team

  • Hybrid: 2 days onsite in Reston, VA to start, but will move to 3 days onsite later this year
  • You’ll be part of the IT organization, collaborating daily with developers, data engineers, infrastructure operations, and security.

How to Succeed In This Role

  • You’re a hands-on engineer who thrives in regulated, high-impact environments.
  • You favor automation over repetition, and observability over guesswork.
  • You collaborate openly, communicate clearly, and leave systems better than you found them.

Pay

$170,000.00 - $200,000.00 per year

Benefits

  • 401(k)
  • Dental insurance
  • Health insurance
  • Paid time off
  • Vision insurance

Work Location

Hybrid remote in Reston, VA 20191

Skills

AMQAnsibleArgoCDAzureBashDatadogGitOpsGitHub ActionsGoGrafanaHashiCorp VaultJenkinsKafkaKubernetesNginxOpenShiftPrometheusPythonTerraform

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free