Site Reliability Engineer (SRE) / Platform Engineer

Perfict

Reston · On-site Contract 1w ago

Apply with a tailored resume Save job

About the role

Job Title : Site Reliability Engineer (SRE) / Platform Engineer

Location: Reston, VA (Hybrid — 2 days onsite / 3 days remote)

Type:3- 6 month contract to hire

Top Skills : OpenShift/Kubernetes

Roles & Responsibilities • Operate, tune, and optimize OpenShift/Kubernetes clusters (scheduling, ingress, upgrades, quotas, policies). • Stand up and/or refine observability (Datadog, Prometheus, Grafana)—dashboards, alerts, SLOs, runbooks. • Map current hybrid topology and critical delivery pipelines; identify toil and prioritize automation (Terraform/Ansible). • Begin supporting Azure environments (compute, networking, storage, data services) used by analytics teams. • Drive GitOps-first workflows; harden CI/CD with ArgoCD/Jenkins/GitHub Actions and policy-as-code guardrails. • Implement or enhance platform services (Vault, Kafka/AMQ, ingress, service mesh) for dev and data teams. • Lead incident response and postmortems; institutionalize RCA, blameless learning, and continuous improvement. • Advance the hybrid service model—migrations, integrations, reliability/latency tuning, cost and performance optimization.

Day-to-Day Responsibilities • Operate and optimize OpenShift/Kubernetes clusters, ingress (e.g., Nginx), and container networking/service mesh. • Manage Azure services (compute, VNet, storage, data services) supporting analytics workloads. • Build and maintain automated infrastructure with Terraform, Ansible, and GitOps workflows. • Implement and evolve observability (Datadog, Prometheus, Grafana): metrics, traces, logs, alerting, SLOs, runbooks. • Design, harden, and support delivery pipelines with ArgoCD/Jenkins/GitHub Actions. • Provide platform tooling and enablement for application developers, data engineers, and operations teams. • Ensure security and access management (HashiCorp Vault, secrets management, least privilege). • Lead incident response, coordinate cross-functional resolution, and drive corrective actions and platform improvements. • Script or develop tools in Bash, Python, or Go to eliminate toil and improve developer experience.

Tech You’ll Work With • Kubernetes / OpenShift • Azure (compute, networking, storage, and data services) • Automation & IaC: Terraform, Ansible, GitOps • Observability: Datadog, Prometheus, Grafana • Networking & Ingress: Nginx, service meshes, container networking • Messaging: Kafka, AMQ • Secrets & Access: HashiCorp Vault • CI/CD: ArgoCD, Jenkins, GitHub Actions • Scripting/Coding: Bash, Python, Go

Must-Have Qualifications • 5+ years hands-on operating and managing Kubernetes and OpenShift clusters. • Strong experience with Microsoft Azure (compute, networking, storage, and data services). • Proven skills in automation and Infrastructure-as-Code (Terraform, Ansible, GitOps). • Proficiency with observability tooling (Datadog, Prometheus, Grafana). • Scripting/coding ability in Bash, Python, or Go.

Preferred / Stand-Out Skills • Experience bridging on-prem and cloud in a hybrid service model (migration, integration, optimization). • Expertise with Kafka/AMQ, HashiCorp Vault, and ArgoCD/Jenkins/GitHub Actions. • Background leading incident response and postmortems with strong RCA and continuous improvement practices.

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

Site Reliability Engineer (SRE) / Platform Engineer

About the role

Similar roles

Databricks Engineer

Linux System Engineer

GCP Engineer

Don't send a generic resume