Site Reliability Engineer (SRE)
Technology Ventures
About the role
About Us
We are seeking a highly skilled and hands-on Site Reliability Engineer (SRE) with deep Kubernetes expertise to support and enhance our enterprise platform engineering environment. This role is ideal for a self-starter who enjoys learning, solving complex infrastructure challenges, improving observability, and partnering closely with engineering teams to streamline CI/CD and platform operations.
The ideal candidate will have strong experience managing Kubernetes environments, preferably Red Hat OpenShift in an on-premises enterprise setting, along with a passion for automation, reliability engineering, and operational excellence.
Key Responsibilities
- Manage, maintain, and optimize Kubernetes/OpenShift platform environments to ensure high availability, scalability, and operational reliability.
- Provide ongoing “care and feeding” of Kubernetes clusters, including cluster administration, upgrades, troubleshooting, and performance tuning.
- Improve end-to-end observability across the platform using tools such as Grafana, Prometheus, and Datadog.
- Lead incident response efforts, root cause analysis, and postmortems to continuously improve platform reliability and resiliency.
- Partner closely with Scrum and development teams to support CI/CD pipelines, deployments, routing, configuration management, and troubleshooting.
- Build and maintain automation and deployment pipelines that support engineering and development teams.
- Develop scripts and automation solutions using Bash, Python, Go, or PowerShell to reduce manual intervention and improve operational efficiency.
- Support and maintain platform services such as HashiCorp Vault, AMQ/Kafka, Keycloak, and related infrastructure components.
- Create and maintain technical documentation, operational procedures, deployment guides, and incident response plans.
- Participate in an on-call rotation and support production environments as needed.
Required Qualifications
- 5–7+ years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or related infrastructure engineering roles.
- Deep hands-on experience with Kubernetes administration and troubleshooting.
- Strong experience with Red Hat OpenShift, including operators, ingress/routing, and cluster management.
- Experience supporting enterprise infrastructure in on-premises environments.
- Strong scripting and automation skills using Bash and/or Python.
- Experience with observability and monitoring tools such as Grafana, Prometheus, and Datadog.
- Experience troubleshooting complex production issues using logs, metrics, traces, packet captures, and Kubernetes debugging tools.
- Experience working with CI/CD pipelines and collaborating directly with Agile/Scrum development teams.
- Familiarity with Azure cloud services and hybrid infrastructure environments.
- Experience with technologies such as HashiCorp Vault, Kafka/AMQ, Redis, and Keycloak is preferred.
- Strong communication skills and ability to work collaboratively across teams.
- Bachelor’s degree in computer science or related field, or equivalent practical experience.
Preferred Qualities
- Self-motivated engineer with a strong desire to learn and continuously improve.
- Ability to thrive in fast-paced, highly collaborative enterprise environments.
- Experience working in heavily audited or compliance-focused organizations is a plus.
Interview Process
- 3 rounds total:
- Round 1: Virtual interview with Hiring Manager
- Round 2: Virtual panel interview with team members
- Round 3: Final onsite interview in Reston, VA
- Entire interview process expected to be completed within 7–10 days.
Compensation
- Target base salary range: $175,000 – $185,000
- Bonus: 7.5% – 10% performance-based
- Exceptional candidates may be considered for higher compensation.
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free