Senior Site Reliability Engineer
Future Fit
About the role
Role Summary
The Senior Site Reliability Engineer (RHEL Specialist) is a critical technical leadership role responsible for ensuring that our production environments are resilient, performant, and highly automated. Unlike traditional systems administration, this role treats infrastructure as a software problem. You will be the primary custodian of our Red Hat Enterprise Linux (RHEL) ecosystem, applying advanced engineering practices to manage thousands of nodes across on‑premise virtualization and public cloud platforms.
Your mission is to bridge the gap between software development and systems operations by designing self‑healing systems and robust Ansible‑based automation frameworks. You will be expected to proactively identify system inefficiencies, optimize kernel performance, and architect CI/CD pipelines that empower development teams while maintaining strict production stability.
Core Mission Statement: To engineer a world‑class RHEL environment where manual intervention is the exception, not the rule. Through advanced automation and deep observability, you will ensure our services achieve 99.99% availability while enabling rapid, low‑risk software delivery.
Ideal Candidate Profile
The ideal candidate is a proactive problem‑solver with a "software‑first" approach to infrastructure. We are looking for an individual who:
- Possesses a deep‑seated expertise in the RHEL kernel, system internals, and performance tuning.
- Views Ansible and Python as their primary tools for managing complexity at scale.
- Demonstrates a proven track record of managing Docker and Kubernetes workloads in high‑traffic production settings.
- Is naturally curious and proactive, often identifying and resolving system bottlenecks before they trigger an alert.
- Thrives in a collaborative DevOps culture and is comfortable navigating the complexities of hybrid‑cloud environments (AWS, Azure, or GCP).
Key Responsibilities
The Senior Site Reliability Engineer (RHEL Specialist) is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of our enterprise Linux services. This role demands a unique blend of systems engineering expertise and software development skills to build and run large‑scale, distributed, fault‑tolerant systems.
Automation & Infrastructure Orchestration
- Ansible Framework Design: Architect, implement, and maintain enterprise‑grade automation solutions using Ansible for our Red Hat Enterprise Linux (RHEL) fleet. This includes developing custom Ansible roles, modules, and playbooks to automate system provisioning, configuration management, and patching.
- Standard Operating Environment (SOE): Maintain and evolve the RHEL SOE across hybrid‑cloud environments, ensuring consistency between on‑premise virtualization (VMware/KVM) and public cloud instances.
- Infrastructure as Code (IaC): Transform manual infrastructure workflows into automated code‑based processes, ensuring that every component of the RHEL environment is version‑controlled and reproducible.
Development & Toil Reduction
- Scripting & Tooling: Develop advanced scripts in Python and Bash to automate repetitive operational tasks (toil). You will be expected to build internal tools that enhance the productivity of the entire engineering organization.
- System Integration: Write code to integrate infrastructure components with internal APIs, monitoring tools, and service management platforms to create seamless, end‑to‑end automated workflows.
- Kernel & OS Optimization: Leverage deep Linux knowledge to tune system parameters and develop automated checks for system health and performance bottlenecks.
CI/CD & Release Engineering
- Pipeline Construction: Build and optimize robust CI/CD pipelines using Jenkins or
Requirements
- Deep-seated expertise in the RHEL kernel, system internals, and performance tuning.
- Proven track record of managing Docker and Kubernetes workloads in high‑traffic production settings.
- Comfortable navigating the complexities of hybrid‑cloud environments (AWS, Azure, or GCP).
Responsibilities
- Ensure production environments are resilient, performant, and highly automated.
- Treat infrastructure as a software problem.
- Be the primary custodian of our Red Hat Enterprise Linux (RHEL) ecosystem.
- Apply advanced engineering practices to manage thousands of nodes across on‑premise virtualization and public cloud platforms.
- Bridge the gap between software development and systems operations by designing self‑healing systems and robust Ansible‑based automation frameworks.
- Proactively identify system inefficiencies, optimize kernel performance, and architect CI/CD pipelines.
- Engineer a world‑class RHEL environment where manual intervention is the exception, not the rule.
- Ensure services achieve 99.99% availability while enabling rapid, low‑risk software delivery.
- Architect, implement, and maintain enterprise‑grade automation solutions using Ansible for our Red Hat Enterprise Linux (RHEL) fleet.
- Develop custom Ansible roles, modules, and playbooks to automate system provisioning, configuration management, and patching.
- Maintain and evolve the RHEL SOE across hybrid‑cloud environments.
- Transform manual infrastructure workflows into automated code‑based processes.
- Develop advanced scripts in Python and Bash to automate repetitive operational tasks.
- Build internal tools that enhance the productivity of the entire engineering organization.
- Write code to integrate infrastructure components with internal APIs, monitoring tools, and service management platforms.
- Tune system parameters and develop automated checks for system health and performance bottlenecks.
- Build and optimize robust CI/CD pipelines using Jenkins.
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free