Senior Site Reliability Engineer

Future Fit

Remote · South Africa Full-time Senior 3mo ago

About the role

Role Summary

The Senior Site Reliability Engineer (RHEL Specialist) is a critical technical leadership role responsible for ensuring that our production environments are resilient, performant, and highly automated. Unlike traditional systems administration, this role treats infrastructure as a software problem. You will be the primary custodian of our Red Hat Enterprise Linux (RHEL) ecosystem, applying advanced engineering practices to manage thousands of nodes across on‑premise virtualization and public cloud platforms.

Your mission is to bridge the gap between software development and systems operations by designing self‑healing systems and robust Ansible‑based automation frameworks. You will be expected to proactively identify system inefficiencies, optimize kernel performance, and architect CI/CD pipelines that empower development teams while maintaining strict production stability.

Core Mission Statement: To engineer a world‑class RHEL environment where manual intervention is the exception, not the rule. Through advanced automation and deep observability, you will ensure our services achieve 99.99% availability while enabling rapid, low‑risk software delivery.

Ideal Candidate Profile

The ideal candidate is a proactive problem‑solver with a "software‑first" approach to infrastructure. We are looking for an individual who:

Possesses a deep‑seated expertise in the RHEL kernel, system internals, and performance tuning.
Views Ansible and Python as their primary tools for managing complexity at scale.
Demonstrates a proven track record of managing Docker and Kubernetes workloads in high‑traffic production settings.
Is naturally curious and proactive, often identifying and resolving system bottlenecks before they trigger an alert.
Thrives in a collaborative DevOps culture and is comfortable navigating the complexities of hybrid‑cloud environments (AWS, Azure, or GCP).

Key Responsibilities

The Senior Site Reliability Engineer (RHEL Specialist) is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of our enterprise Linux services. This role demands a unique blend of systems engineering expertise and software development skills to build and run large‑scale, distributed, fault‑tolerant systems.

Automation & Infrastructure Orchestration

Ansible Framework Design: Architect, implement, and maintain enterprise‑grade automation solutions using Ansible for our Red Hat Enterprise Linux (RHEL) fleet. This includes developing custom Ansible roles, modules, and playbooks to automate system provisioning, configuration management, and patching.
Standard Operating Environment (SOE): Maintain and evolve the RHEL SOE across hybrid‑cloud environments, ensuring consistency between on‑premise virtualization (VMware/KVM) and public cloud instances.
Infrastructure as Code (IaC): Transform manual infrastructure workflows into automated code‑based processes, ensuring that every component of the RHEL environment is version‑controlled and reproducible.

Development & Toil Reduction

Scripting & Tooling: Develop advanced scripts in Python and Bash to automate repetitive operational tasks (toil). You will be expected to build internal tools that enhance the productivity of the entire engineering organization.
System Integration: Write code to integrate infrastructure components with internal APIs, monitoring tools, and service management platforms to create seamless, end‑to‑end automated workflows.
Kernel & OS Optimization: Leverage deep Linux knowledge to tune system parameters and develop automated checks for system health and performance bottlenecks.

CI/CD & Release Engineering

Pipeline Construction: Build and optimize robust CI/CD pipelines using Jenkins or

Skills

AnsibleAWSAzureBashCI/CDDockerGCPJenkinsKubernetesLinuxPythonRHEL

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

Senior Site Reliability Engineer

About the role

Role Summary

Ideal Candidate Profile

Key Responsibilities

Automation & Infrastructure Orchestration

Development & Toil Reduction

CI/CD & Release Engineering

Skills

Similar roles

MCP Engineer / AI Backend Engineer

Senior Database Engineer

Team Leads

Don't send a generic resume