Senior Site Reliability Engineer (SRE) - Infrastructure

WhatJobs Direct

Remote · South Africa Full-time Senior 3mo ago

About the role

About

Our client is seeking a highly skilled Senior Site Reliability Engineer (SRE) to join their engineering team and enhance the reliability, scalability, and performance of their critical infrastructure. This is a fully remote position, offering a challenging and rewarding environment for an experienced SRE to make a significant impact on our systems' availability and efficiency.

Responsibilities

Design, build, and maintain robust and scalable infrastructure solutions, focusing on automation and reliability.
Implement and manage monitoring, alerting, and logging systems to ensure proactive identification and resolution of issues.
Develop automation tools and scripts to streamline deployment, configuration management, and operational tasks.
Collaborate with development teams to integrate reliability best practices into the software development lifecycle (SDLC).
Participate in on-call rotations to respond to system incidents and perform root cause analysis (RCA).
Define and track key performance indicators (KPIs) related to system availability, latency, and performance.
Conduct capacity planning and performance tuning to ensure systems can handle anticipated growth.
Implement and manage CI/CD pipelines to facilitate rapid and reliable software releases.
Troubleshoot complex production issues across distributed systems.
Contribute to security best practices and ensure infrastructure compliance.
Document infrastructure architecture, processes, and runbooks.
Mentor junior engineers and share knowledge on SRE principles and practices.
Evaluate and adopt new technologies and tools to improve infrastructure reliability and efficiency.

Qualifications

Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
Minimum of 5 years of experience in Site Reliability Engineering, DevOps, or Systems Engineering.
Proven experience with cloud platforms (AWS, Azure, or GCP) and containerization technologies (Docker, Kubernetes).
Strong expertise in scripting languages such as Python, Bash, or Go.
Experience with infrastructure as code (IaC) tools like Terraform or Ansible.
Proficiency in monitoring and logging tools (e.g., Prometheus, Grafana, ELK Stack).
Deep understanding of operating systems (Linux/Unix), networking protocols, and distributed systems.
Experience with CI/CD tools and methodologies.
Excellent problem-solving and debugging skills.
Strong communication and collaboration skills, essential for a remote team.
Ability to work independently and manage complex projects.
Experience in performance tuning and capacity planning.
Relevant certifications (e.g., AWS Certified SysOps Administrator, CKA) are a plus.

Location

This is a fully remote role, based out of our East London, Eastern Cape, ZA office, but with complete flexibility for the employee.

Join our dedicated engineering team and help build the future of reliable systems.

Skills

AnsibleAWSAzureBashCI/CDDockerELK StackGCPGoGrafanaKubernetesLinuxPrometheusPythonTerraformUnix

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

Senior Site Reliability Engineer (SRE) - Infrastructure

About the role

About

Responsibilities

Qualifications

Location

Skills

Similar roles

MCP Engineer / AI Backend Engineer

Senior Database Engineer

Team Leads

Don't send a generic resume