WD
Senior Site Reliability Engineer (SRE) - Infrastructure
WhatJobs Direct
Remote · South Africa Full-time Senior 3w ago
About the role
About
Our client is seeking a highly skilled Senior Site Reliability Engineer (SRE) to join their engineering team and enhance the reliability, scalability, and performance of their critical infrastructure. This is a fully remote position, offering a challenging and rewarding environment for an experienced SRE to make a significant impact on our systems' availability and efficiency.
Responsibilities
- Design, build, and maintain robust and scalable infrastructure solutions, focusing on automation and reliability.
- Implement and manage monitoring, alerting, and logging systems to ensure proactive identification and resolution of issues.
- Develop automation tools and scripts to streamline deployment, configuration management, and operational tasks.
- Collaborate with development teams to integrate reliability best practices into the software development lifecycle (SDLC).
- Participate in on-call rotations to respond to system incidents and perform root cause analysis (RCA).
- Define and track key performance indicators (KPIs) related to system availability, latency, and performance.
- Conduct capacity planning and performance tuning to ensure systems can handle anticipated growth.
- Implement and manage CI/CD pipelines to facilitate rapid and reliable software releases.
- Troubleshoot complex production issues across distributed systems.
- Contribute to security best practices and ensure infrastructure compliance.
- Document infrastructure architecture, processes, and runbooks.
- Mentor junior engineers and share knowledge on SRE principles and practices.
- Evaluate and adopt new technologies and tools to improve infrastructure reliability and efficiency.
Qualifications
- Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
- Minimum of 5 years of experience in Site Reliability Engineering, DevOps, or Systems Engineering.
- Proven experience with cloud platforms (AWS, Azure, or GCP) and containerization technologies (Docker, Kubernetes).
- Strong expertise in scripting languages such as Python, Bash, or Go.
- Experience with infrastructure as code (IaC) tools like Terraform or Ansible.
- Proficiency in monitoring and logging tools (e.g., Prometheus, Grafana, ELK Stack).
- Deep understanding of operating systems (Linux/Unix), networking protocols, and distributed systems.
- Experience with CI/CD tools and methodologies.
- Excellent problem-solving and debugging skills.
- Strong communication and collaboration skills, essential for a remote team.
- Ability to work independently and manage complex projects.
- Experience in performance tuning and capacity planning.
- Relevant certifications (e.g., AWS Certified SysOps Administrator, CKA) are a plus.
Location
This is a fully remote role, based out of our East London, Eastern Cape, ZA office, but with complete flexibility for the employee.
Join our dedicated engineering team and help build the future of reliable systems.
Requirements
- Minimum of 5 years of experience in Site Reliability Engineering, DevOps, or Systems Engineering.
- Proven experience with cloud platforms (AWS, Azure, or GCP) and containerization technologies (Docker, Kubernetes).
- Strong expertise in scripting languages such as Python, Bash, or Go.
- Experience with infrastructure as code (IaC) tools like Terraform or Ansible.
- Proficiency in monitoring and logging tools (e.g., Prometheus, Grafana, ELK Stack).
- Deep understanding of operating systems (Linux/Unix), networking protocols, and distributed systems.
- Experience with CI/CD tools and methodologies.
- Excellent problem-solving and debugging skills.
- Strong communication and collaboration skills, essential for a remote team.
- Ability to work independently and manage complex projects.
- Experience in performance tuning and capacity planning.
Responsibilities
- Design, build, and maintain robust and scalable infrastructure solutions, focusing on automation and reliability.
- Implement and manage monitoring, alerting, and logging systems to ensure proactive identification and resolution of issues.
- Develop automation tools and scripts to streamline deployment, configuration management, and operational tasks.
- Collaborate with development teams to integrate reliability best practices into the software development lifecycle (SDLC).
- Participate in on-call rotations to respond to system incidents and perform root cause analysis (RCA).
- Define and track key performance indicators (KPIs) related to system availability, latency, and performance.
- Conduct capacity planning and performance tuning to ensure systems can handle anticipated growth.
- Implement and manage CI/CD pipelines to facilitate rapid and reliable software releases.
- Troubleshoot complex production issues across distributed systems.
- Contribute to security best practices and ensure infrastructure compliance.
- Document infrastructure architecture, processes, and runbooks.
- Mentor junior engineers and share knowledge on SRE principles and practices.
- Evaluate and adopt new technologies and tools to improve infrastructure reliability and efficiency.
Skills
AnsibleAWSAzureBashCI/CDDockerELK StackGCPGoGrafanaKubernetesLinuxPrometheusPythonTerraformUnix
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free