Remote Site Reliability Engineer (SRE)
WhatJobs Direct
About the role
Our client, a leading SaaS provider, is seeking a highly skilled and experienced Remote Site Reliability Engineer (SRE) to join their growing infrastructure team. This is a fully remote position, offering the opportunity to work from anywhere while ensuring the availability, performance, and scalability of our critical production systems. You will be responsible for designing, implementing, and maintaining robust infrastructure, automating operational tasks, and responding to incidents to minimize downtime. The ideal candidate will have a strong background in system administration, cloud computing (AWS, Azure, or GCP), and containerization technologies (Docker, Kubernetes). Proficiency in scripting languages (Python, Bash) and experience with infrastructure as code (IaC) tools such as Terraform or CloudFormation are essential. You will work closely with development teams to build reliable and resilient systems, implement effective monitoring and alerting solutions, and develop disaster recovery strategies. Responsibilities include troubleshooting complex production issues, performing root cause analysis, implementing preventive measures, capacity planning, and ensuring security best practices are followed. A proactive approach to identifying and addressing potential system weaknesses is crucial. The successful candidate will possess excellent problem-solving skills, strong communication abilities for effective remote collaboration, and a deep commitment to operational excellence. A Bachelor's degree in Computer Science, Engineering, or a related field is preferred, along with a minimum of 5 years of experience in SRE, DevOps, or a similar role. Experience with CI/CD pipelines and a solid understanding of networking principles are highly valued.
Responsibilities: Design, build, and maintain scalable and reliable cloud infrastructure. Automate operational tasks using scripting and IaC tools. Monitor system performance and implement alerting solutions. Respond to production incidents and perform root cause analysis. Ensure high availability, performance, and scalability of services. Collaborate with development teams on system design and deployment. Implement and manage CI/CD pipelines. Develop and maintain disaster recovery and business continuity plans. Conduct capacity planning and performance tuning. Enforce security best practices across the infrastructure. Qualifications: Minimum of 5 years of experience in Site Reliability Engineering, DevOps, or system administration. Strong expertise in cloud platforms (AWS, Azure, GCP). Proficiency with containerization technologies (Docker, Kubernetes). Experience with scripting languages (Python, Bash) and IaC tools (Terraform, CloudFormation). Solid understanding of networking protocols and concepts. Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack). Proven ability to troubleshoot complex production issues. Excellent problem-solving and analytical skills. Strong communication and collaboration skills for remote work. Familiarity with CI/CD concepts and tools. This role is fundamental to ensuring the stability and success of our client's platform. The **job location** is based out of **Kano, Kano, NG**, but the role is fully remote.
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free