Senior Site Reliability Engineer
WhatJobs Direct
About the role
Our client, a leading innovator in cloud infrastructure solutions, is seeking a highly skilled Senior Site Reliability Engineer (SRE) to join their robust and dedicated remote engineering team. This position is crucial for ensuring the scalability, availability, and performance of our critical services.
Responsibilities: Design, build, and maintain reliable and scalable infrastructure, automating processes wherever possible. Develop and implement monitoring, alerting, and logging systems to ensure proactive identification and resolution of issues. Participate in on-call rotations to respond to and resolve production incidents, performing root cause analysis and implementing preventative measures. Collaborate with development teams to integrate reliability best practices into the software development lifecycle (SDLC). Manage cloud infrastructure (e.g., AWS, Azure, GCP), focusing on performance optimization, cost efficiency, and security. Develop and maintain infrastructure-as-code (IaC) solutions using tools like Terraform or Ansible. Contribute to disaster recovery and business continuity planning and execution. Troubleshoot complex system issues across distributed environments. Mentor junior SREs and contribute to the team's knowledge sharing and skill development. Drive continuous improvement in system stability, performance, and operational efficiency. Qualifications: Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience. A minimum of 7 years of experience in Site Reliability Engineering, DevOps, or Systems Administration roles, with a strong focus on large-scale distributed systems. Proficiency in at least one scripting language (e.g., Python, Go, Bash) and experience with configuration management tools (e.g., Ansible, Chef, Puppet). Extensive experience with cloud platforms (AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes). Deep understanding of networking concepts, operating systems (Linux), and database technologies. Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack). Proven ability to troubleshoot and resolve complex production issues under pressure. Excellent problem-solving, analytical, and critical thinking skills. Strong communication and collaboration skills, with the ability to work effectively in a remote team environment. Commitment to automation, continuous improvement, and best practices in site reliability. This is an exceptional opportunity for a talented SRE to play a pivotal role in maintaining and enhancing the reliability of cutting-edge cloud services. Join our client and contribute to mission-critical systems from your remote workspace. This role is based in Jos, Plateau, NG , operating fully remotely.
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free