Remote Site Reliability Engineer

WhatJobs Direct

Stutterheim · On-site Full-time 4mo ago

About the role

Are you a highly motivated and experienced Site Reliability Engineer looking for a fully remote opportunity? Our client is seeking a dedicated professional to join their distributed engineering team. In this critical role, you will be responsible for ensuring the availability, performance, scalability, and reliability of our mission-critical systems and infrastructure. You will proactively identify and address potential issues, automate operational tasks, and collaborate closely with development teams to build robust and resilient services. This position requires a deep understanding of cloud infrastructure, containerization, CI/CD pipelines, and monitoring tools, all while working from the comfort of your own home.

Responsibilities: Design, build, and maintain scalable and reliable infrastructure on cloud platforms (e.g., AWS, Azure, GCP). Develop and implement automation for deployment, configuration management, and operational tasks using tools like Terraform, Ansible, or Kubernetes. Monitor system performance, identify bottlenecks, and implement solutions to optimize efficiency and user experience. Establish and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for critical services. Develop and execute disaster recovery and business continuity plans. Respond to and resolve incidents, performing root cause analysis and implementing preventative measures. Collaborate with software engineers to improve the reliability and operability of new and existing services. Implement robust monitoring, logging, and alerting solutions using tools such as Prometheus, Grafana, ELK stack, or Datadog. Participate in on-call rotations to provide 24/7 support for critical systems. Contribute to the development and maintenance of CI/CD pipelines to ensure efficient and reliable software delivery. Document systems, processes, and incident response procedures. Continuously evaluate and recommend new technologies and practices to enhance reliability and efficiency. Qualifications: Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience. Proven experience (5+ years) in Site Reliability Engineering, DevOps, or Systems Administration with a focus on automation and reliability. Strong proficiency in at least one cloud platform (AWS, Azure, or GCP). Expertise in containerization technologies like Docker and orchestration platforms like Kubernetes. Experience with infrastructure as code (IaC) tools such as Terraform or CloudFormation. Proficiency in scripting languages (e.g., Python, Bash, Go). Solid understanding of networking concepts (TCP/IP, DNS, HTTP/S, load balancing). Experience with CI/CD tools (e.g., Jenkins, GitLab CI, CircleCI). Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK, Splunk). Excellent troubleshooting and debugging skills. Strong communication and collaboration skills, essential for a remote team environment. Self-motivated with the ability to manage time effectively and work autonomously. This fully remote position offers an exceptional chance to contribute to a leading technology company from anywhere, focusing on building and maintaining highly reliable systems.

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

Remote Site Reliability Engineer

About the role

Similar roles

Regional Asset Manager

backend developer

AR/VR iOS/Android App Developer

Don't send a generic resume