Skip to content
mimi

Remote Senior Site Reliability Engineer (SRE)

WhatJobs Direct

Meyerton · On-site Full-time Senior 1w ago

About the role

Our client, a rapidly scaling SaaS company revolutionizing the project management software industry, is seeking an experienced Remote Senior Site Reliability Engineer (SRE) to ensure the highest levels of availability, performance, and scalability for their critical cloud infrastructure. This is a fully remote position, offering the flexibility to work from anywhere while contributing to a mission-critical aspect of the business. You will be instrumental in designing, building, and operating robust systems that support millions of users worldwide.

Key Responsibilities: Design, build, and maintain scalable and reliable infrastructure on cloud platforms (e.g., AWS, Azure, GCP). Develop and implement automation strategies for deployment, monitoring, and incident response. Manage and optimize container orchestration platforms (e.g., Kubernetes, Docker Swarm). Define and monitor Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for critical services. Implement robust monitoring, logging, and alerting solutions to ensure proactive issue detection and rapid resolution. Participate in an on-call rotation, responding to and resolving production incidents with a focus on minimizing downtime and impact. Conduct post-mortems for incidents, identifying root causes and implementing preventative measures. Collaborate with development teams to integrate SRE best practices into the software development lifecycle. Develop and maintain infrastructure as code (IaC) using tools like Terraform or CloudFormation. Perform capacity planning and performance tuning to ensure system scalability and efficiency. Contribute to the design and architecture of new services and features with a focus on reliability and operability. Automate routine operational tasks to improve efficiency and reduce manual effort. Evaluate and integrate new technologies and tools to enhance the SRE platform. Ensure security best practices are implemented and maintained across the infrastructure. Mentor junior engineers and share knowledge across the team. Qualifications: Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience. 5+ years of experience in Site Reliability Engineering, Systems Engineering, or DevOps roles, with a strong emphasis on production environments. Proven experience with cloud platforms such as AWS, Azure, or GCP. Deep understanding of containerization technologies (Docker) and orchestration (Kubernetes). Proficiency in at least one programming or scripting language (e.g., Python, Go, Bash). Experience with IaC tools like Terraform, Ansible, or CloudFormation. Solid understanding of networking concepts (TCP/IP, DNS, HTTP/S). Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK Stack, Datadog). Familiarity with CI/CD pipelines and tools (e.g., Jenkins, GitLab CI). Strong troubleshooting and problem-solving skills. Excellent communication and collaboration abilities. Experience in designing and implementing disaster recovery and business continuity plans. Understanding of database technologies and their operational aspects.

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free