Senior Site Reliability Engineer (SRE)
WhatJobs Direct
About the role
About
Our client is seeking a highly skilled and experienced Senior Site Reliability Engineer (SRE) to join their growing infrastructure team. This critical role focuses on ensuring the scalability, availability, and performance of our client's complex systems and services. You will be instrumental in designing, implementing, and automating infrastructure solutions, developing robust monitoring strategies, and leading incident response efforts. The ideal candidate possesses a strong background in systems engineering, cloud technologies, and a passion for building highly reliable and efficient systems. This is an opportunity to work with cutting‑edge technologies and contribute to a stable and high‑performing production environment.
Key Responsibilities
- Design, build, and maintain scalable, reliable, and secure infrastructure on cloud platforms (e.g., AWS, Azure, GCP).
- Develop and implement robust monitoring, alerting, and logging solutions to ensure system health and performance.
- Lead incident response efforts, perform root cause analysis, and implement preventative measures to avoid recurrence.
- Automate infrastructure provisioning, configuration management, and deployment processes using tools like Terraform, Ansible, or similar.
- Optimize system performance and resource utilization to enhance efficiency and reduce costs.
- Collaborate with development teams to ensure services are designed for reliability and operability.
- Develop and maintain comprehensive documentation for infrastructure, systems, and processes.
- Participate in on‑call rotations to provide 24/7 support for production systems.
- Identify and mitigate potential risks to system stability and availability.
- Contribute to capacity planning and performance testing initiatives.
- Implement and enforce security best practices across the infrastructure.
- Stay current with emerging technologies and industry trends in site reliability and cloud computing.
- Mentor junior engineers and share knowledge within the team.
- Troubleshoot complex system issues across the entire technology stack.
- Drive continuous improvement in infrastructure reliability and operational efficiency.
Qualifications
- Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.
- Minimum of 5 years of experience in Site Reliability Engineering, Systems Engineering, or a related role.
- Proven expertise in cloud platforms (AWS, Azure, or GCP), including infrastructure as code (IaC) tools.
- Strong proficiency in scripting and programming languages (e.g., Python, Go, Bash).
- In‑depth knowledge of containerization technologies (e.g., Docker, Kubernetes).
- Experience with CI/CD pipelines and tools (e.g., Jenkins, GitLab CI).
- Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
- Solid understanding of networking concepts (TCP/IP, DNS, HTTP).
- Experience with database administration and optimization is a plus.
- Excellent problem‑solving and analytical skills.
- Strong communication and collaboration abilities.
- Ability to work effectively under pressure and manage critical incidents.
Location
- Halifax, Nova Scotia, CA
Application
If you are a seasoned SRE looking to contribute to a stable and scalable infrastructure, apply today.
Requirements
- Minimum of 5 years of experience in Site Reliability Engineering, Systems Engineering, or a related role.
- Proven expertise in cloud platforms (AWS, Azure, or GCP), including infrastructure as code (IaC) tools.
- Strong proficiency in scripting and programming languages (e.g., Python, Go, Bash).
- In-depth knowledge of containerization technologies (e.g., Docker, Kubernetes).
- Experience with CI/CD pipelines and tools (e.g., Jenkins, GitLab CI).
- Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
- Solid understanding of networking concepts (TCP/IP, DNS, HTTP).
- Experience with database administration and optimization is a plus.
- Excellent problem-solving and analytical skills.
- Strong communication and collaboration abilities.
- Ability to work effectively under pressure and manage critical incidents.
Responsibilities
- Design, build, and maintain scalable, reliable, and secure infrastructure on cloud platforms (e.g., AWS, Azure, GCP).
- Develop and implement robust monitoring, alerting, and logging solutions to ensure system health and performance.
- Lead incident response efforts, perform root cause analysis, and implement preventative measures to avoid recurrence.
- Automate infrastructure provisioning, configuration management, and deployment processes using tools like Terraform, Ansible, or similar.
- Optimize system performance and resource utilization to enhance efficiency and reduce costs.
- Collaborate with development teams to ensure services are designed for reliability and operability.
- Develop and maintain comprehensive documentation for infrastructure, systems, and processes.
- Participate in on-call rotations to provide 24/7 support for production systems.
- Identify and mitigate potential risks to system stability and availability.
- Contribute to capacity planning and performance testing initiatives.
- Implement and enforce security best practices across the infrastructure.
- Stay current with emerging technologies and industry trends in site reliability and cloud computing.
- Mentor junior engineers and share knowledge within the team.
- Troubleshoot complex system issues across the entire technology stack.
- Drive continuous improvement in infrastructure reliability and operational efficiency.
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free