Remote Site Reliability Engineer
Deltatre
About the role
About the Role
The Site Reliability Engineer (SRE) is responsible for improving the reliability, stability, and operational readiness of critical digital platforms. The role focuses on proactively reducing risk, strengthening system resilience, and enabling product and engineering teams to operate with confidence—particularly during live events, launches, and other high‑traffic periods. The role requires a degree of flexibility to support live operations onsite (in the client’s operations center) and regular on‑call support during evening and weekend live event windows and other key periods. If the requirements will lead to work beyond 44 hours/week, overtime payment will be granted.
Outside of these event‑driven windows, the role supports flexible and remote working arrangements provided some consistent onsite presence.
The SRE’s will be operating, monitoring, and enhancing the Deltatre OTT platform which is designed to withstand millions of concurrent users, using the latest cutting‑edge technologies. SRE’s will collaborate with other engineering teams, service owners, and support teams to ensure services are highly available and performant.
Responsibilities
- Improve system availability, performance, and fault tolerance across production environments.
- Identify systemic risks and lead initiatives to reduce operational fragility.
- Lead or support incident response for high‑severity production issues, particularly during evenings, weekends, and live operations as required.
- Design and maintain monitoring, alerting, and logging strategies that prioritize actionable signals over noise.
- Partner with engineering teams to embed reliability considerations into system design.
- Reduce manual operational effort through automation, tooling, and improved deployment practices.
- Improve deployment safety, rollback mechanisms, and change management processes.
- Support capacity planning and performance testing.
Technical Experience
- Cloud platforms such as AWS and/or Azure
- MongoDB (including monitoring and operating in production) and Redis
- CI/CD pipelines using tools such as Bamboo, GitHub, and Octopus
- Observability and monitoring platforms such as New Relic and Datadog
Programming & Systems Expertise
- Proficiency in one or more general‑purpose programming languages, such as C#, JavaScript, Java, PowerShell, Go, or Python
- .NET / C# applications (a significant advantage, as our backend services are written in C#)
- Full‑stack troubleshooting capability, spanning network, application, infrastructure, and distributed services layers
- Familiarity with load and performance testing tools such as k6, Gatling, or JMeter
- Driven to push the boundaries and lead change and performance
- Solid technically speaking, to advise both Clients and internal teams
Culture & Values
Our people are key to our success and we pride ourselves on offering a dynamic, creative, innovative and supportive environment. Everyone has the opportunity to reach their full potential, and every team member is expected to treat everyone with dignity and respect, value different perspectives, use inclusive language and work in alignment with Deltatre's commitment to diversity and inclusion. Depending on the role this normally includes a written test and interview.
Compensation
The Salary range for this position is CAD 110,000 – CAD 164,000
Requirements
- Proficiency in one or more general-purpose programming languages, such as C#, JavaScript, Java, PowerShell, Go, or Python
- Full-stack troubleshooting capability , spanning network, application, infrastructure, and distributed services layers
Responsibilities
- Improve system availability, performance, and fault tolerance across production environments.
- Identify systemic risks and lead initiatives to reduce operational fragility.
- Lead or support incident response for high-severity production issues, particularly during evenings, weekends, and live operations as required.
- Design and maintain monitoring, alerting, and logging strategies that prioritize actionable signals over noise.
- Partner with engineering teams to embed reliability considerations into system design.
- Reduce manual operational effort through automation, tooling, and improved deployment practices.
- Improve deployment safety, rollback mechanisms, and change management processes.
- Support capacity planning and performance testing.
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free