Site Reliability Engineer; SRE
Deltatre
About the role
Position: Site Reliability Engineer (SRE)
About
The Site Reliability Engineer (SRE) is responsible for improving the reliability, stability, and operational readiness of critical digital platforms. The role focuses on proactively reducing risk, strengthening system resilience, and enabling product and engineering teams to operate with confidence—particularly during live events, launches, and other high‑traffic periods. This role is dedicated to a major downtown Toronto‑based client.
The role requires a degree of flexibility to support live operations onsite (in the client’s operations center) and regular on‑call support during evening and weekend live event windows and other key periods. If the requirements will lead to work beyond 44 hours/week, overtime payment will be granted.
Outside of these event‑driven windows, the role supports flexible and remote working arrangements provided some consistent onsite presence.
The SRE’s will be operating, monitoring, and enhancing the Deltatre OTT platform which is designed to withstand millions of concurrent users, using the latest cutting‑edge technologies. On a daily basis, the SRE’s will be innovating, automating, maintaining, and securing our cloud‑based platform. SRE’s will collaborate with other engineering teams, service owners, and support teams to ensure services are highly available and performant.
Key Responsibilities
Reliability & Stability
- Improve system availability, performance, and fault tolerance across production environments.
- Define, measure, and track Service Level Objectives (SLOs), error budgets, and reliability metrics.
- Identify systemic risks and lead initiatives to reduce operational fragility.
Incident Management & Readiness
- Lead or support incident response for high‑severity production issues, particularly during evenings, weekends, and live operations as required.
- Establish and refine incident response processes, runbooks, and escalation paths ensuring B2B and Incident Management teams are duly informed and trained on the procedures.
- Conduct post‑incident reviews (blameless retrospectives) and ensure follow‑up actions are completed.
Observability & Tooling
- Design and maintain monitoring, alerting, and logging strategies that prioritize actionable signals over noise.
- Improve visibility into system health to enable faster detection and resolution of issues.
- Partner with engineering teams to embed reliability considerations into system design.
Automation & Operational Efficiency
- Reduce manual operational effort through automation, tooling, and improved deployment practices.
- Improve deployment safety, rollback mechanisms, and change management processes.
- Support capacity planning and performance testing.
Core Technical Experience
- Cloud platforms such as AWS and/or Azure
- Containerized workloads using Docker and OCI‑compliant containers
- MongoDB (including monitoring and operating in production) and Redis
- CI/CD pipelines using tools such as Bamboo, Git Hub, and Octopus
- Scripting and automation with PowerShell and/or bash
- Observability and monitoring platforms such as New Relic and Datadog
- Infrastructure as Code using Terraform and/or CloudFormation
Programming & Systems Expertise
- Proficiency in one or more general‑purpose programming languages, such as C#, JavaScript, Java, PowerShell, Go, or Python
- Strong ability to read, understand, and debug .NET / C# applications (a significant advantage, as our backend services are written in C#)
- Experience developing or supporting highly scalable, distributed systems
- Hands‑on experience with microservices architectures, leveraging virtualization and/or containerization
- Full‑stack troubleshooting capability, spanning network, application, infrastructure, and distributed services layers
- Familiarity with load and performance testing tools such as k6, Gatling, or JMeter
Desired Traits
- driven to push the boundaries and lead change and performance
- communicative to leave…
Requirements
- Hands‑on problem‑solver with ownership from first alert to permanent resolution
- Practical experience across most components of the technology stack
- Experience with cloud platforms such as AWS and/or Azure
- Experience with containerized workloads using Docker and OCI‑compliant containers
- Experience operating MongoDB and Redis in production
- Experience with CI/CD pipelines using Bamboo, GitHub, and Octopus
- Scripting and automation with PowerShell and/or Bash
- Familiarity with observability platforms such as New Relic and Datadog
- Infrastructure as Code using Terraform and/or CloudFormation
- Proficiency in one or more programming languages (C#, JavaScript, Java, PowerShell, Go, Python)
- Strong ability to read, understand, and debug .NET / C# applications
- Experience developing or supporting highly scalable, distributed systems
- Hands‑on experience with microservices architectures, virtualization and/or containerization
- Full‑stack troubleshooting across network, application, infrastructure, and distributed services layers
- Familiarity with load and performance testing tools such as k6, Gatling, or JMeter
- Driven to push boundaries, lead change and performance
- Communicative
Responsibilities
- Improve system availability, performance, and fault tolerance across production environments
- Define, measure, and track Service Level Objectives (SLOs), error budgets, and reliability metrics
- Identify systemic risks and lead initiatives to reduce operational fragility
- Lead or support incident response for high‑severity production issues, particularly during evenings, weekends, and live operations
- Establish and refine incident response processes, runbooks, and escalation paths
- Conduct post‑incident reviews (blameless retrospectives) and ensure follow‑up actions are completed
- Design and maintain monitoring, alerting, and logging strategies that prioritize actionable signals over noise
- Improve visibility into system health to enable faster detection and resolution of issues
- Partner with engineering teams to embed reliability considerations into system design
- Reduce manual operational effort through automation, tooling, and improved deployment practices
- Improve deployment safety, rollback mechanisms, and change management processes
- Support capacity planning and performance testing
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free