Senior Site Reliability Engineer (SRE) Ottawa,Ontario,Canada Product Development Posted 6 hours ago
Ericsson
About the role
Hybrid Senior Site Reliability Engineer (SRE) – Ottawa, Ontario, Canada
Grow with us
What you will do
- Serve as a technical leader ensuring production service reliability, scalability, and performance.
- Collaborate with development teams to embed operability and automation into system architecture.
- Lead high‑severity incident response, driving resolution and coordinating stakeholder communications.
- Champion root cause analysis and postmortems; ensure remediation is implemented and verified.
- Design and maintain sophisticated monitoring, alerting, deployment, and infrastructure automation systems.
- Oversee creation and regular review of operational runbooks/playbooks; lead resilience and chaos testing exercises.
- Drive service lifecycle processes, including operational readiness, onboarding, and decommissioning.
What you will bring
- Certifications & Training (Asset) to find all you need to know about what our typical hiring process looks like.
- Ericsson uses a merit‑based hiring approach that values people with different experiences, perspectives and skillsets. We truly believe this approach drives innovation, which is essential for our future growth.
- We encourage people from all backgrounds to apply and realize their full potential as part of our Ericsson team.
- Ericsson is proud to be an Equal Opportunity employer.
- If you need assistance or to request an accommodation due to a disability, please contact Ericsson at .
Disclaimer
The above statements are intended to describe the general nature and level of work being performed by employees in this position. They are not an exhaustive list of all responsibilities, duties and skills required for this position, and you may be required to perform additional job tasks as assigned.
Primary location
- Country: Canada (CA)
- City: Ottawa
Compensation and Benefits at Ericsson
Pay
- Salary range for this position (Ottawa): $129,500 – $170,100
- Short‑Term Variable Compensation Plan (select if STV): opportunity for an annual bonus based on business performance, unit objectives, individual performance, and individual bonus target (certain eligibility and pro‑ration rules apply).
Health
- Excellent health benefits including the choice of 3 medical and dental plan options.
- Core level coverage is paid for fully by Ericsson.
Financial Security
- Automatic 2 % company contribution into the Pension Plan.
- 50 % match of employee’s contribution into the Registered Retirement Savings Plan, up to 8 % of the employee’s contribution (maximum of 4 % match) – total company contribution potential of 6 %.
- Basic life insurance and basic accidental death and dismemberment coverage at two‑times annual base pay at no cost.
- Short‑term disability coverage.
- Option to participate in Ericsson’s Stock Purchase Plan.
Time Off
- Minimum of 18 days of accrued vacation, plus at least 3 personal days, minimum 10 holidays, 1 volunteer day, and sick days (paid time off is pro‑rated based on start date).
- Up to 10 weeks of paid maternity leave and 6 weeks of parental or adoption leave at 100 % of pay.
Additional Benefits
- Financial wellness programs, educational assistance, matching gifts, wellness account, and recognition programs.
Note: Ericsson Canada Inc. does not provide immigration assistance/sponsorship now or in the future for this position.
About this exciting opportunity
We are looking for an accomplished Senior Site Reliability Engineer to champion the reliability, availability, performance, and scalability of mission‑critical services. In this role, you will partner closely with development and operations teams, guide system design for operability and automation, and provide leadership in incident response and continuous improvement initiatives. You will set technical direction, mentor peers, and implement advanced tooling and practices that ensure our systems are robust, observable, and efficient.
Required qualifications
- B.Sc., M.Sc., degree in a relevant area, or equivalent experience.
- 7‑10+ years in systems engineering, DevOps, or SRE roles, with at least 3 years in senior/lead capacity driving reliability initiatives.
- Expert knowledge of SRE principles: SLIs, SLOs, error budgets, and reliability engineering methodologies.
- Advanced Linux systems administration and troubleshooting skills, spanning cloud (AWS/Azure/GCP) and on‑premises environments.
- Extensive production experience with Kubernetes and container ecosystems (Docker, CRI).
- Proficiency with Infrastructure as Code (Terraform, CloudFormation, Ansible) and automation scripting (Python, Go, Bash).
- Strong background in designing/operating CI/CD pipelines, automated deployments, and rollout strategies (canary, blue‑green).
- Expertise with observability tools such as Prometheus, Grafana, ELK/EFK, Splunk, plus distributed tracing frameworks (Jaeger, Zipkin, OpenTelemetry).
- Solid networking skills (TCP/IP, routing, load balancing) and security best practices (TLS, identity, secrets management).
- Demonstrated thought leadership in designing and operating complex distributed systems.
- Proven ability in capacity planning, performance tuning, profiling, and cost optimization at scale.
- Understanding of telecom architectures (IMS, 4G/5G core concepts) and carrier‑grade availability standards.
- Experience with OSS/BSS, network management tooling, and telecom protocols.
- Knowledge of regulatory/compliance constraints in telecom deployments.
- Reliability‑first, automation‑first, and risk‑aware approach; skilled at balancing speed and safety in delivery.
- Advanced cloud or Kubernetes certifications (AWS Professional, Azure Expert, GCP Professional, CKA/CKAD) beneficial.
- SRE leadership training, incident response, or chaos engineering certifications preferred.
Operational Leadership
- Command operational excellence during incidents, coordinating cross‑team responses in high‑pressure situations.
- Lead structured problem‑solving for deep root cause analysis with actionable follow‑through.
- Establish operational standards, best practices, and governance for reliability engineering across teams.
Soft Skills & Collaboration
- Exceptional communication to bridge technical and business contexts, influencing senior stakeholders.
- Mentorship and coaching for junior and mid‑level engineers; fostering a culture of reliability‑first thinking.
- Strategic decision‑making under pressure, balancing innovation with risk management.
- Initiative to identify systemic risks and champion enterprise‑grade improvements.
Top Skills
- Corrective Action
- Coordination
- Coaching
- Circuits
- Campaign Management
- Business Continuity Planning
- Budgeting
- Broadcasting
- Availability
- Automated Storage and Retrieval Systems
- Data Privacy Agreement
I agree that my CV can be used to identify skills and experience for job matching and application, and the recruitment team may reach out for job opportunities purposes. I understand.
Requirements
- Expert knowledge of SRE principles: SLIs, SLOs, error budgets, and reliability engineering methodologies.
- Advanced Linux systems administration and troubleshooting skills, spanning cloud (AWS/Azure/GCP) and on-premises environments.
- Extensive production experience with Kubernetes and container ecosystems (Docker, CRI).
- Proficiency with Infrastructure as Code (Terraform, CloudFormation, Ansible) and automation scripting (Python, Go, Bash).
- Strong background in designing/operating CI/CD pipelines, automated deployments, and rollout strategies (canary, blue-green).
- Expertise with observability tools such as Prometheus, Grafana, ELK/EFK, Splunk, plus distributed tracing frameworks (Jaeger, Zipkin, OpenTelemetry).
- Solid networking skills (TCP/IP, routing, load balancing) and security best practices (TLS, identity, secrets management).
- Demonstrated thought leadership in designing and operating complex distributed systems.
- Proven ability in capacity planning, performance tuning, profiling, and cost optimization at scale.
- Understanding of telecom architectures (IMS, 4G/5G core concepts) and carrier-grade availability standards.
- Experience with OSS/BSS, network management tooling, and telecom protocols.
- Knowledge of regulatory/compliance constraints in telecom deployments.
- Reliability-first, automation-first, and risk-aware approach; skilled at balancing speed and safety in delivery.
- Operational excellence during incidents, coordinating cross-team responses in high-pressure situations.
- Lead structured problem-solving for deep root cause analysis with actionable follow-through.
- Establish operational standards, best practices, and governance for reliability engineering across teams.
- Exceptional communication to bridge technical and business contexts, influencing senior stakeholders.
- Mentorship and coaching for junior and mid-level engineers; fostering a culture of reliability-first thinking.
- Strategic decision-making under pressure, balancing innovation with risk management.
- Initiative to identify systemic risks and champion enterprise-grade improvements.
Responsibilities
- Serve as a technical leader ensuring production service reliability, scalability, and performance.
- Collaborate with development teams to embed operability and automation into system architecture.
- Lead high-severity incident response, driving resolution and coordinating stakeholder communications.
- Champion root cause analysis and postmortems; ensure remediation is implemented and verified.
- Design and maintain sophisticated monitoring, alerting, deployment, and infrastructure automation systems.
- Oversee creation and regular review of operational runbooks/playbooks; lead resilience and chaos testing exercises.
- Drive service lifecycle processes, including operational readiness, onboarding, and decommissioning.
Benefits
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free