Site Reliability Engineer — Production

Astra-North Infoteck Inc. ~ Conquering today’s challenges, achieving tomorrow’s vision!

Toronto · Hybrid Full-time Senior 3mo ago

About the role

Job Title: Site Reliability Engineer (SRE) – 6‑8 Years Experience
Location: Toronto, ON (Hybrid – 3 days on‑site, 2 days remote)
Job ID: J‑18808‑Ljbffr
Company: Confidential Tech Firm – Toronto, Canada

About the Company

We are a fast‑growing technology firm based in the heart of Toronto, building scalable, cloud‑native platforms that power mission‑critical services for global customers. Our culture blends engineering excellence with a collaborative, “ownership‑first” mindset. As we expand, we need seasoned reliability experts to keep our services performant, resilient, and secure.

Role Overview

As a Site Reliability Engineer, you will bridge the gap between development and operations, ensuring that our production systems run smoothly at scale. You’ll own the reliability of complex, distributed services, drive incident response, and build automation that reduces toil. This is a hybrid role that demands strong analytical thinking, clear communication, and a proactive ownership attitude.

Key Responsibilities

Area	What You’ll Do
Production Support & Incident Management	• Lead on‑call rotations, triage alerts, and drive post‑mortems. • Coordinate cross‑functional response during high‑severity incidents, ensuring timely communication to stakeholders.
Reliability Engineering	• Design, implement, and maintain SLO/SLI frameworks. • Build and evolve monitoring, alerting, and observability pipelines (Prometheus, Grafana, OpenTelemetry, etc.).
Automation & Tooling	• Develop self‑service tooling and CI/CD pipelines (GitHub Actions, Jenkins, Argo CD). • Write reusable libraries and scripts to eliminate manual toil.
Infrastructure & Cloud	• Architect, provision, and manage cloud resources (AWS, Azure, GCP) using IaC (Terraform, CloudFormation, Pulumi). • Optimize cost, performance, and security of cloud workloads.
Distributed Systems	• Diagnose and resolve issues in micro‑service architectures, message queues, and data stores (Kafka, Redis, PostgreSQL, Cassandra).
Collaboration & Knowledge Sharing	• Partner with software engineers to embed reliability best practices early in the development lifecycle. • Conduct workshops, brown‑bag sessions, and documentation to spread SRE culture.
Continuous Improvement	• Identify patterns of recurring failures and drive systemic improvements. • Contribute to capacity planning, disaster‑recovery testing, and security hardening.

Required Qualifications

Skill	Minimum Requirement
Experience	6‑8 years in production support, incident management, or site reliability engineering.
Programming / Scripting	Proficient in at least one general‑purpose language (Go, Python, Java, or Rust) and strong UNIX shell scripting (bash, ksh, zsh).
Cloud Platforms	Hands‑on experience with AWS, Azure, or GCP (designing, deploying, and operating services).
Infrastructure as Code	Terraform, CloudFormation, Pulumi, or similar.
Observability	Prometheus, Grafana, ELK/EFK, OpenTelemetry, Splunk, or equivalent.
CI/CD	Jenkins, GitHub Actions, GitLab CI, Argo CD, or similar pipelines.
Distributed Systems	Understanding of micro‑services, message brokers, load balancers, and data stores.
Analytical Skills	Ability to dissect complex problems, root‑cause failures, and propose data‑driven solutions.
Communication	Clear written and verbal communication; comfortable presenting to technical and non‑technical audiences.
Ownership Mindset	Proactive, self‑starter who takes end‑to‑end responsibility for reliability outcomes.
UNIX/Linux	Deep familiarity with Linux/UNIX environments, networking, and system internals.

Preferred (Nice‑to‑Have) Skills

Certifications: AWS Certified Solutions Architect, Google Cloud Professional Engineer, or Azure Solutions Architect.
Experience with container orchestration (Kubernetes, Docker Swarm).
Knowledge of service mesh technologies (Istio, Linkerd).
Familiarity with security frameworks (IAM, RBAC, secrets management).
Experience in a regulated industry (finance, healthcare, etc.).

What We Offer

Competitive salary + performance‑based bonuses.
Comprehensive health, dental, and vision benefits.
401(k)/RRSP matching program.
Generous paid time off + company holidays.
Professional development budget (certifications, conferences, courses).
Modern office in downtown Toronto with flexible hybrid schedule.
Collaborative, inclusive culture that values diversity of thought.

How to Apply

Prepare your résumé highlighting relevant SRE experience, tooling, and cloud projects.
Write a brief cover letter (max 300 words) explaining why you’re excited about this role and how your ownership mindset has driven reliability improvements in past positions.
Submit your application through our careers portal or email sre‑recruit@yourcompany.com with the subject line:

[Job ID J‑18808‑Ljbffr] Site Reliability Engineer – Your Name

All applications will be reviewed confidentially. Only shortlisted candidates will be contacted for an interview.

Join us and help shape the future of reliable, cloud‑native services in Toronto and beyond! 🚀

Skills

cloud technologiesdistributed systemsUNIX shell scripting

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free