Site Reliability Engineer (SRE)

H D

Remote · Canada Full-time Mid Level $50k – $70k/yr 1mo ago

About the role

About Us

We’re hiring an SRE who takes production personally. Someone who loses sleep over p99 latency, gets excited about runbook automation, and believes on-call should be boring because systems are already resilient, observable, and well-designed.

This role is for an engineer who thrives at the intersection of infrastructure, automation, reliability, and developer experience. You’ll work across cloud platforms, CI/CD systems, distributed applications, and production operations to ensure our systems remain scalable, secure, and highly available.

What You’ll Do

Lead reliability and operational excellence initiatives across our cloud infrastructure spanning AWS, Azure, GCP, and hybrid/private cloud environments
Design, implement, and maintain scalable infrastructure using Terraform, Ansible, and infrastructure-as-code best practices
Own and improve production systems running on AWS services including ECS, Fargate, Lambda, Aurora MySQL, RDS, ElastiCache, and S3
Maintain healthy, observable deployments across Azure App Services and Netlify environments
Manage and optimize Cloudflare configurations including WAF, DNS, caching, Workers, and edge security policies
Build and improve CI/CD pipelines using GitHub Actions, Jenkins, and related tooling with a focus on deployment safety, rollback strategies, and release velocity
Define and enforce SLOs, SLAs, error budgets, monitoring standards, and incident response processes
Drive postmortems that produce measurable operational improvements — not just documentation
Develop automation tools and scripts using Python, Bash, Go, PowerShell, or Ruby to reduce manual operational work
Manage and support Kubernetes and Docker-based containerized environments for microservices architectures
Monitor system performance, troubleshoot production issues proactively, and optimize availability, latency, and scalability
Collaborate closely with engineering teams to design resilient systems and improve application reliability from development through production
Support secure cloud operations through implementation of access controls, firewalls, VPNs, and infrastructure security best practices
Maintain clear operational documentation, runbooks, and architecture standards
Participate in incident response rotations and reliability planning initiatives

What We’re Looking For

3–5+ years of hands-on experience in Site Reliability Engineering, Platform Engineering, DevOps, or Infrastructure Engineering
Strong expertise in AWS and production experience with ECS, Lambda, managed databases, and cloud-native architectures
Experience working with Azure and/or GCP environments in production
Strong knowledge of Kubernetes, Docker, and microservices-based systems
Experience with Infrastructure as Code and configuration management tools such as Terraform, Ansible, or Puppet
Solid Linux/Unix systems administration skills; Windows Server experience is a plus
Strong scripting and automation experience with Python, Bash, Go, PowerShell, or Ruby
Experience building and maintaining CI/CD pipelines using GitHub Actions, Jenkins, or similar tools
Experience configuring and debugging Cloudflare in production environments — beyond basic DNS management
Familiarity with observability and monitoring practices including metrics, logging, tracing, and alerting systems
Experience with relational and NoSQL databases including MySQL, PostgreSQL, MongoDB, Cassandra, or similar technologies
Understanding of distributed systems, REST APIs, SOA, and modern application deployment practices
Ability to read and understand application codebases (Node.js, Next.js, or similar) and evaluate infrastructure implications
Strong communication skills across engineering teams, leadership stakeholders, and incident response channels

Nice to Have

Experience with private cloud or virtualization platforms such as OpenStack, VMware, Citrix, or VirtualBox
Familiarity with SaaS/PaaS environments and large-scale distributed systems
Exposure to security engineering, edge networking, or performance optimization
Experience supporting high-traffic production environments with strict uptime requirements
Background in Agile development environments and SDLC best practices

Why Join Us

You’ll have the opportunity to work on mission-critical systems using modern cloud-native technologies while shaping the reliability culture of the organization. We value engineers who automate relentlessly, think systematically, and care deeply about operational excellence.

Compensation

Pay: $50,000.00-$70,000.00 per year

Work Location

Remote

Skills

AnsibleAWSAWS Aurora MySQLAWS CloudflareAWS ECSAWS ElastiCacheAWS FargateAWS LambdaAWS RDSAWS S3AzureBashCI/CDDockerGCPGitHub ActionsGoInfrastructure as CodeJenkinsKubernetesLinuxMicroservicesMongoDBMySQLNetlifyNode.jsObservabilityPostgreSQLPowerShellPythonRubyTerraformUnixVirtualizationVMwareWindows Server

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

Site Reliability Engineer (SRE)

About the role

About Us

What You’ll Do

What We’re Looking For

Nice to Have

Why Join Us

Compensation

Work Location

Skills

Similar roles

Mid-Level IoT Engineer

AI Forward Deploy Engineer

Software Engineer

Don't send a generic resume