All jobs

Site Reliability Engineer

Thinkific

Kootenay Bay · On-site Full-time Lead 3mo ago

Apply with a tailored resume Save job

About the role

Position

Staff Site Reliability Engineer

Location

Kootenay Bay

About

Are you an experienced Site Reliability Engineer looking for a new challenge? We’re looking for a Staff Site Reliability Engineer to join us at Thinkific.

We’re looking for a Staff Site Reliability Engineer (SRE) to join us a Staff Site Reliability Engineer, you will help us scale and secure the infrastructure that powers thousands of online course creators around the world.

In this role, you’ll play a critical role in improving the performance, reliability, and security of our platform. You’ll work cross‑functionally with engineers, product managers, and stakeholders to drive forward reliability‑focused initiatives, build scalable systems, and mentor others. You’ll also help shape our technical strategy, lead major infrastructure projects, and act as a domain expert in modern cloud‑native practices, with a specific emphasis on Kubernetes, cloud infrastructure (AWS), observability, and service reliability.

Your goal will be to help guide and execute on projects related to your technical domain.

Responsibilities

Own one or more technical domains across our infrastructure with accountability for system reliability, performance, scalability, and security
Lead projects to evolve our Kubernetes‑based platform, ensuring alignment with SLOs, security best practices, and long‑term maintainability
Contribute to the design and evolution of our infrastructure using Terraform, Helm, and cloud‑native tools, with an emphasis on modularity, reuse, and automation
Partner with engineering teams to design robust deployment pipelines, ensure operational readiness, and build secure‑by‑default patterns for new services
Lead incident response efforts and participate in on‑call rotation, driving a culture of blameless postmortems and learning
Write infrastructure and application code in Ruby, Node.js, Python, or Bash to automate operations and improve developer experience
Serve as a mentor and multiplier, raising the technical bar through coaching, knowledge sharing, and technical leadership
Actively promote observability, testing, and continuous improvement in everything you build and advocate for within your team
Participate in our on‑call rotation and incident response processes to help maintain a high level of service reliability

Requirements

Has 6+ years of experience in software or infrastructure engineering, including 4+ years working with Kubernetes in production environments
Holds a CKA certification or equivalent hands‑on Kubernetes expertise (bonus for experience managing multi‑tenant clusters or complex networking in K8s)
Has deep knowledge of TLS, certificates, ciphers, and encryption protocols, and can explain how they secure communications in a distributed system
Has production experience with AWS infrastructure and services (EKS, RDS, IAM, ALB, S3, etc.)
Writes infrastructure‑as‑code using Terraform, and has built scalable and secure infrastructure following modular and reusable patterns
Is comfortable with monitoring and observability tooling (e.g., New Relic, Datadog, Prometheus, Grafana, Sentry) and building alerting based on meaningful SLOs
Has experience supporting distributed systems with relational and non‑relational databases (PostgreSQL, AWS Aurora), message queues (Sidekiq, SNS/SQS), and asynchronous architectures
Enjoys collaborating across teams and helping shape engineering roadmaps and architectural direction
Brings a strong ownership mentality, cares deeply about developer experience and operational excellence, and thrives in a fast‑paced environment
Loves to learn and grow. They’ve found (and keep looking for) ways to level up their skills in this field, whether that’s through formal education, gaining professional experience, or maybe even building their own business

Nice to Have (could be learned on the job)

Experience with Database Administration (DBA) practices, including performance tuning, replication strategies, backup and recovery planning, and operational support for PostgreSQL or AWS Aurora environments
Experience working with Ruby on Rails and/or Node.js applications in production
Familiarity with Cloudflare, load…

Skills

AWSAWS AuroraAWS LambdaBashCKACloudflareDatadogDockerEKSGrafanaHelmIAMKubernetesNew RelicNode.jsObservabilityPrometheusPythonRDSReactRubyS3SentrySidekiqSLOsSNS/SQSTerraformTLS

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

Site Reliability Engineer

About the role

Position

Location

About

Responsibilities

Requirements

Nice to Have (could be learned on the job)

Skills

Similar roles

MCP Engineer / AI Backend Engineer

Senior Database Engineer

Team Leads

Don't send a generic resume