DevOps/SRE (Kubernetes)

Confidential

Remote · Canada Full-time Senior 3mo ago

About the role

About the Company

We’re operating critical infrastructure that powers applications serving millions of users globally. Our platform runs on Kubernetes across multiple regions, handling high‑traffic workloads with strict SLAs for uptime and performance. We’re looking for experienced infrastructure engineers who can help us scale reliably while maintaining security and operational excellence.

The Role

We’re seeking a Senior DevOps/SRE Engineer to own and evolve our Kubernetes‑based infrastructure. You’ll be responsible for cluster operations, security hardening, performance optimization, and ensuring our platform can scale to meet growing demands. This role requires someone who can balance the operational needs of running production systems with the long‑term vision of building self‑healing, automated infrastructure.

You’ll work closely with product engineering teams to improve developer experience, implement robust CI/CD pipelines, and build the observability systems needed to maintain high reliability. This isn’t just about keeping the lights on—you’ll shape the infrastructure strategy and help establish best practices that enable the entire engineering organization to move faster safely.

What You’ll Do

Manage and optimize multi‑tenant Kubernetes clusters running hundreds of services across multiple AWS regions
Implement security hardening measures including network policies, pod security standards, RBAC, and secrets management
Design and maintain Infrastructure as Code using Terraform for all AWS resources and Kubernetes manifests
Build and improve CI/CD pipelines using GitHub Actions, ArgoCD, or similar tools for automated deployments
Implement comprehensive observability using Prometheus, Grafana, Loki, and distributed tracing
Design and implement autoscaling strategies (HPA, VPA, cluster autoscaling) to handle traffic patterns efficiently
Manage service mesh configurations (Istio, Linkerd) for traffic management and security
Build disaster recovery procedures and conduct regular failure scenario testing
Optimize cloud costs through right‑size, spot instance usage, and resource efficiency improvements
Establish and maintain SLOs/SLIs for critical services, implementing alerting that minimizes noise
Participate in on‑call rotation, responding to incidents and conducting thorough post‑incident reviews
Create runbooks, documentation, and automation to reduce operational toil
Collaborate with development teams to optimize application performance and resource usage
Evaluate and integrate new infrastructure technologies that improve reliability or developer experience

What We’re Looking For

Required

5+ years of experience in DevOps, SRE, or platform engineering roles
Strong proficiency with Terraform for infrastructure as code across cloud providers
Expert‑level knowledge of AWS services: EC2, EKS, RDS, S3, VPC, IAM, CloudWatch, and more
Experience with container technologies (Docker, containerd) and container registries
Hands‑on experience implementing CI/CD pipelines with GitOps principles
Proficiency in scripting languages (Bash, Python, Go) for automation
Strong understanding of Linux systems administration and networking fundamentals
Production experience with monitoring and observability stacks (Prometheus, Grafana, ELK/Loki)
Understanding of security best practices including secrets management (Vault, SOPS, sealed‑secrets)
Experience with service mesh technologies and their operational challenges
Proven ability to debug complex distributed systems issues
Strong incident response and post‑mortem facilitation skills
Excellent documentation and communication abilities

Nice to Have

Experience with multi‑cloud or hybrid cloud architectures
Background with GitOps tools (ArgoCD, Flux)
Familiarity with Helm and Kustomize for Kubernetes application management
Knowledge of eBPF‑based tools (Cilium, Pixie)
Experience with chaos engineering practices and tools (Chaos Mesh, Litmus)
Understanding of FinOps and cloud cost optimization strategies
Experience with compliance requirements (SOC2, HIPAA, PCI‑DSS)
Background in performance engineering and load testing
Familiarity with service mesh architectures (Istio, Linkerd, Consul)
Experience building platform engineering teams or internal developer platforms
Contributions to Kubernetes or CNCF ecosystem projects

What We Offer

Competitive salary with equity in a growing infrastructure company
Comprehensive health, dental, and vision insurance
Fully remote work within Canada with flexible hours
Home office stipend for ergonomic setup and monitors
$2,000 annual learning budget for certifications (CKA, CKAD, AWS) and training
Access to cloud sandbox environments for learning and experimentation
Collaborative SRE team culture with blameless post‑mortems
Reasonable on‑call schedule with compensation for after‑hours work
Latest MacBook Pro or Linux laptop of your choice
Generous PTO policy and paid company holidays

About You

You have a strong bias toward automation and eliminating toil. You understand that the best infrastructure is invisible to developers but resilient under stress. You’re comfortable making trade‑offs between feature velocity and operational stability. You value simplicity and are skeptical of adding complexity without clear benefits. You enjoy teaching others and building tools that make the entire team more effective. You stay calm during incidents and focus on restoration first, analysis second.

#J-18808-Ljbffr

Skills

AWSArgoCDBashcontainerdDockerEC2EKSELKFinOpsFluxGitOpsGitHub ActionsGrafanaGoHelmHIPAAIAMIstioKubernetesLinkerdLinuxLitmusLokiPCI-DSSPrometheusPythonRDSS3SOC2SOPSTerraformVaultVPCVPA

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free