DevOps/SRE (Kubernetes)
Confidential
About the role
About the Company
We’re operating critical infrastructure that powers applications serving millions of users globally. Our platform runs on Kubernetes across multiple regions, handling high‑traffic workloads with strict SLAs for uptime and performance. We’re looking for experienced infrastructure engineers who can help us scale reliably while maintaining security and operational excellence.
The Role
We’re seeking a Senior DevOps/SRE Engineer to own and evolve our Kubernetes‑based infrastructure. You’ll be responsible for cluster operations, security hardening, performance optimization, and ensuring our platform can scale to meet growing demands. This role requires someone who can balance the operational needs of running production systems with the long‑term vision of building self‑healing, automated infrastructure.
You’ll work closely with product engineering teams to improve developer experience, implement robust CI/CD pipelines, and build the observability systems needed to maintain high reliability. This isn’t just about keeping the lights on—you’ll shape the infrastructure strategy and help establish best practices that enable the entire engineering organization to move faster safely.
What You’ll Do
- Manage and optimize multi‑tenant Kubernetes clusters running hundreds of services across multiple AWS regions
- Implement security hardening measures including network policies, pod security standards, RBAC, and secrets management
- Design and maintain Infrastructure as Code using Terraform for all AWS resources and Kubernetes manifests
- Build and improve CI/CD pipelines using GitHub Actions, ArgoCD, or similar tools for automated deployments
- Implement comprehensive observability using Prometheus, Grafana, Loki, and distributed tracing
- Design and implement autoscaling strategies (HPA, VPA, cluster autoscaling) to handle traffic patterns efficiently
- Manage service mesh configurations (Istio, Linkerd) for traffic management and security
- Build disaster recovery procedures and conduct regular failure scenario testing
- Optimize cloud costs through right‑size, spot instance usage, and resource efficiency improvements
- Establish and maintain SLOs/SLIs for critical services, implementing alerting that minimizes noise
- Participate in on‑call rotation, responding to incidents and conducting thorough post‑incident reviews
- Create runbooks, documentation, and automation to reduce operational toil
- Collaborate with development teams to optimize application performance and resource usage
- Evaluate and integrate new infrastructure technologies that improve reliability or developer experience
What We’re Looking For
Required
- 5+ years of experience in DevOps, SRE, or platform engineering roles
- Strong proficiency with Terraform for infrastructure as code across cloud providers
- Expert‑level knowledge of AWS services: EC2, EKS, RDS, S3, VPC, IAM, CloudWatch, and more
- Experience with container technologies (Docker, containerd) and container registries
- Hands‑on experience implementing CI/CD pipelines with GitOps principles
- Proficiency in scripting languages (Bash, Python, Go) for automation
- Strong understanding of Linux systems administration and networking fundamentals
- Production experience with monitoring and observability stacks (Prometheus, Grafana, ELK/Loki)
- Understanding of security best practices including secrets management (Vault, SOPS, sealed‑secrets)
- Experience with service mesh technologies and their operational challenges
- Proven ability to debug complex distributed systems issues
- Strong incident response and post‑mortem facilitation skills
- Excellent documentation and communication abilities
Nice to Have
- Experience with multi‑cloud or hybrid cloud architectures
- Background with GitOps tools (ArgoCD, Flux)
- Familiarity with Helm and Kustomize for Kubernetes application management
- Knowledge of eBPF‑based tools (Cilium, Pixie)
- Experience with chaos engineering practices and tools (Chaos Mesh, Litmus)
- Understanding of FinOps and cloud cost optimization strategies
- Experience with compliance requirements (SOC2, HIPAA, PCI‑DSS)
- Background in performance engineering and load testing
- Familiarity with service mesh architectures (Istio, Linkerd, Consul)
- Experience building platform engineering teams or internal developer platforms
- Contributions to Kubernetes or CNCF ecosystem projects
What We Offer
- Competitive salary with equity in a growing infrastructure company
- Comprehensive health, dental, and vision insurance
- Fully remote work within Canada with flexible hours
- Home office stipend for ergonomic setup and monitors
- $2,000 annual learning budget for certifications (CKA, CKAD, AWS) and training
- Access to cloud sandbox environments for learning and experimentation
- Collaborative SRE team culture with blameless post‑mortems
- Reasonable on‑call schedule with compensation for after‑hours work
- Latest MacBook Pro or Linux laptop of your choice
- Generous PTO policy and paid company holidays
About You
You have a strong bias toward automation and eliminating toil. You understand that the best infrastructure is invisible to developers but resilient under stress. You’re comfortable making trade‑offs between feature velocity and operational stability. You value simplicity and are skeptical of adding complexity without clear benefits. You enjoy teaching others and building tools that make the entire team more effective. You stay calm during incidents and focus on restoration first, analysis second.
#J-18808-Ljbffr
Requirements
- Strong proficiency with Terraform for infrastructure as code across cloud providers
- Expert-level knowledge of AWS services: EC2, EKS, RDS, S3, VPC, IAM, CloudWatch, and more
- Experience with container technologies (Docker, containerd) and container registries
- Hands-on experience implementing CI/CD pipelines with GitOps principles
- Proficiency in scripting languages (Bash, Python, Go) for automation
- Strong understanding of Linux systems administration and networking fundamentals
- Production experience with monitoring and observability stacks (Prometheus, Grafana, ELK/Loki)
- Understanding of security best practices including secrets management (Vault, SOPS, sealed-secrets)
- Experience with service mesh technologies and their operational challenges
- Proven ability to debug complex distributed systems issues
- Strong incident response and post-mortem facilitation skills
- Excellent documentation and communication abilities
Responsibilities
- Manage and optimize multi-tenant Kubernetes clusters running hundreds of services across multiple AWS regions
- Implement security hardening measures including network policies, pod security standards, RBAC, and secrets management
- Design and maintain Infrastructure as Code using Terraform for all AWS resources and Kubernetes manifests
- Build and improve CI/CD pipelines using GitHub Actions, ArgoCD, or similar tools for automated deployments
- Implement comprehensive observability using Prometheus, Grafana, Loki, and distributed tracing
- Design and implement autoscaling strategies (HPA, VPA, cluster autoscaling) to handle traffic patterns efficiently
- Manage service mesh configurations (Istio, Linkerd) for traffic management and security
- Build disaster recovery procedures and conduct regular failure scenario testing
- Optimize cloud costs through right-size, spot instance usage, and resource efficiency improvements
- Establish and maintain SLOs/SLIs for critical services, implementing alerting that minimizes noise
- Participate in on-call rotation, responding to incidents and conducting thorough post-incident reviews
- Create runbooks, documentation, and automation to reduce operational toil
- Collaborate with development teams to optimize application performance and resource usage
- Evaluate and integrate new infrastructure technologies that improve reliability or developer experience
Benefits
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free