Staff Infrastructure & SRE Engineer
Intuitive.ai
About the role
About Intuitive
Intuitive is an innovation-led engineering company delivering business outcomes for 100’s of Enterprises globally. With the reputation of being a Tiger Team & a Trusted Partner of enterprise technology leaders, we help solve the most complex Digital Transformation challenges across following Intuitive Superpowers:
Modernization & Migration
- Application & Database Modernization
- Cloud Native Engineering, Migration to Cloud, VMware Exit
- Fin Ops
Data & AI/ML
Cybersecurity
- Infrastructure Security
- Application Security
- Data Security
SDx & Digital Workspace (M365, G-suite)
- SDDC, SD-WAN, SDN, Net Sec, Wireless/Mobility
- Email, Collaboration, Directory Services, Shared Files Services
Intuitive Services:
- Professional and Advisory Services
- Elastic Engineering Services
- Managed Services
- Talent Acquisition & Platform Resell Services
About the job
Start Date: Immediately
of Positions: 1
Position Type: Full Time/ Contract
Location: Remote across Canada (occasional travel to USA)
About the Role
The Staff Infrastructure & SRE Engineer will own the full lifecycle of our cloud-native platform — from provisioning and sizing AWS and Kubernetes infrastructure, to maintaining reliability through observability, release engineering, and incident response. This is a deeply hands‑on engineering role with real production ownership, where you'll balance technical depth with operational leadership to keep our platform reliable and scalable.
You will write Terraform, Python, and Shell scripts daily, manage EKS clusters at scale, integrate applications into APM and monitoring systems, and enforce Dev Ops best practices including change control and uptime monitoring. Your focus will be on platform reliability and operational excellence — building the automation, observability, and infrastructure-as-code foundations that make our cloud platform programmable, observable, and resilient. We value engineers who automate relentlessly, own their systems end-to-end, and drive reliability improvements through data and discipline.
Key Responsibilities
- Own AWS infrastructure provisioning and operations ensuring production reliability across VPCs, EC2, RDS, S3, IAM, Route 53, ALB/NLB in multi-account environments following AWS Well-Architected Framework principles; implement cost optimization, right-sizing, and resource tagging strategies
- Lead Kubernetes platform operations end-to-end from provisioning EKS clusters from scratch through full lifecycle management — sizing and capacity planning with Cluster Autoscaler/Karpenter, version upgrades, node group rotations, and breaking-change migrations
- Drive infrastructure as code excellence setting standards for Terraform/Open Tofu module development with automated testing (terratest, plan validation), reliable state management with remote backends, and governance enforcement through policy checks (OPA/Rego, tflint)
- Own end-to-end observability and APM integration ensuring full visibility across infrastructure and applications — design monitoring frameworks with Prometheus, Grafana, Loki, Tempo, and Open Telemetry; instrument applications for distributed tracing and structured logging; define and track SLIs/SLOs for platform services
- Lead release engineering and change control from planning through production deployment — coordinate infrastructure and application releases with rollback plans, validation gates, maintenance windows, and audit trails for all production changes
- Drive incident response and platform reliability building on-call rotations, escalation paths, actionable runbooks, and blameless postmortem processes; implement chaos engineering practices to proactively identify platform weaknesses
- Own environment provisioning pipelines ensuring repeatable, automated infrastructure delivery from bare AWS accounts to fully operational platforms across dev, staging, and production
- Build Git Ops workflows implementing ArgoCD or Flux for declarative cluster and application management, ensuring all changes flow through Git with PR-based review and automated validation
- Develop automation and tooling writing Python CLI tools, Bash scripts, and CI/CD pipelines (Git Lab CI/Git Hub Actions) for infrastructure provisioning
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free