TI
Site Reliability Developer
TechInsights Inc.
Remote · Canada Full-time Mid Level 2d ago
About the role
What You'll Do
- Design, implement, and maintain highly available, scalable infrastructure systems across multi-region AWS deployments, ensuring production environments consistently meet availability and performance requirements.
- Develop and maintain service level objectives (SLOs) and service level indicators (SLIs) in collaboration with development teams, using metrics to quantify and continuously improve system reliability.
- Monitor system performance, availability, and resource utilization using CloudWatch, DataDog, and Prometheus, proactively identifying optimization opportunities and conducting root cause analysis for outages and degradations.
- Implement capacity planning strategies using historical data analysis and growth projections to ensure infrastructure scales ahead of demand, balanced against cost optimization using AWS Cost Explorer and Kubecost.
- Create comprehensive infrastructure-as-code solutions using Terraform and GitOps methodologies to manage AWS resources consistently, securely, and repeatably.
- Develop and maintain CI/CD pipelines using Jenkins, GitLab CI, or GitHub Actions to automate deployment processes with built-in testing and validation.
- Implement and maintain containerization platforms using Docker and Kubernetes, establishing standards for container orchestration, cluster management, and reusable infrastructure patterns.
- Build automation tools and scripts in Python, Go, or Java to eliminate manual operational tasks, reduce toil, and automate routine maintenance procedures including patching, backups, and resource cleanup.
- Lead incident response for critical system outages and performance issues, coordinating cross functional teams to diagnose and resolve problems with speed and precision.
- Implement comprehensive observability solutions-including logging, monitoring, distributed tracing, and intelligent alerting via Grafana and PagerDuty-to ensure rapid response to genuine issues while minimizing alert fatigue.
- Conduct blameless post mortems and thorough post incident reviews, documenting lessons learned and driving implementation of preventive measures and updated runbooks.
- Develop and maintain disaster recovery procedures and business continuity plans, including regular testing, and collaborate with Security and Compliance teams to ensure monitoring systems meet audit and regulatory requirements.
What You'll Bring
Technical Requirements
- Bachelor's degree in Computer Science, Engineering, or related field, or equivalent experience.
- 5-7 years in Site Reliability Engineering, DevOps, or cloud operations.
- Strong AWS expertise (EC2, ECS/EKS, RDS, S3, Lambda, VPC) and hybrid cloud environments.
- Proficiency in Python, Go, or Java; experience with Docker, Kubernetes, and container orchestration.
- Expertise in infrastructure as code (Terraform, Ansible, CloudFormation) and CI/CD pipeline development.
- Experience with observability tools (Prometheus, Grafana, DataDog, CloudWatch, PagerDuty).
- Solid foundation in Linux/Unix administration, networking, security, and database systems.
Professional Skills
- Independently solves complex problems and drives innovative infrastructure solutions with minimal guidance.
- Translates business challenges into infrastructure and process improvements.
- Communicates technical concepts effectively across technical and non technical audiences.
- Leads projects and mentors junior engineers.
Preferred Qualifications
- Experience in semiconductor or technology industry environments.
- AWS certifications (Solutions Architect, DevOps Engineer) or Kubernetes certifications (CKA, CKAD).
- Experience with microservices architecture and distributed systems design.
- Knowledge of security frameworks and compliance requirements (SOC 2, ISO 27001).
- Experience with database administration, performance tuning, and Agile/Scrum methodologies.
- Familiarity with service mesh technologies (Istio, Linkerd).
- Contributions to open source infrastructure projects.
Working Arrangement
- Remote position for candidates based in Canada.
- Occasional travel may be required.
TechInsights is an equal opportunity employer. We are committed to inclusion and provide accommodations for candidates with disabilities.
Skills
AWS Cost ExplorerAnsibleCloudFormationCloudWatchDataDogDockerEC2ECSEKSGitLab CIGitOpsGoGrafanaGitHub ActionsJenkinsKubecostKubernetesLambdaLinuxPagerDutyPrometheusPythonS3TerraformUnixVPC
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free