Skip to content
mimi

Senior Site Reliability Engineer

Devopie Inc.

Hamilton · On-site Full-time Senior 2w ago

About the role

**💡 What You’ll Do**

You’ll operate at the intersection of software engineering and systems engineering , building resilient systems that scale, self-heal, and empower developers to ship safely. • *🔎 Reliability Engineering**

- Define and manage • *SLIs, SLOs, and error budgets** - Reduce MTTD, MTTA, and MTTR through structured incident response - Conduct blameless postmortems and drive preventative improvements - Champion reliability in architectural reviews and production readiness • *📊 Observability & Monitoring**

- Design actionable, symptom-based alerts (not noise) - Build dashboards and tracing systems using tools like • *CloudWatch, Prometheus, Grafana, New Relic, X-Ray, ADOT** - Implement synthetic monitoring to simulate real user journeys (URLs, clickpaths, APIs) - Ensure full observability coverage across critical paths • *☁️ Cloud & Infrastructure**

- Operate and optimize • *AWS environments (EC2, EKS/ECS, Lambda, VPC, RDS, IAM, S3, ALB/NLB, CloudTrail)** - Build resilient, multi-AZ and regionally replicated systems - Implement autoscaling and fault-tolerant architecture - Leverage Infrastructure as Code (Terraform, CDK, CloudFormation) • *🤖 Automation & Toil Reduction**

- Eliminate manual processes through automation - Build self-healing infrastructure - Improve CI/CD pipelines with safe deployment strategies (canary releases, feature flags) - Write production-quality code (not just scripts) in Python, Go, Ruby, Bash, or Java • *📈 Performance & Capacity Planning**

- Analyze system metrics and traffic patterns - Conduct load testing, chaos testing, and capacity modeling - Identify bottlenecks and proactively optimize systems • *🤝 Cross-Functional Collaboration**

You’ll work closely with:

- Engineering & Platform teams on scalable system design - Security teams on IAM, KMS, GuardDuty, secrets management - Product leaders to align reliability with roadmap priorities - Cloud vendors and SaaS providers during critical incidents • *🧠 What You Bring** • *Must-Have Experience**

- Bachelor’s degree in Computer Science, Software Engineering, or related field - Strong Linux/Unix systems knowledge - Deep AWS experience - Hands-on Kubernetes (EKS/ECS), Docker, and container orchestration - Infrastructure as Code (Terraform, CDK, CloudFormation) - Production on-call and incident management experience - Strong understanding of MTTx metrics (MTTD, MTTR, MTBF, etc.) - Experience with MongoDB, PostgreSQL, Redis, RabbitMQ - Experience with observability and monitoring platforms - CI/CD pipeline experience (GitHub, Kubernetes, etc.) • *Nice-to-Have**

- Performance engineering and chaos testing - Experience in fintech or regulated environments - Knowledge of distributed storage systems (NFS, HDFS, Ceph, S3) - Familiarity with dynamic resource frameworks (Kubernetes, Mesos, Yarn)

Requirements

  • Bachelor's degree in Computer Science, Software Engineering, or related field
  • Strong Linux/Unix systems knowledge
  • Deep AWS experience
  • Hands-on Kubernetes (EKS/ECS), Docker, and container orchestration
  • Infrastructure as Code (Terraform, CDK, CloudFormation)
  • Production on-call and incident management experience
  • Strong understanding of MTTx metrics (MTTD, MTTR, MTBF, etc.)
  • Experience with MongoDB, PostgreSQL, Redis, RabbitMQ
  • Experience with observability and monitoring platforms
  • CI/CD pipeline experience (GitHub, Kubernetes, etc.)

Responsibilities

  • Define and manage SLIs, SLOs, and error budgets
  • Reduce MTTD, MTTA, and MTTR through structured incident response
  • Conduct blameless postmortems and drive preventative improvements
  • Champion reliability in architectural reviews and production readiness
  • Design actionable, symptom-based alerts
  • Build dashboards and tracing systems
  • Implement synthetic monitoring
  • Ensure full observability coverage
  • Operate and optimize AWS environments
  • Build resilient, multi-AZ and regionally replicated systems
  • Implement autoscaling and fault-tolerant architecture
  • Eliminate manual processes through automation
  • Build self-healing infrastructure
  • Improve CI/CD pipelines
  • Analyze system metrics and traffic patterns
  • Conduct load testing, chaos testing, and capacity modeling
  • Identify bottlenecks and proactively optimize systems

Benefits

null

Skills

AWSKubernetesDockerContainer OrchestrationTerraformCDKCloudFormationLinuxUnixMongoDBPostgreSQLRedisRabbitMQCI/CDGitHubObservabilityMonitoring

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free