Senior Site Reliability Engineer

Devopie Inc.

Hamilton · On-site Full-time Senior 4mo ago

About the role

**💡 What You’ll Do**

You’ll operate at the intersection of software engineering and systems engineering , building resilient systems that scale, self-heal, and empower developers to ship safely. • *🔎 Reliability Engineering**

- Define and manage • *SLIs, SLOs, and error budgets** - Reduce MTTD, MTTA, and MTTR through structured incident response - Conduct blameless postmortems and drive preventative improvements - Champion reliability in architectural reviews and production readiness • *📊 Observability & Monitoring**

- Design actionable, symptom-based alerts (not noise) - Build dashboards and tracing systems using tools like • *CloudWatch, Prometheus, Grafana, New Relic, X-Ray, ADOT** - Implement synthetic monitoring to simulate real user journeys (URLs, clickpaths, APIs) - Ensure full observability coverage across critical paths • *☁️ Cloud & Infrastructure**

- Operate and optimize • *AWS environments (EC2, EKS/ECS, Lambda, VPC, RDS, IAM, S3, ALB/NLB, CloudTrail)** - Build resilient, multi-AZ and regionally replicated systems - Implement autoscaling and fault-tolerant architecture - Leverage Infrastructure as Code (Terraform, CDK, CloudFormation) • *🤖 Automation & Toil Reduction**

- Eliminate manual processes through automation - Build self-healing infrastructure - Improve CI/CD pipelines with safe deployment strategies (canary releases, feature flags) - Write production-quality code (not just scripts) in Python, Go, Ruby, Bash, or Java • *📈 Performance & Capacity Planning**

- Analyze system metrics and traffic patterns - Conduct load testing, chaos testing, and capacity modeling - Identify bottlenecks and proactively optimize systems • *🤝 Cross-Functional Collaboration**

You’ll work closely with:

- Engineering & Platform teams on scalable system design - Security teams on IAM, KMS, GuardDuty, secrets management - Product leaders to align reliability with roadmap priorities - Cloud vendors and SaaS providers during critical incidents • *🧠 What You Bring** • *Must-Have Experience**

- Bachelor’s degree in Computer Science, Software Engineering, or related field - Strong Linux/Unix systems knowledge - Deep AWS experience - Hands-on Kubernetes (EKS/ECS), Docker, and container orchestration - Infrastructure as Code (Terraform, CDK, CloudFormation) - Production on-call and incident management experience - Strong understanding of MTTx metrics (MTTD, MTTR, MTBF, etc.) - Experience with MongoDB, PostgreSQL, Redis, RabbitMQ - Experience with observability and monitoring platforms - CI/CD pipeline experience (GitHub, Kubernetes, etc.) • *Nice-to-Have**

- Performance engineering and chaos testing - Experience in fintech or regulated environments - Knowledge of distributed storage systems (NFS, HDFS, Ceph, S3) - Familiarity with dynamic resource frameworks (Kubernetes, Mesos, Yarn)

Skills

AWSKubernetesDockerContainer OrchestrationTerraformCDKCloudFormationLinuxUnixMongoDBPostgreSQLRedisRabbitMQCI/CDGitHubObservabilityMonitoring

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

Senior Site Reliability Engineer

About the role

Skills

Similar roles

Job

Engineering

Power Electronics Embedded Systems Engineering

Don't send a generic resume