Senior Site Reliability Engineer
Devopie Inc.
About the role
**💡 What You’ll Do**
You’ll operate at the intersection of software engineering and systems engineering , building resilient systems that scale, self-heal, and empower developers to ship safely. • *🔎 Reliability Engineering**
- Define and manage • *SLIs, SLOs, and error budgets** - Reduce MTTD, MTTA, and MTTR through structured incident response - Conduct blameless postmortems and drive preventative improvements - Champion reliability in architectural reviews and production readiness • *📊 Observability & Monitoring**
- Design actionable, symptom-based alerts (not noise) - Build dashboards and tracing systems using tools like • *CloudWatch, Prometheus, Grafana, New Relic, X-Ray, ADOT** - Implement synthetic monitoring to simulate real user journeys (URLs, clickpaths, APIs) - Ensure full observability coverage across critical paths • *☁️ Cloud & Infrastructure**
- Operate and optimize • *AWS environments (EC2, EKS/ECS, Lambda, VPC, RDS, IAM, S3, ALB/NLB, CloudTrail)** - Build resilient, multi-AZ and regionally replicated systems - Implement autoscaling and fault-tolerant architecture - Leverage Infrastructure as Code (Terraform, CDK, CloudFormation) • *🤖 Automation & Toil Reduction**
- Eliminate manual processes through automation - Build self-healing infrastructure - Improve CI/CD pipelines with safe deployment strategies (canary releases, feature flags) - Write production-quality code (not just scripts) in Python, Go, Ruby, Bash, or Java • *📈 Performance & Capacity Planning**
- Analyze system metrics and traffic patterns - Conduct load testing, chaos testing, and capacity modeling - Identify bottlenecks and proactively optimize systems • *🤝 Cross-Functional Collaboration**
You’ll work closely with:
- Engineering & Platform teams on scalable system design - Security teams on IAM, KMS, GuardDuty, secrets management - Product leaders to align reliability with roadmap priorities - Cloud vendors and SaaS providers during critical incidents • *🧠 What You Bring** • *Must-Have Experience**
- Bachelor’s degree in Computer Science, Software Engineering, or related field - Strong Linux/Unix systems knowledge - Deep AWS experience - Hands-on Kubernetes (EKS/ECS), Docker, and container orchestration - Infrastructure as Code (Terraform, CDK, CloudFormation) - Production on-call and incident management experience - Strong understanding of MTTx metrics (MTTD, MTTR, MTBF, etc.) - Experience with MongoDB, PostgreSQL, Redis, RabbitMQ - Experience with observability and monitoring platforms - CI/CD pipeline experience (GitHub, Kubernetes, etc.) • *Nice-to-Have**
- Performance engineering and chaos testing - Experience in fintech or regulated environments - Knowledge of distributed storage systems (NFS, HDFS, Ceph, S3) - Familiarity with dynamic resource frameworks (Kubernetes, Mesos, Yarn)
Requirements
- Bachelor's degree in Computer Science, Software Engineering, or related field
- Strong Linux/Unix systems knowledge
- Deep AWS experience
- Hands-on Kubernetes (EKS/ECS), Docker, and container orchestration
- Infrastructure as Code (Terraform, CDK, CloudFormation)
- Production on-call and incident management experience
- Strong understanding of MTTx metrics (MTTD, MTTR, MTBF, etc.)
- Experience with MongoDB, PostgreSQL, Redis, RabbitMQ
- Experience with observability and monitoring platforms
- CI/CD pipeline experience (GitHub, Kubernetes, etc.)
Responsibilities
- Define and manage SLIs, SLOs, and error budgets
- Reduce MTTD, MTTA, and MTTR through structured incident response
- Conduct blameless postmortems and drive preventative improvements
- Champion reliability in architectural reviews and production readiness
- Design actionable, symptom-based alerts
- Build dashboards and tracing systems
- Implement synthetic monitoring
- Ensure full observability coverage
- Operate and optimize AWS environments
- Build resilient, multi-AZ and regionally replicated systems
- Implement autoscaling and fault-tolerant architecture
- Eliminate manual processes through automation
- Build self-healing infrastructure
- Improve CI/CD pipelines
- Analyze system metrics and traffic patterns
- Conduct load testing, chaos testing, and capacity modeling
- Identify bottlenecks and proactively optimize systems
Benefits
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free