Site Reliability Engineer
Avrioc Technologies
About the role
We’re looking for a Seasoned DevOps & Site Reliability Engineer (SRE) Lead to design, scale, and enhance our cloud infrastructure and observability ecosystem.
If you’re passionate about automation, resilience, and reliability — this role is for you! • Architect and deploy scalable, highly available cloud infrastructure for production workloads. • Lead and implement SRE best practices, ensuring system reliability, performance, and scalability. • Oversee and optimize CI/CD pipelines (Jenkins, Argo CD or similar) for seamless deployments. • Define and monitor SLOs & SLIs to ensure service reliability and uptime. • Design and manage observability frameworks — monitoring, logging, and alerting (Elastic Stack, Prometheus, Grafana, Dynatrace, New Relic). • Manage and optimize Kubernetes clusters and Helm charts for efficient orchestration and streamlined releases. • Implement auto-healing and proactive monitoring systems to prevent outages. • Drive fault injection testing & chaos engineering (Chaos Mesh, Litmus, AWS FIS) for resilience validation. • Collaborate with engineering and product teams to embed reliability into every phase of development. • Maintain clear documentation on infrastructure, incidents, and operational processes. • 8+ years of experience as a DevOps/SRE professional, leading enterprise SRE implementations. • Hands‑on with AWS, GCP, or Azure (EC2, S3, RDS, Lambda, etc.). • Strong with IaC tools (Terraform, CloudFormation, Ansible). • Proven experience in CI/CD automation, monitoring, and incident response. • Skilled in observability tools — Elastic Stack, Grafana, Prometheus, Dynatrace, New Relic. • Experience with AWS managed & self‑managed databases (MySQL, Cassandra, etc.). • Skilled in Python, Bash, or Go scripting. • Experience designing and testing BCP/DR strategies. • Proactive in capacity planning, ensuring scalability and resilience across cloud environments. • Excellent communication, documentation, and troubleshooting skills. • Comply with Avrioc’s Information Security & Service Management policies. • Maintain the confidentiality and integrity of all information assets. • Attend mandatory information security trainings. • Report any security incidents through official channels.
Requirements
- 8+ years of experience as a DevOps/SRE professional, leading enterprise SRE implementations.
- Hands-on with AWS, GCP, or Azure.
- Strong with IaC tools (Terraform, CloudFormation, Ansible).
- Proven experience in CI/CD automation, monitoring, and incident response.
- Skilled in observability tools — Elastic Stack, Grafana, Prometheus, Dynatrace, New Relic.
- Experience with AWS managed & self-managed databases (MySQL, Cassandra, etc.).
- Skilled in Python, Bash, or Go scripting.
- Experience designing and testing BCP/DR strategies.
- Proactive in capacity planning, ensuring scalability and resilience across cloud environments.
- Excellent communication, documentation, and troubleshooting skills.
Responsibilities
- Architect and deploy scalable, highly available cloud infrastructure for production workloads.
- Lead and implement SRE best practices, ensuring system reliability, performance, and scalability.
- Oversee and optimize CI/CD pipelines for seamless deployments.
- Define and monitor SLOs & SLIs to ensure service reliability and uptime.
- Design and manage observability frameworks — monitoring, logging, and alerting.
- Manage and optimize Kubernetes clusters and Helm charts for efficient orchestration and streamlined releases.
- Implement auto-healing and proactive monitoring systems to prevent outages.
- Drive fault injection testing & chaos engineering for resilience validation.
- Collaborate with engineering and product teams to embed reliability into every phase of development.
- Maintain clear documentation on infrastructure, incidents, and operational processes.
Benefits
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free