DevOps SRE Engineer - Observability & Automation
TAT IT Technolgies
About the role
Urgent requirement for DevOps SRE Engineer - Observability & Automation is required for our banking clients in Abu Dhabi ,UAE • Strong experience in Kafka, RabbitMQ, Redis, RDS/Aurora ---Must • Strong experience in observability (metrics, logs, traces, dashboards, and alerts) is Must
Strong experience in Kubernetes, Docker, container orchestration, microservices support is Must
Strong experience in Terraform, IaC practice is MUST
Strong experience in Linux environments and performance troubleshooting is MUST
Strong experience in Banking is MUST
We’re looking for a talented Site Reliability Engineer (SRE) to keep our systems running smoothly, reliably, and at scale. Through smart automation, deep observability, and a calm head
in a crisis, you’ll help us balance speed, compliance, and stability, working alongside DevOps,Cloud, Quality Engineering, and Product teams to drive continuous improvements inperformance, security, and resilience.. • Define and implement SLIs / SLOs and error budgets for business-critical digital banking
services. • Build actionable observability (metrics, logs, traces, dashboards, and alerts) using Dynatrace,
Prometheus, Grafana, and ELK, while reducing alert fatigue. • Leverage AI-driven insights and anomaly detection (Dynatrace Davis AI or equivalent AIOps
platform) to proactively predict and resolve reliability issues before impact. • Lead incident management — from on-call triage and root-cause analysis to blameless
postmortems with actionable follow-ups. • Improve deployment safety with robust rollout / rollback strategies, canary and blue-green
deployments, and production readiness reviews. • Support and optimize microservices-based architectures, ensuring service reliability,
scalability, and inter-service resilience. • Conduct capacity planning, performance tuning, and resilience testing, optimizing for both
reliability and cost efficiency. • Automate operational toil — from runbooks and remediation scripts to proactive health checks
and self-healing workflows. • Collaborate with DevOps to embed reliability gates and validations into CI / CD pipelines
(GitHub Actions, Jenkins, GitLab CI / CD or Azure DevOps). • Own and evolve the observability and AIOps stack, driving intelligent automation and predictive
alerting capabilities. • Maintain high-quality documentation, playbooks, and operational standards across
environments. • Ensure operational compliance and security alignment with internal controls and regulatory
standards. • Analyze system performance, availability, and cost data to continually optimize operations. • Provide reliability support and escalation guidance for critical production systems during major
incidents. • 5+ years of experience in SRE or DevOps roles, building and managing large-scale,
high-availability systems across banking, fintech, e-commerce, or other data-intensive digital
ecosystems. • Bachelor’s degree in Computer Science or equivalent technical experience. • Strong experience with Linux environments and performance troubleshooting. • Proven expertise in Terraform and Infrastructure as Code (IaC) methodologies. • Proficiency with Kubernetes and container orchestration in microservices environments. • Hands-on experience with AWS (preferred); exposure to Azure or GCP is an advantage. • Deep knowledge of Dynatrace (AIOps, Davis AI), Prometheus, Grafana, and the ELK stack. • Experience implementing AI / ML-driven reliability or automation solutions (AIOps, anomaly
detection, predictive alerting). • Practical understanding of CI / CD pipelines (GitHub Actions, Jenkins, GitLab CI / CD or Azure
DevOps). • Experience with Kafka, RabbitMQ, Redis, Aurora, and RDS databases. • Strong scripting or programming skills in Python, Bash, or Go.
Skills: automation,devops,sre
Requirements
- 5+ years of experience in SRE or DevOps roles
- Bachelor's degree in Computer Science or equivalent technical experience
- Strong experience with Linux environments and performance troubleshooting
- Proven expertise in Terraform and Infrastructure as Code (IaC) methodologies
- Proficiency with Kubernetes and container orchestration in microservices environments
- Hands-on experience with AWS (preferred); exposure to Azure or GCP is an advantage
- Deep knowledge of Dynatrace (AIOps, Davis AI), Prometheus, Grafana, and the ELK stack
- Experience implementing AI / ML-driven reliability or automation solutions
- Practical understanding of CI / CD pipelines
- Experience with Kafka, RabbitMQ, Redis, Aurora, and RDS databases
- Strong scripting or programming skills in Python, Bash, or Go
Responsibilities
- Define and implement SLIs / SLOs and error budgets for business-critical digital banking services
- Build actionable observability using Dynatrace, Prometheus, Grafana, and ELK
- Leverage AI-driven insights and anomaly detection to proactively predict and resolve reliability issues
- Lead incident management
- Improve deployment safety with robust rollout / rollback strategies
- Support and optimize microservices-based architectures
- Conduct capacity planning, performance tuning, and resilience testing
- Automate operational toil
- Collaborate with DevOps to embed reliability gates and validations into CI / CD pipelines
- Own and evolve the observability and AIOps stack
- Maintain high-quality documentation, playbooks, and operational standards
- Ensure operational compliance and security alignment
- Analyze system performance, availability, and cost data
- Provide reliability support and escalation guidance for critical production systems
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free