Senior Configuration Engineer
Cubical Operations LLP
About the role
Job Title
SRE Observability Engineer / Senior Observability Engineer
Experience
5 – 10 Years
Location
Hyderabad - Madhapur
Employment Type
Full-Time
Notice Period
Immediate Joiners Preferred
Job Summary
We are looking for a highly skilled and forward-thinking SRE Observability Engineer to lead the design and implementation of observability solutions across complex, distributed systems. The ideal candidate should have strong expertise in monitoring, logging, and tracing tools, along with a vision for implementing AI-driven observability to enhance system reliability and performance.
This role requires close collaboration with cross-functional teams including Development, DevOps, Infrastructure, and SRE to improve system visibility, incident response, and overall platform stability.
Mandatory Skills
- Strong hands-on experience in Observability Engineering
- Expertise in Grafana for visualization and monitoring
- Advanced experience in Prometheus & Loki , including writing complex queries
- Proven experience in implementing AI-driven observability / anomaly detection systems
Key Responsibilities
- Lead the design and implementation of observability solutions (monitoring, logging, tracing) across cloud and on-prem environments
- Build and manage monitoring tools such as Prometheus, Grafana, Datadog, New Relic, and AppDynamics
- Implement distributed tracing frameworks like OpenTelemetry, Jaeger, or Zipkin
- Optimize log management using tools like Elasticsearch, Splunk, Loki, and Fluentd
- Develop advanced alerting and anomaly detection mechanisms to reduce MTTR
- Collaborate with DevOps and SRE teams to integrate observability into CI/CD pipelines and microservices architecture
- Automate observability workflows using scripting languages (Python, Bash, Golang)
- Drive scalability and performance improvements across large-scale distributed systems
- Lead incident troubleshooting, root cause analysis, and system diagnostics
- Stay updated with the latest trends in observability, SRE, and AI-driven monitoring
Required Qualifications
- 5–10 years of experience in SRE, Observability, or DevOps roles
- Strong expertise in Prometheus, Grafana, and Loki (must-have)
- Experience with cloud platforms: Azure / AWS / GCP
- Hands-on experience with Kubernetes and containerized environments
- Strong scripting skills (Python, Bash, or Golang)
- Experience with Infrastructure as Code tools (Terraform, Ansible)
- Deep understanding of distributed systems, system performance, and reliability engineering
- Experience in incident management and production support environments
- Excellent communication and stakeholder management skills
Preferred Qualifications
- Experience with AI-driven observability tools and anomaly detection techniques
- Familiarity with microservices, serverless, and event-driven architectures
- Experience with on-call support and incident response workflows
- Relevant certifications in cloud platforms or SRE practices
Key Competencies
- Strong analytical and problem-solving skills
- Ownership and accountability
- Leadership and mentoring ability
- Ability to work in a fast-paced Agile environment.
Requirements
- Strong expertise in Prometheus, Grafana, and Loki (must-have)
- Experience with cloud platforms: Azure / AWS / GCP
- Hands-on experience with Kubernetes and containerized environments
- Strong scripting skills (Python, Bash, or Golang)
- Experience with Infrastructure as Code tools (Terraform, Ansible)
- Deep understanding of distributed systems, system performance, and reliability engineering
- Experience in incident management and production support environments
- Excellent communication and stakeholder management skills
Responsibilities
- Lead the design and implementation of observability solutions (monitoring, logging, tracing) across cloud and on-prem environments
- Build and manage monitoring tools such as Prometheus, Grafana, Datadog, New Relic, and AppDynamics
- Implement distributed tracing frameworks like OpenTelemetry, Jaeger, or Zipkin
- Optimize log management using tools like Elasticsearch, Splunk, Loki, and Fluentd
- Develop advanced alerting and anomaly detection mechanisms to reduce MTTR
- Collaborate with DevOps and SRE teams to integrate observability into CI/CD pipelines and microservices architecture
- Automate observability workflows using scripting languages (Python, Bash, Golang)
- Drive scalability and performance improvements across large-scale distributed systems
- Lead incident troubleshooting, root cause analysis, and system diagnostics
- Stay updated with the latest trends in observability, SRE, and AI-driven monitoring
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free