Skip to content
mimi

Senior Configuration Engineer

Cubical Operations LLP

Hyderabad · On-site Full-time Senior Yesterday

About the role

Job Title

SRE Observability Engineer / Senior Observability Engineer

Experience

5 – 10 Years

Location

Hyderabad - Madhapur

Employment Type

Full-Time

Notice Period

Immediate Joiners Preferred

Job Summary

We are looking for a highly skilled and forward-thinking SRE Observability Engineer to lead the design and implementation of observability solutions across complex, distributed systems. The ideal candidate should have strong expertise in monitoring, logging, and tracing tools, along with a vision for implementing AI-driven observability to enhance system reliability and performance.

This role requires close collaboration with cross-functional teams including Development, DevOps, Infrastructure, and SRE to improve system visibility, incident response, and overall platform stability.

Mandatory Skills

  • Strong hands-on experience in Observability Engineering
  • Expertise in Grafana for visualization and monitoring
  • Advanced experience in Prometheus & Loki , including writing complex queries
  • Proven experience in implementing AI-driven observability / anomaly detection systems

Key Responsibilities

  • Lead the design and implementation of observability solutions (monitoring, logging, tracing) across cloud and on-prem environments
  • Build and manage monitoring tools such as Prometheus, Grafana, Datadog, New Relic, and AppDynamics
  • Implement distributed tracing frameworks like OpenTelemetry, Jaeger, or Zipkin
  • Optimize log management using tools like Elasticsearch, Splunk, Loki, and Fluentd
  • Develop advanced alerting and anomaly detection mechanisms to reduce MTTR
  • Collaborate with DevOps and SRE teams to integrate observability into CI/CD pipelines and microservices architecture
  • Automate observability workflows using scripting languages (Python, Bash, Golang)
  • Drive scalability and performance improvements across large-scale distributed systems
  • Lead incident troubleshooting, root cause analysis, and system diagnostics
  • Stay updated with the latest trends in observability, SRE, and AI-driven monitoring

Required Qualifications

  • 5–10 years of experience in SRE, Observability, or DevOps roles
  • Strong expertise in Prometheus, Grafana, and Loki (must-have)
  • Experience with cloud platforms: Azure / AWS / GCP
  • Hands-on experience with Kubernetes and containerized environments
  • Strong scripting skills (Python, Bash, or Golang)
  • Experience with Infrastructure as Code tools (Terraform, Ansible)
  • Deep understanding of distributed systems, system performance, and reliability engineering
  • Experience in incident management and production support environments
  • Excellent communication and stakeholder management skills

Preferred Qualifications

  • Experience with AI-driven observability tools and anomaly detection techniques
  • Familiarity with microservices, serverless, and event-driven architectures
  • Experience with on-call support and incident response workflows
  • Relevant certifications in cloud platforms or SRE practices

Key Competencies

  • Strong analytical and problem-solving skills
  • Ownership and accountability
  • Leadership and mentoring ability
  • Ability to work in a fast-paced Agile environment.

Requirements

  • Strong expertise in Prometheus, Grafana, and Loki (must-have)
  • Experience with cloud platforms: Azure / AWS / GCP
  • Hands-on experience with Kubernetes and containerized environments
  • Strong scripting skills (Python, Bash, or Golang)
  • Experience with Infrastructure as Code tools (Terraform, Ansible)
  • Deep understanding of distributed systems, system performance, and reliability engineering
  • Experience in incident management and production support environments
  • Excellent communication and stakeholder management skills

Responsibilities

  • Lead the design and implementation of observability solutions (monitoring, logging, tracing) across cloud and on-prem environments
  • Build and manage monitoring tools such as Prometheus, Grafana, Datadog, New Relic, and AppDynamics
  • Implement distributed tracing frameworks like OpenTelemetry, Jaeger, or Zipkin
  • Optimize log management using tools like Elasticsearch, Splunk, Loki, and Fluentd
  • Develop advanced alerting and anomaly detection mechanisms to reduce MTTR
  • Collaborate with DevOps and SRE teams to integrate observability into CI/CD pipelines and microservices architecture
  • Automate observability workflows using scripting languages (Python, Bash, Golang)
  • Drive scalability and performance improvements across large-scale distributed systems
  • Lead incident troubleshooting, root cause analysis, and system diagnostics
  • Stay updated with the latest trends in observability, SRE, and AI-driven monitoring

Skills

AnsibleAppDynamicsBashDatadogDockerElasticsearchFluentdGolangGrafanaJaegerKubernetesLokiNew RelicOpenTelemetryPrometheusPythonSplunkTerraformZipkin

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free