All jobs

Senior Configuration Engineer

Cubical Operations LLP

Hyderabad · On-site Full-time Senior Yesterday

Apply with a tailored resume Save job

About the role

Job Title

SRE Observability Engineer / Senior Observability Engineer

Experience

5 – 10 Years

Location

Hyderabad - Madhapur

Employment Type

Full-Time

Notice Period

Immediate Joiners Preferred

Job Summary

We are looking for a highly skilled and forward-thinking SRE Observability Engineer to lead the design and implementation of observability solutions across complex, distributed systems. The ideal candidate should have strong expertise in monitoring, logging, and tracing tools, along with a vision for implementing AI-driven observability to enhance system reliability and performance.

This role requires close collaboration with cross-functional teams including Development, DevOps, Infrastructure, and SRE to improve system visibility, incident response, and overall platform stability.

Mandatory Skills

Strong hands-on experience in Observability Engineering
Expertise in Grafana for visualization and monitoring
Advanced experience in Prometheus & Loki , including writing complex queries
Proven experience in implementing AI-driven observability / anomaly detection systems

Key Responsibilities

Lead the design and implementation of observability solutions (monitoring, logging, tracing) across cloud and on-prem environments
Build and manage monitoring tools such as Prometheus, Grafana, Datadog, New Relic, and AppDynamics
Implement distributed tracing frameworks like OpenTelemetry, Jaeger, or Zipkin
Optimize log management using tools like Elasticsearch, Splunk, Loki, and Fluentd
Develop advanced alerting and anomaly detection mechanisms to reduce MTTR
Collaborate with DevOps and SRE teams to integrate observability into CI/CD pipelines and microservices architecture
Automate observability workflows using scripting languages (Python, Bash, Golang)
Drive scalability and performance improvements across large-scale distributed systems
Lead incident troubleshooting, root cause analysis, and system diagnostics
Stay updated with the latest trends in observability, SRE, and AI-driven monitoring

Required Qualifications

5–10 years of experience in SRE, Observability, or DevOps roles
Strong expertise in Prometheus, Grafana, and Loki (must-have)
Experience with cloud platforms: Azure / AWS / GCP
Hands-on experience with Kubernetes and containerized environments
Strong scripting skills (Python, Bash, or Golang)
Experience with Infrastructure as Code tools (Terraform, Ansible)
Deep understanding of distributed systems, system performance, and reliability engineering
Experience in incident management and production support environments
Excellent communication and stakeholder management skills

Preferred Qualifications

Experience with AI-driven observability tools and anomaly detection techniques
Familiarity with microservices, serverless, and event-driven architectures
Experience with on-call support and incident response workflows
Relevant certifications in cloud platforms or SRE practices

Key Competencies

Strong analytical and problem-solving skills
Ownership and accountability
Leadership and mentoring ability
Ability to work in a fast-paced Agile environment.

Requirements

Strong expertise in Prometheus, Grafana, and Loki (must-have)
Experience with cloud platforms: Azure / AWS / GCP
Hands-on experience with Kubernetes and containerized environments
Strong scripting skills (Python, Bash, or Golang)
Experience with Infrastructure as Code tools (Terraform, Ansible)
Deep understanding of distributed systems, system performance, and reliability engineering
Experience in incident management and production support environments
Excellent communication and stakeholder management skills

Responsibilities

Lead the design and implementation of observability solutions (monitoring, logging, tracing) across cloud and on-prem environments
Build and manage monitoring tools such as Prometheus, Grafana, Datadog, New Relic, and AppDynamics
Implement distributed tracing frameworks like OpenTelemetry, Jaeger, or Zipkin
Optimize log management using tools like Elasticsearch, Splunk, Loki, and Fluentd
Develop advanced alerting and anomaly detection mechanisms to reduce MTTR
Collaborate with DevOps and SRE teams to integrate observability into CI/CD pipelines and microservices architecture
Automate observability workflows using scripting languages (Python, Bash, Golang)
Drive scalability and performance improvements across large-scale distributed systems
Lead incident troubleshooting, root cause analysis, and system diagnostics
Stay updated with the latest trends in observability, SRE, and AI-driven monitoring

Skills

AnsibleAppDynamicsBashDatadogDockerElasticsearchFluentdGolangGrafanaJaegerKubernetesLokiNew RelicOpenTelemetryPrometheusPythonSplunkTerraformZipkin

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free