Skip to content
mimi

SME SRE Observability

Info Way Solutions

US · On-site Full-time Senior 5d ago

About the role

Job Title

SME - SRE Observability Engineer

Location

Minnesota (Onsite - 4 to 5 days/week)

Job Summary

We are seeking an experienced Subject Matter Expert (SME) in Site Reliability Engineering (SRE) with a strong focus on Observability. The ideal candidate will be responsible for designing, implementing, and optimizing observability frameworks to ensure high system reliability, performance, and scalability in a production environment.

Key Responsibilities

  • Lead the design and implementation of observability solutions including metrics, logging, and tracing.
  • Act as an SME for SRE best practices, ensuring system reliability, availability, and performance.
  • Develop and maintain dashboards, alerts, and monitoring strategies.
  • Collaborate with development, DevOps, and infrastructure teams to improve system visibility.
  • Perform root cause analysis (RCA) and drive incident resolution.
  • Optimize system performance and reliability through proactive monitoring.
  • Implement automation to improve operational efficiency and reduce manual intervention.
  • Define and track SLIs, SLOs, and SLAs.

Required Skills & Qualifications

  • Strong experience in Site Reliability Engineering (SRE) concepts and practices.
  • Deep expertise in Observability tools (e.g., Prometheus, Grafana, ELK Stack, Datadog, Splunk, or similar).
  • Experience with cloud platforms (AWS, Azure, or GCP).
  • Proficiency in scripting/programming (Python, Bash, or similar).
  • Hands-on experience with monitoring, alerting, and logging frameworks.
  • Strong troubleshooting and performance tuning skills.
  • Experience with CI/CD pipelines and automation tools.

Preferred Qualifications

  • Experience working in high-availability, distributed systems.
  • Knowledge of containerization and orchestration tools (Docker, Kubernetes).
  • Prior experience as an SRE SME or Lead.

Requirements

  • Strong experience in Site Reliability Engineering (SRE) concepts and practices.
  • Deep expertise in Observability tools (e.g., Prometheus, Grafana, ELK Stack, Datadog, Splunk, or similar).
  • Experience with cloud platforms (AWS, Azure, or GCP).
  • Proficiency in scripting/programming (Python, Bash, or similar).
  • Hands-on experience with monitoring, alerting, and logging frameworks.
  • Strong troubleshooting and performance tuning skills.
  • Experience with CI/CD pipelines and automation tools.

Responsibilities

  • Lead the design and implementation of observability solutions including metrics, logging, and tracing.
  • Act as an SME for SRE best practices, ensuring system reliability, availability, and performance.
  • Develop and maintain dashboards, alerts, and monitoring strategies.
  • Collaborate with development, DevOps, and infrastructure teams to improve system visibility.
  • Perform root cause analysis (RCA) and drive incident resolution.
  • Optimize system performance and reliability through proactive monitoring.
  • Implement automation to improve operational efficiency and reduce manual intervention.
  • Define and track SLIs, SLOs, and SLAs.

Skills

AWSAzureBashDatadogDockerELK StackGCPGrafanaKubernetesPrometheusPythonSplunk

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free