IW
SME SRE Observability
Info Way Solutions
US · On-site Full-time Senior 5d ago
About the role
Job Title
SME - SRE Observability Engineer
Location
Minnesota (Onsite - 4 to 5 days/week)
Job Summary
We are seeking an experienced Subject Matter Expert (SME) in Site Reliability Engineering (SRE) with a strong focus on Observability. The ideal candidate will be responsible for designing, implementing, and optimizing observability frameworks to ensure high system reliability, performance, and scalability in a production environment.
Key Responsibilities
- Lead the design and implementation of observability solutions including metrics, logging, and tracing.
- Act as an SME for SRE best practices, ensuring system reliability, availability, and performance.
- Develop and maintain dashboards, alerts, and monitoring strategies.
- Collaborate with development, DevOps, and infrastructure teams to improve system visibility.
- Perform root cause analysis (RCA) and drive incident resolution.
- Optimize system performance and reliability through proactive monitoring.
- Implement automation to improve operational efficiency and reduce manual intervention.
- Define and track SLIs, SLOs, and SLAs.
Required Skills & Qualifications
- Strong experience in Site Reliability Engineering (SRE) concepts and practices.
- Deep expertise in Observability tools (e.g., Prometheus, Grafana, ELK Stack, Datadog, Splunk, or similar).
- Experience with cloud platforms (AWS, Azure, or GCP).
- Proficiency in scripting/programming (Python, Bash, or similar).
- Hands-on experience with monitoring, alerting, and logging frameworks.
- Strong troubleshooting and performance tuning skills.
- Experience with CI/CD pipelines and automation tools.
Preferred Qualifications
- Experience working in high-availability, distributed systems.
- Knowledge of containerization and orchestration tools (Docker, Kubernetes).
- Prior experience as an SRE SME or Lead.
Requirements
- Strong experience in Site Reliability Engineering (SRE) concepts and practices.
- Deep expertise in Observability tools (e.g., Prometheus, Grafana, ELK Stack, Datadog, Splunk, or similar).
- Experience with cloud platforms (AWS, Azure, or GCP).
- Proficiency in scripting/programming (Python, Bash, or similar).
- Hands-on experience with monitoring, alerting, and logging frameworks.
- Strong troubleshooting and performance tuning skills.
- Experience with CI/CD pipelines and automation tools.
Responsibilities
- Lead the design and implementation of observability solutions including metrics, logging, and tracing.
- Act as an SME for SRE best practices, ensuring system reliability, availability, and performance.
- Develop and maintain dashboards, alerts, and monitoring strategies.
- Collaborate with development, DevOps, and infrastructure teams to improve system visibility.
- Perform root cause analysis (RCA) and drive incident resolution.
- Optimize system performance and reliability through proactive monitoring.
- Implement automation to improve operational efficiency and reduce manual intervention.
- Define and track SLIs, SLOs, and SLAs.
Skills
AWSAzureBashDatadogDockerELK StackGCPGrafanaKubernetesPrometheusPythonSplunk
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free