Skip to content
mimi

Site Reliability Engineer (SRE) Observability & Performance

NovBliss

India · On-site Full-time 1w ago

About the role

As an Engineer focused on bridging the gap between development and operations, your role will involve:

- Performance Engineering: - Analyzing VLT/Evo system bottlenecks and optimizing the kernel, network, or application stack for improved speed and efficiency.

- Service Level Management: - Defining, implementing, and defending SLIs and SLOs to maintain the Error Budget balance between feature development and system stability.

- Observability Architecture: - Designing a comprehensive monitoring strategy using tools like Prometheus, Loki, and ELK to transition from reactive alerting to proactive issue resolution.

- Toil Reduction: - Identifying and automating repetitive manual tasks through code to streamline operational processes and enhance efficiency.

In terms of technical requirements, you should have expertise in: - Observability Stack: Prometheus, Grafana, and ELK/Loki. - Automation: Proficiency in Python or Go, with a preference for Go as the SRE standard, along with strong Bash scripting skills. - Infrastructure: Experience with Kubernetes or specialized Data Center orchestration.

Additionally, a cultural fit for this role would involve embracing a Blameless Post-mortem philosophy, where each outage is seen as an opportunity to learn and improve system architecture. As an Engineer focused on bridging the gap between development and operations, your role will involve:

- Performance Engineering: - Analyzing VLT/Evo system bottlenecks and optimizing the kernel, network, or application stack for improved speed and efficiency.

- Service Level Management: - Defining, implementing, and defending SLIs and SLOs to maintain the Error Budget balance between feature development and system stability.

- Observability Architecture: - Designing a comprehensive monitoring strategy using tools like Prometheus, Loki, and ELK to transition from reactive alerting to proactive issue resolution.

- Toil Reduction: - Identifying and automating repetitive manual tasks through code to streamline operational processes and enhance efficiency.

In terms of technical requirements, you should have expertise in: - Observability Stack: Prometheus, Grafana, and ELK/Loki. - Automation: Proficiency in Python or Go, with a preference for Go as the SRE standard, along with strong Bash scripting skills. - Infrastructure: Experience with Kubernetes or specialized Data Center orchestration.

Additionally, a cultural fit for this role would involve embracing a Blameless Post-mortem philosophy, where each outage is seen as an opportunity to learn and improve system architecture.

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free