Skip to content
mimi

Staff SRE for AI Workloads and Observability

Sitetracker

Sunset House · On-site Full-time Lead 3w ago

About the role

About

Take on a critical role as a Staff Site Reliability Engineer focused on optimizing AI workloads. This position empowers you to define engineering standards and improve our reliability practices.

You will drive the organization toward a proactive reliability approach as you partner with existing engineers. Your efforts will not only enhance incident response strategies but also establish effective tools that improve system observability and response metrics, allowing for sustainable improvements.

Contribute to our mission by enhancing system reliability and creating a robust foundation for future engineering innovations.

Responsibilities

  • Set SLIs and SLOs based on critical user interactions
  • Direct production incident responses and lead effective postmortems
  • Develop meaningful dashboards that enhance system understanding
  • Mentor the engineering team for technical skill enhancement
  • Lead architectural improvements for reliability tools

Requirements

  • Profound experience with Site Reliability Engineering and AWS
  • Background in handling production incidents effectively
  • Strong communication and influence within engineering environments
  • Ability to craft educational runbooks and alerts
  • Proven track record in strategic reliability implementations

Skills

AWSSite Reliability Engineering

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free