Skip to content
mimi

Site Reliability Engineer (SRE) — Monstro US

Monstro

New York · On-site Full-time $142k – $215k/yr 4d ago

About the role

About the position

Monstro is building a secure, multi-tenant platform on Google Cloud, and we’re hiring a Site Reliability Engineer to own the reliability and observability of that platform end-to-end. This is a hands-on role for someone who wants to do real SRE work - not a rebrand of L1 support. You’ll write the dashboards, define the SLOs, build the automation that kills toil, and take your turn on the on-call rotation that proves it all works. When something breaks at 2 AM, you’re the person who keeps it running; when nothing’s breaking, you’re the person making sure the next break is smaller, shorter, or doesn’t happen at all.

Responsibilities • Define and maintain SLOs and SLIs for our tier-1 services: API gateway, application services, identity, and edge availability • Build canonical dashboards and alerts in Google Cloud Monitoring, backed by structured logs and BigQuery log analytics • Tune alert routing so every page is actionable — kill the rest • Instrument services for distributed tracing and structured logging; push back on services that ship without it • Own error budgets and use them to prioritize reliability work over feature work when burned • Reduce toil: automate the top recurring page from the previous quarter • Maintain runbooks so every page maps to one within a cycle of first occurrence • First responder for production alerts across monitoring, API gateway, edge defense, and CI • Triage severity, run the incident bridge, drive mitigation (revision rollback, traffic shift, scaling, edge block, credential rotation) • Own internal and external incident comms during your shift • Drive postmortems to closure with action items tracked as audit evidence • Clean written handoffs at end of shift

Requirements • Solid production experience on GCP (or comparable AWS/Azure depth with willingness to ramp on GCP fast) • Comfortable on-call: you’ve run incidents, written postmortems, and shipped the action items • Strong observability fundamentals: SLOs, log-based metrics, alert hygiene, dashboard discipline • Working knowledge of Kubernetes, API gateways, identity systems, and at least one IaC tool • Scripting / coding fluency (Python, Go, Bash) for automation and tooling • Good written communication — handoffs, postmortems, and runbooks are part of the job • Bias toward fixing the system, not the symptoms

Nice-to-haves • Apigee or another enterprise API gateway in production • BigQuery for log analytics or audit • Experience standing up observability from scratch, not just maintaining inherited dashboards • SOC2 or similar compliance environments

Benefits • Competitive salary • Equity • Robust benefits package • Paid health coverage • Vision coverage • Dental coverage • Disability coverage

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free