Site Reliability Engineer

Monstro

New York · On-site Full-time Mid Level $142k – $215k/yr 2mo ago

About the role

About the position

Monstro is building a secure, multi-tenant platform on Google Cloud, and we’re hiring a Site Reliability Engineer to own the reliability and observability of that platform end-to-end. This is a hands-on role for someone who wants to do real SRE work - not a rebrand of L1 support. You’ll write the dashboards, define the SLOs, build the automation that kills toil, and take your turn on the on-call rotation that proves it all works. When something breaks at 2 AM, you’re the person who keeps it running; when nothing’s breaking, you’re the person making sure the next break is smaller, shorter, or doesn’t happen at all.

Responsibilities

Define and maintain SLOs and SLIs for our tier-1 services: API gateway, application services, identity, and edge availability
Build canonical dashboards and alerts in Google Cloud Monitoring, backed by structured logs and BigQuery log analytics
Tune alert routing so every page is actionable — kill the rest
Instrument services for distributed tracing and structured logging; push back on services that ship without it
Own error budgets and use them to prioritize reliability work over feature work when burned
Reduce toil: automate the top recurring page from the previous quarter
Maintain runbooks so every page maps to one within a cycle of first occurrence
First responder for production alerts across monitoring, API gateway, edge defense, and CI
Triage severity, run the incident bridge, drive mitigation (revision rollback, traffic shift, scaling, edge block, credential rotation)
Own internal and external incident comms during your shift
Drive postmortems to closure with action items tracked as audit evidence
Clean written handoffs at end of shift

Requirements

Solid production experience on GCP (or comparable AWS/Azure depth with willingness to ramp on GCP fast)
Comfortable on-call: you’ve run incidents, written postmortems, and shipped the action items
Strong observability fundamentals: SLOs, log-based metrics, alert hygiene, dashboard discipline
Working knowledge of Kubernetes, API gateways, identity systems, and at least one IaC tool
Scripting / coding fluency (Python, Go, Bash) for automation and tooling
Good written communication — handoffs, postmortems, and runbooks are part of the job
Bias toward fixing the system, not the symptoms

Nice-to-haves

Apigee or another enterprise API gateway in production
BigQuery for log analytics or audit
Experience standing up observability from scratch, not just maintaining inherited dashboards
SOC2 or similar compliance environments

Benefits

Competitive salary
Equity
Robust benefits package
Paid health coverage
Vision coverage
Dental coverage
Disability coverage

Skills

API gatewayBigQueryBashGoGoogle Cloud MonitoringGCPIaCKubernetesPythonstructured logging

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

Site Reliability Engineer

About the role

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Skills

Similar roles

Fullstack Software Architect / Lead Engineer

Backend Engineer (Bangalore)

Sr. Full Stack Engineer

Don't send a generic resume