Staff Site Reliability Engineer

Zefr

California · Hybrid Full-time Senior $190k – $210k/yr 3mo ago

About the role

About

Zefr is the leading global technology company enabling responsible marketing in walled garden social environments. Zefr’s solutions empower brands to manage their content adjacency on scaled platforms such as YouTube, Meta, TikTok, and Snap, in accordance with industry standard frameworks. Through its patented AI technology, Zefr offers brands and agencies more accurate and transparent solutions for social walled gardens. The company is headquartered in Los Angeles, California, with additional locations across the globe.

Responsibilities

Support and build systems and tools that enable other engineers to generate, deploy, and manage product features and models both quickly and safely.
Deploy and support a multi‑cloud, micro‑service architecture, including infrastructure tailored for ML workloads, deployed via Github Actions, ArgoCD & Kubernetes.
Collaborate with other engineers, particularly the Machine Learning team, to architect secure, resilient, scalable, and cost‑efficient applications and ML systems/pipelines in AWS and GCP.
Foster and push our DevOps culture and philosophy by encouraging continuous improvement across all engineering teams.
Proactively maintain the health of production environments, including monitoring application performance and resource utilization.
Participate in 24/7 on‑call rotation, respond to system performance issues and outages.
Debug code at the application and infrastructure level.
Mature our CI/CD workflows and release process.
Maintain a forward‑thinking approach, actively researching and proposing new solutions.
Propose and review Engineering Request for Comments (RFC) to drive Engineering architecture and practices.

Technology Stack

Core Infrastructure & Cloud Platforms

Cloud Providers: Google Cloud Platform (primary), Amazon Web Services
Infrastructure as Code (IaC): Terraform, Terragrunt
Containerization & Orchestration: Docker, Kubernetes (experience with GKE and/or EKS expected), Helm, Kustomize
Service Mesh: Istio

CI/CD & Automation

CI/CD Pipelines: GitHub Actions
GitOps / Continuous Delivery: Argo CD
Primary Scripting/Automation Language: Python

Observability & Monitoring

Monitoring & Alerting: Prometheus, Chronosphere, PagerDuty
Telemetry Standards: OpenTelemetry

Application & Data Ecosystem (Supporting)

Application Languages/Frameworks: Python, FastAPI, Flask, Node.js, React
Data Streaming: Apache Kafka
Data Processing/Transformation: Pandas, DBT
Workflow Orchestration: Apache Airflow, Ray

Data Stores & Databases

Relational Databases: PostgreSQL (including managed versions like AWS Aurora, GCP Cloud SQL)
NoSQL Databases: DynamoDB
Search Databases: OpenSearch
Vector Databases: Qdrant
Caching: Redis
Data Warehousing: Snowflake

Requirements

7+ year job history designing, managing, deploying, and supporting Cloud Infrastructure in a production environment using major public cloud providers (GCP experience a huge bonus)
Knowledge of GitOps including an understanding of modern CI/CD pipelines, techniques and technologies (Github Actions, GitLab, CircleCI, Argo CD, Flux)
Proficiency with IaC and configuration management tools (Terraform, Terragrunt, OpenTofu, Crossplane, Pulumi)
Production experience architecting, managing, deploying, and supporting container‑based workloads into Kubernetes clusters
Strong problem‑solving experience, focusing on automation
Proven track record of building and scaling reliability practices, including SLO/SLI frameworks, incident management, and capacity planning
Heavy production experience with observability platforms and practices (Prometheus, Grafana, Chronosphere, Datadog, OpenTelemetry); ability to design monitoring strategies for complex distributed systems
Knowledge of cloud networking (Mesh, NAT, Load Balancers, API Gateways, proxies, etc), cloud security, and cost optimization strategies
Strong written and verbal communication, organization, and documentation skills

Benefits (US‑based employees)

Flexible PTO
Medical, dental, and vision insurance with FSA options
Company‑paid life insurance
Paid parental leave
401(k) with company match
Professional development opportunities
10+ paid holidays off
Summer Fridays (we leave early)
In‑office, hybrid, and fully‑remote work options available
In‑office lunches and lots of free food
Optional in‑person and virtual events (we like to celebrate!)

Compensation (US‑based employees)

The anticipated salary for this position is between $190,000 and $210,000. Within the range, individual pay is determined by factors such as job‑related skills, experience, and relevant education or training. If your compensation expectations fall outside of this range, it may still be worth having a conversation.

Equal Opportunity

Zefr is an equal opportunity employer that embraces diversity and inclusion in the workplace. We are committed to building a team that represents a variety of backgrounds, skills, and perspectives because we know this only makes us better. We strongly encourage women, persons of color, LGBTQIA+ individuals, persons with disabilities, members of ethnic minorities, foreign‑born residents, and veterans to apply even if you do not meet 100% of the qualifications.

Skills

Apache AirflowApache KafkaArgo CDAWSAWS AuroraDBTDockerDynamoDBFastAPIFlaskGCPGKEGrafanaHelmIstioKustomizeKubernetesNode.jsOpenSearchOpenTelemetryPandasPagerdutyPostgreSQLPrometheusPythonQdrantRayReactRedisSnowflakeTerraformTerragruntTikTokTraceableTwitterAWS LambdaGCP Cloud SQLGithub Actions

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

Staff Site Reliability Engineer

About the role

About

Responsibilities

Technology Stack

Requirements

Benefits (US‑based employees)

Compensation (US‑based employees)

Equal Opportunity

Skills

Similar roles

MCP Engineer / AI Backend Engineer

Senior Database Engineer

Team Leads

Don't send a generic resume