Senior / Staff DevOps & Site Reliability Engineer

Superscale

Baunatal · Hybrid Senior 3mo ago

About the role

About the Role

We're scaling fast — and we want to do it without the chaos that usually comes with it.

As our Senior/Staff DevOps & Site Reliability Engineer, you'll own the infrastructure that powers Superscale's AI platform. But this isn't a traditional "keep the lights on" SRE role. You'll be building an infrastructure layer designed for a new kind of engineering team: one where every developer works alongside multiple AI coding agents, and the infra itself is a force multiplier.

You'll be our first dedicated infrastructure hire, which means you get to set the standard — from observability and incident response to CI/CD pipelines and cloud architecture. You'll make sure we scale smoothly as load, team size, and AI workloads grow, and you'll be the counterpart engineers rely on to ship systems that are resilient from day one.

We believe in hiring for breadth and building leverage through AI tooling. We're not growing the team by stacking people in the same roles — we're hiring unique skill sets and amplifying everyone through best-in-class infrastructure and AI-native workflows. You'll be central to making that philosophy real.

Key Responsibilities

Own and evolve our AWS infrastructure: containerized services, networking, security, and cost optimization — building toward a setup that scales with both user load and AI workloads
Design and implement state-of-the-art monitoring, alerting, and observability with Datadog (no more "is this broken for everyone?" Slack messages — you'll know before anyone asks)
Build proactive systems for incident detection and response — shifting the team from reactive firefighting to confident, data-informed operations
Architect and deploy infrastructure for AI-native development: cloud-based coding agent environments where multiple agents per developer can build, test, and deploy in parallel
Prepare our infrastructure for AI-specific load patterns: bursty GPU/LLM workloads, intelligent request routing, and cost-efficient scaling strategies
Create a developer platform that treats coding agents as first-class citizens — giving them access to the same data, tools, secrets, and deployment pipelines that human engineers use
Design CI/CD pipelines and deployment workflows that are fast, reliable, and safe — optimized for high-frequency pushes from both humans and agents
Partner with the engineering team to build systems that are scaling- and future-proof from the architecture level, not patched after the fact
Establish infrastructure-as-code practices, documentation, and runbooks that make the whole team more autonomous

Requirements

5+ years of experience in DevOps, SRE, or platform engineering, with deep hands-on AWS expertise
Strong experience with container orchestration (ECS or Kubernetes), infrastructure-as-code (Terraform, Pulumi), and modern CI/CD systems (e.g GitHub Actions)
Proven track record of building observability stacks (Datadog, Grafana, Prometheus, CloudWatch, or similar) that actually prevent incidents, not just log them
Experience designing infrastructure for service-oriented architectures with relational databases and modern web frontends (Next.js experience is a plus)
You understand load balancing, auto-scaling, and cost optimization at a level where you can make real architectural trade-offs
Security-minded: you bake in least-privilege access, secrets management, and network segmentation without making developers hate their lives
AI-native working style: you actively use LLMs, coding agents, and automation tools in your own workflow. We're building toward 10x coding agents per developer — you'll be the one making that infrastructure possible
Strong communicator who can translate infrastructure decisions into language the product and engineering teams understand

Nice to Have

Experience building developer platforms or internal tooling that improved team velocity measurably
Background in managing AI/ML infrastructure: GPU scheduling, model serving, LLM gateway/proxy setups
Experience at an early-stage startup where you built infra foundations that lasted through 10x growth
Contributions to open-source infrastructure or DevOps tooling

What We Offer

Competitive salary and equity/stock options in a high-growth AI company
Flexible remote or hybrid work arrangement
Generous paid time off and company holidays
Professional development budget for conferences, courses, and certifications
Greenfield opportunity — you're setting the infrastructure standard
A team that values horizontal skill over narrow specialization, and is investing in AI tooling and agent infrastructure
Direct, visible impact — every engineer and every AI agent on the team will feel the quality of what you build

How to Apply

Please send your application to magnus@superscale.ai with your LinkedIn / GitHub profile and a short note on why this role excites you and what you'd change about our setup on day one.

We are an equal opportunity employer and welcome candidates of all backgrounds.

Skills

AWSCloudWatchDatadogECSGitHub ActionsGrafanaKubernetesNext.jsPrometheusPulumiTerraform

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free