All jobs · Machine Learning Engineer jobs

Senior Site Reliability Engineer – AI Operations

TechInsights Inc.

Egremont · On-site Full-time Senior 1mo ago

About the role

About Tech Insights

Tech Insights is building the reliability and AI operations foundation for its next chapter — an AI-first intelligence platform that runs the most demanding semiconductor intelligence workflows in the world. We’re looking for a Senior Site Reliability Engineer who owns that foundation.

As a senior individual contributor at the technical leadership tier, you will own strategic reliability initiatives end‑to‑end: setting technical direction, defining SLOs and error budgets across our production platform, designing reliability patterns for AI agent pipelines, and enabling our development and AI Engineering teams to build and ship with confidence.

Platform Reliability & AI Operations

Own SLOs, SLIs, and error budgets for all production services; drive error budget discipline across engineering.
Design reliability patterns for AI agent pipelines: LLM observability, tool‑use tracking, failure detection, and graceful degradation.
Architect blast radius containment so agent failures have bounded customer impact through isolation, circuit breaking, and rapid recovery.
Mature our Canada Central/West active‑active architecture toward 24‑hour RTO with full regional failover.
Lead incident response and post‑incident reviews that produce durable fixes; maintain DR procedures through regular testing.

Developer & AI Engineering Enablement

Serve as the primary reliability liaison to Software and AI Engineering, translating requirements into actionable standards.
Partner with AI Engineering on compute provisioning, model serving, inference latency, and workload isolation.
Own CI/CD pipeline strategy (Bitbucket Pipelines, Git Hub Actions) — set standards, optimize deployment frequency, and ensure teams can ship confidently.
Drive IDP adoption and enable teams on SRE practices: on‑call readiness, SLO definition, runbook development, and self‑service tooling.
Represent reliability in architectural discussions; surface risk before it’s committed to design.

Observability, IDP & Service Catalog

Own the service catalog — a living inventory of all services, AI agents, dependencies, ownership, and SLOs.
Operate Datadog as the single pane of glass for service health, infrastructure, and agentic pipeline telemetry.
Extend observability to AI workloads: LLM latency, token consumption, agent completion rates, and pipeline throughput.
Build golden path templates in Backstage or Atlassian Compass so teams ship reliably without routine SRE involvement.
Apply AIOps in Datadog to automate anomaly detection, incident triage, and remediation recommendations.

Fin Ops, IaC & Continuous Improvement

Own infrastructure as code via Terraform and Git Ops; enforce IaC policy in partnership with Trust Assurance.
Own Fin Ops visibility into AWS cost segments; model cloud cost impact as AI/ML workloads scale.
Formally mentor junior and intermediate SRE engineers, with accountability for their technical growth and career progression.
Build AI‑assisted automation to progressively reduce toil and scale the team's operational capacity.

What You’ll Bring

Technical Requirements

Bachelor's degree in Computer Science, Engineering, or equivalent combination of education and experience.
6–8 years of progressive experience in site reliability engineering, platform engineering, or Dev Ops, with demonstrated technical leadership at the senior individual contributor level.
Deep expertise in AWS (EKS, Lambda, Cloud Watch, AWS Config) and multi‑region architecture patterns.
Proficiency with Terraform and Git Ops; experience with policy‑as‑code (Sentinel, OPA/Rego, or equivalent).
Hands‑on Datadog experience at operational depth: dashboards, SLO tracking, alerting, log management, distributed tracing.
Strong containerization expertise: Docker, Kubernetes (EKS preferred).
Proficiency in Python and/or Bash; experience building operational tooling; solid understanding of Java and Spring Boot microservice architecture for EKS‑hosted services.
Deep expertise in CI/CD pipeline design and optimization using Bitbucket Pipelines and Git Hub Actions.
Familiarity with IDP tooling (Backstage, Atlassian Compass, or equivalent) strongly preferred.
Experie…

Skills

AWS ConfigAWS LambdaAWSBackstageBashBitbucket PipelinesCloud WatchDatadogDockerEKSGit Hub ActionsGit OpsJavaKubernetesOPA/RegoPythonSentinelSpring BootTerraform

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

Senior Site Reliability Engineer – AI Operations

About the role

About Tech Insights

Platform Reliability & AI Operations

Developer & AI Engineering Enablement

Observability, IDP & Service Catalog

Fin Ops, IaC & Continuous Improvement

What You’ll Bring

Technical Requirements

Skills

Similar roles

Senior Database Engineer

Software Engineer (Rust)

Mid-Level IoT Engineer

Don't send a generic resume