All jobs · Machine Learning Engineer jobs

Platform Reliability Engineer; Agentic AI

Search Atlas

On-site Full-time Senior 3mo ago

Apply with a tailored resume Save job

About the role

[Your Name]
[City, State] • [Phone] • [Email] • [LinkedIn] • [GitHub]

Cover Letter – Platform Reliability Engineer (Agentic AI)

Dear Hiring Team,

I’m excited to apply for the Platform Reliability Engineer (Agentic AI) role on the Atlas Brain team. Over the past 7 years I have built and operated large‑scale, Kubernetes‑native platforms that power autonomous ML workloads for high‑growth SaaS companies. My career has been defined by turning “manual‑only” processes into zero‑touch, self‑healing systems—exactly the mindset required to give Atlas Brain the reliability backbone it needs to replace manual marketing execution.

Below is a concise map of how my experience aligns with every pillar of the mission you described.

1. Architecting the Autonomous Backbone

Requirement	My Experience
Kubernetes (EKS/GKE) at massive scale	Designed and ran a multi‑region EKS fleet (≈ 150 nodes, 30 k pods) that served > 10 M requests/day for a real‑time recommendation engine. Implemented custom CNI policies, pod‑security‑standards, and multi‑tenant namespaces for isolated agency workloads.
Distributed agentic workers	Built a “worker‑as‑service” framework where each worker runs a containerized LLM inference server plus a task‑router. The system currently executes 2 M SEO crawls and 1.5 M ad‑bid adjustments per day with sub‑second latency.
High‑concurrency crawling & real‑time bidding	Developed a Go‑based crawler orchestrated by KEDA that auto‑scales to 10 k concurrent fetches, feeding a Kafka‑based SERP stream consumed by the agent decision engine. Integrated with Google Ads API for real‑time bid updates, achieving 95 %‑ile latency < 150 ms.

2. Zero‑Touch Automation

Terraform + Terragrunt + GitOps – All infrastructure (VPC, EKS, IAM, Karpenter, RDS, S3) lives in a single Terraform monorepo, version‑controlled and automatically promoted via ArgoCD pipelines. Every change is validated with terraform plan in CI, then applied without human touch.
Self‑service tooling – Built a Python‑CLI (atlasctl) that provisions a new agency tenant (namespace, RBAC, resource‑quota, secret‑store) in < 30 seconds. The CLI is used by internal devs and external partners alike.
Automated disaster‑recovery drills – Nightly ArgoCD‑driven chaos tests that simulate node loss, network partition, and model‑registry corruption; failures trigger auto‑remediation playbooks written in Go.

3. Radical Reliability for AI Execution

SLO/SLI definition – Introduced AI‑Task‑Success Rate (target 99.99 %) and Decision‑Latency (p99 < 300 ms) as primary SLOs, alongside classic uptime metrics. Built Prometheus alerts that fire when the success‑rate drops below 99.95 % for 2 minutes, automatically rolling back to the previous stable model version.
Self‑healing – Implemented a Karpenter‑based auto‑scaler that reacts to inference queue length, and a custom controller that watches model health (error‑rate, latency) and performs a blue‑green rollout if thresholds are breached.
Chaos engineering – Integrated LitmusChaos to inject latency, pod failures, and API throttling into the agent pipeline; results feed directly into the SLO dashboard for continuous improvement.

4. Observability for Agent Decisions

OpenTelemetry – Instrumented every stage of the agent workflow (crawling → embedding → policy evaluation → action) with trace IDs that propagate across services (Python FastAPI, Go workers, TensorFlow Serving).
Grafana/Prometheus dashboards – Real‑time view of Decision‑Tree depth, Model‑Version usage, and Budget‑Compliance metrics per agency.
Alerting on “why” – When a task fails, the trace is automatically enriched with the LLM prompt, model version, and policy rule that produced the decision, enabling a one‑click root‑cause analysis.

5. Safety & Guardrails

Policy Engine – Developed a declarative policy language (YAML) that encodes budget caps, keyword black‑lists, and compliance rules. The engine evaluates every agent action before execution; violations are logged and optionally auto‑escalated to a human reviewer.
Human‑in‑the‑Loop (HITL) – Integrated Slack/Teams bots that surface “edge‑case” decisions (e.g., > 10 % budget shift) for manual approval. The system records the approval decision and feeds it back into the reinforcement‑learning loop.
Audit Trail – All agent actions are persisted to an immutable CloudTrail‑backed audit log, searchable via ElasticSearch for compliance reviews.

6. Cost & Performance Governance

KEDA + Spot‑Instances – Leveraged KEDA to scale inference pods on demand, combined with EC2 Spot fleets managed by Karpenter, cutting compute spend by ~ 45 % while maintaining latency SLAs.
Resource‑Quota & Autoscaling – Implemented per‑tenant CPU/Memory quotas and a custom “budget‑aware” autoscaler that throttles agents when an agency’s spend approaches its limit.
Continuous Cost‑Optimization – Daily Cost‑Explorer reports feed a Python‑based optimizer that suggests right‑sizing of node groups; recommendations are auto‑applied via Terraform.

7. MLOps for Autonomous Agents

Model Registry & Versioning – Used MLflow + S3 to store LLM checkpoints, embeddings, and policy models. Automated CI pipelines run unit‑tests, performance benchmarks, and canary deployments for every new version.
Prompt Management – Built a Prompt‑Store service (PostgreSQL + Redis) with A/B testing capabilities; the agent selects a prompt variant based on real‑time CTR feedback.
A/B Testing of Behaviors – Deployed a traffic‑splitting controller that routes 5 % of requests to a new policy version, collects KPI metrics (conversion, cost per acquisition), and automatically promotes the version if it meets a predefined uplift threshold.

Why I’m a Perfect Fit

End‑to‑end ownership – From Terraform‑defined infra to the last line of Python that decides a bid, I have built the full stack of reliable, autonomous systems.
AI‑centric reliability mindset – I treat model health and decision correctness as first‑class reliability metrics, not an afterthought.
Proven scale – My platforms have sustained > 20 M events/sec during peak marketing campaigns, with < 0.01 % error rates.
Passion for agency‑level automation – I’ve previously built a “self‑service ad‑optimizer” that reduced manual campaign setup time from days to minutes, directly mirroring Atlas Brain’s vision.

I am thrilled at the prospect of joining the Atlas Brain team and helping create the autonomous nervous system that will let AI agents truly replace manual marketing execution. I look forward to discussing how my background can accelerate your roadmap.

Thank you for your consideration.

Sincerely,
[Your Name]

Quick‑Reference Technical Snapshot

Area	Tools / Technologies
IaC / GitOps	Terraform, Terragrunt, ArgoCD, GitHub Actions
K8s	EKS (multi‑region), Karpenter, KEDA, Calico CNI, OPA Gatekeeper
Languages	Python (FastAPI, asyncio, boto3), Go (workers, controllers)
MLOps	MLflow, Seldon Core, TensorFlow Serving, ONNX, HuggingFace Transformers
Observability	OpenTelemetry, Jaeger, Prometheus, Grafana, Loki, Elastic APM
Data Pipelines	Kafka, Kinesis, Pulsar, Redis Streams, Airflow
Security / Guardrails	OPA, Sentinel, IAM least‑privilege, Secrets Manager, Vault
Cost Governance	CloudWatch Cost Explorer, Spotinst, custom Python optimizer

Feel free to reach out if you’d like a deeper dive into any of the systems above or a live demo of the autonomous agent pipeline I built.

Skills

ArgoCDAWS LambdaDockerEKSGKEGoGrafanaKEDAKubernetesLLMsOpen TelemetryPrometheusPythonTerraform

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

Platform Reliability Engineer; Agentic AI

About the role

Cover Letter – Platform Reliability Engineer (Agentic AI)

1. Architecting the Autonomous Backbone

2. Zero‑Touch Automation

3. Radical Reliability for AI Execution

4. Observability for Agent Decisions

5. Safety & Guardrails

6. Cost & Performance Governance

7. MLOps for Autonomous Agents

Why I’m a Perfect Fit

Quick‑Reference Technical Snapshot

Skills

Similar roles

MCP Engineer / AI Backend Engineer

Software Engineer

Senior Database Engineer

Don't send a generic resume

Platform Reliability Engineer; Agentic AI

About the role

Cover Letter – Platform Reliability Engineer (Agentic AI)

1. Architecting the Autonomous Backbone

2. Zero‑Touch Automation

3. Radical Reliability for AI Execution

4. Observability for Agent Decisions

5. Safety & Guardrails

6. Cost & Performance Governance

7. MLOps for Autonomous Agents

Why I’m a Perfect Fit

Quick‑Reference Technical Snapshot

Skills

Similar roles

MCP Engineer / AI Backend Engineer

Software Engineer

Senior Database Engineer

Don't send a generic resume

Cover Letter – Platform Reliability Engineer (Agentic AI)