C
AWS Cloud Engineering Ops Lead (Application Support)
Conglomerate-IT
Atlanta · On-site Contract Lead 1mo ago
About the role
Our Mission
Keep our AWS platforms and customer-facing apps available, observable, recoverable, secure, and cost‑sensible. Make the runbook path the easiest path, so on-call personnel feel calm and releases feel straightforward—in a good way.
Scope of the role
- AWS operations: EC2, EKS, RDS, ALB/CloudFront, IAM/OIDC, VPC/TGW/SGs, patching, and hygiene.
- Application support: release readiness, runbooks, post-deploy smoke checks, performance baselines, and clean rollback paths.
- Visibility: dashboards, logs, metrics, traces, synthetics, error budgets, and alert health.
- Backup & DR: policies, schedules, retention, cross-region copies, restore testing, and DR runbooks (RPO/RTO owned and measured).
- Incident leadership: run Sev‑1/2 bridges, keep comms clear, and land post‑mortems with actions that actually close.
- Cost hygiene: tagging, right-sizing, SP/RI coverage, lifecycle cleanups (EBS/EIP/AMIs).
- Team enablement: guardrails, golden runbooks, and small automations that remove toil.
Day‑to‑day (what this looks like)
- Triage overnight alerts and hot issues, set priorities, and make sure owners are clear.
- Keep dashboards honest; fix flapping or missing alerts before they wake people up.
- Check backups and recent restore points; open tickets for any gaps and track to done.
- Unblock releases; verify smoke checks; keep environments tidy and predictable.
- Lead or delegate break/fix; no lingering “mystery” incidents.
- Write down what we learned in the runbook so the next person can fix it faster.
Weekly rhythm
- Ops review: incidents, alerts, deploys, costs, capacity, and backup status in one short readout.
- Observability tune‑up: delete noise, add the missing signal, and test a synthetic from the edge.
- Backup/DR: run a small restore test and record RPO/RTO evidence.
- Patch and change review: what shipped, what rolled back, why.
Monthly outcomes
- Share availability/SLOs, MTTR, change failure rate, observability coverage, backup compliance, and costs in plain English.
- Close the top recurring issues (noisy alerts, flaky deploys).
- Refresh the most‑used runbooks; validate DR for one critical workload (tabletop or live restore).
Core responsibilities
- Own production readiness and stability for assigned AWS accounts and apps.
- Lead incidents and land post‑mortems; make the fixes stick.
- Keep monitoring/logging/tracing standards real; enforce SLOs and error budgets.
- Own backup strategy end-to-end, including monthly restore tests and DR docs.
- Keep access least‑privileged and auditable; rotate secrets and certs on time.
- Drive cost posture and mentor the team; make on-call humane.
What “good” looks like
- Visibility: one clear dashboard per service, clean alert routing, low false positives.
- Backups: 100% jobs green (or retried), documented RPO/RTO, and monthly restore tests that pass.
- Reliability: MTTR trending down; most issues solved by the first responder with a runbook.
- Change: predictable releases with smoke and rollback; fewer failed changes month over month.
- Cost: flat or down against growth; tagging at or above 95%.
Minimum Experience Required
- 8–10+ years in cloud/app operations with strong AWS hands-on experience.
- Comfortable leading incidents, shaping dashboards and alerts, and automating the boring bits (Terraform, Ansible, Python).
- Experience running backups/DR in AWS and proving it with real restore tests.
- Cloud network experience.
Preferred Experience
- AWS Solution Architect Certification
- Any professional networking certifications
- ITIL Certification
Benefits:
- Health insurance
Work Location:
In person
Skills
AWSAnsibleCloud networkEC2EKSIAM/OIDCPythonRDSTerraformVPC/TGW/SGs
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free