AWS Cloud Engineering Ops Lead (Application Support)

Conglomerate-IT

Atlanta · On-site Contract Lead 1mo ago

About the role

Our Mission

Keep our AWS platforms and customer-facing apps available, observable, recoverable, secure, and cost‑sensible. Make the runbook path the easiest path, so on-call personnel feel calm and releases feel straightforward—in a good way.

Scope of the role

AWS operations: EC2, EKS, RDS, ALB/CloudFront, IAM/OIDC, VPC/TGW/SGs, patching, and hygiene.
Application support: release readiness, runbooks, post-deploy smoke checks, performance baselines, and clean rollback paths.
Visibility: dashboards, logs, metrics, traces, synthetics, error budgets, and alert health.
Backup & DR: policies, schedules, retention, cross-region copies, restore testing, and DR runbooks (RPO/RTO owned and measured).
Incident leadership: run Sev‑1/2 bridges, keep comms clear, and land post‑mortems with actions that actually close.
Cost hygiene: tagging, right-sizing, SP/RI coverage, lifecycle cleanups (EBS/EIP/AMIs).
Team enablement: guardrails, golden runbooks, and small automations that remove toil.

Day‑to‑day (what this looks like)

Triage overnight alerts and hot issues, set priorities, and make sure owners are clear.
Keep dashboards honest; fix flapping or missing alerts before they wake people up.
Check backups and recent restore points; open tickets for any gaps and track to done.
Unblock releases; verify smoke checks; keep environments tidy and predictable.
Lead or delegate break/fix; no lingering “mystery” incidents.
Write down what we learned in the runbook so the next person can fix it faster.

Weekly rhythm

Ops review: incidents, alerts, deploys, costs, capacity, and backup status in one short readout.
Observability tune‑up: delete noise, add the missing signal, and test a synthetic from the edge.
Backup/DR: run a small restore test and record RPO/RTO evidence.
Patch and change review: what shipped, what rolled back, why.

Monthly outcomes

Share availability/SLOs, MTTR, change failure rate, observability coverage, backup compliance, and costs in plain English.
Close the top recurring issues (noisy alerts, flaky deploys).
Refresh the most‑used runbooks; validate DR for one critical workload (tabletop or live restore).

Core responsibilities

Own production readiness and stability for assigned AWS accounts and apps.
Lead incidents and land post‑mortems; make the fixes stick.
Keep monitoring/logging/tracing standards real; enforce SLOs and error budgets.
Own backup strategy end-to-end, including monthly restore tests and DR docs.
Keep access least‑privileged and auditable; rotate secrets and certs on time.
Drive cost posture and mentor the team; make on-call humane.

What “good” looks like

Visibility: one clear dashboard per service, clean alert routing, low false positives.
Backups: 100% jobs green (or retried), documented RPO/RTO, and monthly restore tests that pass.
Reliability: MTTR trending down; most issues solved by the first responder with a runbook.
Change: predictable releases with smoke and rollback; fewer failed changes month over month.
Cost: flat or down against growth; tagging at or above 95%.

Minimum Experience Required

8–10+ years in cloud/app operations with strong AWS hands-on experience.
Comfortable leading incidents, shaping dashboards and alerts, and automating the boring bits (Terraform, Ansible, Python).
Experience running backups/DR in AWS and proving it with real restore tests.
Cloud network experience.