Principal DevOps Engineer
Amtrak
About the role
At Amtrak, the Principal DevOps Engineer is a Principal technical leader responsible for ensuring the resilience, scalability, and security of our digital platforms. This role combines software engineering, systems engineering, and a deep operational mindset to improve reliability through automation, observability, and proactive incident response. The successful candidate will drive architectural decisions around SLOs, error budgets, infrastructure as code, and deployment strategies while mentoring engineers and standardizing practices across teams. They will collaborate cross-functionally to implement scalable solutions that align with our goals for service health, security, and development velocity.
ESSENTIAL FUNCTIONS:
CI/CD & Release
- Architect progressive delivery (canary/blue-green/feature flags) of DevSecOps CI/CD pipelines
- Automate rollback/fail-forward and release evidence capture.
- Standardize quality gates (tests, perf/chaos pre-prod).
Platform (IaC, Cloud, Containers)
- Publish hardened base images and golden IaC modules with guardrails.
- Enforce k8s/RBAC, network policies, quotas; secret standards.
- Design multi-env promotion workflows with policy checks.
Observability, SLOs & Incidents
- Establish SLOs/error budgets; drive cross-team reliability improvements.
- Bake runbooks into alerts; add synthetic/load tests to pipelines.
- Lead major incidents; land systemic fixes (not just patches).
Security & Compliance
- Enforce short-lived creds, zero-trust patterns, and attestation/signing.
- Automate compliance checks and evidence collection.
- Partner with security on threat-modeling for platform changes.
Automation & Tooling
- Create internal libraries/CLIs with telemetry and docs.
- Measure automation ROI (time saved, error-rate drop).
- Orchestrate complex workflows (e.g., Step Functions/Argo Workflows).
Platform DX, Docs & Collaboration
- Own a platform capability end-to-end (roadmap, SLAs, upgrades).
- Drive adoption of best practices across multiple teams.
- Write ADRs and decision logs that clarify trade-offs.
Networking, Data Resilience & FinOps
- Define/validate RPO/RTO; automate restore drills and reports.
- Tune critical paths for latency/throughput and cost.
- Forecast impacts of migrations; deliver measurable cost/perf wins.
MINIMUM QUALIFICATIONS:
- Bachelor’s degree in Computer Science, Engineering, or related technical discipline.
- At least 5 years of experience in DevOps, SRE, or Platform Engineering roles with leadership experience in automation and infrastructure reliability.
- 3+ years hands-on experience in high-availability production environments with cloud-native security and observability tooling.
PREFERRED QUALIFICATIONS:
- Master’s degree in Computer Science or equivalent.
- Certifications: AWS DevOps Engineer Pro, Terraform Associate, CKA, or SRE-focused credentials.
- Experience with developer portals (e.g., Backstage), service mesh (e.g., Istio), and security tooling (e.g., Vault, Falco, Trivy).
- Knowledge of DORA metrics, reliability KPIs, and engineering effectiveness measurement frameworks.
- Background in regulated environments (e.g., PCI, HIPAA, FedRAMP) with experience implementing security automation at scale.
KNOWLEDGE, SKILLS and ABILITIES:
- Deep expertise in AWS (or equivalent cloud platform), especially in compute, networking, IAM, and monitoring.
- Proficiency in Terraform, AWS CDK, CloudFormation, Docker, and Linux systems.
- Experience with ‘pipelines as code’ and setting up CI/CD with Github Actions, AWS CodeBuild/CodePipelines, Jenkins automation.
- Experience implementing and managing CI/CD systems with security tollgates and rollback logic.
- Strong scripting skills in Python, Go, or Bash for automation and tooling.
- In-depth understanding of SRE practices including incident response, SLO/SLA management, chaos engineering, and capacity modeling.
- Familiarity with Git and GitOps patterns.
- Proven track record of creating shared tooling and documentation that promotes operational excellence.
WORK ENVIRONMENT:
- Onsite 4/5 days per week in Washington DC, Philadelphia PA or Wilmington DE
- Occasional participation in on-call rotations and availability for high-severity incident response
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free