TI
SR AWS Cloud Engineer/Architect
Tekaccel Inc
US · Hybrid Contract Senior Today
About the role
Job Description
Keep AWS environments and customer applications stable, secure, cost-efficient, and resilient at all times. Focus is on making deployments feel routine, keeping incidents manageable, and ensuring operations run in a predictable, controlled manner.
Key Responsibilities
- Lead incident management end-to-end, including handling critical outages and ensuring long-term fixes are implemented.
- Ensure production environments across AWS accounts and applications remain stable and reliable.
- Continuously optimize cloud spend using tagging strategies, rightsizing, and lifecycle controls.
- Strengthen observability across systems by making logs, metrics, tracing, and alerts actionable and meaningful.
- Build and scale reusable automation, playbooks, and operational best practices to support the team.
- Enforce secure access through least-privilege principles, regular audits, and credential hygiene.
- Define and maintain robust backup and disaster recovery strategies with periodic validation and documentation.
Core Functional Areas
- Application Operations: Manage deployments, perform smoke validations, track performance baselines, and ensure reliable rollback mechanisms.
- Cloud Infrastructure Management: Oversee AWS services like EC2, EKS, RDS, networking (VPC, security groups, transit gateways), IAM/OIDC, and edge components such as CloudFront and load balancers.
- Incident Management: Run high-priority incident bridges, maintain clear stakeholder communication, and drive effective post-incident reviews.
- Monitoring & Observability: Develop dashboards, alerts, and synthetic monitoring while maintaining a strong signal-to-noise ratio.
- Operational Excellence: Standardize processes via runbooks and reduce manual effort through automation.
- Backup & Recovery: Manage backup strategies, retention policies, cross-region replication, and validate recovery through regular testing.
- Cost Optimization: Control cloud expenses using savings plans, reserved instances, tagging discipline, and cleanup of unused assets.
Daily Activities
- Keep monitoring systems sharp by reducing alert noise and fixing visibility gaps.
- Act on operational issues quickly-resolve or escalate without leaving anything unclear.
- Review incidents and alerts from the previous cycle, prioritize them, and assign ownership.
- Update runbooks and documentation with new fixes, learnings, and recurring patterns.
- Validate backup success and confirm that recovery points are usable.
- Assist in deployments by ensuring readiness and verifying post-release checks.
Weekly Focus
- Strengthen observability by refining alerts and filling in missing telemetry signals.
- Review patches, recent changes, and rollback scenarios to identify improvement areas.
- Conduct a consolidated operations review across incidents, deployments, cost trends, capacity, and backup health.
- Perform recovery drills or partial restore validations to ensure disaster readiness.
Monthly Deliverables
- Refresh and maintain critical runbooks while validating disaster recovery readiness through drills or actual restore tests.
- Publish key operational insights such as uptime/SLO adherence, MTTR, deployment reliability, monitoring coverage, backup compliance, and cost optimization metrics.
- Drive closure of recurring operational issues like unstable releases and excessive alert noise.
Success Indicators
- Efficient incident resolution with most issues handled via well-defined runbooks.
- Controlled and optimized cloud spending that aligns with system growth, supported by strong tagging discipline.
- Reliable backup systems with consistent restore validation and full compliance.
- Clean, dependable dashboards with accurate alerting, minimal noise, and proper escalation flows.
- Smooth and predictable release cycles with very few failures.
Preferred Qualifications
- Certification as an AWS Solutions Architect.
- Relevant certifications in networking.
- ITIL or similar service management certification.
Monthly Deliverables
- Publish operational metrics covering uptime/SLO performance, MTTR, deployment stability, monitoring coverage, backup adherence, and cost efficiency.
- Keep critical runbooks up to date and validate disaster recovery readiness through drills or real restore exercises.
- Eliminate repeat operational issues such as unreliable releases and alert fatigue.
Required Experience
- 8 10+ years of experience in cloud or application operations, with strong hands-on work in AWS environments.
- Proven ability to handle incidents, build effective monitoring systems, and automate operational workflows using tools like Terraform, Ansible, or Python.
- Hands-on exposure to AWS backup and disaster recovery setups, including real restore validations.
- Solid understanding of cloud networking concepts and architecture.
Skills
AnsibleAWSCloudFrontEC2EKSIAMOIDCRDSTerraformVPC
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free