Skip to content
mimi

SR AWS Cloud Engineer/Architect

Tekaccel Inc

US · Hybrid Contract Senior Today

About the role

Job Description

Keep AWS environments and customer applications stable, secure, cost-efficient, and resilient at all times. Focus is on making deployments feel routine, keeping incidents manageable, and ensuring operations run in a predictable, controlled manner.

Key Responsibilities

  • Lead incident management end-to-end, including handling critical outages and ensuring long-term fixes are implemented.
  • Ensure production environments across AWS accounts and applications remain stable and reliable.
  • Continuously optimize cloud spend using tagging strategies, rightsizing, and lifecycle controls.
  • Strengthen observability across systems by making logs, metrics, tracing, and alerts actionable and meaningful.
  • Build and scale reusable automation, playbooks, and operational best practices to support the team.
  • Enforce secure access through least-privilege principles, regular audits, and credential hygiene.
  • Define and maintain robust backup and disaster recovery strategies with periodic validation and documentation.

Core Functional Areas

  • Application Operations: Manage deployments, perform smoke validations, track performance baselines, and ensure reliable rollback mechanisms.
  • Cloud Infrastructure Management: Oversee AWS services like EC2, EKS, RDS, networking (VPC, security groups, transit gateways), IAM/OIDC, and edge components such as CloudFront and load balancers.
  • Incident Management: Run high-priority incident bridges, maintain clear stakeholder communication, and drive effective post-incident reviews.
  • Monitoring & Observability: Develop dashboards, alerts, and synthetic monitoring while maintaining a strong signal-to-noise ratio.
  • Operational Excellence: Standardize processes via runbooks and reduce manual effort through automation.
  • Backup & Recovery: Manage backup strategies, retention policies, cross-region replication, and validate recovery through regular testing.
  • Cost Optimization: Control cloud expenses using savings plans, reserved instances, tagging discipline, and cleanup of unused assets.

Daily Activities

  • Keep monitoring systems sharp by reducing alert noise and fixing visibility gaps.
  • Act on operational issues quickly-resolve or escalate without leaving anything unclear.
  • Review incidents and alerts from the previous cycle, prioritize them, and assign ownership.
  • Update runbooks and documentation with new fixes, learnings, and recurring patterns.
  • Validate backup success and confirm that recovery points are usable.
  • Assist in deployments by ensuring readiness and verifying post-release checks.

Weekly Focus

  • Strengthen observability by refining alerts and filling in missing telemetry signals.
  • Review patches, recent changes, and rollback scenarios to identify improvement areas.
  • Conduct a consolidated operations review across incidents, deployments, cost trends, capacity, and backup health.
  • Perform recovery drills or partial restore validations to ensure disaster readiness.

Monthly Deliverables

  • Refresh and maintain critical runbooks while validating disaster recovery readiness through drills or actual restore tests.
  • Publish key operational insights such as uptime/SLO adherence, MTTR, deployment reliability, monitoring coverage, backup compliance, and cost optimization metrics.
  • Drive closure of recurring operational issues like unstable releases and excessive alert noise.

Success Indicators

  • Efficient incident resolution with most issues handled via well-defined runbooks.
  • Controlled and optimized cloud spending that aligns with system growth, supported by strong tagging discipline.
  • Reliable backup systems with consistent restore validation and full compliance.
  • Clean, dependable dashboards with accurate alerting, minimal noise, and proper escalation flows.
  • Smooth and predictable release cycles with very few failures.

Preferred Qualifications

  • Certification as an AWS Solutions Architect.
  • Relevant certifications in networking.
  • ITIL or similar service management certification.

Monthly Deliverables

  • Publish operational metrics covering uptime/SLO performance, MTTR, deployment stability, monitoring coverage, backup adherence, and cost efficiency.
  • Keep critical runbooks up to date and validate disaster recovery readiness through drills or real restore exercises.
  • Eliminate repeat operational issues such as unreliable releases and alert fatigue.

Required Experience

  • 8 10+ years of experience in cloud or application operations, with strong hands-on work in AWS environments.
  • Proven ability to handle incidents, build effective monitoring systems, and automate operational workflows using tools like Terraform, Ansible, or Python.
  • Hands-on exposure to AWS backup and disaster recovery setups, including real restore validations.
  • Solid understanding of cloud networking concepts and architecture.

Skills

AnsibleAWSCloudFrontEC2EKSIAMOIDCRDSTerraformVPC

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free