Skip to content
mimi

Senior AWS Agentcore Platform Engineer

InvestM Technology LLC

Exton · Hybrid Contract Senior Today

About the role

About

This role focuses on enhancing the Agentcore platform with capabilities in Cost Tracking, TCO, Monitoring, Incident Management, Security, and Governance.

Key Focus Areas

  • Observability & Distributed Tracing
  • Cost Tracking & TCO (Total Cost of Ownership)
  • Monitoring & Incident Management
  • Security & Governance

Responsibilities

  • Observability & Distributed Tracing
    • Assess AWS CloudWatch, X-Ray, Bedrock logging, and AgentCore traces against agentic workflow requirements; produce a comprehensive gap analysis and lead the setup of observability within Dynatrace.
    • Design and implement post-deployment validation pipelines for agents and Model Context Protocol (MCP) servers, ensuring deployment health and successful tool registration.
    • Implement distributed tracing and structured logging to capture LLM decision logic, tool selections, sub-agent calls, and MCP interactions.
    • Evaluate LangFuse and LiteLLM proxies against AWS-native solutions; deliver a target-state observability architecture recommendation.
  • Cost Tracking & TCO
    • Extend tagging taxonomy to capture costs across agent runtimes, MCP servers, vector databases, and Bedrock token consumption per namespace.
    • Design a granular cost visibility model to aggregate expenses for agents, MCPs, and LLM tokens by team and department.
    • Build CloudWatch (or equivalent) dashboards for per-team spending; configure AWS Budgets with proactive alerting thresholds.
    • Automate cost reporting via email and Microsoft Teams, incorporating anomaly detection rules to identify spend spikes.
  • Monitoring & Incident Management
    • Define and implement P1 P4 alerting rules covering deployment failures, runtime errors, tool invocation failures, and MCP connectivity issues.
    • Integrate alert notifications with Microsoft Teams and email, utilizing resource ownership tags for intelligent routing.
    • Author detailed runbooks for every alert; publish and maintain these in Confluence to facilitate developer self-service resolution.
  • Security & Governance
    • Compare AWS-native vs. third-party monitoring stacks to deliver a long-term recommendation aligned with the broader observability architecture.
    • Evaluate current IAM and tagging strategies for multi-team isolation; identify scalability gaps and potential security risks.
    • Assess the Cedar policy engine (AgentCore) for fine-grained tool access control and document gaps for enterprise-scale deployment.
    • Design a scalable Attribute-Based Access Control (ABAC) identity model to ensure multi-team isolation without IAM policy sprawl; deliver production-ready Terraform modules.

Skills

AWSAWS BudgetsAWS CloudWatchAWS IAMAWS X-RayAttribute-Based Access ControlBedrockCedarConfluenceDockerDynatraceLangFuseLiteLLMMicrosoft TeamsTerraform

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free