Skip to content
mimi

AI Infrastructure Engineer

High Trail

Mundelein · On-site Full-time Yesterday

About the role

Overview

Build and own the observability and diagnostics layer for a real-time AI assistant platform. You’ll make complex AI systems transparent, debuggable, and reliable by enabling end-to-end tracing, rapid root-cause analysis, and real-time monitoring.

Responsibilities

  • Design event tracing across AI decisioning, workflows, and real-time communication systems
  • Build automated pipelines to detect, classify, and analyze system failures
  • Create dashboards for real-time and post-session visibility (timelines, decision paths, errors)
  • Monitor live sessions and surface alerts for anomalies (latency, loops, failed actions)
  • Enable human intervention tools for in-session issue handling
  • Identify recurring failure patterns and drive system improvements
  • Implement automated triage and alerting to route issues to the right teams

Requirements

  • Strong backend experience with distributed systems and observability
  • Proficiency in Python and event-driven architectures
  • Experience debugging complex systems
  • Familiarity with AI/LLM systems, workflow/state machines, and telemetry tools

Nice to Have

  • Experience with real-time/voice systems
  • Observability tools (e.g., Grafana, OpenTelemetry)
  • Exposure to human-in-the-loop systems or operational tooling

Requirements

  • Strong backend experience with distributed systems and observability
  • Proficiency in Python and event-driven architectures
  • Experience debugging complex systems
  • Familiarity with AI/LLM systems, workflow/state machines, and telemetry tools

Responsibilities

  • Design event tracing across AI decisioning, workflows, and real-time communication systems
  • Build automated pipelines to detect, classify, and analyze system failures
  • Create dashboards for real-time and post-session visibility (timelines, decision paths, errors)
  • Monitor live sessions and surface alerts for anomalies (latency, loops, failed actions)
  • Enable human intervention tools for in-session issue handling
  • Identify recurring failure patterns and drive system improvements
  • Implement automated triage and alerting to route issues to the right teams

Skills

AI/LLMPythondistributed systemsevent-driven architecturesobservabilitytelemetry

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free