HT
AI Infrastructure Engineer
High Trail
Mundelein · On-site Full-time Yesterday
About the role
Overview
Build and own the observability and diagnostics layer for a real-time AI assistant platform. You’ll make complex AI systems transparent, debuggable, and reliable by enabling end-to-end tracing, rapid root-cause analysis, and real-time monitoring.
Responsibilities
- Design event tracing across AI decisioning, workflows, and real-time communication systems
- Build automated pipelines to detect, classify, and analyze system failures
- Create dashboards for real-time and post-session visibility (timelines, decision paths, errors)
- Monitor live sessions and surface alerts for anomalies (latency, loops, failed actions)
- Enable human intervention tools for in-session issue handling
- Identify recurring failure patterns and drive system improvements
- Implement automated triage and alerting to route issues to the right teams
Requirements
- Strong backend experience with distributed systems and observability
- Proficiency in Python and event-driven architectures
- Experience debugging complex systems
- Familiarity with AI/LLM systems, workflow/state machines, and telemetry tools
Nice to Have
- Experience with real-time/voice systems
- Observability tools (e.g., Grafana, OpenTelemetry)
- Exposure to human-in-the-loop systems or operational tooling
Requirements
- Strong backend experience with distributed systems and observability
- Proficiency in Python and event-driven architectures
- Experience debugging complex systems
- Familiarity with AI/LLM systems, workflow/state machines, and telemetry tools
Responsibilities
- Design event tracing across AI decisioning, workflows, and real-time communication systems
- Build automated pipelines to detect, classify, and analyze system failures
- Create dashboards for real-time and post-session visibility (timelines, decision paths, errors)
- Monitor live sessions and surface alerts for anomalies (latency, loops, failed actions)
- Enable human intervention tools for in-session issue handling
- Identify recurring failure patterns and drive system improvements
- Implement automated triage and alerting to route issues to the right teams
Skills
AI/LLMPythondistributed systemsevent-driven architecturesobservabilitytelemetry
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free