AI/ML Engineer
Zenotis Group
About the role
About the Role
We are seeking an experienced AI/ML Engineer to build, scale, and maintain the critical infrastructure that powers our AI models and autonomous agents. In this role, you will act as the bridge between our AI research/development teams and our production environments. You will not just be deploying models; you will be designing the high-performance, distributed systems required to serve Large Language Models (LLMs), orchestrate multi-agent workflows, and optimize GPU compute at scale.
If you are passionate about turning complex AI capabilities into highly reliable, scalable, and cost-efficient production systems, this is the role for you.
Key Responsibilities
Machine Learning Infrastructure & Serving
- Design, build, and manage scalable infrastructure for training, fine-tuning, and serving LLMs and multimodal models.
- Optimize inference latency, throughput, and cost using modern serving frameworks (e.g., vLLM, Triton Inference Server, Ray Serve).
- Manage and orchestrate GPU/TPU clusters, ensuring high utilization and efficient resource allocation.
Building and Scaling Agentic Operations (AgentOps)
- Architect and deploy infrastructure to support autonomous AI agents and multi-agent systems.
- Integrate and maintain agent orchestration frameworks (e.g., LangGraph, CrewAI) within production environments.
- Build robust state management and memory systems (vector databases, graph databases) required for agentic workflows.
Observability, Evaluation, and Reliability
- Implement comprehensive observability stacks tailored for LLMs and agents (tracing, prompt logging, cost tracking) using tools like Langfuse, Arize, or Datadog.
- Design automated evaluation pipelines to monitor agent performance, safety, and reliability in real-time (LLMOps/AgentOps).
- Act as the first line of defense for production AI systems, diagnosing and resolving issues related to memory limits, inference queues, and cluster failures.
Developer Platform & CI/CD for AI
- Build internal developer platforms and tooling that allow AI engineers and data scientists to easily deploy models and agents to production.
- Adapt traditional CI/CD pipelines to accommodate model versioning, prompt management, and continuous evaluation.
Qualifications
Required Skills:
- Systems Engineering: Strong background in distributed systems, backend engineering, or DevOps/SRE.
- Programming: Proficiency in Python (essential for the AI ecosystem) and systems languages like Go or Rust.
- Containerization & Orchestration: Deep expertise in Kubernetes (K8s), Docker, and infrastructure-as-code (Terraform, Pulumi).
- AI/ML Tooling: Hands-on experience with LLM serving engines (vLLM, TGI, Triton) and distributed computing frameworks (Ray).
- Agent Frameworks: Familiarity with modern agentic development frameworks like LangChain, LangGraph, or CrewAI.
- Cloud & Hardware: Experience managing high-performance compute (GPUs/TPUs) on major cloud providers (AWS, GCP, Azure)
Preferred Skills:
- Experience with vector databases (Pinecone, Milvus, Qdrant) and retrieval-augmented generation (RAG) pipelines.
- Understanding of model optimization techniques (quantization, LoRA, KV caching).
- Previous experience building platforms from the ground up in a high-growth
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free