Skip to content
mimi

AI/ML Engineer

Zenotis Group

Burlingame · Hybrid Full-time Senior Today

About the role

About the Role

We are seeking an experienced AI/ML Engineer to build, scale, and maintain the critical infrastructure that powers our AI models and autonomous agents. In this role, you will act as the bridge between our AI research/development teams and our production environments. You will not just be deploying models; you will be designing the high-performance, distributed systems required to serve Large Language Models (LLMs), orchestrate multi-agent workflows, and optimize GPU compute at scale.

If you are passionate about turning complex AI capabilities into highly reliable, scalable, and cost-efficient production systems, this is the role for you.

Key Responsibilities

Machine Learning Infrastructure & Serving

  • Design, build, and manage scalable infrastructure for training, fine-tuning, and serving LLMs and multimodal models.
  • Optimize inference latency, throughput, and cost using modern serving frameworks (e.g., vLLM, Triton Inference Server, Ray Serve).
  • Manage and orchestrate GPU/TPU clusters, ensuring high utilization and efficient resource allocation.

Building and Scaling Agentic Operations (AgentOps)

  • Architect and deploy infrastructure to support autonomous AI agents and multi-agent systems.
  • Integrate and maintain agent orchestration frameworks (e.g., LangGraph, CrewAI) within production environments.
  • Build robust state management and memory systems (vector databases, graph databases) required for agentic workflows.

Observability, Evaluation, and Reliability

  • Implement comprehensive observability stacks tailored for LLMs and agents (tracing, prompt logging, cost tracking) using tools like Langfuse, Arize, or Datadog.
  • Design automated evaluation pipelines to monitor agent performance, safety, and reliability in real-time (LLMOps/AgentOps).
  • Act as the first line of defense for production AI systems, diagnosing and resolving issues related to memory limits, inference queues, and cluster failures.

Developer Platform & CI/CD for AI

  • Build internal developer platforms and tooling that allow AI engineers and data scientists to easily deploy models and agents to production.
  • Adapt traditional CI/CD pipelines to accommodate model versioning, prompt management, and continuous evaluation.

Qualifications

Required Skills:

  • Systems Engineering: Strong background in distributed systems, backend engineering, or DevOps/SRE.
  • Programming: Proficiency in Python (essential for the AI ecosystem) and systems languages like Go or Rust.
  • Containerization & Orchestration: Deep expertise in Kubernetes (K8s), Docker, and infrastructure-as-code (Terraform, Pulumi).
  • AI/ML Tooling: Hands-on experience with LLM serving engines (vLLM, TGI, Triton) and distributed computing frameworks (Ray).
  • Agent Frameworks: Familiarity with modern agentic development frameworks like LangChain, LangGraph, or CrewAI.
  • Cloud & Hardware: Experience managing high-performance compute (GPUs/TPUs) on major cloud providers (AWS, GCP, Azure)

Preferred Skills:

  • Experience with vector databases (Pinecone, Milvus, Qdrant) and retrieval-augmented generation (RAG) pipelines.
  • Understanding of model optimization techniques (quantization, LoRA, KV caching).
  • Previous experience building platforms from the ground up in a high-growth

Skills

AWSAzureDockerGCPGoKubernetesLangChainLangGraphLLMMLOpsPythonRayRustTerraformTriton Inference ServervLLM

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free