Skip to content
mimi

Senior ML Engineer (Evaluation)

kaiko.ai

Zürich · Hybrid Full-time Senior 3w ago

About the role

kaiko building a next-generation agentic clinical AI assistant that helps clinicians reason across patient data, guidelines, and diagnostics.

Healthcare decisions are rarely made by a single person or from a single data source. kaiko’s assistant maintains longitudinal patient context across encounters, clinicians, and institutions, enabling collaboration, second opinions, and complex diagnostic workflows. The system is designed to operate safely in real clinical environments, with human oversight, auditability, and regulatory alignment at its core.

Our assistant core supports broadly applicable clinical tasks such as patient data navigation, guideline interaction, multimodal interaction (chat and voice), and care coordination. On top of this foundation, we are developing specialized diagnostic agents in areas such as oncology, radiology, and pathology.

We build in close collaboration with leading hospitals and research centers, including the Netherlands Cancer Institute (NKI). kaiko is a well-funded company with a growing international team, operating from Zurich and Amsterdam.

The Role

Kaiko’s Multimodal Large Language Model (MLLM) is trained on domain-specific, high-complexity medical data. Reaching clinical-grade performance demands a comprehensive evaluation stack that is fast, reliable, and deeply integrated with our model development loop.

As a Senior Evaluation ML Engineer, you will own the engineering stack to run evaluations at scale, from efficient inference across a growing set of frontier models to async evaluation against a wide array of clinical benchmarks, enabling automated orchestration of our pipelines with a strong eye for observability and production-grade system organisation. You will work closely with other ML researchers and product to translate research and clinical requirements into reliable and well-engineered eval signals.

As a Senior ML Evaluation Engineer you will

  • Own AI factory orchestration for evaluation workloads, with Dagster as the primary orchestration layer: design, operate, and mature the pipelines and workflows that run large-scale evaluation jobs, and extend automation across the stack wherever possible.
  • Maintain and evolve the inference services that power evaluation runs, including cluster- and actor-level resource management, ensuring correctness, reproducibility, and throughput as the model and benchmark zoo grows.
  • Ensure the functional integrity of the eval stack through rigorous testing and validation: verify model integrations, confirm expected behaviour across configurations, and support ML researchers in understanding model outputs.
  • Own Eval/MLOps end-to-end: service deployments, model registry and artifact versioning, eval database organisation, rollout and rollback procedures, and post-deployment observability.
  • Develop towards a technical lead: set engineering direction, make architectural decisions, and support other engineers in execution.

You will be based in Zurich or Amsterdam, with the expectation of spending approximately 50% of your time in the office.

About you

  • Excellent Python skills and strong software engineering fundamentals: testing, modular design, CI/CD, code review, and monorepo tooling.
  • Proven experience building and operating ML inference services or MLOps infrastructure at scale, ideally for large language or multimodal models.
  • Hands‑on experience with distributed compute and GPU workloads: familiarity with frameworks such as Ray, CUDA tool chains, and container runtimes (Docker/Kubernetes or equivalent).
  • Experience with model serving frameworks such as vLLM, Tensor RT‑LLM, Triton Inference Server, or similar.
  • Experience with workflow orchestration tools, with a preference for Dagster; ability to design reliable, maintainable pipeline DAGs.
  • Familiarity with the full deployment lifecycle, from containerisation and config management to observability, alerting, and incident response.
  • Ability to read and reason about model internals at a low level: tokenisation, numerical precision, tensor shapes, and inference‑time behaviour.
  • Prior experience in the medical domain is not required, but a strong motivation to push the frontier of clinical…

Skills

CUDACI/CDDockerKubernetesMLOpsPythonRay

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free