Skip to content
mimi

AI Engineer – LLM Specialist

AlpineAI AG

On-site 3d ago

About the role

About

We are looking for a skilled AI Engineer with a strong focus on testing, evaluating, and operationalizing Large Language Models (LLMs) to join our growing team. In this role, you will ensure that our language models meet high standards of accuracy, robustness, safety, and performance, and that they integrate seamlessly into our Speech-to-Text and AI-driven application landscape.

You will work closely with product, full-stack, and infrastructure engineers to transform state-of-the‑art language models into reliable, production‑ready systems that solve real customer problems. You make prototypes production ready.

Responsibilities

LLM Evaluation Testing

  • Design and maintain systematic evaluation frameworks for LLMs, including: Automated test suites, Golden datasets, Regression benchmarks
  • Define quantitative metrics (e.g., accuracy, latency, hallucination rate, task success) and qualitative evaluation protocols.
  • Perform error analysis and root‑cause investigations on model failures.

Task Alignment Optimization

  • Focus on rapid prototyping and operationalization of customer use cases
  • Improve model performance on specific tasks using a prompt‑first workflow (system prompts, few‑shot examples, tool instructions).
  • Build and iterate evaluation sets; run experiments to measure quality, latency, and cost.
  • Curate high‑signal datasets for automated prompt optimization (cleaning, labeling, filtering, augmentation).
  • Apply lightweight adaptation when beneficial (prompt tuning, parameter‑efficient methods like LoRA/adapters).
  • Use supervised fine‑tuning / instruction tuning when prompting and lightweight methods don’t reach the target.
  • Prepare and curate training datasets (cleaning, labeling, augmentation, filtering).
  • Evaluate and compare open‑source and commercial LLMs for specific use cases.
  • Design controlled experiments (A/B tests, offline evaluations).
  • Document results and recommend model choices.
  • Collaborate with full‑stack engineers to integrate prototypes into product, backend services and user‑facing applications.
  • Support API design for model inference and post‑processing.
  • Ensure models behave reliably in real‑time and batch workflows.

Quality, Safety Guardrails

  • Implement mechanisms to:
    • Reduce hallucinations
    • Enforce output formats
    • Apply content filters
    • Detect and handle unsafe or low‑confidence outputs

Performance Cost Optimization

  • Optimize inference latency and throughput.
  • Balance model size, quantization, batching, and caching strategies.
  • Monitor and optimize inference costs.

MLOps Lifecycle Management

  • Version models, datasets, prompts, and evaluation results.
  • Monitor model performance in production and detect drift.
  • Work closely with product managers to translate requirements into model behaviors.
  • Support internal teams with guidance on prompt design and model usage.
  • Contribute to documentation and internal best practices.
  • Define standards for dataset quality, labeling guidelines, and storage.
  • Maintain traceability between datasets, experiments, and deployed models.

Synthetic Data Generation

  • Use LLMs or other techniques to generate synthetic training data where real data is scarce.

Agentic LLMs Human‑in‑the‑Loop Workflows

  • Design and test LLM workflows that call tools, functions, or external APIs.
  • Design feedback loops where human reviewers validate or correct model outputs.

Research Scouting

  • Track relevant papers, frameworks, and open‑source projects.

Internal Enablement

  • Create internal guidelines for prompt writing and evaluation.
  • Run occasional knowledge‑sharing sessions.

What You Bring

AI / ML Experience

  • At least 3–5 years of experience in machine learning or applied AI.
  • Practical experience working with LLMs in production or advanced prototypes.
  • Experience with PyTorch or TensorFlow.
  • Familiarity with fine‑tuning techniques and training pipelines.
  • Strong understanding of experimental design.

Programming Skills

  • Familiarity with REST APIs and backend integration.
  • Experience with dataset preprocessing, labeling pipelines, and versioning.
  • Familiarity with Docker, CI/CD, and model deployment.

Analytical Mindset

  • Ability to reason about model behavior and failure modes.

Communication

  • Good verbal and written communication in English and German.
  • Startup Mentality
  • Comfortable with ambiguity, fast iteration, and high ownership.

What We Are Offering

  • Opportunity to participate in AlpineAI’s company shares program after initiation period.
  • Dynamic, innovation‑driven culture.
  • High autonomy and real product impact.
  • Close collaboration with experts in speech, NLP, and applied AI.
  • Exposure to cutting‑edge AI technologies.

Don’t Apply If

  • You are not willing to work on‑site in Zurich or Davos.
  • You do not have a work permission for Switzerland.
  • You have never worked in a startup environment.

About Us

Learn more about AlpineAI at:

Ready to Help Customers Succeed with AI?

Apply now with your CV and a short cover letter. We look forward to hearing from you.

#J-18808-Ljbffr

Requirements

  • At least 3–5 years of experience in machine learning or applied AI.
  • Practical experience working with LLMs in production or advanced prototypes.
  • Experience with PyTorch or TensorFlow.
  • Familiarity with fine-tuning techniques and training pipelines.
  • Strong understanding of experimental design.
  • Familiarity with REST APIs and backend integration.
  • Experience with dataset preprocessing, labeling pipelines, and versioning.
  • Familiarity with Docker, CI/CD, and model deployment.
  • Ability to reason about model behavior and failure modes.
  • Good verbal and written communication in English and German.
  • Comfortable with ambiguity, fast iteration, and high ownership.

Responsibilities

  • Design and maintain systematic evaluation frameworks for LLMs, including: Automated test suites, Golden datasets, Regression benchmarks
  • Define quantitative metrics (e.g., accuracy, latency, hallucination rate, task success) and qualitative evaluation protocols.
  • Perform error analysis and root-cause investigations on model failures.
  • Focus on rapid prototyping and operationalization of customer use cases
  • Improve model performance on specific tasks using a prompt-first workflow (system prompts, few-shot examples, tool instructions).
  • Build and iterate evaluation sets; run experiments to measure quality, latency, and cost.
  • Curate high-signal datasets for automated prompt optimization (cleaning, labeling, filtering, augmentation).
  • Apply lightweight adaptation when beneficial (prompt tuning, parameter-efficient methods like LoRA/adapters).
  • Use supervised fine-tuning / instruction tuning when prompting and lightweight methods don’t reach the target.
  • Prepare and curate training datasets (cleaning, labeling, augmentation, filtering).
  • Evaluate and compare open-source and commercial LLMs for specific use cases.
  • Design controlled experiments (A/B tests, offline evaluations).
  • Document results and recommend model choices.
  • Collaborate with full-stack engineers to integrate prototypes into product, backend services and user-facing applications.
  • Support API design for model inference and post-processing.
  • Ensure models behave reliably in real-time and batch workflows.
  • Implement mechanisms to reduce hallucinations, enforce output formats, apply content filters, and detect and handle unsafe or low-confidence outputs.
  • Optimize inference latency and throughput.
  • Balance model size, quantization, batching, and caching strategies.
  • Monitor and optimize inference costs.
  • Version models, datasets, prompts, and evaluation results.
  • Monitor model performance in production and detect drift.
  • Work closely with product managers to translate requirements into model behaviors.
  • Support internal teams with guidance on prompt design and model usage.
  • Contribute to documentation and internal best practices.
  • Define standards for dataset quality, labeling guidelines, and storage.
  • Maintain traceability between datasets, experiments, and deployed models.
  • Use LLMs or other techniques to generate synthetic training data where real data is scarce.
  • Design and test LLM workflows that call tools, functions, or external APIs.
  • Design feedback loops where human reviewers validate or correct model outputs.
  • Track relevant papers, frameworks, and open-source projects.
  • Create internal guidelines for prompt writing and evaluation.
  • Run occasional knowledge-sharing sessions.

Benefits

company shares

Skills

AICI/CDDockerLLMLoRAMachine LearningNLPPyTorchREST APIsSpeech

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free