Software Engineer, LLM Evaluation

Braintrust

Paris · flexible Contract Lead 1mo ago

About the role

About Braintrust

Braintrust is a global talent network that connects top independent professionals with leading companies for high-quality, flexible work. We help organizations hire skilled talent faster while giving professionals access to vetted opportunities with innovative teams.

Job Description

This is a contracting engagement - initially 6 months - with potential for long term engagement.

Location: Paris-based preferred; alternatively Europe remote for strong candidates

We are building and evaluating state-of-the-art large language models (LLMs) and are looking for experienced software engineers to join our evaluation and annotation team. This role sits at the intersection of real-world software engineering, model evaluation, and applied AI, and is critical to improving model reliability, reasoning, and code quality.

You will design challenging coding tasks, evaluate model outputs against rigorous benchmarks, identify failure modes, and contribute to reinforcement learning and model improvement workflows.

This is not a junior annotation role. We are looking for practitioners with deep hands-on coding experience who can think like both an engineer and an evaluator.

What You’ll Do

Create high-quality coding prompts and reference answers (benchmark-style, e.g. SWE-Bench-like problems).
Evaluate LLM outputs for code generation, refactoring, debugging, and implementation tasks.
Identify and document model failures, edge cases, and reasoning gaps.
Perform head-to-head evaluations between private LLMs (Mistral-based) and leading external models.
Build or configure coding environments to support evaluation and reinforcement learning (RL).
Follow detailed annotation and evaluation guidelines with high consistency.

What We’re Looking For

10+ years of professional software development experience.
Strong Python skills (required).
Knowledge of at least one additional programming language (bonus).
1+ year of coding annotation and/or LLM evaluation experience (part-time OK) for a major frontier AI lab or AI infrastructure company.
Prior code reviewer experience is a plus.
Proven ability to apply structured evaluation criteria and write clear technical feedback.
Fluent in English (written and spoken).
Team lead or mentoring experience is a strong plus.

Why This Role

Work hands-on with cutting-edge LLMs.
Apply real-world engineering judgment to model evaluation and improvement.
High-impact, technical work with a focused, senior team.

Skills

PythonRL

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free

Software Engineer, LLM Evaluation

About the role

About Braintrust

Job Description

What You’ll Do

What We’re Looking For

Why This Role

Skills

Similar roles

Senior Database Engineer

Software Engineer (Rust)

Mid-Level IoT Engineer

Don't send a generic resume