Software Engineer, LLM Evaluation
Braintrust
About the role
About Braintrust
Braintrust is a global talent network that connects top independent professionals with leading companies for high-quality, flexible work. We help organizations hire skilled talent faster while giving professionals access to vetted opportunities with innovative teams.
Job Description
This is a contracting engagement - initially 6 months - with potential for long term engagement.
Location: Paris-based preferred; alternatively Europe remote for strong candidates
We are building and evaluating state-of-the-art large language models (LLMs) and are looking for experienced software engineers to join our evaluation and annotation team. This role sits at the intersection of real-world software engineering, model evaluation, and applied AI, and is critical to improving model reliability, reasoning, and code quality.
You will design challenging coding tasks, evaluate model outputs against rigorous benchmarks, identify failure modes, and contribute to reinforcement learning and model improvement workflows.
This is not a junior annotation role. We are looking for practitioners with deep hands-on coding experience who can think like both an engineer and an evaluator.
What You’ll Do
- Create high-quality coding prompts and reference answers (benchmark-style, e.g. SWE-Bench-like problems).
- Evaluate LLM outputs for code generation, refactoring, debugging, and implementation tasks.
- Identify and document model failures, edge cases, and reasoning gaps.
- Perform head-to-head evaluations between private LLMs (Mistral-based) and leading external models.
- Build or configure coding environments to support evaluation and reinforcement learning (RL).
- Follow detailed annotation and evaluation guidelines with high consistency.
What We’re Looking For
- 10+ years of professional software development experience.
- Strong Python skills (required).
- Knowledge of at least one additional programming language (bonus).
- 1+ year of coding annotation and/or LLM evaluation experience (part-time OK) for a major frontier AI lab or AI infrastructure company.
- Prior code reviewer experience is a plus.
- Proven ability to apply structured evaluation criteria and write clear technical feedback.
- Fluent in English (written and spoken).
- Team lead or mentoring experience is a strong plus.
Why This Role
- Work hands-on with cutting-edge LLMs.
- Apply real-world engineering judgment to model evaluation and improvement.
- High-impact, technical work with a focused, senior team.
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free