feat: evaluation of recipe extraction and LLM as a judge

## Goal
Use 4 LLMs to (1) extract synthesis recipes and (2) act as judges of those extracted recipes. Human annotators will score each extracted recipe. We will:
- select the **LLM judge** with the highest agreement with human scores, then
- select the **LLM extractor** that performs best under the chosen judge (and human scores).

## Key requirement
The annotation interface must be **seamless for annotators** (low-friction workflow, clear rubric, minimal clicks, reliable progress tracking).

## Evaluation data
Use the existing annotations in this repo:
- https://github.com/LeMaterial/lematerial-llm-synthesis/tree/main/annotations

## Notes
Subsequent tasks are tracked as linked sub-issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: evaluation of recipe extraction and LLM as a judge #197

Goal

Key requirement

Evaluation data

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

feat: evaluation of recipe extraction and LLM as a judge #197

Description

Goal

Key requirement

Evaluation data

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions