Goal
Use 4 LLMs to (1) extract synthesis recipes and (2) act as judges of those extracted recipes. Human annotators will score each extracted recipe. We will:
- select the LLM judge with the highest agreement with human scores, then
- select the LLM extractor that performs best under the chosen judge (and human scores).
Key requirement
The annotation interface must be seamless for annotators (low-friction workflow, clear rubric, minimal clicks, reliable progress tracking).
Evaluation data
Use the existing annotations in this repo:
Notes
Subsequent tasks are tracked as linked sub-issues.
Goal
Use 4 LLMs to (1) extract synthesis recipes and (2) act as judges of those extracted recipes. Human annotators will score each extracted recipe. We will:
Key requirement
The annotation interface must be seamless for annotators (low-friction workflow, clear rubric, minimal clicks, reliable progress tracking).
Evaluation data
Use the existing annotations in this repo:
Notes
Subsequent tasks are tracked as linked sub-issues.