Skip to content

feat: evaluation of recipe extraction and LLM as a judge #197

Description

@sid-betalol

Goal

Use 4 LLMs to (1) extract synthesis recipes and (2) act as judges of those extracted recipes. Human annotators will score each extracted recipe. We will:

  • select the LLM judge with the highest agreement with human scores, then
  • select the LLM extractor that performs best under the chosen judge (and human scores).

Key requirement

The annotation interface must be seamless for annotators (low-friction workflow, clear rubric, minimal clicks, reliable progress tracking).

Evaluation data

Use the existing annotations in this repo:

Notes

Subsequent tasks are tracked as linked sub-issues.

Metadata

Metadata

No fields configured for Feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions