model-graded evaluation

AI / MLexternal call

(Generation, ReferenceAnswer, Rubric) -> Score

Grade a target model's output using a separate referee model prompted with a grading rubric, target output, and optional reference answer.

Problem it solves

Programmatic metrics like exact match or BLEU fail to capture semantic accuracy in open-ended generations, while human evaluation is slow and expensive.

Consumes

GenerationReferenceAnswerRubric

Emits

Score

Distilled from 1 source

The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.

openai/evalsgithub