elo-ranking-evaluator

AI / MLtransform

List<ComparisonResult> -> Map<ModelID, EloScore>

Calculate relative Elo rating changes for LLMs based on win/loss outcomes from blind A/B comparisons.

Problem it solves

Subjective human evaluation of generative text outputs is difficult to quantify consistently.

Consumes

ComparisonResults

Emits

ModelLeaderboard

Distilled from 1 source

The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.