Eval-Framework — LLM-as-a-Judge
Our framework for systematic evaluation of LLM outputs — with pairwise comparisons, bias corrections, and calibration against human baselines.
Python 3.13SQLitePydanticPrometheus2-Judge
Why LLM-as-a-Judge matters
Subjective output quality doesn't scale through humans. Our framework uses prometheus2:7b as judge with Spearman ρ = 0.90 for relevance — enough for productive quality gates.