← All cases

Eval-Framework — LLM-as-a-Judge

Our framework for systematic evaluation of LLM outputs — with pairwise comparisons, bias corrections, and calibration against human baselines.

Python 3.13SQLitePydanticPrometheus2-Judge

Why LLM-as-a-Judge matters

Subjective output quality doesn't scale through humans. Our framework uses prometheus2:7b as judge with Spearman ρ = 0.90 for relevance — enough for productive quality gates.