Eval-Framework — LLM-as-a-Judge with bias correction
Our framework for systematic evaluation of LLM outputs: pairwise comparisons, bias corrections (position swap, verbosity, self-preference), calibration against human baselines with Spearman ρ and Krippendorff α as acceptance thresholds. Used for RAG-Wissen quality gates, Nexbid content reviews, and advisory mandates where customers need objective AI quality measurement.
Why LLM-as-a-Judge — and why with bias correction
Subjective output quality (helpfulness, completeness, tone) does not scale through humans — a sample of 200 answers needs a day of human reviewer time, daily quality gates are economically unworkable. LLM-as-a-Judge is the answer but has three systematic bias problems: position bias (the first comparison entry is preferred), verbosity bias (longer answers are overrated), self-preference (a model prefers its own outputs). Our framework corrects all three with documented methods — position-swap averaging, verbosity normalisation, generator/judge separation.
Calibration against human baselines
A judge model is only usable if its judgments correlate with human judgments. We measure that with Spearman rank correlation (acceptance threshold ρ ≥ 0.7) and Krippendorff alpha (acceptance threshold α ≥ 0.67). Current state: prometheus2:7b reaches ρ = 0.90 for relevance and ρ = 0.80 for faithfulness in our setup. That is good enough for productive quality gates without a human in the loop. We continuously inject 5 calibration samples every 50 evaluations to detect model drift before it bites in production.
Practice proof for AI quality mandates
Anyone running AI pipelines in SME production cannot get by without objective quality measurement — and subjective reviews stop scaling by the second productive pipeline at the latest. Our eval framework gives customers a starting point: pre-configured rubrics for common use cases (customer service, knowledge answering, content generation), example datasets, documented bias corrections. We set up the framework against our customers' use cases and hand it over in the mandate as an in-house capability — the goal is not vendor lock-in, but internal AI competence.