Prüfstand — AI Testing Framework
Prüfstand is our framework for systematic testing of AI systems: prompt variations, model comparisons, quality scoring via LLM-as-a-Judge, regression detection over time. Used for QA of our own products and in advisory work for customers who want to harden their own AI pipelines before going to production.
Why systematic testing looks different for AI systems
Classical software is deterministic: same input, same output. AI systems are not. The same prompt yields different answers over days, a model update can quietly shift quality, and subjective dimensions (tone, helpfulness, completeness) cannot be covered with assert statements. Prüfstand orchestrates test suites with prompt variants, runs them against multiple models in parallel, scoring via LLM-as-a-Judge against a defined rubric. Result: AI drift becomes a measurable quantity instead of a gut feeling.
Architecture — GUI for sparring, CLI for CI
Prüfstand has two tiers: an Electron GUI for the exploratory phase (try out prompts, compare models, adjust rubrics) and a Python backend with Pytest hooks for the CI/CD phase (quality gates on deploy, drift alarms over time). The same test definition runs in both tiers — what convinces in sparring moves into the CI pipeline run without a translation step. The result database is a local SQLite, the judge is prometheus2:7b locally via Ollama (privacy compliance: no cloud calls for test data).
Practice proof for AI quality mandates
When we accompany customers on their first AI projects, the question 'how do we measure whether this is good enough' usually arrives too late — once the pilot is already running and subjective impressions diverge. With Prüfstand we can establish a shared measurement setup from day one that objectifies the question: what is the baseline, what is 'good enough', how do we know when it gets worse. It is the direct translator between engineering vocabulary ('latency, throughput, p95') and business vocabulary ('quality, trust, brand consistency').