Production evaluation, not vibes. Fifteen curated tests run in parallel against the model — seven scoring methods, three of which are deterministic, four of which use the model itself or pattern-matching. Pass/fail per test, accuracy by category, hallucination flags, regression vs a saved baseline. The shape of the eval methodology that turns "the AI works" from a guess into a measurement.
Test runner (pure JS, no library):
- 15 tests defined as { id, category, prompt, scoring }
- Concurrency pool: 6 workers process the queue
- Each worker:
1. POST /api/lab/chat (the test prompt)
2. score the response client-side (deterministic) OR
POST /api/lab/chat again with a judge rubric (LLM-as-judge)
3. emit { pass, score, detail } and update the row
- Aggregate panel updates after every test completes
- Total wall time: ~10-15s for 15 tests (vs ~60s sequential)
Baseline (localStorage):
- Save current run as baseline → stored under "lab.eval.baseline.v1"
- Subsequent runs show ↑/↓ deltas vs baseline accuracy, latency, cost
16-18 chat calls per full run (15 tests + 2-3 LLM-judge passes for the scored-by-rubric tests). Cache hits make repeat runs ~free. Concurrency cap of 6 keeps us under the per-IP rate limit while still being fast.
Real production eval suites mix these the same way — most checks are deterministic (cheap, deterministic, fast); a small number that need judgement use LLM-as-judge with a calibrated rubric. Pure LLM-as-judge on every test is slow and expensive, and the variance can mask real regressions.
This is a 15-test demo, not a production eval system. Real production evals include:
What this demo gets right: the shape of the harness, the mix of scoring methods, and the baseline-comparison loop. Most teams don't have any eval at all — this is what the first useful one looks like.