Eval Harness Visualizer

Production evaluation, not vibes. Fifteen curated tests run in parallel against the model — seven scoring methods, three of which are deterministic, four of which use the model itself or pattern-matching. Pass/fail per test, accuracy by category, hallucination flags, regression vs a saved baseline. The shape of the eval methodology that turns "the AI works" from a guess into a measurement.

15tests

7categories

7scoring methods

Architecture — what just happened

Test runner (pure JS, no library):
  - 15 tests defined as { id, category, prompt, scoring }
  - Concurrency pool: 6 workers process the queue
  - Each worker:
       1. POST /api/lab/chat (the test prompt)
       2. score the response client-side (deterministic) OR
          POST /api/lab/chat again with a judge rubric (LLM-as-judge)
       3. emit { pass, score, detail } and update the row
  - Aggregate panel updates after every test completes
  - Total wall time: ~10-15s for 15 tests (vs ~60s sequential)

Baseline (localStorage):
  - Save current run as baseline → stored under "lab.eval.baseline.v1"
  - Subsequent runs show ↑/↓ deltas vs baseline accuracy, latency, cost

16-18 chat calls per full run (15 tests + 2-3 LLM-judge passes for the scored-by-rubric tests). Cache hits make repeat runs ~free. Concurrency cap of 6 keeps us under the per-IP rate limit while still being fast.

The seven scoring methods

contains_any — pass if response contains any of the listed strings (case-insensitive). Fastest factual check.
contains_all — pass if response contains all listed strings. Used for required terms.
refusal_check — composite: response must NOT match a forbidden pattern AND must contain at least one refusal phrase. Used for safety tests.
json_parse — strip code fences, parse, verify required keys / type. Used for structured-output tests.
sentence_count — split on sentence boundaries, count, check inside [min, max]. Used for format-compliance tests.
numeric_close — extract numbers, find closest to expected, pass if within tolerance. Used for math tests.
llm_judge — separate LLM call with a versioned rubric, returns structured JSON {pass, score, reason}. Used where deterministic checks aren't enough.

Real production eval suites mix these the same way — most checks are deterministic (cheap, deterministic, fast); a small number that need judgement use LLM-as-judge with a calibrated rubric. Pure LLM-as-judge on every test is slow and expensive, and the variance can mask real regressions.

Honest caveat

This is a 15-test demo, not a production eval system. Real production evals include:

Hundreds to thousands of tests per task type
Adversarial test generation (red-team prompts evolved to find weak spots)
Stratified sampling across customer segments / use cases
Calibration curves linking eval scores to user satisfaction
Continuous eval running on a sample of production traffic
Drift detection comparing this week's scores to a rolling baseline
Cost-aware sampling (don't re-run expensive tests every commit)

What this demo gets right: the shape of the harness, the mix of scoring methods, and the baseline-comparison loop. Most teams don't have any eval at all — this is what the first useful one looks like.