Same question, three voices. One user prompt drives three concurrent calls to the same model — each with a different system prompt. After they finish, an LLM-as-judge call scores all three on adherence to their own constraints and specificity. The whole stack is visible: edit the personas, inspect the request bodies, see the rubric.
| Persona | Adherence | Specificity | Speed | Cost* | Verdict |
|---|
Send a prompt to populate telemetry.
Three concurrent SSE streams plus one judge call:
Browser
├─→ POST /api/lab/chat (persona A · Concise Analyst) ┐
├─→ POST /api/lab/chat (persona B · Skeptical Reviewer)│ concurrent
└─→ POST /api/lab/chat (persona C · Strategic Advisor) ┘ SSE streams
│
↓ all three complete
│
└─→ POST /api/lab/judge {user_prompt, candidates: [A,B,C]}
→ single chat completion with response_format=json_object
→ scored against the versioned rubric (judge.v1)
← {results: [...], rubric_version, latency_ms, …}
Backend on Oracle: FastAPI on 127.0.0.1:8000, Nginx reverse proxy,
per-IP rate limit (10 req/min), in-memory LRU cache keyed on full
request — identical re-runs return instantly without burning quota.
One model, three voices. The differentiation is entirely in the system prompts — no fine-tuning, no model routing, no client-side AI. That's the whole point of the Foundation tier.
Visible exactly as the judge model receives it (versioned judge.v1):
Run an evaluation to load the rubric from the backend.
Honest caveat: this is self-judgement — the same Kimi K2 model evaluates its own outputs. Useful for relative ranking, weak as an absolute quality signal. Production evals use a stronger judge model and a human-graded test set. The rubric is exposed so you can see exactly what's being measured rather than trusting an opaque score.