Prompt Lab

Same question, three voices. One user prompt drives three concurrent calls to the same model — each with a different system prompt. After they finish, an LLM-as-judge call scores all three on adherence to their own constraints and specificity. The whole stack is visible: edit the personas, inspect the request bodies, see the rubric.

User prompt Try: "Should we build an internal helpdesk on RAG or fine-tune?" · "How would you sequence an EU AI Act compliance program?"

One extra LLM call ranks the three outputs against their own system prompts.

Scoreboard

judge.v1 · — · —

Persona	Adherence	Specificity	Speed	Cost*	Verdict

* Simulated cost at Kimi K2 published rates ($0.60 / 1M input tokens, $2.50 / 1M output tokens). Today's runs are free — these numbers are what equivalent paid usage would look like, included so the FinOps point lands. Adherence = how well the output followed its own system prompt's constraints. Specificity = density of concrete references (frameworks, patterns, named entities).

Telemetry — per-column request, latency, tokens

Send a prompt to populate telemetry.

Architecture — what just happened

Three concurrent SSE streams plus one judge call:

Browser
  ├─→ POST /api/lab/chat (persona A · Concise Analyst)   ┐
  ├─→ POST /api/lab/chat (persona B · Skeptical Reviewer)│  concurrent
  └─→ POST /api/lab/chat (persona C · Strategic Advisor) ┘  SSE streams
        │
        ↓ all three complete
        │
        └─→ POST /api/lab/judge {user_prompt, candidates: [A,B,C]}
              → single chat completion with response_format=json_object
              → scored against the versioned rubric (judge.v1)
              ← {results: [...], rubric_version, latency_ms, …}

Backend on Oracle: FastAPI on 127.0.0.1:8000, Nginx reverse proxy,
per-IP rate limit (10 req/min), in-memory LRU cache keyed on full
request — identical re-runs return instantly without burning quota.

One model, three voices. The differentiation is entirely in the system prompts — no fine-tuning, no model routing, no client-side AI. That's the whole point of the Foundation tier.

How this is measured (the rubric)

Visible exactly as the judge model receives it (versioned judge.v1):

Run an evaluation to load the rubric from the backend.

Honest caveat: this is self-judgement — the same Kimi K2 model evaluates its own outputs. Useful for relative ranking, weak as an absolute quality signal. Production evals use a stronger judge model and a human-graded test set. The rubric is exposed so you can see exactly what's being measured rather than trusting an opaque score.