The simplest reliability pattern beyond single-call AI: same model, three sequential passes. The first writes a draft (v1). The second critiques its own draft. The third produces v2 by addressing the critique. Watch each pass stream live, then see a word-level diff of what actually changed. The optional judge call evaluates both versions honestly — sometimes the revision is worse, and the page will say so.
Pass 1 — Generate (streaming)
POST /api/lab/chat
system: "You are an enterprise consultant. Concise. Cite frameworks."
user: {question}
→ v1 draft
Pass 2 — Critique (streaming)
POST /api/lab/chat
system: "Identify weaknesses honestly. 3-5 numbered critiques + verdict."
user: "Question: … Draft: {v1}"
→ critique text
Pass 3 — Refine (streaming)
POST /api/lab/chat
system: "Address the critique. Same constraints as v1. Be tighter."
user: "Question + v1 + critique"
→ v2 draft
Diff (client-side)
word-level LCS-based diff between v1 and v2
→ del / ins spans rendered with color-coded highlights
Optional Pass 4 — Judge (single call, JSON)
POST /api/lab/chat (response_format=json_object)
system: "Score v1 and v2 on accuracy / specificity / structure 0-10."
user: "Question + v1 + v2"
→ { v1: {...}, v2: {...}, winner, rationale }
Same model across all passes. The "agent" is just role-by-prompt sequencing. The diff is pure JS — no LLM is asked to compute it, because that would be slow, expensive, and unreliable.
Single-call AI gets you to "good enough" most of the time. But the same model, given a chance to review its own work, can usually catch its own thinness — too vague, hedged, missing a framework reference, or off-topic. Ask it to revise, and the output gets tighter.
When this pattern shines: production copy generation, structured analysis, anywhere "draft and edit" beats "one shot." When it doesn't: short factual answers, deterministic extraction, anything where extra latency or cost isn't worth a marginal quality bump.
Honest caveat: the critique-and-revise loop sometimes makes things worse. Critique can hallucinate weaknesses; refine can over-correct. The judge call here is the empirical check — and in production you'd run it on a sample of revisions to decide whether the loop is worth keeping.