Self-Correcting Agent

The simplest reliability pattern beyond single-call AI: same model, three sequential passes. The first writes a draft (v1). The second critiques its own draft. The third produces v2 by addressing the critique. Watch each pass stream live, then see a word-level diff of what actually changed. The optional judge call evaluates both versions honestly — sometimes the revision is worse, and the page will say so.

User question Pick a preset or write your own.

ready

// VIEW

One extra LLM call scores both versions on accuracy, specificity, and structure.

Pipeline complete

Architecture — what just happened

Pass 1 — Generate (streaming)
  POST /api/lab/chat
    system: "You are an enterprise consultant. Concise. Cite frameworks."
    user:   {question}
    → v1 draft

Pass 2 — Critique (streaming)
  POST /api/lab/chat
    system: "Identify weaknesses honestly. 3-5 numbered critiques + verdict."
    user:   "Question: …  Draft: {v1}"
    → critique text

Pass 3 — Refine (streaming)
  POST /api/lab/chat
    system: "Address the critique. Same constraints as v1. Be tighter."
    user:   "Question + v1 + critique"
    → v2 draft

Diff (client-side)
  word-level LCS-based diff between v1 and v2
  → del / ins spans rendered with color-coded highlights

Optional Pass 4 — Judge (single call, JSON)
  POST /api/lab/chat (response_format=json_object)
    system: "Score v1 and v2 on accuracy / specificity / structure 0-10."
    user:   "Question + v1 + v2"
    → { v1: {...}, v2: {...}, winner, rationale }

Same model across all passes. The "agent" is just role-by-prompt sequencing. The diff is pure JS — no LLM is asked to compute it, because that would be slow, expensive, and unreliable.

Why this pattern, and when to use it

Single-call AI gets you to "good enough" most of the time. But the same model, given a chance to review its own work, can usually catch its own thinness — too vague, hedged, missing a framework reference, or off-topic. Ask it to revise, and the output gets tighter.

When this pattern shines: production copy generation, structured analysis, anywhere "draft and edit" beats "one shot." When it doesn't: short factual answers, deterministic extraction, anything where extra latency or cost isn't worth a marginal quality bump.

Honest caveat: the critique-and-revise loop sometimes makes things worse. Critique can hallucinate weaknesses; refine can over-correct. The judge call here is the empirical check — and in production you'd run it on a sample of revisions to decide whether the loop is worth keeping.