Agent Studio

Three agents, one model, one shared corpus. The Researcher iterates over the knowledge base — deciding what to look up next based on what it has found — then hands its notes to the Analyst, who weighs trade-offs and risks. The Writer synthesizes the final answer with citations. Everything streams live so you can watch the system think.

Pick a preset or write your own.
loading corpus…
Pipeline complete
Architecture — what just happened

A real iterative agent loop, all client-side, with the corpus as the shared knowledge base:

Researcher (up to 3 iterations):
  loop:
    POST /api/lab/chat
      "Decide: next search query, or stop?"
      → {next_query: "...", rationale: "...", done: false}
    if done or iteration ≥ 3: break
    POST /api/lab/embed (input_type=query)
      → 1024-dim vector
    cosineSimilarity over corpus.json (client-side)
      → top-5 chunks (deduped vs prior iterations)
    feed retrieved chunks back into next iteration's context

  POST /api/lab/chat (streaming)
    "Synthesize research notes from the chunks gathered, with [n] citations"
    → research notes

Analyst:
  POST /api/lab/chat (streaming)
    "Identify trade-offs, risks, decision factors. Be direct."
    context: question + research notes
    → analysis

Writer:
  POST /api/lab/chat (streaming)
    "Synthesize a concise executive-ready answer with [n] citations."
    context: question + research notes + analyst's analysis
    → final answer

One model (Kimi K2). Three roles, differentiated only by system prompt and the context they receive. The Researcher's iteration loop is the most "agentic" part — every loop the model sees what it's already learned and decides whether to dig deeper.

Pipeline budget — calls, latency, simulated cost
  • 1–3 plan calls (Researcher iterations) · ~1–2 s each, JSON output
  • 1–3 embed calls (one per iteration that did a search) · ~0.4 s each
  • 1 synthesis call (Researcher writes notes) · ~3–5 s, streamed
  • 1 Analyst call · ~3–5 s, streamed
  • 1 Writer call · ~3–5 s, streamed

Total: 4–6 LLM calls + 1–3 embed calls per question. ~15–25 s end-to-end. All requests cached (LRU keyed on full request) so re-running the same question is instant and free of upstream cost.

Why this is the right shape for the demo

A real production agent system would use formal function calling — the model emits structured tool requests, the orchestrator runs them, results flow back. That's how Claude, GPT, Gemini, and Kimi all support tools today.

This demo simulates the same pattern with plain prompts and JSON outputs. The visible structure is the same: plan a query, get results, decide what's next. The visitor sees the agent reason between calls rather than watching opaque tool-call traces.

In production you'd also add: parallel branches (multiple researchers working in parallel), critic agents (one agent reviews another's output), human-in-the-loop gates, and observability via OpenTelemetry. The Lab keeps it focused — three roles in series — to make the choreography legible.