Retrieval-augmented generation, with the wires showing. Type a question — the corpus panel lights up the chunks that were retrieved, the answer streams in with inline citations, and every step is inspectable: the embedding call, the similarity scores, the prompt sent to the model with the retrieved context grounded in.
Send a question to see telemetry.
A two-call retrieval pipeline, with the math visible:
Browser
├─ corpus.json (77 chunks × 1024-dim embeddings) loaded once on page open
│
├─→ POST /api/lab/embed {text, input_type:"query"}
│ → NVIDIA NIM (nv-embedqa-e5-v5)
│ ← {embedding: [1024 floats], latency_ms, …}
│
├─ cosineSimilarity(query_vec, chunk_vec) for every chunk in corpus.json
├─ sort desc, take top-k → highlight in corpus pane
│
└─→ POST /api/lab/chat {messages: [system+user], …}
- system: "Answer using ONLY these chunks. Cite as [1], [2]…"
- user: the question
- context: top-k chunks injected into the system prompt
← SSE stream of tokens with [n] citation markers
Retrieval runs in your browser. The backend never touches the corpus — it only embeds your query (one cheap call) and runs the grounded chat (one ordinary chat call). corpus.json is a static asset on the same CDN edge as the page.
Seven sources, paragraph-aware chunking with a soft 500-char cap:
Standards are summarized in original prose, clearly labeled. The corpus is rebuilt
via scripts/build_corpus.py any
time the source files change.