april 14, 20252 min read

how docgen turns a repo into docs

it's not chatgpt on your repo — it's ingestion, chunking, retrieval, caching, and a frontend that doesn't feel broken while you wait.

people ask if docgen is "just chatgpt on your repo." not quite. the model is one step. the product is everything around it — ingestion, chunking, context selection, caching, and a frontend that doesn't feel broken while the expensive part runs.

stage 1 — ingestion

clone or pull the repo. walk the file tree. skip binaries and node_modules. respect .gitignore. hash the tree so you know when nothing changed.

if the hash matches the last run, skip re-processing. users click "generate" twice — you shouldn't pay twice.

skipped without apology:

  • node_modules/, dist/, .git/
  • images, pdfs, compiled assets
  • files over a configurable size threshold

stage 2 — chunking

a 4,000-line file doesn't fit in one prompt. split by function or section. keep metadata so answers can cite where they came from.

bad chunking gives confident wrong answers. good chunking is librarian work, not ml magic.

every chunk carries:

  1. path — src/auth/middleware.ts
  2. language — typescript
  3. line range — 42–89
  4. content — the text sent to the model

stage 3 — generation

for each question: retrieve relevant chunks (keyword search first, embeddings when you have them), assemble a prompt with strict instructions — answer from context only, say you don't know if missing. temperature low.

hallucinated api methods are worse than no doc.

tip

cited or silent

users forgive "i don't know." they don't forgive fake function names. force the model to cite paths or refuse.

typescript
const context = chunks
  .map((c) => `[${c.path}:${c.start}-${c.end}]\n${c.text}`)
  .join('\n\n');

const prompt = `
Answer only from the context below.
If the answer is not in the context, say you don't know.
Cite file paths when possible.

Context:
${context}

Question: ${question}
`;

stage 4 — delivery

the react frontend streams partial text so five seconds feels shorter. error states matter as much as happy paths:

  • streaming text — chunks retrieved, model responding
  • "repo too large" — ingestion hit size cap
  • cached badge — hash match, skipped llm
  • retry button — rate limit or transient failure

boring copy. fewer support pings.

where redis fits

redis caches completed doc runs keyed by repo hash + config version. that's how we hit sub-five-second responses on repeats and cut inference spend ~35%. the llm is the expensive guest. everything else is hosting.


building ai products in 2025: spend your sprint on the pipeline, not the model badge on the landing page. users remember whether it worked — not which gpt version you shipped.

· · ·
newerjwt middleware in plain english