redis before hero architecture
docgen was burning tokens on identical questions. cache-aside fixed it — 35% lower inference cost, sub-second repeats, zero heroics.
docgen generates documentation from your codebase using an llm. first version: every request hit the model. perfect for demos. expensive for production. same repo, same branch, same question — we paid for the same answer twice, then a third time, then wondered why the bill hurt.
the fix wasn't a smarter model. it was redis and a pattern so boring it has a name: cache-aside. check cache first. on miss, compute. store with a ttl. return. next identical request — redis answers in milliseconds and the model sleeps.
cache-aside in four lines
- read from cache
- on miss, compute
- write to cache with ttl
- return
everything else is key design and expiry policy.
what goes in the key
not just user id. for docgen:
- repo hash — code changed? cache miss. fresh docs.
- file path — scope the answer
- prompt version — change the prompt? don't serve yesterday's tone
forget prompt version and you serve stale answers that look confident. that's worse than slow.
const key = `docs:${repoHash}:${path}:${promptVersion}`;
const cached = await redis.get(key);
if (cached) return JSON.parse(cached);
const docs = await generateDocs(repo, path);
await redis.set(key, JSON.stringify(docs), 'EX', 60 * 60 * 24);
return docs;when to cache — and when not to
good fit
- repeated expensive reads
- derived summaries and aggregations
- rate-limit counters
- session-ish state with a ttl
bad fit
- one-off admin exports
- user-specific secrets
- anything with legal or safety stale risk
- balances that must be exact every millisecond
we cut repeated inference cost by ~35% on real usage. not because redis is magic — because most users don't change their repo every five minutes. they re-run docs while iterating on the same branch. cache hits are a product feature disguised as infrastructure.
ttl is a product decision
expiry isn't just infra. it's how wrong you're willing to be, for how long.
- short (minutes) — near-live data, higher cost
- medium (hours) — docgen default for completed runs
- long (days) — stable public content, manual invalidation on deploy
shorten ttl too aggressively → cost goes up. lengthen without prompt versioning → quality complaints go up. tune both together.
my order now:
- make it correct
- measure what's actually slow
- cache the hot path
- document how entries expire
- hero architecture waits until you know what hurts