Wiki scaling — search architecture and growth path

TL;DR: Hussh wiki now uses a hosted serving plane: Drive is the private source workspace, Cloud Storage stores derived public/private snapshots, Firestore stores metadata and indexes, and MCP is the read/write capability boundary. The scaling story is still clean tiers: keep content editing human-friendly in Drive, keep public reads off Drive fanout, preserve Git as a scheduled audit/export mirror during the stability window, and move heavier search/rerank work to dedicated services only when usage justifies it.

Status as of 2026-05-26: Production is live on Cloud Run. The current corpus sync publishes the wiki corpus into GCS/Firestore. Ordinary content CRUD should not require a reader redeploy or a per-write Git commit; code/runtime changes still do.

Relations

Hussh wiki — scalable app + MCP architecture — the foundational architecture this scales.
LLM Wiki pattern — landscape & R&D — Karpathy's three-layer pattern.
Wiki search — dual-audience UX + AX — the human-vs-agent UX side.

1. How search works today

The wiki has three layers — storage / index / serving — each replaceable independently.

Storage layer — Drive source plus hosted snapshots

The source workspace is the private Shared Drive. It stores markdown, raw captures, private files, and private artifacts under a permission boundary limited to owners and the runtime account. Every page still has:

name, description, type (user/feedback/project/reference), visibility (public/private)
A ## Relations block listing curated outgoing links
A **Status as of YYYY-MM-DD** line for time-decay tracking
A ## Sources block citing source provenance

The runtime serving layer is not Drive. Validated public/private snapshots are written to Cloud Storage, and Firestore stores path, title, visibility, keywords, chronology, aliases, checksums, publish state, and sensitivity state. Git remains the scheduled audit/export/bootstrap mirror during the stability window.

Index layer — Firestore/search indexes first, heavier retrieval later

The current serving plane builds public and private indexes from the same validated page snapshots:

Metadata index — Firestore stores visibility, type, keywords, aliases, sort dates, and publish state.
Search index — Cloud Storage stores public/private search payloads generated from title, description, TL;DR, body, relations, and keywords.
Display relevance — raw search scores stay useful for ranking, but reader-facing relevance is capped so UI never reports above 100%.
Future retrieval — qmd/vector/rerank or managed vector stores remain an upgrade path when corpus size and query volume justify the extra operational cost.

                     ┌─ BM25 lexical scores  ─┐
   query   ─────────►│                        │── RRF ──► top-30 ──► LLM rerank ──► final ranking
                     └─ Vector cosine scores ─┘

The current principle is boring and fast: use hosted indexes for normal reads, avoid Drive fanout on public traffic, and add semantic retrieval only when the simpler index stops being enough.

Serving layer — MCP server + reader

MCP server: TypeScript / Express / StreamableHTTPServerTransport. It enforces auth, visibility, read/write tools, Drive sync, and GCS/Firestore publishing.
Tools: anonymous users see 8 public read tools; authenticated owners see 17 tools including write, capture, and artifact actions.
Reader: Next.js 16, calls MCP from server components and route handlers. The reader does not read markdown from disk.
Sync: MCP writes and Drive changes publish derived snapshots into Cloud Storage/Firestore. Scheduled reconcile catches missed webhooks and keeps indexes fresh.

How `wiki_search` answers a query end-to-end

Client calls wiki_search { query, max_results, type_filter } over MCP HTTP.
MCP chooses the public or private index based on auth tier.
Search runs over the hosted index and applies visibility/type filters.
Results return clean reader URLs, public-safe descriptions, keywords, and capped display relevance.
Private pages, private Drive IDs, private aliases, and private-only relations stay out of anonymous responses.

The critical performance property: page/list/search reads come from GCS/Firestore indexes and cached MCP responses, not Drive listing or repo scans per request.

2. Scaling axes — what grows independently

Four axes grow at different rates. Plan against each separately:

Axis	Today	Likely 6 months	2 years
Data size (pages)	~60	~600	6,000+
Read throughput (queries/sec)	~0.01 (you, occasional)	~1 (team of 10 active)	~50+ (multi-tenant)
Write throughput (edits/min)	~1 (Claude appends after sources)	~10 (team + agent ingestion)	~100+ (cron pipelines)
Concurrent users	1	10–50	100s–1000s

The current architecture handles today's column comfortably. Each tier below describes what breaks first as you cross a column boundary.

3. Scaling tiers — what breaks, what to do

Tier 0 — single user, ≤ 500 pages (today)

Status: working. Nothing to change.

qmd reindex completes in <5 s for the whole corpus.
Searches return in 200–400 ms cold, <50 ms warm.
Reader serves all routes from a single Next.js dev server.
One MCP server process is plenty.

Investments: none structural. Keep the schema discipline tight (Status as of, ## Sources, visibility:) so that scaling later is cheap.

Tier 1 — team-shared, ≤ 5,000 pages

What breaks first:

wiki_list returns the whole directory in one shot — at 5K pages this is a 1–5 MB JSON payload per call.
wiki_lint rebuilds the entire edge index every call; O(N²) for cross-references.
qmd embed on save scales with chunk count; debounced rebuild gets noticeable.
Single localhost server can't be shared across machines.

Fixes (in order of leverage):

Pagination on wiki_list — add limit + offset to the input schema. Default to first 50.
Async incremental indexing — chokidar already debounces; switch to qmd update --incremental (qmd 2.x supports it) instead of full re-embed.
Lint partitioning — run wiki_lint per-section (scope: "wiki/people") by default; add a wiki_lint_full for the cross-section graph pass that's run nightly via a scheduled job.
MCP server moves off localhost — deploy to a small VM behind HTTPS. Add session-aware transport (sessionIdGenerator) so each user gets stateful streaming. Add per-user auth (API keys → OAuth).
Reader gets a CDN edge cache — Next.js static export to Vercel/Cloudflare Pages for the public wiki; SSR-on-edge for the private wiki; the MCP endpoint is the only origin call.

What stays the same: qmd is still the search engine. BM25 + vector + LLM rerank still works at 5K pages on a single VM with 8 GB RAM.

Tier 2 — multi-tenant, ≤ 50,000 pages

What breaks first:

Local GGUF embeddings on a single machine: latency creeps to 1–2 s per query as the corpus grows.
BM25 + vector indices on disk get large; cold-start of a new server replica means re-embedding.
Write contention: many agents writing simultaneously through wiki_write/wiki_patch need coordination.
Memory pressure: the LLM reranker holds a model in RAM; can't horizontally scale read replicas without separating it.

Fixes (in order of leverage):

Move embeddings to a managed vector store — pgvector / Qdrant / Turbopuffer / Pinecone. Keep BM25 local (cheap, fast) and call out for vector + rerank. Embeddings get computed once at ingest and cached in the vector DB; servers become stateless.
Separate read/write paths — read replicas (multiple stateless MCP servers behind a load balancer) hit a shared vector store + a read-only mirror of the markdown. Write path goes through a single coordinator that handles wiki_write/patch/link and publishes change events to a queue. Read replicas pick up changes async.
Hot/cold tiering — pages with Status as of < 90 days old + high access count stay in the hot index (full BM25 + vector + rerank). Cold pages move to a single keyword-only BM25 index. The wiki_search tool unions the two.
Move the LLM reranker out-of-process — run as a dedicated GPU service or use a hosted reranker (Cohere Rerank, Voyage, etc.). Read replicas call it over RPC.
Move append-only log.md — at this scale, log.md is gigabytes. Move to a proper append-only store (SQLite WAL, S3 object-per-day, or a time-series DB). wiki_log_query becomes a query against that store.
Backups + DR — git repo gets snapshotted to S3 daily. Vector DB gets its own backup cadence. Test restore quarterly.

Tier 3 — large-scale public, 500,000+ pages

What breaks first:

Single git repo can't hold 500K markdown files efficiently.
Lint and relations graph are O(N²) without partitioning.
Search rerank latency is the user-facing bottleneck even with managed vector stores.

Fixes (these are real distributed-systems decisions; sketches not specs):

Shard the corpus — by directory tree (wiki/people/* shard, wiki/entities/* shard) or by tenant if multi-customer. Each shard has its own qmd-equivalent and write coordinator.
Search aggregator — a thin layer that fans search out to all shards, applies a final RRF + rerank across shard top-Ks, and returns. This is how all major search engines work (Elasticsearch, Meilisearch, etc.).
Caching tier in front — Redis / Upstash for hot queries. The 60s client cache becomes a 60s server cache; keyed by (query, type_filter, visibility, user).
CDN-fronted reader — every wiki page gets a static-export prerender; private pages render via signed URL on the CDN edge.
Streaming ingest — agents writing pages publish to a queue; a writer pool consumes, validates schema, writes to git, indexes incrementally. Backpressure visible in queue depth.

4. Deployment topology

The current production topology is a hosted wiki app, not a laptop-only wiki:

Humans + MCP-aware agents
        |
        v
wiki.hushh.ai reader  ---->  mcp.hushh.ai MCP server
                                  |
                                  +--> Drive private source workspace
                                  +--> Cloud Storage public/private snapshots
                                  +--> Firestore metadata and indexes
                                  +--> GitHub daily audit/export/bootstrap mirror

The important split:

Drive is for authoring and review.
Cloud Storage/Firestore is for fast serving and search/list indexes.
MCP owns auth, tool capability, write governance, and sync.
Git remains recovery and provenance, not the intended long-term hot database.

Future topologies should evolve this model rather than return to Drive fanout or repo scans on every request.

In leverage order:

Keep content CRUD on MCP/Drive and derived publishing. Do not redeploy the reader for page edits.
Keep public reads on GCS/Firestore indexes. Do not make Drive the public database.
Keep repo markdown until export, rollback, diff/history, and owner review are fully hosted and proven.
Add richer search only when the hosted index stops satisfying real queries.
Add write queues only if concurrent MCP writes become a measurable problem.
Add analytics-driven content dashboards before adding heavyweight retrieval infrastructure.

6. What you don't have to worry about

A few things scale gracefully without intervention:

Markdown files — the source format is lightweight. The bottleneck is not file size; it is permissioning, indexing, sync, and retrieval semantics.
Schema correctness — the lint tool catches drift early; Status as of discipline + visibility: tags + ## Sources blocks make every page self-describing.
MCP protocol — Streamable HTTP scales identically to any HTTPS API. No special handling needed.
Authoring throughput — Claude (or any LLM agent) writing pages is bounded by your review cycle, not the system. The wiki can absorb writes faster than you can validate them.

7. Cost model (rough orders of magnitude)

Tier	Compute	Storage	Bandwidth	Net /month
Current — Cloud Run + GCS/Firestore	Cloud Run services	GCS + Firestore + Drive	Low public traffic	Low, dominated by Cloud Run and sync calls
Richer search	same + search worker	vector/search index	moderate	increases only when semantic retrieval is added
Large public	autoscaled read path	larger indexes + CDN/cache	higher	dominated by public traffic and search/rerank

These are rough; actual cost is dominated by the vector DB at Tier 2+ and by traffic at Tier 3.

8. Observability checklist

MCP server: structured logs (request ID, tool name, latency, error code).
Serving sync: Drive scan count, pages published, pages retracted, quarantines, and duration.
Firestore/GCS: index freshness, object generations, and self-repair count.
Reader: client-side error reporting (Sentry) + Web Vitals (LCP, INP, CLS).
Health endpoint: /readyz and MCP tool checks verify serving mode and connectivity.
Audit trail: MCP writes emit structured audit logs immediately; the backup export job preserves Git audit continuity while the hosted metadata plane stabilizes.

Sources

MCP implementation notes — current server architecture, capabilities, and transport.
Serving implementation notes — GCS/Firestore helper layer.
Serving sync implementation notes — Drive and filesystem sync into hosted snapshots.
Hussh wiki architecture — current architecture.
Karpathy LLM Wiki gist (raw / wiki / schema layering) — referenced in wiki/about/llm-wiki-pattern.md.

Relations

1. How search works today

Storage layer — Drive source plus hosted snapshots

Index layer — Firestore/search indexes first, heavier retrieval later

Serving layer — MCP server + reader

How wiki_search answers a query end-to-end

2. Scaling axes — what grows independently

3. Scaling tiers — what breaks, what to do

Tier 0 — single user, ≤ 500 pages (today)

Tier 1 — team-shared, ≤ 5,000 pages

Tier 2 — multi-tenant, ≤ 50,000 pages

Tier 3 — large-scale public, 500,000+ pages

4. Deployment topology

5. Concrete next-steps menu

6. What you don't have to worry about

7. Cost model (rough orders of magnitude)

8. Observability checklist

Sources

How `wiki_search` answers a query end-to-end