PR #4235: [Conrad]: ARC-2356: Update Brain corpus from document changes

PR Summary

What problems was I solving

Before this change, ordinary document uploads and edits in the Documents Service did not automatically flow into Brain's knowledge corpus updates; Brain corpora were only rebuilt through domain onboarding. Additionally, Brain RAG-Fusion retrieval was not filtered to a published corpus version, so chunks from superseded or incomplete corpus states could be returned.

What user-facing changes did I ship

Documents uploads and edits now automatically update Brain's default knowledge corpus. Brain-grounded retrieval respects the latest published corpus version, reducing stale or partial corpus results. During the transition, tenants without a published Brain corpus fall back to document retrieval so answers remain available.

How I implemented it

The PR wires Documents Service ingestion to best-effort trigger a new tenant-scoped documentCorpusUpdateBatchWorkflow after a document is marked ready. The batch workflow coalesces document-version signals, starts per-document claim extraction early while the batch is still open, reuses existing pending/processing/completed extractions to avoid duplicate model work, bounds batches by idle debounce/max duration/max count, and continues-as-new to limit history. Completed batch extractions feed the existing document-to-corpus pipeline, which now publishes complete corpus manifests by merging changed document sources with unchanged current corpus sources. Corpus publication now stages default_corpus_version_ids membership in Turbopuffer before flipping knowledge_corpora.current_version_id. Brain RAG-Fusion retrieval loads the latest published corpus version and filters keyword/semantic Turbopuffer searches to that version, falling back to Documents retrieval when no corpus has been published.

Description for the changelog

Implemented ARC-2356 so document uploads and edits automatically feed Brain's default knowledge corpus through a tenant-scoped, coalesced batch workflow with early, idempotent claim extraction. Corpus publishes now merge changed document sources into the current published manifest, stage Turbopuffer corpus-membership metadata before flipping the current version, and Brain RAG-Fusion retrieval is filtered to the latest published corpus version. A post-release source-chunk re-index is required for tenants with existing Brain corpora.

Brain Service Repository & Store Layer

medium7 files

Adds corpus-version staging queries and reusable document extraction lookup to Brain's repository layer.

Brain Service Corpus Publication Semantics

medium3 files

Makes corpus publishes complete-manifest merges and stages version membership in Turbopuffer before flipping the current version.

Brain Service Document Corpus Update API & Workflow

medium6 files

Adds `startDocumentCorpusUpdate` RPC and a tenant-scoped Temporal batch workflow that coalesces document changes into Brain corpus updates.

Brain Service RAG-Fusion Retrieval Filtering

medium2 files

Filters Brain RAG-Fusion retrieval to the latest published default corpus version and falls back to Documents when no corpus exists.

Documents Service Trigger for Brain Corpus Updates

medium5 files

Wires document ingestion to best-effort trigger Brain corpus updates after document readiness.

Brain Service Reusable Extraction & Onboarding Wiring

medium4 files

Adds reusable document extraction activity and wires Turbopuffer config through the knowledge-build runtime.

Brain Service Test Harness Updates

low3 files

Adds the new required `findReusableDocumentExtractionBuild` stub to existing Brain unit tests.

Post-Release Re-index Task

low1 file

Documents the required post-deploy Brain source-chunk re-index to backfill corpus-membership metadata.