Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

ADR-0039: Use SQLite FTS5 as read-only search index for governance artifacts

Status: accepted | Date: 2026-04-09

Tags: cli

References: RFC-0002, RFC-0004, ADR-0048

Context

govctl manages governance artifacts (RFCs, ADRs, clauses, work items, guards) as TOML files in gov/. As the corpus grows (currently 200+ artifacts, projected to reach 1000+ in active projects), finding artifacts by content becomes increasingly difficult.

Problem Statement

Users need to answer questions like “which ADR discussed caching?”, “which RFC clause mentions backward compatibility?”, or “which work items reference RFC-0002?”. Currently this requires:

  • grep over raw TOML files (poor UX, no ranking, no stemming)
  • govctl list + manual inspection (only searches titles)
  • Memorizing artifact IDs

None of these scale or provide relevance-ranked results.

Constraints

  • RFC-0002 establishes TOML files as the source of truth — any index must be derived, not authoritative
  • RFC-0004 governs concurrent write safety — the index must not interfere with the file locking protocol
  • The index must work offline with no external services
  • Rebuild must be fast enough to run transparently on every search query

Decision

Use SQLite FTS5 with lazy incremental sync as the search backend for govctl search, stored as derived local state under .govctl/index.db.

Design

  1. Index location: .govctl/index.db, with SQLite sidecars such as .govctl/index.db-wal and .govctl/index.db-shm when WAL mode is active. The database is disposable local state and is covered by the existing .govctl/ gitignore invariant.

  2. Catalog separation: The database may contain shared artifact catalog tables for ID-to-path lookup and freshness metadata per ADR-0048. Those catalog tables are shared lookup infrastructure. Search-specific FTS tables, ranking data, and snippets remain derived search data.

  3. Indexed content: RFCs, clauses, ADRs, work items, and guards. Each search document stores artifact ID, type, title, source path, stable status metadata where applicable, tags, refs, and a curated concatenation of searchable content fields. Work item descriptions, acceptance criteria, and notes are searchable; legacy inline journal entries remain render-only compatibility data and are not indexed. Raw TOML is not the search document.

  4. Sync strategy — lazy incremental: On every govctl search, compare current artifact path and freshness metadata against the local index manifest. New or changed files are parsed and upserted into the search projection. Deleted files are removed. Missing, corrupt, or incompatible index state is rebuilt.

  5. Freshness rule: govctl search must not return results from an index whose freshness cannot be established. If freshness cannot be established, it must rebuild, fall back to an uncached scan where possible, or return a diagnostic instead of silently returning stale results.

  6. No write-through optimization: Artifact write commands do not need to update the search FTS tables directly. Lazy sync remains correct for manual edits, branch switches, and govctl writes.

  7. Concurrency: SQLite WAL mode and transactions protect the local index from corruption during concurrent search invocations. The search index is not a governed artifact and does not participate in the RFC-0004 gov-root write lock.

  8. Explicit escape hatch: govctl search --reindex forces a full rebuild before querying.

Why This Design

  • .govctl/ is the existing local-state boundary for loop execution and other derived state.
  • Keeping the index out of gov/ avoids treating disposable search data as a governed artifact mutation.
  • Lazy sync avoids coupling between artifact write paths and search indexing.
  • SQLite FTS5 provides BM25 ranking and snippets without a daemon or external service.

Consequences

Positive

  • Users can find artifacts by content with relevance ranking instead of relying on raw grep, title-only lists, or memorized IDs.
  • The index is disposable and self-healing: deleting .govctl/index.db only removes local cache state, and the next search can rebuild it.
  • Lazy sync keeps search correct across manual edits, branch switches, and govctl writes without coupling every artifact mutation to the search backend.
  • Using .govctl/ keeps derived search data out of governed artifacts and rendered outputs.
  • Shared catalog metadata from ADR-0048 lets search avoid duplicating path and freshness discovery logic.

Negative

  • Adds rusqlite with bundled SQLite, increasing binary size and build complexity.
  • First search after a large branch switch or cache deletion may be slower while the local index rebuilds.
  • CJK text may require tokenizer improvements beyond the default English-oriented stemming setup.
  • Search must guard freshness carefully; returning stale results would be worse than a slower rebuild or diagnostic.

Neutral

  • The local SQLite database is a cache, not a new artifact storage format. TOML files remain the source of truth.

Alternatives Considered

SQLite FTS5 with lazy incremental sync: single-file read-only index using rusqlite (bundled), Porter stemming, BM25 ranking, and content-hash-based incremental updates on each search query. (accepted)

  • Pros: Battle-tested BM25 ranking out of the box, Single-file index, no daemon or external service, Porter stemming handles English morphology (cache/caching/cached), rusqlite is mature with bundled compilation — no system SQLite dependency, Lazy sync means no separate build step or cache invalidation protocol
  • Cons: Adds ~3MB to binary size from bundled SQLite, CJK segmentation requires additional tokenizer configuration

Tantivy (Rust-native full-text search): Use the tantivy crate, a Lucene-inspired search engine written in Rust. Supports BM25, tokenizers, and schema-defined fields natively. (rejected)

  • Pros: Pure Rust, no C dependency, More powerful query language (boolean, phrase, fuzzy), Purpose-built for search — better performance at scale
  • Cons: Much heavier dependency (~50 crates in dependency tree), Index is a directory of segment files, not a single file, Overkill for <1000 documents
  • Rejected because: Dependency weight and complexity are disproportionate to the scale of govctl’s artifact corpus. SQLite FTS5 covers the requirements with a single well-understood dependency.

In-memory inverted index with no persistence: Build a simple inverted index on every search invocation by scanning all TOML files, tokenizing content, and ranking by term frequency. No disk cache. (rejected)

  • Pros: Zero dependencies — no SQLite, no new crates, No cache invalidation problem — always fresh
  • Cons: Full rebuild on every query (~100ms at 200 files, grows linearly), No stemming or advanced tokenization without additional code, No BM25 — would need a custom ranking implementation
  • Rejected because: Lacks stemming and BM25 ranking out of the box. Rebuild cost scales linearly and becomes noticeable beyond 500 artifacts. The UX gap versus FTS5 is significant for the marginal dependency savings.