Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

RFC-0004: Search Protocol

Version: 0.2.0 | Status: normative | Phase: test


1. Summary

[RFC-0004:C-OVERVIEW] Overview (Informative)

Search enables agents to find content within skills using full-text queries.

The search system builds an index during compilation and supports BM25-ranked queries with snippet extraction. Rather than requiring agents to know exact section headings, search allows natural language queries like “how to configure authentication” to find relevant content.

Design principles:

  1. Progressive disclosure — Search returns section references; agents use show to retrieve content
  2. Offline-first — Uses SQLite FTS5, no external dependencies
  3. Extensible formats — File format support is added incrementally
  4. Observable — All searches are logged for analytics

Commands:

  • skc search <skill> <query> — Find matching sections

Since: v0.1.0


2. Specification

[RFC-0004:C-FORMATS] Supported Formats (Normative)

The search index MUST support the following file formats:

v0.1.0:

  • .md — Markdown files, segmented by headings. Each section (heading + content until next heading of equal or higher level) becomes a searchable document. The section field contains the heading text.
  • .txt — Plain text files, indexed as a single document. The section field MUST be an empty string ("").

Retrieval: Search results from any indexed format can be retrieved using skc open <skill> <path>. The open command is not restricted to .md files (see RFC-0002:C-OPEN).

Additional formats MAY be added in future versions. The implementation MUST silently skip unsupported file types without error.

Planned (future versions):

  • .csv — Each row as a searchable document
  • .json, .yaml, .toml — Flattened key-value pairs
  • Code files (.py, .js, .ts, .rs) — Plain text with comment weighting

Since: v0.1.0

[RFC-0004:C-INDEX] Index Storage (Normative)

The search index MUST be stored as a SQLite FTS5 database in the skill’s runtime directory.

Index file naming: To avoid collisions when multiple source directories share a runtime store, the index file MUST be named using a hash of the source path:

.skillc-meta/search-<hash16>.db

Where <hash16> is the first 16 characters (64 bits) of the SHA-256 hash of the canonicalized source directory path.

Tokenizer preference: The current tokenizer preference is determined at runtime:

  1. Attempt porter unicode61
  2. If unavailable, fall back to unicode61

This preference is used for both index creation (build) and checking (search).

Corrupt index definition: An index is corrupt if any of:

  • The database file cannot be opened as a SQLite database
  • The index_meta table does not exist
  • Any required key is missing (skill_path, source_hash, schema_version, tokenizer)
  • Any required key value cannot be parsed (e.g., non-integer schema_version)

Any read or parse failure during index access MUST be treated as corruption. Implementations MUST NOT attempt to distinguish transient errors from permanent corruption; all failures map to the same handling (E002 for search, delete+rebuild for build).

Index file selection (search): When searching, the implementation MUST follow these steps in order:

  1. Compute the expected filename search-<hash16>.db
  2. If file does not exist: exit with error E002 (missing)
  3. Open the database and read required keys from index_meta; if corrupt (per definition above): exit with error E002
  4. If skill_path does not match current source path: exit with error E003 (collision)
  5. Check staleness conditions (see below): if stale, exit with error E002
  6. Proceed with search

E002 conditions (search): E002 (“unusable”) is the umbrella error covering three distinct failure modes:

  • Missing: file does not exist (step 2)
  • Corrupt: any read/parse failure (step 3)
  • Stale: metadata mismatch (step 5) — a subset of “unusable”

All three require skc build to fix.

See RFC-0005:C-CODES for canonical error messages.

Staleness conditions (search-only): “Stale” is a subset of “unusable” that applies when the index can be read but its metadata does not match the current skill state.

After confirming no collision (step 4), the index is stale if any of:

  • source_hash does not match current manifest hash
  • schema_version < current schema version (currently 2)
  • tokenizer does not match current tokenizer preference

These three fields are the only staleness conditions. Missing file and corrupt index are handled earlier (steps 2-3). The skill_path field is used for collision detection (step 4), not staleness.

If stale, skc search MUST exit with error E002.

Index lifecycle (build): When skc build runs:

  1. Compute the expected filename search-<hash16>.db
  2. If file does not exist: proceed to step 6 to create new index
  3. Open the database and read required keys from index_meta; if any read/parse failure: delete file, proceed to step 6 to create new index
  4. If skill_path does NOT match current source path: exit with error E003 (collision). Stop.
  5. Compare source_hash, schema_version, and tokenizer:
    • If all match: skip rebuild. Done.
    • Else: delete existing file, proceed to step 6 to create new index.
  6. Create new index with current tokenizer preference

Build does not error on unusable indexes; it rebuilds them. Corrupt indexes are deleted without collision detection since skill_path cannot be reliably read.

Build behavior summary:

  • Missing: create new index (step 2 → 6)
  • Corrupt: delete and rebuild without collision check (step 3 → 6)
  • Collision (skill_path mismatch): error E003 (step 4)
  • Up-to-date (all metadata matches): skip rebuild (step 5)
  • Stale (metadata differs): rebuild (step 5 → 6)

No automatic cleanup: The implementation MUST NOT delete other search-*.db files. Multiple skills may share a runtime directory.

Runtime directory resolution: The runtime directory MUST be resolved using the same logic as RFC-0007:C-RESOLUTION.

Index schema: The database MUST contain a virtual table using FTS5 for full-text search:

CREATE VIRTUAL TABLE sections USING fts5(
    file,
    section,
    content,
    tokenize='porter unicode61'
);

The database MUST contain a headings table for section lookup by RFC-0002:C-SHOW:

CREATE TABLE headings (
    id INTEGER PRIMARY KEY,
    file TEXT NOT NULL,
    text TEXT NOT NULL,
    level INTEGER NOT NULL,
    start_line INTEGER NOT NULL,
    end_line INTEGER NOT NULL
);

CREATE INDEX idx_headings_text ON headings(text COLLATE NOCASE);

Fields:

  • file — relative path from skill root
  • text — heading text (without # prefix)
  • level — heading level (1-6)
  • start_line — 1-based line number of heading
  • end_line — 1-based line number of next heading (or EOF+1)

The database MUST also contain a metadata table:

CREATE TABLE index_meta (
    key TEXT PRIMARY KEY,
    value TEXT
);

Required metadata keys:

  • source_hash — Hash from RFC-0001:C-MANIFEST
  • skill_path — Canonicalized absolute path to source directory
  • schema_version — Integer (currently 2)
  • indexed_at — RFC 3339 UTC timestamp
  • tokenizerporter or unicode61

Schema migration: No in-place migration. User must run skc build.

Updated in v0.2.0: Added headings table for index-based section lookup. Bumped schema_version to 2.

Since: v0.1.0

[RFC-0004:C-SEARCH] Search Command (Normative)

Syntax: skc search <skill> <query> [options]

The search command MUST query the FTS5 index and return ranked results.

Options:

  • --limit N — Maximum results (default: 10)
  • --format <text|json> — Output format (default: text)

Ranking: Results MUST be ranked using BM25. Scores MUST be negated for display (FTS5 returns negative).

Snippet extraction: Snippet parameters (FTS5 snippet() function):

  • Column: content (index 2)
  • Start marker: [MATCH]
  • End marker: [/MATCH]
  • Ellipsis: ...
  • Token limit: 32

Output format guarantees:

JSON is canonical (Normative). Use JSON for machine parsing.

JSON output (--format json):

{
  "query": "<original-query>",
  "results": [
    {
      "file": "<relative-path>",
      "section": "<heading-or-identifier>",
      "snippet": "...text with [MATCH]term[/MATCH]...",
      "score": <float>
    }
  ]
}

Text output (--format text) — Informative only: Human-readable, NOT a stable contract. Do NOT parse.

<file>#<section> (score: <score>)
  <snippet>

No results: Return empty result set, exit 0. JSON: {"query": "...", "results": []}.

Skill resolution: Per RFC-0007:C-RESOLUTION.

Error handling: All errors exit with status 1. See RFC-0005:C-CODES for canonical error messages.

ConditionError Code
Skill resolution failedE001 or E010
Index unusableE002
Index hash collisionE003
Empty queryE004
Invalid CLI optionE100

Note: “Index unusable” (E002) covers three cases: missing file, corrupt index, or stale metadata. See RFC-0004:C-INDEX for details.

Since: v0.1.0

[RFC-0004:C-LOGGING] Search Logging (Normative)

Search commands MUST be logged per RFC-0007:C-LOGGING.

Command name: search

Args format:

{
  "query": "<search-query>",
  "result_count": <number-of-results-returned>
}

Error field: If the search fails (e.g., stale index), the error message MUST be recorded in the error field.

Analytics extension (future): The stats command (per RFC-0003) SHOULD be extended to support a searches query type that aggregates:

  • Query strings and their frequency
  • Average result counts
  • Zero-result query patterns

This extension is NOT part of v0.1.0. Until implemented, --query searches is not a valid query type and will result in error E030 per RFC-0005:C-CODES.

Since: v0.1.0

[RFC-0004:C-QUERY-SYNTAX] Query Syntax (Normative)

Query semantics: The <query> argument is treated as a bag-of-words query with implicit AND. Each word in the query must appear somewhere in the document for a match; word order and adjacency are NOT required.

Example: Query configure authentication matches documents containing both words anywhere in the content, regardless of order or proximity.

Tokenizer-dependent matching: Query matching behavior depends on the tokenizer used to build the index (recorded in index_meta.tokenizer):

  • porter: Terms are stemmed. Query configuring matches indexed configure.
  • unicode61: No stemming. Query configuring does NOT match configure.

FTS5 handles tokenizer selection internally based on how the index was created. The implementation does not need to branch on tokenizer type; FTS5 applies the correct tokenizer automatically. However, the recorded tokenizer value is useful for debugging, diagnostics, and user understanding of matching behavior.

Query tokenization: To construct a bag-of-words AND query, the implementation MUST:

  1. Split the query on ASCII whitespace only (space U+0020, tab U+0009, newline U+000A, carriage return U+000D)
  2. Remove empty tokens
  3. For each non-empty token: a. Escape internal " by doubling (""") b. Wrap in double quotes to make it a literal FTS5 term
  4. Join all quoted tokens with spaces (implicit AND in FTS5)

Unicode whitespace limitation: Non-ASCII whitespace characters (e.g., non-breaking space U+00A0, ideographic space U+3000) are NOT treated as token separators. They are included as part of the token. This is a known limitation. Users should use ASCII spaces in queries.

Example: User input configure authentication becomes FTS5 query "configure" "authentication".

Example with quotes: User input my "special" app becomes FTS5 query "my" """special""" "app".

Why this works: Each quoted token is passed to FTS5 as a single-term phrase. FTS5 applies its internal tokenizer (matching the index) to each term, handling Unicode normalization and stemming as appropriate. Multiple quoted terms joined by spaces create an implicit AND query.

Note on punctuation: Punctuation attached to words (e.g., hello, or (world)) is passed to FTS5 as-is. FTS5’s tokenizer will strip it during matching.

Shell quoting: The query is passed as a single shell argument. Users MUST quote multi-word queries:

skc search my-skill "configure authentication"  # Correct
skc search my-skill configure authentication    # Wrong: two positional args

Empty query: If the query is empty or contains only ASCII whitespace, the command MUST exit with error E004. See RFC-0005:C-CODES for the canonical message.

Future extension:

  • --raw — Pass raw FTS5 syntax for advanced users
  • --phrase — Require adjacency (true phrase matching)

These flags are NOT part of v0.1.0.

Since: v0.1.0

[RFC-0004:C-ERRORS] Error Messages (Normative)

All search-related errors MUST exit with status 1 and print an error message to stderr.

Error codes: See RFC-0005:C-CODES for canonical error messages. This RFC uses:

ConditionError Code
Skill resolution failedE001 or E010
Index unusableE002
Index hash collisionE003
Empty queryE004
Invalid CLI optionE100

Usage:

  • E001/E010: Skill resolution failed per RFC-0007:C-RESOLUTION. See RFC-0005:C-CODES for when to use each.
  • E002: Index unusable (missing, corrupt, or stale). See RFC-0004:C-INDEX for details.
  • E003: Index filename exists but skill_path does not match (hash collision)
  • E004: Query is empty or contains only whitespace
  • E100: Unknown flag, missing required value, or other CLI parsing failure

Collision handling: Hash collisions are rare but possible. When detected, the user MUST manually delete the conflicting index file. The error message includes the filename pattern to delete. Automatic deletion is not performed because it could destroy another skill’s valid index.

Corrupt index handling: An index is corrupt if the database cannot be opened, index_meta table is missing, required keys are missing, or key values cannot be parsed. See RFC-0004:C-INDEX for the full definition. During search, corruption maps to E002. During build, corrupt files are deleted and rebuilt without collision detection (since skill_path cannot be verified).

Since: v0.1.0


Changelog

v0.2.0 (2026-01-31)

C-INDEX updated with headings table, schema version bumped to 2

v0.1.0 (2026-01-30)

Initial release