RFC-0004: Search Protocol
Version: 0.2.0 | Status: normative | Phase: test
1. Summary
[RFC-0004:C-OVERVIEW] Overview (Informative)
Search enables agents to find content within skills using full-text queries.
The search system builds an index during compilation and supports BM25-ranked queries with snippet extraction. Rather than requiring agents to know exact section headings, search allows natural language queries like “how to configure authentication” to find relevant content.
Design principles:
- Progressive disclosure — Search returns section references; agents use
showto retrieve content - Offline-first — Uses SQLite FTS5, no external dependencies
- Extensible formats — File format support is added incrementally
- Observable — All searches are logged for analytics
Commands:
skc search <skill> <query>— Find matching sections
Since: v0.1.0
2. Specification
[RFC-0004:C-FORMATS] Supported Formats (Normative)
The search index MUST support the following file formats:
v0.1.0:
.md— Markdown files, segmented by headings. Each section (heading + content until next heading of equal or higher level) becomes a searchable document. Thesectionfield contains the heading text..txt— Plain text files, indexed as a single document. Thesectionfield MUST be an empty string ("").
Retrieval:
Search results from any indexed format can be retrieved using skc open <skill> <path>. The open command is not restricted to .md files (see RFC-0002:C-OPEN).
Additional formats MAY be added in future versions. The implementation MUST silently skip unsupported file types without error.
Planned (future versions):
.csv— Each row as a searchable document.json,.yaml,.toml— Flattened key-value pairs- Code files (
.py,.js,.ts,.rs) — Plain text with comment weighting
Since: v0.1.0
[RFC-0004:C-INDEX] Index Storage (Normative)
The search index MUST be stored as a SQLite FTS5 database in the skill’s runtime directory.
Index file naming: To avoid collisions when multiple source directories share a runtime store, the index file MUST be named using a hash of the source path:
.skillc-meta/search-<hash16>.db
Where <hash16> is the first 16 characters (64 bits) of the SHA-256 hash of the canonicalized source directory path.
Tokenizer preference: The current tokenizer preference is determined at runtime:
- Attempt
porter unicode61 - If unavailable, fall back to
unicode61
This preference is used for both index creation (build) and checking (search).
Corrupt index definition: An index is corrupt if any of:
- The database file cannot be opened as a SQLite database
- The
index_metatable does not exist - Any required key is missing (
skill_path,source_hash,schema_version,tokenizer) - Any required key value cannot be parsed (e.g., non-integer
schema_version)
Any read or parse failure during index access MUST be treated as corruption. Implementations MUST NOT attempt to distinguish transient errors from permanent corruption; all failures map to the same handling (E002 for search, delete+rebuild for build).
Index file selection (search): When searching, the implementation MUST follow these steps in order:
- Compute the expected filename
search-<hash16>.db - If file does not exist: exit with error E002 (missing)
- Open the database and read required keys from
index_meta; if corrupt (per definition above): exit with error E002 - If
skill_pathdoes not match current source path: exit with error E003 (collision) - Check staleness conditions (see below): if stale, exit with error E002
- Proceed with search
E002 conditions (search): E002 (“unusable”) is the umbrella error covering three distinct failure modes:
- Missing: file does not exist (step 2)
- Corrupt: any read/parse failure (step 3)
- Stale: metadata mismatch (step 5) — a subset of “unusable”
All three require skc build to fix.
See RFC-0005:C-CODES for canonical error messages.
Staleness conditions (search-only): “Stale” is a subset of “unusable” that applies when the index can be read but its metadata does not match the current skill state.
After confirming no collision (step 4), the index is stale if any of:
source_hashdoes not match current manifest hashschema_version< current schema version (currently2)tokenizerdoes not match current tokenizer preference
These three fields are the only staleness conditions. Missing file and corrupt index are handled earlier (steps 2-3). The skill_path field is used for collision detection (step 4), not staleness.
If stale, skc search MUST exit with error E002.
Index lifecycle (build):
When skc build runs:
- Compute the expected filename
search-<hash16>.db - If file does not exist: proceed to step 6 to create new index
- Open the database and read required keys from
index_meta; if any read/parse failure: delete file, proceed to step 6 to create new index - If
skill_pathdoes NOT match current source path: exit with error E003 (collision). Stop. - Compare
source_hash,schema_version, andtokenizer:- If all match: skip rebuild. Done.
- Else: delete existing file, proceed to step 6 to create new index.
- Create new index with current tokenizer preference
Build does not error on unusable indexes; it rebuilds them. Corrupt indexes are deleted without collision detection since skill_path cannot be reliably read.
Build behavior summary:
- Missing: create new index (step 2 → 6)
- Corrupt: delete and rebuild without collision check (step 3 → 6)
- Collision (
skill_pathmismatch): error E003 (step 4) - Up-to-date (all metadata matches): skip rebuild (step 5)
- Stale (metadata differs): rebuild (step 5 → 6)
No automatic cleanup:
The implementation MUST NOT delete other search-*.db files. Multiple skills may share a runtime directory.
Runtime directory resolution: The runtime directory MUST be resolved using the same logic as RFC-0007:C-RESOLUTION.
Index schema: The database MUST contain a virtual table using FTS5 for full-text search:
CREATE VIRTUAL TABLE sections USING fts5(
file,
section,
content,
tokenize='porter unicode61'
);
The database MUST contain a headings table for section lookup by RFC-0002:C-SHOW:
CREATE TABLE headings (
id INTEGER PRIMARY KEY,
file TEXT NOT NULL,
text TEXT NOT NULL,
level INTEGER NOT NULL,
start_line INTEGER NOT NULL,
end_line INTEGER NOT NULL
);
CREATE INDEX idx_headings_text ON headings(text COLLATE NOCASE);
Fields:
file— relative path from skill roottext— heading text (without#prefix)level— heading level (1-6)start_line— 1-based line number of headingend_line— 1-based line number of next heading (or EOF+1)
The database MUST also contain a metadata table:
CREATE TABLE index_meta (
key TEXT PRIMARY KEY,
value TEXT
);
Required metadata keys:
source_hash— Hash from RFC-0001:C-MANIFESTskill_path— Canonicalized absolute path to source directoryschema_version— Integer (currently2)indexed_at— RFC 3339 UTC timestamptokenizer—porterorunicode61
Schema migration:
No in-place migration. User must run skc build.
Updated in v0.2.0: Added headings table for index-based section lookup. Bumped schema_version to 2.
Since: v0.1.0
[RFC-0004:C-SEARCH] Search Command (Normative)
Syntax: skc search <skill> <query> [options]
The search command MUST query the FTS5 index and return ranked results.
Options:
--limit N— Maximum results (default: 10)--format <text|json>— Output format (default: text)
Ranking: Results MUST be ranked using BM25. Scores MUST be negated for display (FTS5 returns negative).
Snippet extraction:
Snippet parameters (FTS5 snippet() function):
- Column:
content(index 2) - Start marker:
[MATCH] - End marker:
[/MATCH] - Ellipsis:
... - Token limit:
32
Output format guarantees:
JSON is canonical (Normative). Use JSON for machine parsing.
JSON output (--format json):
{
"query": "<original-query>",
"results": [
{
"file": "<relative-path>",
"section": "<heading-or-identifier>",
"snippet": "...text with [MATCH]term[/MATCH]...",
"score": <float>
}
]
}
Text output (--format text) — Informative only:
Human-readable, NOT a stable contract. Do NOT parse.
<file>#<section> (score: <score>)
<snippet>
No results:
Return empty result set, exit 0. JSON: {"query": "...", "results": []}.
Skill resolution: Per RFC-0007:C-RESOLUTION.
Error handling: All errors exit with status 1. See RFC-0005:C-CODES for canonical error messages.
| Condition | Error Code |
|---|---|
| Skill resolution failed | E001 or E010 |
| Index unusable | E002 |
| Index hash collision | E003 |
| Empty query | E004 |
| Invalid CLI option | E100 |
Note: “Index unusable” (E002) covers three cases: missing file, corrupt index, or stale metadata. See RFC-0004:C-INDEX for details.
Since: v0.1.0
[RFC-0004:C-LOGGING] Search Logging (Normative)
Search commands MUST be logged per RFC-0007:C-LOGGING.
Command name: search
Args format:
{
"query": "<search-query>",
"result_count": <number-of-results-returned>
}
Error field:
If the search fails (e.g., stale index), the error message MUST be recorded in the error field.
Analytics extension (future):
The stats command (per RFC-0003) SHOULD be extended to support a searches query type that aggregates:
- Query strings and their frequency
- Average result counts
- Zero-result query patterns
This extension is NOT part of v0.1.0. Until implemented, --query searches is not a valid query type and will result in error E030 per RFC-0005:C-CODES.
Since: v0.1.0
[RFC-0004:C-QUERY-SYNTAX] Query Syntax (Normative)
Query semantics:
The <query> argument is treated as a bag-of-words query with implicit AND. Each word in the query must appear somewhere in the document for a match; word order and adjacency are NOT required.
Example: Query configure authentication matches documents containing both words anywhere in the content, regardless of order or proximity.
Tokenizer-dependent matching:
Query matching behavior depends on the tokenizer used to build the index (recorded in index_meta.tokenizer):
porter: Terms are stemmed. Queryconfiguringmatches indexedconfigure.unicode61: No stemming. Queryconfiguringdoes NOT matchconfigure.
FTS5 handles tokenizer selection internally based on how the index was created. The implementation does not need to branch on tokenizer type; FTS5 applies the correct tokenizer automatically. However, the recorded tokenizer value is useful for debugging, diagnostics, and user understanding of matching behavior.
Query tokenization: To construct a bag-of-words AND query, the implementation MUST:
- Split the query on ASCII whitespace only (space U+0020, tab U+0009, newline U+000A, carriage return U+000D)
- Remove empty tokens
- For each non-empty token:
a. Escape internal
"by doubling ("→"") b. Wrap in double quotes to make it a literal FTS5 term - Join all quoted tokens with spaces (implicit AND in FTS5)
Unicode whitespace limitation: Non-ASCII whitespace characters (e.g., non-breaking space U+00A0, ideographic space U+3000) are NOT treated as token separators. They are included as part of the token. This is a known limitation. Users should use ASCII spaces in queries.
Example: User input configure authentication becomes FTS5 query "configure" "authentication".
Example with quotes: User input my "special" app becomes FTS5 query "my" """special""" "app".
Why this works: Each quoted token is passed to FTS5 as a single-term phrase. FTS5 applies its internal tokenizer (matching the index) to each term, handling Unicode normalization and stemming as appropriate. Multiple quoted terms joined by spaces create an implicit AND query.
Note on punctuation:
Punctuation attached to words (e.g., hello, or (world)) is passed to FTS5 as-is. FTS5’s tokenizer will strip it during matching.
Shell quoting: The query is passed as a single shell argument. Users MUST quote multi-word queries:
skc search my-skill "configure authentication" # Correct
skc search my-skill configure authentication # Wrong: two positional args
Empty query: If the query is empty or contains only ASCII whitespace, the command MUST exit with error E004. See RFC-0005:C-CODES for the canonical message.
Future extension:
--raw— Pass raw FTS5 syntax for advanced users--phrase— Require adjacency (true phrase matching)
These flags are NOT part of v0.1.0.
Since: v0.1.0
[RFC-0004:C-ERRORS] Error Messages (Normative)
All search-related errors MUST exit with status 1 and print an error message to stderr.
Error codes: See RFC-0005:C-CODES for canonical error messages. This RFC uses:
| Condition | Error Code |
|---|---|
| Skill resolution failed | E001 or E010 |
| Index unusable | E002 |
| Index hash collision | E003 |
| Empty query | E004 |
| Invalid CLI option | E100 |
Usage:
- E001/E010: Skill resolution failed per RFC-0007:C-RESOLUTION. See RFC-0005:C-CODES for when to use each.
- E002: Index unusable (missing, corrupt, or stale). See RFC-0004:C-INDEX for details.
- E003: Index filename exists but
skill_pathdoes not match (hash collision) - E004: Query is empty or contains only whitespace
- E100: Unknown flag, missing required value, or other CLI parsing failure
Collision handling: Hash collisions are rare but possible. When detected, the user MUST manually delete the conflicting index file. The error message includes the filename pattern to delete. Automatic deletion is not performed because it could destroy another skill’s valid index.
Corrupt index handling:
An index is corrupt if the database cannot be opened, index_meta table is missing, required keys are missing, or key values cannot be parsed. See RFC-0004:C-INDEX for the full definition. During search, corruption maps to E002. During build, corrupt files are deleted and rebuilt without collision detection (since skill_path cannot be verified).
Since: v0.1.0
Changelog
v0.2.0 (2026-01-31)
C-INDEX updated with headings table, schema version bumped to 2
v0.1.0 (2026-01-30)
Initial release