Full Text Search Index¶
The full text search (FTS) index (a.k.a. inverted index) provides efficient text search by mapping terms to the documents containing them. It's designed for high-performance text search with support for various scoring algorithms and phrase queries.
Index Details¶
message InvertedIndexDetails {
// Marking this field as optional as old versions of the index store blank details and we
// need to make sure we have a proper optional field to detect this.
optional string base_tokenizer = 1;
string language = 2;
bool with_position = 3;
optional uint32 max_token_length = 4;
bool lower_case = 5;
bool stem = 6;
bool remove_stop_words = 7;
bool ascii_folding = 8;
uint32 min_ngram_length = 9;
uint32 max_ngram_length = 10;
bool prefix_only = 11;
}
Storage Layout¶
The FTS index consists of multiple files storing the token dictionary, document information, and posting lists:
tokens.lance
- Token dictionary mapping tokens to token IDsdocs.lance
- Document metadata including token countsinvert.lance
- Compressed posting lists for each tokenmetadata.lance
- Index metadata and configuration
Token Dictionary File Schema¶
Column | Type | Nullable | Description |
---|---|---|---|
_token |
Utf8 | false | The token string |
_token_id |
UInt32 | false | Unique identifier for the token |
Document File Schema¶
Column | Type | Nullable | Description |
---|---|---|---|
_rowid |
UInt64 | false | Document row ID |
_num_tokens |
UInt32 | false | Number of tokens in the document |
FTS List File Schema¶
Column | Type | Nullable | Description |
---|---|---|---|
_posting |
List |
false | Compressed posting lists (delta-encoded row IDs and frequencies) |
_max_score |
Float32 | false | Maximum score for the token (for query optimization) |
_length |
UInt32 | false | Number of documents containing the token |
_compressed_position |
List
|
true | Optional compressed position lists for phrase queries |
Metadata File Schema¶
The metadata file contains JSON-serialized configuration and partition information:
Key | Type | Description |
---|---|---|
partitions |
Array |
List of partition IDs for distributed index organization |
params |
JSON Object | Serialized InvertedIndexParams with tokenizer config |
InvertedIndexParams Structure¶
Field | Type | Default | Description |
---|---|---|---|
base_tokenizer |
String | "simple" | Base tokenizer type (see Tokenizers section) |
language |
String | "English" | Language for stemming and stop words |
with_position |
Boolean | false | Store term positions for phrase queries (increases index size) |
max_token_length |
UInt32? | None | Maximum token length (tokens longer than this are removed) |
lower_case |
Boolean | true | Convert tokens to lowercase |
stem |
Boolean | false | Apply language-specific stemming |
remove_stop_words |
Boolean | false | Remove common stop words for the specified language |
ascii_folding |
Boolean | true | Convert accented characters to ASCII equivalents |
min_gram |
UInt32 | 2 | Minimum n-gram length (only for ngram tokenizer) |
max_gram |
UInt32 | 15 | Maximum n-gram length (only for ngram tokenizer) |
prefix_only |
Boolean | false | Generate only prefix n-grams (only for ngram tokenizer) |
Tokenizers¶
The full text search index supports multiple tokenizer types for different text processing needs:
Base Tokenizers¶
Tokenizer | Description | Use Case |
---|---|---|
simple | Splits on whitespace and punctuation, removes non-alphanumeric characters | General text (default) |
whitespace | Splits only on whitespace characters | Preserve punctuation |
raw | No tokenization, treats entire text as single token | Exact matching |
ngram | Breaks text into overlapping character sequences | Substring/fuzzy search |
jieba/* | Chinese text tokenizer with word segmentation | Chinese text |
lindera/* | Japanese text tokenizer with morphological analysis | Japanese text |
Jieba Tokenizer (Chinese)¶
Jieba is a popular Chinese text segmentation library that uses a dictionary-based approach with statistical methods for word segmentation.
- Configuration: Uses a
config.json
file in the model directory - Models: Must be downloaded and placed in the Lance home directory under
jieba/
- Usage: Specify as
jieba/<model_name>
or justjieba
for the default model - Config Structure:
- Features:
- Accurate word segmentation for Simplified and Traditional Chinese
- Support for custom user dictionaries
- Multiple segmentation modes (precise, full, search engine)
Lindera Tokenizer (Japanese)¶
Lindera is a morphological analysis tokenizer specifically designed for Japanese text. It provides proper word segmentation for Japanese, which doesn't use spaces between words.
- Configuration: Uses a
config.yml
file in the model directory - Models: Must be downloaded and placed in the Lance home directory under
lindera/
- Usage: Specify as
lindera/<model_name>
where<model_name>
is the subdirectory containing the model files - Features:
- Morphological analysis with part-of-speech tagging
- Dictionary-based tokenization
- Support for custom user dictionaries
Token Filters¶
Token filters are applied in sequence after the base tokenizer:
Filter | Description | Configuration |
---|---|---|
RemoveLong | Removes tokens exceeding max_token_length | max_token_length |
LowerCase | Converts tokens to lowercase | lower_case (default: true) |
Stemmer | Reduces words to their root form | stem , language |
StopWords | Removes common words like "the", "is", "at" | remove_stop_words , language |
AsciiFolding | Converts accented characters to ASCII | ascii_folding (default: true) |
Supported Languages¶
For stemming and stop word removal, the following languages are supported: Arabic, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Turkish
Accelerated Queries¶
Lance SDKs provide dedicated full text search APIs to leverage the FTS index capabilities. These APIs support complex query types beyond simple token matching, enabling sophisticated text search operations. Here are the query types enabled by the FTS index:
Query Type | Description | Example Usage | Result Type |
---|---|---|---|
contains_tokens | Basic token-based search (UDF) with BM25 scoring and automatic result ranking | SQL: contains_tokens(column, 'search terms') |
AtMost |
match | Match query with configurable AND/OR operators and relevance scoring | {"match": {"query": "text", "operator": "and/or"}} |
AtMost |
phrase | Exact phrase matching with position information (requires with_position: true ) |
{"phrase": {"query": "exact phrase"}} |
AtMost |
boolean | Complex boolean queries with must/should/must_not clauses for sophisticated search logic | {"boolean": {"must": [...], "should": [...]}} |
AtMost |
multi_match | Search across multiple fields simultaneously with unified scoring | {"multi_match": [{"field1": "query"}, ...]} |
AtMost |
boost | Boost relevance scores for specific terms or queries by a configurable factor | {"boost": {"query": {...}, "factor": 2.0}} |
AtMost |