Full Text Search Index¶

The full text search (FTS) index (a.k.a. inverted index) provides efficient text search by mapping terms to the documents containing them. It's designed for high-performance text search with support for various scoring algorithms and phrase queries.

Index Details¶

message InvertedIndexDetails {
  // Marking this field as optional as old versions of the index store blank details and we
  // need to make sure we have a proper optional field to detect this.
  optional string base_tokenizer = 1;
  string language = 2;
  bool with_position = 3;
  optional uint32 max_token_length = 4;
  bool lower_case = 5;
  bool stem = 6;
  bool remove_stop_words = 7;
  bool ascii_folding = 8;
  uint32 min_ngram_length = 9;
  uint32 max_ngram_length = 10;
  bool prefix_only = 11;

}

Storage Layout¶

The FTS index consists of multiple files storing the token dictionary, document information, and posting lists:

tokens.lance - Token dictionary mapping tokens to token IDs
docs.lance - Document metadata including token counts
invert.lance - Compressed posting lists for each token
metadata.lance - Index metadata and configuration

Token Dictionary File Schema¶

Column	Type	Nullable	Description
`_token`	Utf8	false	The token string
`_token_id`	UInt32	false	Unique identifier for the token

Document File Schema¶

Column	Type	Nullable	Description
`_rowid`	UInt64	false	Document row ID
`_num_tokens`	UInt32	false	Number of tokens in the document

FTS List File Schema¶

Column	Type	Nullable	Description
`_posting`	List	false	Compressed posting lists (delta-encoded row IDs and frequencies)
`_max_score`	Float32	false	Maximum score for the token (for query optimization)
`_length`	UInt32	false	Number of documents containing the token
`_compressed_position`	List>	true	Optional compressed position lists for phrase queries

Metadata File Schema¶

The metadata file contains JSON-serialized configuration and partition information:

Key	Type	Description
`partitions`	Array	List of partition IDs for distributed index organization
`params`	JSON Object	Serialized InvertedIndexParams with tokenizer config

InvertedIndexParams Structure¶

Field	Type	Default	Description
`base_tokenizer`	String	"simple"	Base tokenizer type (see Tokenizers section)
`language`	String	"English"	Language for stemming and stop words
`with_position`	Boolean	false	Store term positions for phrase queries (increases index size)
`max_token_length`	UInt32?	None	Maximum token length (tokens longer than this are removed)
`lower_case`	Boolean	true	Convert tokens to lowercase
`stem`	Boolean	false	Apply language-specific stemming
`remove_stop_words`	Boolean	false	Remove common stop words for the specified language
`ascii_folding`	Boolean	true	Convert accented characters to ASCII equivalents
`min_gram`	UInt32	2	Minimum n-gram length (only for ngram tokenizer)
`max_gram`	UInt32	15	Maximum n-gram length (only for ngram tokenizer)
`prefix_only`	Boolean	false	Generate only prefix n-grams (only for ngram tokenizer)

Tokenizers¶

The full text search index supports multiple tokenizer types for different text processing needs:

Base Tokenizers¶

Tokenizer	Description	Use Case
simple	Splits on whitespace and punctuation, removes non-alphanumeric characters	General text (default)
whitespace	Splits only on whitespace characters	Preserve punctuation
raw	No tokenization, treats entire text as single token	Exact matching
ngram	Breaks text into overlapping character sequences	Substring/fuzzy search
jieba/*	Chinese text tokenizer with word segmentation	Chinese text
lindera/*	Japanese text tokenizer with morphological analysis	Japanese text

Jieba Tokenizer (Chinese)¶

Jieba is a popular Chinese text segmentation library that uses a dictionary-based approach with statistical methods for word segmentation.

Configuration: Uses a config.json file in the model directory
Models: Must be downloaded and placed in the Lance home directory under jieba/
Usage: Specify as jieba/<model_name> or just jieba for the default model

Config Structure:

{
  "main": "path/to/main/dictionary",
  "users": ["path/to/user/dict1", "path/to/user/dict2"]
}

Features:
Accurate word segmentation for Simplified and Traditional Chinese
Support for custom user dictionaries
Multiple segmentation modes (precise, full, search engine)

Lindera Tokenizer (Japanese)¶

Lindera is a morphological analysis tokenizer specifically designed for Japanese text. It provides proper word segmentation for Japanese, which doesn't use spaces between words.

Configuration: Uses a config.yml file in the model directory
Models: Must be downloaded and placed in the Lance home directory under lindera/
Usage: Specify as lindera/<model_name> where <model_name> is the subdirectory containing the model files
Features:
Morphological analysis with part-of-speech tagging
Dictionary-based tokenization
Support for custom user dictionaries

Token Filters¶

Token filters are applied in sequence after the base tokenizer:

Filter	Description	Configuration
RemoveLong	Removes tokens exceeding max_token_length	`max_token_length`
LowerCase	Converts tokens to lowercase	`lower_case` (default: true)
Stemmer	Reduces words to their root form	`stem`, `language`
StopWords	Removes common words like "the", "is", "at"	`remove_stop_words`, `language`
AsciiFolding	Converts accented characters to ASCII	`ascii_folding` (default: true)

Supported Languages¶

For stemming and stop word removal, the following languages are supported: Arabic, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Turkish

Accelerated Queries¶

Lance SDKs provide dedicated full text search APIs to leverage the FTS index capabilities. These APIs support complex query types beyond simple token matching, enabling sophisticated text search operations. Here are the query types enabled by the FTS index:

Query Type	Description	Example Usage	Result Type
contains_tokens	Basic token-based search (UDF) with BM25 scoring and automatic result ranking	SQL: `contains_tokens(column, 'search terms')`	AtMost
match	Match query with configurable AND/OR operators and relevance scoring	`{"match": {"query": "text", "operator": "and/or"}}`	AtMost
phrase	Exact phrase matching with position information (requires `with_position: true`)	`{"phrase": {"query": "exact phrase"}}`	AtMost
boolean	Complex boolean queries with must/should/must_not clauses for sophisticated search logic	`{"boolean": {"must": [...], "should": [...]}}`	AtMost
multi_match	Search across multiple fields simultaneously with unified scoring	`{"multi_match": [{"field1": "query"}, ...]}`	AtMost
boost	Boost relevance scores for specific terms or queries by a configurable factor	`{"boost": {"query": {...}, "factor": 2.0}}`	AtMost