Full-text search (Native FTS)
LanceDB provides support for full-text search via Lance, allowing you to incorporate keyword-based search (based on BM25) in your retrieval solutions.
Note
The Python SDK uses tantivy-based FTS by default, need to pass use_tantivy=False
to use native FTS.
Example
Consider that we have a LanceDB table named my_table
, whose string column text
we want to index and query via keyword search, the FTS index must be created before you can search via keywords.
import lancedb
uri = "data/sample-lancedb"
db = lancedb.connect(uri)
table = db.create_table(
"my_table",
data=[
{"vector": [3.1, 4.1], "text": "Frodo was a happy puppy"},
{"vector": [5.9, 26.5], "text": "There are several kittens playing"},
],
)
# passing `use_tantivy=False` to use lance FTS index
# `use_tantivy=True` by default
table.create_fts_index("text", use_tantivy=False)
table.search("puppy").limit(10).select(["text"]).to_list()
# [{'text': 'Frodo was a happy puppy', '_score': 0.6931471824645996}]
# ...
import * as lancedb from "@lancedb/lancedb";
const uri = "data/sample-lancedb"
const db = await lancedb.connect(uri);
const data = [
{ vector: [3.1, 4.1], text: "Frodo was a happy puppy" },
{ vector: [5.9, 26.5], text: "There are several kittens playing" },
];
const tbl = await db.createTable("my_table", data, { mode: "overwrite" });
await tbl.createIndex("text", {
config: lancedb.Index.fts(),
});
await tbl
.search("puppy", queryType="fts")
.select(["text"])
.limit(10)
.toArray();
let uri = "data/sample-lancedb";
let db = connect(uri).execute().await?;
let initial_data: Box<dyn RecordBatchReader + Send> = create_some_records()?;
let tbl = db
.create_table("my_table", initial_data)
.execute()
.await?;
tbl
.create_index(&["text"], Index::FTS(FtsIndexBuilder::default()))
.execute()
.await?;
tbl
.query()
.full_text_search(FullTextSearchQuery::new("puppy".to_owned()))
.select(lancedb::query::Select::Columns(vec!["text".to_owned()]))
.limit(10)
.execute()
.await?;
It would search on all indexed columns by default, so it's useful when there are multiple indexed columns.
Passing fts_columns="text"
if you want to specify the columns to search.
Note
LanceDB automatically searches on the existing FTS index if the input to the search is of type str
. If you provide a vector as input, LanceDB will search the ANN index instead.
Tokenization
By default the text is tokenized by splitting on punctuation and whitespaces, and would filter out words that are with length greater than 40, and lowercase all words.
Stemming is useful for improving search results by reducing words to their root form, e.g. "running" to "run". LanceDB supports stemming for multiple languages, you can specify the tokenizer name to enable stemming by the pattern tokenizer_name="{language_code}_stem"
, e.g. en_stem
for English.
For example, to enable stemming for English:
the following languages are currently supported.
The tokenizer is customizable, you can specify how the tokenizer splits the text, and how it filters out words, etc.
For example, for language with accents, you can specify the tokenizer to use ascii_folding
to remove accents, e.g. 'Γ©' to 'e':
Filtering
LanceDB full text search supports to filter the search results by a condition, both pre-filtering and post-filtering are supported.
This can be invoked via the familiar where
syntax.
With pre-filtering:
With post-filtering:
Phrase queries vs. terms queries
Warn
Lance-based FTS doesn't support queries using boolean operators OR
, AND
.
For full-text search you can specify either a phrase query like "the old man and the sea"
,
or a terms search query like old man sea
. For more details on the terms
query syntax, see Tantivy's query parser rules.
To search for a phrase, the index must be created with with_position=True
:
Incremental indexing
LanceDB supports incremental indexing, which means you can add new records to the table without reindexing the entire table.
This can make the query more efficient, especially when the table is large and the new records are relatively small.
Note
New data added after creating the FTS index will appear in search results while incremental index is still progress, but with increased latency due to a flat search on the unindexed portion. LanceDB Cloud automates this merging process, minimizing the impact on search speed.