Filtering

Pre and post-filtering

LanceDB supports filtering of query results based on metadata fields. By default, post-filtering is performed on the top-k results returned by the vector search. However, pre-filtering is also an option that performs the filter prior to vector search. This can be useful to narrow down the search space of a very large dataset to reduce query latency.

Note that both pre-filtering and post-filtering can yield false positives. For pre-filtering, if the filter is too selective, it might eliminate relevant items that the vector search would have otherwise identified as a good match. In this case, increasing nprobes parameter will help reduce such false positives. It is recommended to call bypass_vector_index() if you know that the filter is highly selective.

Similarly, a highly selective post-filter can lead to false positives. Increasing both nprobes and refine_factor can mitigate this issue. When deciding between pre-filtering and post-filtering, pre-filtering is generally the safer choice if you're uncertain.

PythonTypeScript

# Synchronous client
result = tbl.search([0.5, 0.2]).where("id = 10", prefilter=True).limit(1).to_arrow()
# Asynchronous client
result = await async_tbl.query().where("id = 10").nearest_to([0.5, 0.2]).limit(1).to_arrow()

@lancedb/lancedbvectordb (deprecated)

const _result = await tbl
  .search(Array(1536).fill(0.5))
  .limit(1)
  .where("id = 10")
  .toArray();

let result = await tbl
  .search(Array(1536).fill(0.5))
  .limit(1)
  .filter("id = 10")
  .prefilter(true)
  .execute();

Note

Creating a scalar index accelerates filtering.

SQL filters

Because it's built on top of DataFusion, LanceDB embraces the utilization of standard SQL expressions as predicates for filtering operations. SQL can be used during vector search, update, and deletion operations.

LanceDB supports a growing list of SQL expressions:

>, >=, <, <=, =
AND, OR, NOT
IS NULL, IS NOT NULL
IS TRUE, IS NOT TRUE, IS FALSE, IS NOT FALSE
IN
LIKE, NOT LIKE
CAST
regexp_match(column, pattern)
DataFusion Functions

For example, the following filter string is acceptable:

PythonTypeScript

# Synchronous client
tbl.search([100, 102]).where(
    "(item IN ('item 0', 'item 2')) AND (id > 10)"
).to_arrow()
# Asynchronous client
await (
    async_tbl.query()
    .where("(item IN ('item 0', 'item 2')) AND (id > 10)")
    .nearest_to([100, 102])
    .to_arrow()
)

@lancedb/lancedbvectordb (deprecated)

const result = await (
  tbl.search(Array(1536).fill(0)) as lancedb.VectorQuery
)
  .where("(item IN ('item 0', 'item 2')) AND (id > 10)")
  .postfilter()
  .toArray();

await tbl
  .search(Array(1536).fill(0))
  .where("(item IN ('item 0', 'item 2')) AND (id > 10)")
  .execute();

If your column name contains special characters, upper-case characters, or is a SQL Keyword, you can use backtick (`) to escape it. For nested fields, each segment of the path must be wrapped in backticks.

SQL

`CUBE` = 10 AND `UpperCaseName` = '3' AND `column name with space` IS NOT NULL
  AND `nested with space`.`inner with space` < 2

Field names containing periods (.) are not supported.

Literals for dates, timestamps, and decimals can be written by writing the string value after the type name. For example:

SQL

date_col = date '2021-01-01'
and timestamp_col = timestamp '2021-01-01 00:00:00'
and decimal_col = decimal(8,3) '1.000'

For timestamp columns, the precision can be specified as a number in the type parameter. Microsecond precision (6) is the default.

SQL	Time unit
`timestamp(0)`	Seconds
`timestamp(3)`	Milliseconds
`timestamp(6)`	Microseconds
`timestamp(9)`	Nanoseconds

LanceDB internally stores data in Apache Arrow format. The mapping from SQL types to Arrow types is:

SQL type	Arrow type
`boolean`	`Boolean`
`tinyint` / `tinyint unsigned`	`Int8` / `UInt8`
`smallint` / `smallint unsigned`	`Int16` / `UInt16`
`int` or `integer` / `int unsigned` or `integer unsigned`	`Int32` / `UInt32`
`bigint` / `bigint unsigned`	`Int64` / `UInt64`
`float`	`Float32`
`double`	`Float64`
`decimal(precision, scale)`	`Decimal128`
`date`	`Date32`
`timestamp`	`Timestamp` ¹
`string`	`Utf8`
`binary`	`Binary`

Filtering without Vector Search

You can also filter your data without search:

PythonTypeScript

# Synchronous client
tbl.search().where("id = 10").limit(10).to_arrow()
# Asynchronous client
await async_tbl.query().where("id = 10").limit(10).to_arrow()

@lancedb/lancedbvectordb (deprecated)

await tbl.query().where("id = 10").limit(10).toArray();

await tbl.filter("id = 10").limit(10).execute();

If your table is large, this could potentially return a very large amount of data. Please be sure to use a limit clause unless you're sure you want to return the whole result set.

See precision mapping in previous table. ↩