Skip to content

Python API Reference

This section contains the API reference for the Python API. There is a synchronous and an asynchronous API client.

The general flow of using the API is:

  1. Use lancedb.connect or lancedb.connect_async to connect to a database.
  2. Use the returned lancedb.DBConnection or lancedb.AsyncConnection to create or open tables.
  3. Use the returned lancedb.table.Table or lancedb.AsyncTable to query or modify tables.

Installation

pip install lancedb

The following methods describe the synchronous API client. There is also an asynchronous API client.

Connections (Synchronous)

lancedb.connect

connect(uri: URI, *, api_key: Optional[str] = None, region: str = 'us-east-1', host_override: Optional[str] = None, read_consistency_interval: Optional[timedelta] = None, request_thread_pool: Optional[Union[int, ThreadPoolExecutor]] = None, client_config: Union[ClientConfig, Dict[str, Any], None] = None, storage_options: Optional[Dict[str, str]] = None, **kwargs: Any) -> DBConnection

Connect to a LanceDB database.

Parameters:

  • uri (URI) –

    The uri of the database.

  • api_key (Optional[str], default: None ) –

    If presented, connect to LanceDB cloud. Otherwise, connect to a database on file system or cloud storage. Can be set via environment variable LANCEDB_API_KEY.

  • region (str, default: 'us-east-1' ) –

    The region to use for LanceDB Cloud.

  • host_override (Optional[str], default: None ) –

    The override url for LanceDB Cloud.

  • read_consistency_interval (Optional[timedelta], default: None ) –

    (For LanceDB OSS only) The interval at which to check for updates to the table from other processes. If None, then consistency is not checked. For performance reasons, this is the default. For strong consistency, set this to zero seconds. Then every read will check for updates from other processes. As a compromise, you can set this to a non-zero timedelta for eventual consistency. If more than that interval has passed since the last check, then the table will be checked for updates. Note: this consistency only applies to read operations. Write operations are always consistent.

  • client_config (Union[ClientConfig, Dict[str, Any], None], default: None ) –

    Configuration options for the LanceDB Cloud HTTP client. If a dict, then the keys are the attributes of the ClientConfig class. If None, then the default configuration is used.

  • storage_options (Optional[Dict[str, str]], default: None ) –

    Additional options for the storage backend. See available options at https://lancedb.github.io/lancedb/guides/storage/

Examples:

For a local directory, provide a path for the database:

>>> import lancedb
>>> db = lancedb.connect("~/.lancedb")

For object storage, use a URI prefix:

>>> db = lancedb.connect("s3://my-bucket/lancedb",
...                      storage_options={"aws_access_key_id": "***"})

Connect to LanceDB cloud:

>>> db = lancedb.connect("db://my_database", api_key="ldb_...",
...                      client_config={"retry_config": {"retries": 5}})

Returns:

  • conn ( DBConnection ) –

    A connection to a LanceDB database.

Source code in lancedb/__init__.py
def connect(
    uri: URI,
    *,
    api_key: Optional[str] = None,
    region: str = "us-east-1",
    host_override: Optional[str] = None,
    read_consistency_interval: Optional[timedelta] = None,
    request_thread_pool: Optional[Union[int, ThreadPoolExecutor]] = None,
    client_config: Union[ClientConfig, Dict[str, Any], None] = None,
    storage_options: Optional[Dict[str, str]] = None,
    **kwargs: Any,
) -> DBConnection:
    """Connect to a LanceDB database.

    Parameters
    ----------
    uri: str or Path
        The uri of the database.
    api_key: str, optional
        If presented, connect to LanceDB cloud.
        Otherwise, connect to a database on file system or cloud storage.
        Can be set via environment variable `LANCEDB_API_KEY`.
    region: str, default "us-east-1"
        The region to use for LanceDB Cloud.
    host_override: str, optional
        The override url for LanceDB Cloud.
    read_consistency_interval: timedelta, default None
        (For LanceDB OSS only)
        The interval at which to check for updates to the table from other
        processes. If None, then consistency is not checked. For performance
        reasons, this is the default. For strong consistency, set this to
        zero seconds. Then every read will check for updates from other
        processes. As a compromise, you can set this to a non-zero timedelta
        for eventual consistency. If more than that interval has passed since
        the last check, then the table will be checked for updates. Note: this
        consistency only applies to read operations. Write operations are
        always consistent.
    client_config: ClientConfig or dict, optional
        Configuration options for the LanceDB Cloud HTTP client. If a dict, then
        the keys are the attributes of the ClientConfig class. If None, then the
        default configuration is used.
    storage_options: dict, optional
        Additional options for the storage backend. See available options at
        <https://lancedb.github.io/lancedb/guides/storage/>

    Examples
    --------

    For a local directory, provide a path for the database:

    >>> import lancedb
    >>> db = lancedb.connect("~/.lancedb")

    For object storage, use a URI prefix:

    >>> db = lancedb.connect("s3://my-bucket/lancedb",
    ...                      storage_options={"aws_access_key_id": "***"})

    Connect to LanceDB cloud:

    >>> db = lancedb.connect("db://my_database", api_key="ldb_...",
    ...                      client_config={"retry_config": {"retries": 5}})

    Returns
    -------
    conn : DBConnection
        A connection to a LanceDB database.
    """
    if isinstance(uri, str) and uri.startswith("db://"):
        if api_key is None:
            api_key = os.environ.get("LANCEDB_API_KEY")
        if api_key is None:
            raise ValueError(f"api_key is required to connected LanceDB cloud: {uri}")
        if isinstance(request_thread_pool, int):
            request_thread_pool = ThreadPoolExecutor(request_thread_pool)
        return RemoteDBConnection(
            uri,
            api_key,
            region,
            host_override,
            # TODO: remove this (deprecation warning downstream)
            request_thread_pool=request_thread_pool,
            client_config=client_config,
            storage_options=storage_options,
            **kwargs,
        )

    if kwargs:
        raise ValueError(f"Unknown keyword arguments: {kwargs}")
    return LanceDBConnection(
        uri,
        read_consistency_interval=read_consistency_interval,
        storage_options=storage_options,
    )

lancedb.db.DBConnection

Bases: EnforceOverrides

An active LanceDB connection interface.

Source code in lancedb/db.py
class DBConnection(EnforceOverrides):
    """An active LanceDB connection interface."""

    @abstractmethod
    def table_names(
        self, page_token: Optional[str] = None, limit: int = 10
    ) -> Iterable[str]:
        """List all tables in this database, in sorted order

        Parameters
        ----------
        page_token: str, optional
            The token to use for pagination. If not present, start from the beginning.
            Typically, this token is last table name from the previous page.
            Only supported by LanceDb Cloud.
        limit: int, default 10
            The size of the page to return.
            Only supported by LanceDb Cloud.

        Returns
        -------
        Iterable of str
        """
        pass

    @abstractmethod
    def create_table(
        self,
        name: str,
        data: Optional[DATA] = None,
        schema: Optional[Union[pa.Schema, LanceModel]] = None,
        mode: str = "create",
        exist_ok: bool = False,
        on_bad_vectors: str = "error",
        fill_value: float = 0.0,
        embedding_functions: Optional[List[EmbeddingFunctionConfig]] = None,
        *,
        storage_options: Optional[Dict[str, str]] = None,
        data_storage_version: Optional[str] = None,
        enable_v2_manifest_paths: Optional[bool] = None,
    ) -> Table:
        """Create a [Table][lancedb.table.Table] in the database.

        Parameters
        ----------
        name: str
            The name of the table.
        data: The data to initialize the table, *optional*
            User must provide at least one of `data` or `schema`.
            Acceptable types are:

            - list-of-dict

            - pandas.DataFrame

            - pyarrow.Table or pyarrow.RecordBatch
        schema: The schema of the table, *optional*
            Acceptable types are:

            - pyarrow.Schema

            - [LanceModel][lancedb.pydantic.LanceModel]
        mode: str; default "create"
            The mode to use when creating the table.
            Can be either "create" or "overwrite".
            By default, if the table already exists, an exception is raised.
            If you want to overwrite the table, use mode="overwrite".
        exist_ok: bool, default False
            If a table by the same name already exists, then raise an exception
            if exist_ok=False. If exist_ok=True, then open the existing table;
            it will not add the provided data but will validate against any
            schema that's specified.
        on_bad_vectors: str, default "error"
            What to do if any of the vectors are not the same size or contains NaNs.
            One of "error", "drop", "fill".
        fill_value: float
            The value to use when filling vectors. Only used if on_bad_vectors="fill".
        storage_options: dict, optional
            Additional options for the storage backend. Options already set on the
            connection will be inherited by the table, but can be overridden here.
            See available options at
            <https://lancedb.github.io/lancedb/guides/storage/>
        data_storage_version: optional, str, default "stable"
            Deprecated.  Set `storage_options` when connecting to the database and set
            `new_table_data_storage_version` in the options.
        enable_v2_manifest_paths: optional, bool, default False
            Deprecated.  Set `storage_options` when connecting to the database and set
            `new_table_enable_v2_manifest_paths` in the options.
        Returns
        -------
        LanceTable
            A reference to the newly created table.

        !!! note

            The vector index won't be created by default.
            To create the index, call the `create_index` method on the table.

        Examples
        --------

        Can create with list of tuples or dictionaries:

        >>> import lancedb
        >>> db = lancedb.connect("./.lancedb")
        >>> data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},
        ...         {"vector": [0.2, 1.8], "lat": 40.1, "long":  -74.1}]
        >>> db.create_table("my_table", data)
        LanceTable(name='my_table', version=1, ...)
        >>> db["my_table"].head()
        pyarrow.Table
        vector: fixed_size_list<item: float>[2]
          child 0, item: float
        lat: double
        long: double
        ----
        vector: [[[1.1,1.2],[0.2,1.8]]]
        lat: [[45.5,40.1]]
        long: [[-122.7,-74.1]]

        You can also pass a pandas DataFrame:

        >>> import pandas as pd
        >>> data = pd.DataFrame({
        ...    "vector": [[1.1, 1.2], [0.2, 1.8]],
        ...    "lat": [45.5, 40.1],
        ...    "long": [-122.7, -74.1]
        ... })
        >>> db.create_table("table2", data)
        LanceTable(name='table2', version=1, ...)
        >>> db["table2"].head()
        pyarrow.Table
        vector: fixed_size_list<item: float>[2]
          child 0, item: float
        lat: double
        long: double
        ----
        vector: [[[1.1,1.2],[0.2,1.8]]]
        lat: [[45.5,40.1]]
        long: [[-122.7,-74.1]]

        Data is converted to Arrow before being written to disk. For maximum
        control over how data is saved, either provide the PyArrow schema to
        convert to or else provide a [PyArrow Table](pyarrow.Table) directly.

        >>> import pyarrow as pa
        >>> custom_schema = pa.schema([
        ...   pa.field("vector", pa.list_(pa.float32(), 2)),
        ...   pa.field("lat", pa.float32()),
        ...   pa.field("long", pa.float32())
        ... ])
        >>> db.create_table("table3", data, schema = custom_schema)
        LanceTable(name='table3', version=1, ...)
        >>> db["table3"].head()
        pyarrow.Table
        vector: fixed_size_list<item: float>[2]
          child 0, item: float
        lat: float
        long: float
        ----
        vector: [[[1.1,1.2],[0.2,1.8]]]
        lat: [[45.5,40.1]]
        long: [[-122.7,-74.1]]


        It is also possible to create an table from `[Iterable[pa.RecordBatch]]`:


        >>> import pyarrow as pa
        >>> def make_batches():
        ...     for i in range(5):
        ...         yield pa.RecordBatch.from_arrays(
        ...             [
        ...                 pa.array([[3.1, 4.1], [5.9, 26.5]],
        ...                     pa.list_(pa.float32(), 2)),
        ...                 pa.array(["foo", "bar"]),
        ...                 pa.array([10.0, 20.0]),
        ...             ],
        ...             ["vector", "item", "price"],
        ...         )
        >>> schema=pa.schema([
        ...     pa.field("vector", pa.list_(pa.float32(), 2)),
        ...     pa.field("item", pa.utf8()),
        ...     pa.field("price", pa.float32()),
        ... ])
        >>> db.create_table("table4", make_batches(), schema=schema)
        LanceTable(name='table4', version=1, ...)

        """
        raise NotImplementedError

    def __getitem__(self, name: str) -> LanceTable:
        return self.open_table(name)

    def open_table(
        self,
        name: str,
        *,
        storage_options: Optional[Dict[str, str]] = None,
        index_cache_size: Optional[int] = None,
    ) -> Table:
        """Open a Lance Table in the database.

        Parameters
        ----------
        name: str
            The name of the table.
        index_cache_size: int, default 256
            Set the size of the index cache, specified as a number of entries

            The exact meaning of an "entry" will depend on the type of index:
            * IVF - there is one entry for each IVF partition
            * BTREE - there is one entry for the entire index

            This cache applies to the entire opened table, across all indices.
            Setting this value higher will increase performance on larger datasets
            at the expense of more RAM
        storage_options: dict, optional
            Additional options for the storage backend. Options already set on the
            connection will be inherited by the table, but can be overridden here.
            See available options at
            <https://lancedb.github.io/lancedb/guides/storage/>

        Returns
        -------
        A LanceTable object representing the table.
        """
        raise NotImplementedError

    def drop_table(self, name: str):
        """Drop a table from the database.

        Parameters
        ----------
        name: str
            The name of the table.
        """
        raise NotImplementedError

    def rename_table(self, cur_name: str, new_name: str):
        """Rename a table in the database.

        Parameters
        ----------
        cur_name: str
            The current name of the table.
        new_name: str
            The new name of the table.
        """
        raise NotImplementedError

    def drop_database(self):
        """
        Drop database
        This is the same thing as dropping all the tables
        """
        raise NotImplementedError

    def drop_all_tables(self):
        """
        Drop all tables from the database
        """
        raise NotImplementedError

    @property
    def uri(self) -> str:
        return self._uri

table_names abstractmethod

table_names(page_token: Optional[str] = None, limit: int = 10) -> Iterable[str]

List all tables in this database, in sorted order

Parameters:

  • page_token (Optional[str], default: None ) –

    The token to use for pagination. If not present, start from the beginning. Typically, this token is last table name from the previous page. Only supported by LanceDb Cloud.

  • limit (int, default: 10 ) –

    The size of the page to return. Only supported by LanceDb Cloud.

Returns:

  • Iterable of str –
Source code in lancedb/db.py
@abstractmethod
def table_names(
    self, page_token: Optional[str] = None, limit: int = 10
) -> Iterable[str]:
    """List all tables in this database, in sorted order

    Parameters
    ----------
    page_token: str, optional
        The token to use for pagination. If not present, start from the beginning.
        Typically, this token is last table name from the previous page.
        Only supported by LanceDb Cloud.
    limit: int, default 10
        The size of the page to return.
        Only supported by LanceDb Cloud.

    Returns
    -------
    Iterable of str
    """
    pass

create_table abstractmethod

create_table(name: str, data: Optional[DATA] = None, schema: Optional[Union[Schema, LanceModel]] = None, mode: str = 'create', exist_ok: bool = False, on_bad_vectors: str = 'error', fill_value: float = 0.0, embedding_functions: Optional[List[EmbeddingFunctionConfig]] = None, *, storage_options: Optional[Dict[str, str]] = None, data_storage_version: Optional[str] = None, enable_v2_manifest_paths: Optional[bool] = None) -> Table

Create a Table in the database.

Parameters:

  • name (str) –

    The name of the table.

  • data (Optional[DATA], default: None ) –

    User must provide at least one of data or schema. Acceptable types are:

    • list-of-dict

    • pandas.DataFrame

    • pyarrow.Table or pyarrow.RecordBatch

  • schema (Optional[Union[Schema, LanceModel]], default: None ) –

    Acceptable types are:

  • mode (str, default: 'create' ) –

    The mode to use when creating the table. Can be either "create" or "overwrite". By default, if the table already exists, an exception is raised. If you want to overwrite the table, use mode="overwrite".

  • exist_ok (bool, default: False ) –

    If a table by the same name already exists, then raise an exception if exist_ok=False. If exist_ok=True, then open the existing table; it will not add the provided data but will validate against any schema that's specified.

  • on_bad_vectors (str, default: 'error' ) –

    What to do if any of the vectors are not the same size or contains NaNs. One of "error", "drop", "fill".

  • fill_value (float, default: 0.0 ) –

    The value to use when filling vectors. Only used if on_bad_vectors="fill".

  • storage_options (Optional[Dict[str, str]], default: None ) –

    Additional options for the storage backend. Options already set on the connection will be inherited by the table, but can be overridden here. See available options at https://lancedb.github.io/lancedb/guides/storage/

  • data_storage_version (Optional[str], default: None ) –

    Deprecated. Set storage_options when connecting to the database and set new_table_data_storage_version in the options.

  • enable_v2_manifest_paths (Optional[bool], default: None ) –

    Deprecated. Set storage_options when connecting to the database and set new_table_enable_v2_manifest_paths in the options.

Returns:

  • LanceTable –

    A reference to the newly created table.

  • !!! note –

    The vector index won't be created by default. To create the index, call the create_index method on the table.

Examples:

Can create with list of tuples or dictionaries:

>>> import lancedb
>>> db = lancedb.connect("./.lancedb")
>>> data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},
...         {"vector": [0.2, 1.8], "lat": 40.1, "long":  -74.1}]
>>> db.create_table("my_table", data)
LanceTable(name='my_table', version=1, ...)
>>> db["my_table"].head()
pyarrow.Table
vector: fixed_size_list<item: float>[2]
  child 0, item: float
lat: double
long: double
----
vector: [[[1.1,1.2],[0.2,1.8]]]
lat: [[45.5,40.1]]
long: [[-122.7,-74.1]]

You can also pass a pandas DataFrame:

>>> import pandas as pd
>>> data = pd.DataFrame({
...    "vector": [[1.1, 1.2], [0.2, 1.8]],
...    "lat": [45.5, 40.1],
...    "long": [-122.7, -74.1]
... })
>>> db.create_table("table2", data)
LanceTable(name='table2', version=1, ...)
>>> db["table2"].head()
pyarrow.Table
vector: fixed_size_list<item: float>[2]
  child 0, item: float
lat: double
long: double
----
vector: [[[1.1,1.2],[0.2,1.8]]]
lat: [[45.5,40.1]]
long: [[-122.7,-74.1]]

Data is converted to Arrow before being written to disk. For maximum control over how data is saved, either provide the PyArrow schema to convert to or else provide a PyArrow Table directly.

>>> import pyarrow as pa
>>> custom_schema = pa.schema([
...   pa.field("vector", pa.list_(pa.float32(), 2)),
...   pa.field("lat", pa.float32()),
...   pa.field("long", pa.float32())
... ])
>>> db.create_table("table3", data, schema = custom_schema)
LanceTable(name='table3', version=1, ...)
>>> db["table3"].head()
pyarrow.Table
vector: fixed_size_list<item: float>[2]
  child 0, item: float
lat: float
long: float
----
vector: [[[1.1,1.2],[0.2,1.8]]]
lat: [[45.5,40.1]]
long: [[-122.7,-74.1]]

It is also possible to create an table from [Iterable[pa.RecordBatch]]:

>>> import pyarrow as pa
>>> def make_batches():
...     for i in range(5):
...         yield pa.RecordBatch.from_arrays(
...             [
...                 pa.array([[3.1, 4.1], [5.9, 26.5]],
...                     pa.list_(pa.float32(), 2)),
...                 pa.array(["foo", "bar"]),
...                 pa.array([10.0, 20.0]),
...             ],
...             ["vector", "item", "price"],
...         )
>>> schema=pa.schema([
...     pa.field("vector", pa.list_(pa.float32(), 2)),
...     pa.field("item", pa.utf8()),
...     pa.field("price", pa.float32()),
... ])
>>> db.create_table("table4", make_batches(), schema=schema)
LanceTable(name='table4', version=1, ...)
Source code in lancedb/db.py
@abstractmethod
def create_table(
    self,
    name: str,
    data: Optional[DATA] = None,
    schema: Optional[Union[pa.Schema, LanceModel]] = None,
    mode: str = "create",
    exist_ok: bool = False,
    on_bad_vectors: str = "error",
    fill_value: float = 0.0,
    embedding_functions: Optional[List[EmbeddingFunctionConfig]] = None,
    *,
    storage_options: Optional[Dict[str, str]] = None,
    data_storage_version: Optional[str] = None,
    enable_v2_manifest_paths: Optional[bool] = None,
) -> Table:
    """Create a [Table][lancedb.table.Table] in the database.

    Parameters
    ----------
    name: str
        The name of the table.
    data: The data to initialize the table, *optional*
        User must provide at least one of `data` or `schema`.
        Acceptable types are:

        - list-of-dict

        - pandas.DataFrame

        - pyarrow.Table or pyarrow.RecordBatch
    schema: The schema of the table, *optional*
        Acceptable types are:

        - pyarrow.Schema

        - [LanceModel][lancedb.pydantic.LanceModel]
    mode: str; default "create"
        The mode to use when creating the table.
        Can be either "create" or "overwrite".
        By default, if the table already exists, an exception is raised.
        If you want to overwrite the table, use mode="overwrite".
    exist_ok: bool, default False
        If a table by the same name already exists, then raise an exception
        if exist_ok=False. If exist_ok=True, then open the existing table;
        it will not add the provided data but will validate against any
        schema that's specified.
    on_bad_vectors: str, default "error"
        What to do if any of the vectors are not the same size or contains NaNs.
        One of "error", "drop", "fill".
    fill_value: float
        The value to use when filling vectors. Only used if on_bad_vectors="fill".
    storage_options: dict, optional
        Additional options for the storage backend. Options already set on the
        connection will be inherited by the table, but can be overridden here.
        See available options at
        <https://lancedb.github.io/lancedb/guides/storage/>
    data_storage_version: optional, str, default "stable"
        Deprecated.  Set `storage_options` when connecting to the database and set
        `new_table_data_storage_version` in the options.
    enable_v2_manifest_paths: optional, bool, default False
        Deprecated.  Set `storage_options` when connecting to the database and set
        `new_table_enable_v2_manifest_paths` in the options.
    Returns
    -------
    LanceTable
        A reference to the newly created table.

    !!! note

        The vector index won't be created by default.
        To create the index, call the `create_index` method on the table.

    Examples
    --------

    Can create with list of tuples or dictionaries:

    >>> import lancedb
    >>> db = lancedb.connect("./.lancedb")
    >>> data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},
    ...         {"vector": [0.2, 1.8], "lat": 40.1, "long":  -74.1}]
    >>> db.create_table("my_table", data)
    LanceTable(name='my_table', version=1, ...)
    >>> db["my_table"].head()
    pyarrow.Table
    vector: fixed_size_list<item: float>[2]
      child 0, item: float
    lat: double
    long: double
    ----
    vector: [[[1.1,1.2],[0.2,1.8]]]
    lat: [[45.5,40.1]]
    long: [[-122.7,-74.1]]

    You can also pass a pandas DataFrame:

    >>> import pandas as pd
    >>> data = pd.DataFrame({
    ...    "vector": [[1.1, 1.2], [0.2, 1.8]],
    ...    "lat": [45.5, 40.1],
    ...    "long": [-122.7, -74.1]
    ... })
    >>> db.create_table("table2", data)
    LanceTable(name='table2', version=1, ...)
    >>> db["table2"].head()
    pyarrow.Table
    vector: fixed_size_list<item: float>[2]
      child 0, item: float
    lat: double
    long: double
    ----
    vector: [[[1.1,1.2],[0.2,1.8]]]
    lat: [[45.5,40.1]]
    long: [[-122.7,-74.1]]

    Data is converted to Arrow before being written to disk. For maximum
    control over how data is saved, either provide the PyArrow schema to
    convert to or else provide a [PyArrow Table](pyarrow.Table) directly.

    >>> import pyarrow as pa
    >>> custom_schema = pa.schema([
    ...   pa.field("vector", pa.list_(pa.float32(), 2)),
    ...   pa.field("lat", pa.float32()),
    ...   pa.field("long", pa.float32())
    ... ])
    >>> db.create_table("table3", data, schema = custom_schema)
    LanceTable(name='table3', version=1, ...)
    >>> db["table3"].head()
    pyarrow.Table
    vector: fixed_size_list<item: float>[2]
      child 0, item: float
    lat: float
    long: float
    ----
    vector: [[[1.1,1.2],[0.2,1.8]]]
    lat: [[45.5,40.1]]
    long: [[-122.7,-74.1]]


    It is also possible to create an table from `[Iterable[pa.RecordBatch]]`:


    >>> import pyarrow as pa
    >>> def make_batches():
    ...     for i in range(5):
    ...         yield pa.RecordBatch.from_arrays(
    ...             [
    ...                 pa.array([[3.1, 4.1], [5.9, 26.5]],
    ...                     pa.list_(pa.float32(), 2)),
    ...                 pa.array(["foo", "bar"]),
    ...                 pa.array([10.0, 20.0]),
    ...             ],
    ...             ["vector", "item", "price"],
    ...         )
    >>> schema=pa.schema([
    ...     pa.field("vector", pa.list_(pa.float32(), 2)),
    ...     pa.field("item", pa.utf8()),
    ...     pa.field("price", pa.float32()),
    ... ])
    >>> db.create_table("table4", make_batches(), schema=schema)
    LanceTable(name='table4', version=1, ...)

    """
    raise NotImplementedError

open_table

open_table(name: str, *, storage_options: Optional[Dict[str, str]] = None, index_cache_size: Optional[int] = None) -> Table

Open a Lance Table in the database.

Parameters:

  • name (str) –

    The name of the table.

  • index_cache_size (Optional[int], default: None ) –

    Set the size of the index cache, specified as a number of entries

    The exact meaning of an "entry" will depend on the type of index: * IVF - there is one entry for each IVF partition * BTREE - there is one entry for the entire index

    This cache applies to the entire opened table, across all indices. Setting this value higher will increase performance on larger datasets at the expense of more RAM

  • storage_options (Optional[Dict[str, str]], default: None ) –

    Additional options for the storage backend. Options already set on the connection will be inherited by the table, but can be overridden here. See available options at https://lancedb.github.io/lancedb/guides/storage/

Returns:

  • A LanceTable object representing the table. –
Source code in lancedb/db.py
def open_table(
    self,
    name: str,
    *,
    storage_options: Optional[Dict[str, str]] = None,
    index_cache_size: Optional[int] = None,
) -> Table:
    """Open a Lance Table in the database.

    Parameters
    ----------
    name: str
        The name of the table.
    index_cache_size: int, default 256
        Set the size of the index cache, specified as a number of entries

        The exact meaning of an "entry" will depend on the type of index:
        * IVF - there is one entry for each IVF partition
        * BTREE - there is one entry for the entire index

        This cache applies to the entire opened table, across all indices.
        Setting this value higher will increase performance on larger datasets
        at the expense of more RAM
    storage_options: dict, optional
        Additional options for the storage backend. Options already set on the
        connection will be inherited by the table, but can be overridden here.
        See available options at
        <https://lancedb.github.io/lancedb/guides/storage/>

    Returns
    -------
    A LanceTable object representing the table.
    """
    raise NotImplementedError

drop_table

drop_table(name: str)

Drop a table from the database.

Parameters:

  • name (str) –

    The name of the table.

Source code in lancedb/db.py
def drop_table(self, name: str):
    """Drop a table from the database.

    Parameters
    ----------
    name: str
        The name of the table.
    """
    raise NotImplementedError

rename_table

rename_table(cur_name: str, new_name: str)

Rename a table in the database.

Parameters:

  • cur_name (str) –

    The current name of the table.

  • new_name (str) –

    The new name of the table.

Source code in lancedb/db.py
def rename_table(self, cur_name: str, new_name: str):
    """Rename a table in the database.

    Parameters
    ----------
    cur_name: str
        The current name of the table.
    new_name: str
        The new name of the table.
    """
    raise NotImplementedError

drop_database

drop_database()

Drop database This is the same thing as dropping all the tables

Source code in lancedb/db.py
def drop_database(self):
    """
    Drop database
    This is the same thing as dropping all the tables
    """
    raise NotImplementedError

drop_all_tables

drop_all_tables()

Drop all tables from the database

Source code in lancedb/db.py
def drop_all_tables(self):
    """
    Drop all tables from the database
    """
    raise NotImplementedError

Tables (Synchronous)

lancedb.table.Table

Bases: ABC

A Table is a collection of Records in a LanceDB Database.

Examples:

Create using DBConnection.create_table (more examples in that method's documentation).

>>> import lancedb
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data=[{"vector": [1.1, 1.2], "b": 2}])
>>> table.head()
pyarrow.Table
vector: fixed_size_list<item: float>[2]
  child 0, item: float
b: int64
----
vector: [[[1.1,1.2]]]
b: [[2]]

Can append new data with Table.add().

>>> table.add([{"vector": [0.5, 1.3], "b": 4}])
AddResult(version=2)

Can query the table with Table.search.

>>> table.search([0.4, 0.4]).select(["b", "vector"]).to_pandas()
   b      vector  _distance
0  4  [0.5, 1.3]       0.82
1  2  [1.1, 1.2]       1.13

Search queries are much faster when an index is created. See Table.create_index.

Source code in lancedb/table.py
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
class Table(ABC):
    """
    A Table is a collection of Records in a LanceDB Database.

    Examples
    --------

    Create using [DBConnection.create_table][lancedb.DBConnection.create_table]
    (more examples in that method's documentation).

    >>> import lancedb
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", data=[{"vector": [1.1, 1.2], "b": 2}])
    >>> table.head()
    pyarrow.Table
    vector: fixed_size_list<item: float>[2]
      child 0, item: float
    b: int64
    ----
    vector: [[[1.1,1.2]]]
    b: [[2]]

    Can append new data with [Table.add()][lancedb.table.Table.add].

    >>> table.add([{"vector": [0.5, 1.3], "b": 4}])
    AddResult(version=2)

    Can query the table with [Table.search][lancedb.table.Table.search].

    >>> table.search([0.4, 0.4]).select(["b", "vector"]).to_pandas()
       b      vector  _distance
    0  4  [0.5, 1.3]       0.82
    1  2  [1.1, 1.2]       1.13

    Search queries are much faster when an index is created. See
    [Table.create_index][lancedb.table.Table.create_index].
    """

    @property
    @abstractmethod
    def name(self) -> str:
        """The name of this Table"""
        raise NotImplementedError

    @property
    @abstractmethod
    def version(self) -> int:
        """The version of this Table"""
        raise NotImplementedError

    @property
    @abstractmethod
    def schema(self) -> pa.Schema:
        """The [Arrow Schema](https://arrow.apache.org/docs/python/api/datatypes.html#)
        of this Table

        """
        raise NotImplementedError

    @property
    @abstractmethod
    def tags(self) -> Tags:
        """Tag management for the table.

        Similar to Git, tags are a way to add metadata to a specific version of the
        table.

        .. warning::

            Tagged versions are exempted from the :py:meth:`cleanup_old_versions()`
            process.

            To remove a version that has been tagged, you must first
            :py:meth:`~Tags.delete` the associated tag.

        Examples
        --------

        .. code-block:: python

            table = db.open_table("my_table")
            table.tags.create("v2-prod-20250203", 10)

            tags = table.tags.list()

        """
        raise NotImplementedError

    def __len__(self) -> int:
        """The number of rows in this Table"""
        return self.count_rows(None)

    @property
    @abstractmethod
    def embedding_functions(self) -> Dict[str, EmbeddingFunctionConfig]:
        """
        Get a mapping from vector column name to it's configured embedding function.
        """

    @abstractmethod
    def count_rows(self, filter: Optional[str] = None) -> int:
        """
        Count the number of rows in the table.

        Parameters
        ----------
        filter: str, optional
            A SQL where clause to filter the rows to count.
        """
        raise NotImplementedError

    def to_pandas(self) -> "pandas.DataFrame":
        """Return the table as a pandas DataFrame.

        Returns
        -------
        pd.DataFrame
        """
        return self.to_arrow().to_pandas()

    @abstractmethod
    def to_arrow(self) -> pa.Table:
        """Return the table as a pyarrow Table.

        Returns
        -------
        pa.Table
        """
        raise NotImplementedError

    def create_index(
        self,
        metric="l2",
        num_partitions=256,
        num_sub_vectors=96,
        vector_column_name: str = VECTOR_COLUMN_NAME,
        replace: bool = True,
        accelerator: Optional[str] = None,
        index_cache_size: Optional[int] = None,
        *,
        index_type: VectorIndexType = "IVF_PQ",
        wait_timeout: Optional[timedelta] = None,
        num_bits: int = 8,
        max_iterations: int = 50,
        sample_rate: int = 256,
        m: int = 20,
        ef_construction: int = 300,
    ):
        """Create an index on the table.

        Parameters
        ----------
        metric: str, default "l2"
            The distance metric to use when creating the index.
            Valid values are "l2", "cosine", "dot", or "hamming".
            l2 is euclidean distance.
            Hamming is available only for binary vectors.
        num_partitions: int, default 256
            The number of IVF partitions to use when creating the index.
            Default is 256.
        num_sub_vectors: int, default 96
            The number of PQ sub-vectors to use when creating the index.
            Default is 96.
        vector_column_name: str, default "vector"
            The vector column name to create the index.
        replace: bool, default True
            - If True, replace the existing index if it exists.

            - If False, raise an error if duplicate index exists.
        accelerator: str, default None
            If set, use the given accelerator to create the index.
            Only support "cuda" for now.
        index_cache_size : int, optional
            The size of the index cache in number of entries. Default value is 256.
        num_bits: int
            The number of bits to encode sub-vectors. Only used with the IVF_PQ index.
            Only 4 and 8 are supported.
        wait_timeout: timedelta, optional
            The timeout to wait if indexing is asynchronous.
        """
        raise NotImplementedError

    def drop_index(self, name: str) -> None:
        """
        Drop an index from the table.

        Parameters
        ----------
        name: str
            The name of the index to drop.

        Notes
        -----
        This does not delete the index from disk, it just removes it from the table.
        To delete the index, run [optimize][lancedb.table.Table.optimize]
        after dropping the index.

        Use [list_indices][lancedb.table.Table.list_indices] to find the names of
        the indices.
        """
        raise NotImplementedError

    def wait_for_index(
        self, index_names: Iterable[str], timeout: timedelta = timedelta(seconds=300)
    ) -> None:
        """
        Wait for indexing to complete for the given index names.
        This will poll the table until all the indices are fully indexed,
        or raise a timeout exception if the timeout is reached.

        Parameters
        ----------
        index_names: str
            The name of the indices to poll
        timeout: timedelta
            Timeout to wait for asynchronous indexing. The default is 5 minutes.
        """
        raise NotImplementedError

    @abstractmethod
    def stats(self) -> TableStatistics:
        """
        Retrieve table and fragment statistics.
        """
        raise NotImplementedError

    @abstractmethod
    def create_scalar_index(
        self,
        column: str,
        *,
        replace: bool = True,
        index_type: ScalarIndexType = "BTREE",
        wait_timeout: Optional[timedelta] = None,
    ):
        """Create a scalar index on a column.

        Parameters
        ----------
        column : str
            The column to be indexed.  Must be a boolean, integer, float,
            or string column.
        replace : bool, default True
            Replace the existing index if it exists.
        index_type: Literal["BTREE", "BITMAP", "LABEL_LIST"], default "BTREE"
            The type of index to create.
        wait_timeout: timedelta, optional
            The timeout to wait if indexing is asynchronous.
        Examples
        --------

        Scalar indices, like vector indices, can be used to speed up scans.  A scalar
        index can speed up scans that contain filter expressions on the indexed column.
        For example, the following scan will be faster if the column ``my_col`` has
        a scalar index:

        >>> import lancedb # doctest: +SKIP
        >>> db = lancedb.connect("/data/lance") # doctest: +SKIP
        >>> img_table = db.open_table("images") # doctest: +SKIP
        >>> my_df = img_table.search().where("my_col = 7", # doctest: +SKIP
        ...                                  prefilter=True).to_pandas()

        Scalar indices can also speed up scans containing a vector search and a
        prefilter:

        >>> import lancedb # doctest: +SKIP
        >>> db = lancedb.connect("/data/lance") # doctest: +SKIP
        >>> img_table = db.open_table("images") # doctest: +SKIP
        >>> img_table.search([1, 2, 3, 4], vector_column_name="vector") # doctest: +SKIP
        ...     .where("my_col != 7", prefilter=True)
        ...     .to_pandas()

        Scalar indices can only speed up scans for basic filters using
        equality, comparison, range (e.g. ``my_col BETWEEN 0 AND 100``), and set
        membership (e.g. `my_col IN (0, 1, 2)`)

        Scalar indices can be used if the filter contains multiple indexed columns and
        the filter criteria are AND'd or OR'd together
        (e.g. ``my_col < 0 AND other_col> 100``)

        Scalar indices may be used if the filter contains non-indexed columns but,
        depending on the structure of the filter, they may not be usable.  For example,
        if the column ``not_indexed`` does not have a scalar index then the filter
        ``my_col = 0 OR not_indexed = 1`` will not be able to use any scalar index on
        ``my_col``.
        """
        raise NotImplementedError

    def create_fts_index(
        self,
        field_names: Union[str, List[str]],
        *,
        ordering_field_names: Optional[Union[str, List[str]]] = None,
        replace: bool = False,
        writer_heap_size: Optional[int] = 1024 * 1024 * 1024,
        use_tantivy: bool = True,
        tokenizer_name: Optional[str] = None,
        with_position: bool = True,
        # tokenizer configs:
        base_tokenizer: BaseTokenizerType = "simple",
        language: str = "English",
        max_token_length: Optional[int] = 40,
        lower_case: bool = True,
        stem: bool = False,
        remove_stop_words: bool = False,
        ascii_folding: bool = False,
        wait_timeout: Optional[timedelta] = None,
    ):
        """Create a full-text search index on the table.

        Warning - this API is highly experimental and is highly likely to change
        in the future.

        Parameters
        ----------
        field_names: str or list of str
            The name(s) of the field to index.
            can be only str if use_tantivy=True for now.
        replace: bool, default False
            If True, replace the existing index if it exists. Note that this is
            not yet an atomic operation; the index will be temporarily
            unavailable while the new index is being created.
        writer_heap_size: int, default 1GB
            Only available with use_tantivy=True
        ordering_field_names:
            A list of unsigned type fields to index to optionally order
            results on at search time.
            only available with use_tantivy=True
        tokenizer_name: str, default "default"
            The tokenizer to use for the index. Can be "raw", "default" or the 2 letter
            language code followed by "_stem". So for english it would be "en_stem".
            For available languages see: https://docs.rs/tantivy/latest/tantivy/tokenizer/enum.Language.html
        use_tantivy: bool, default True
            If True, use the legacy full-text search implementation based on tantivy.
            If False, use the new full-text search implementation based on lance-index.
        with_position: bool, default True
            Only available with use_tantivy=False
            If False, do not store the positions of the terms in the text.
            This can reduce the size of the index and improve indexing speed.
            But it will raise an exception for phrase queries.
        base_tokenizer : str, default "simple"
            The base tokenizer to use for tokenization. Options are:
            - "simple": Splits text by whitespace and punctuation.
            - "whitespace": Split text by whitespace, but not punctuation.
            - "raw": No tokenization. The entire text is treated as a single token.
        language : str, default "English"
            The language to use for tokenization.
        max_token_length : int, default 40
            The maximum token length to index. Tokens longer than this length will be
            ignored.
        lower_case : bool, default True
            Whether to convert the token to lower case. This makes queries
            case-insensitive.
        stem : bool, default False
            Whether to stem the token. Stemming reduces words to their root form.
            For example, in English "running" and "runs" would both be reduced to "run".
        remove_stop_words : bool, default False
            Whether to remove stop words. Stop words are common words that are often
            removed from text before indexing. For example, in English "the" and "and".
        ascii_folding : bool, default False
            Whether to fold ASCII characters. This converts accented characters to
            their ASCII equivalent. For example, "cafΓ©" would be converted to "cafe".
        wait_timeout: timedelta, optional
            The timeout to wait if indexing is asynchronous.
        """
        raise NotImplementedError

    @abstractmethod
    def add(
        self,
        data: DATA,
        mode: AddMode = "append",
        on_bad_vectors: OnBadVectorsType = "error",
        fill_value: float = 0.0,
    ) -> AddResult:
        """Add more data to the [Table](Table).

        Parameters
        ----------
        data: DATA
            The data to insert into the table. Acceptable types are:

            - list-of-dict

            - pandas.DataFrame

            - pyarrow.Table or pyarrow.RecordBatch
        mode: str
            The mode to use when writing the data. Valid values are
            "append" and "overwrite".
        on_bad_vectors: str, default "error"
            What to do if any of the vectors are not the same size or contains NaNs.
            One of "error", "drop", "fill".
        fill_value: float, default 0.
            The value to use when filling vectors. Only used if on_bad_vectors="fill".

        Returns
        -------
        AddResult
            An object containing the new version number of the table after adding data.
        """
        raise NotImplementedError

    def merge_insert(self, on: Union[str, Iterable[str]]) -> LanceMergeInsertBuilder:
        """
        Returns a [`LanceMergeInsertBuilder`][lancedb.merge.LanceMergeInsertBuilder]
        that can be used to create a "merge insert" operation

        This operation can add rows, update rows, and remove rows all in a single
        transaction. It is a very generic tool that can be used to create
        behaviors like "insert if not exists", "update or insert (i.e. upsert)",
        or even replace a portion of existing data with new data (e.g. replace
        all data where month="january")

        The merge insert operation works by combining new data from a
        **source table** with existing data in a **target table** by using a
        join.  There are three categories of records.

        "Matched" records are records that exist in both the source table and
        the target table. "Not matched" records exist only in the source table
        (e.g. these are new data) "Not matched by source" records exist only
        in the target table (this is old data)

        The builder returned by this method can be used to customize what
        should happen for each category of data.

        Please note that the data may appear to be reordered as part of this
        operation.  This is because updated rows will be deleted from the
        dataset and then reinserted at the end with the new values.

        Parameters
        ----------

        on: Union[str, Iterable[str]]
            A column (or columns) to join on.  This is how records from the
            source table and target table are matched.  Typically this is some
            kind of key or id column.

        Examples
        --------
        >>> import lancedb
        >>> data = pa.table({"a": [2, 1, 3], "b": ["a", "b", "c"]})
        >>> db = lancedb.connect("./.lancedb")
        >>> table = db.create_table("my_table", data)
        >>> new_data = pa.table({"a": [2, 3, 4], "b": ["x", "y", "z"]})
        >>> # Perform a "upsert" operation
        >>> res = table.merge_insert("a")     \\
        ...      .when_matched_update_all()     \\
        ...      .when_not_matched_insert_all() \\
        ...      .execute(new_data)
        >>> res
        MergeResult(version=2, num_updated_rows=2, num_inserted_rows=1, num_deleted_rows=0)
        >>> # The order of new rows is non-deterministic since we use
        >>> # a hash-join as part of this operation and so we sort here
        >>> table.to_arrow().sort_by("a").to_pandas()
           a  b
        0  1  b
        1  2  x
        2  3  y
        3  4  z
        """  # noqa: E501
        on = [on] if isinstance(on, str) else list(iter(on))

        return LanceMergeInsertBuilder(self, on)

    @abstractmethod
    def search(
        self,
        query: Optional[
            Union[VEC, str, "PIL.Image.Image", Tuple, FullTextQuery]
        ] = None,
        vector_column_name: Optional[str] = None,
        query_type: QueryType = "auto",
        ordering_field_name: Optional[str] = None,
        fts_columns: Optional[Union[str, List[str]]] = None,
    ) -> LanceQueryBuilder:
        """Create a search query to find the nearest neighbors
        of the given query vector. We currently support [vector search][search]
        and [full-text search][experimental-full-text-search].

        All query options are defined in
        [LanceQueryBuilder][lancedb.query.LanceQueryBuilder].

        Examples
        --------
        >>> import lancedb
        >>> db = lancedb.connect("./.lancedb")
        >>> data = [
        ...    {"original_width": 100, "caption": "bar", "vector": [0.1, 2.3, 4.5]},
        ...    {"original_width": 2000, "caption": "foo",  "vector": [0.5, 3.4, 1.3]},
        ...    {"original_width": 3000, "caption": "test", "vector": [0.3, 6.2, 2.6]}
        ... ]
        >>> table = db.create_table("my_table", data)
        >>> query = [0.4, 1.4, 2.4]
        >>> (table.search(query)
        ...     .where("original_width > 1000", prefilter=True)
        ...     .select(["caption", "original_width", "vector"])
        ...     .limit(2)
        ...     .to_pandas())
          caption  original_width           vector  _distance
        0     foo            2000  [0.5, 3.4, 1.3]   5.220000
        1    test            3000  [0.3, 6.2, 2.6]  23.089996

        Parameters
        ----------
        query: list/np.ndarray/str/PIL.Image.Image, default None
            The targetted vector to search for.

            - *default None*.
            Acceptable types are: list, np.ndarray, PIL.Image.Image

            - If None then the select/where/limit clauses are applied to filter
            the table
        vector_column_name: str, optional
            The name of the vector column to search.

            The vector column needs to be a pyarrow fixed size list type

            - If not specified then the vector column is inferred from
            the table schema

            - If the table has multiple vector columns then the *vector_column_name*
            needs to be specified. Otherwise, an error is raised.
        query_type: str
            *default "auto"*.
            Acceptable types are: "vector", "fts", "hybrid", or "auto"

            - If "auto" then the query type is inferred from the query;

                - If `query` is a list/np.ndarray then the query type is
                "vector";

                - If `query` is a PIL.Image.Image then either do vector search,
                or raise an error if no corresponding embedding function is found.

            - If `query` is a string, then the query type is "vector" if the
            table has embedding functions else the query type is "fts"

        Returns
        -------
        LanceQueryBuilder
            A query builder object representing the query.
            Once executed, the query returns

            - selected columns

            - the vector

            - and also the "_distance" column which is the distance between the query
            vector and the returned vector.
        """
        raise NotImplementedError

    @abstractmethod
    def _execute_query(
        self,
        query: Query,
        *,
        batch_size: Optional[int] = None,
        timeout: Optional[timedelta] = None,
    ) -> pa.RecordBatchReader: ...

    @abstractmethod
    def _explain_plan(self, query: Query, verbose: Optional[bool] = False) -> str: ...

    @abstractmethod
    def _analyze_plan(self, query: Query) -> str: ...

    @abstractmethod
    def _do_merge(
        self,
        merge: LanceMergeInsertBuilder,
        new_data: DATA,
        on_bad_vectors: OnBadVectorsType,
        fill_value: float,
    ) -> MergeResult: ...

    @abstractmethod
    def delete(self, where: str) -> DeleteResult:
        """Delete rows from the table.

        This can be used to delete a single row, many rows, all rows, or
        sometimes no rows (if your predicate matches nothing).

        Parameters
        ----------
        where: str
            The SQL where clause to use when deleting rows.

            - For example, 'x = 2' or 'x IN (1, 2, 3)'.

            The filter must not be empty, or it will error.

        Returns
        -------
        DeleteResult
            An object containing the new version number of the table after deletion.

        Examples
        --------
        >>> import lancedb
        >>> data = [
        ...    {"x": 1, "vector": [1.0, 2]},
        ...    {"x": 2, "vector": [3.0, 4]},
        ...    {"x": 3, "vector": [5.0, 6]}
        ... ]
        >>> db = lancedb.connect("./.lancedb")
        >>> table = db.create_table("my_table", data)
        >>> table.to_pandas()
           x      vector
        0  1  [1.0, 2.0]
        1  2  [3.0, 4.0]
        2  3  [5.0, 6.0]
        >>> table.delete("x = 2")
        DeleteResult(version=2)
        >>> table.to_pandas()
           x      vector
        0  1  [1.0, 2.0]
        1  3  [5.0, 6.0]

        If you have a list of values to delete, you can combine them into a
        stringified list and use the `IN` operator:

        >>> to_remove = [1, 5]
        >>> to_remove = ", ".join([str(v) for v in to_remove])
        >>> to_remove
        '1, 5'
        >>> table.delete(f"x IN ({to_remove})")
        DeleteResult(version=3)
        >>> table.to_pandas()
           x      vector
        0  3  [5.0, 6.0]
        """
        raise NotImplementedError

    @abstractmethod
    def update(
        self,
        where: Optional[str] = None,
        values: Optional[dict] = None,
        *,
        values_sql: Optional[Dict[str, str]] = None,
    ) -> UpdateResult:
        """
        This can be used to update zero to all rows depending on how many
        rows match the where clause. If no where clause is provided, then
        all rows will be updated.

        Either `values` or `values_sql` must be provided. You cannot provide
        both.

        Parameters
        ----------
        where: str, optional
            The SQL where clause to use when updating rows. For example, 'x = 2'
            or 'x IN (1, 2, 3)'. The filter must not be empty, or it will error.
        values: dict, optional
            The values to update. The keys are the column names and the values
            are the values to set.
        values_sql: dict, optional
            The values to update, expressed as SQL expression strings. These can
            reference existing columns. For example, {"x": "x + 1"} will increment
            the x column by 1.

        Returns
        -------
        UpdateResult
            - rows_updated: The number of rows that were updated
            - version: The new version number of the table after the update

        Examples
        --------
        >>> import lancedb
        >>> import pandas as pd
        >>> data = pd.DataFrame({"x": [1, 2, 3], "vector": [[1.0, 2], [3, 4], [5, 6]]})
        >>> db = lancedb.connect("./.lancedb")
        >>> table = db.create_table("my_table", data)
        >>> table.to_pandas()
           x      vector
        0  1  [1.0, 2.0]
        1  2  [3.0, 4.0]
        2  3  [5.0, 6.0]
        >>> table.update(where="x = 2", values={"vector": [10.0, 10]})
        UpdateResult(rows_updated=1, version=2)
        >>> table.to_pandas()
           x        vector
        0  1    [1.0, 2.0]
        1  3    [5.0, 6.0]
        2  2  [10.0, 10.0]
        >>> table.update(values_sql={"x": "x + 1"})
        UpdateResult(rows_updated=3, version=3)
        >>> table.to_pandas()
           x        vector
        0  2    [1.0, 2.0]
        1  4    [5.0, 6.0]
        2  3  [10.0, 10.0]
        """
        raise NotImplementedError

    @abstractmethod
    def cleanup_old_versions(
        self,
        older_than: Optional[timedelta] = None,
        *,
        delete_unverified: bool = False,
    ) -> "CleanupStats":
        """
        Clean up old versions of the table, freeing disk space.

        Parameters
        ----------
        older_than: timedelta, default None
            The minimum age of the version to delete. If None, then this defaults
            to two weeks.
        delete_unverified: bool, default False
            Because they may be part of an in-progress transaction, files newer
            than 7 days old are not deleted by default. If you are sure that
            there are no in-progress transactions, then you can set this to True
            to delete all files older than `older_than`.

        Returns
        -------
        CleanupStats
            The stats of the cleanup operation, including how many bytes were
            freed.

        See Also
        --------
        [Table.optimize][lancedb.table.Table.optimize]: A more comprehensive
            optimization operation that includes cleanup as well as other operations.

        Notes
        -----
        This function is not available in LanceDb Cloud (since LanceDB
        Cloud manages cleanup for you automatically)
        """

    @abstractmethod
    def compact_files(self, *args, **kwargs):
        """
        Run the compaction process on the table.
        This can be run after making several small appends to optimize the table
        for faster reads.

        Arguments are passed onto Lance's
        [compact_files][lance.dataset.DatasetOptimizer.compact_files].
        For most cases, the default should be fine.

        See Also
        --------
        [Table.optimize][lancedb.table.Table.optimize]: A more comprehensive
            optimization operation that includes cleanup as well as other operations.

        Notes
        -----
        This function is not available in LanceDB Cloud (since LanceDB
        Cloud manages compaction for you automatically)
        """

    @abstractmethod
    def optimize(
        self,
        *,
        cleanup_older_than: Optional[timedelta] = None,
        delete_unverified: bool = False,
        retrain: bool = False,
    ):
        """
        Optimize the on-disk data and indices for better performance.

        Modeled after ``VACUUM`` in PostgreSQL.

        Optimization covers three operations:

         * Compaction: Merges small files into larger ones
         * Prune: Removes old versions of the dataset
         * Index: Optimizes the indices, adding new data to existing indices

        Parameters
        ----------
        cleanup_older_than: timedelta, optional default 7 days
            All files belonging to versions older than this will be removed.  Set
            to 0 days to remove all versions except the latest.  The latest version
            is never removed.
        delete_unverified: bool, default False
            Files leftover from a failed transaction may appear to be part of an
            in-progress operation (e.g. appending new data) and these files will not
            be deleted unless they are at least 7 days old. If delete_unverified is True
            then these files will be deleted regardless of their age.
        retrain: bool, default False
            If True, retrain the vector indices, this would refine the IVF clustering
            and quantization, which may improve the search accuracy. It's faster than
            re-creating the index from scratch, so it's recommended to try this first,
            when the data distribution has changed significantly.

        Experimental API
        ----------------

        The optimization process is undergoing active development and may change.
        Our goal with these changes is to improve the performance of optimization and
        reduce the complexity.

        That being said, it is essential today to run optimize if you want the best
        performance.  It should be stable and safe to use in production, but it our
        hope that the API may be simplified (or not even need to be called) in the
        future.

        The frequency an application shoudl call optimize is based on the frequency of
        data modifications.  If data is frequently added, deleted, or updated then
        optimize should be run frequently.  A good rule of thumb is to run optimize if
        you have added or modified 100,000 or more records or run more than 20 data
        modification operations.
        """

    @abstractmethod
    def list_indices(self) -> Iterable[IndexConfig]:
        """
        List all indices that have been created with
        [Table.create_index][lancedb.table.Table.create_index]
        """

    @abstractmethod
    def index_stats(self, index_name: str) -> Optional[IndexStatistics]:
        """
        Retrieve statistics about an index

        Parameters
        ----------
        index_name: str
            The name of the index to retrieve statistics for

        Returns
        -------
        IndexStatistics or None
            The statistics about the index. Returns None if the index does not exist.
        """

    @abstractmethod
    def add_columns(
        self, transforms: Dict[str, str] | pa.Field | List[pa.Field] | pa.Schema
    ):
        """
        Add new columns with defined values.

        Parameters
        ----------
        transforms: Dict[str, str], pa.Field, List[pa.Field], pa.Schema
            A map of column name to a SQL expression to use to calculate the
            value of the new column. These expressions will be evaluated for
            each row in the table, and can reference existing columns.
            Alternatively, a pyarrow Field or Schema can be provided to add
            new columns with the specified data types. The new columns will
            be initialized with null values.

        Returns
        -------
        AddColumnsResult
            version: the new version number of the table after adding columns.
        """

    @abstractmethod
    def alter_columns(self, *alterations: Iterable[Dict[str, str]]):
        """
        Alter column names and nullability.

        Parameters
        ----------
        alterations : Iterable[Dict[str, Any]]
            A sequence of dictionaries, each with the following keys:
            - "path": str
                The column path to alter. For a top-level column, this is the name.
                For a nested column, this is the dot-separated path, e.g. "a.b.c".
            - "rename": str, optional
                The new name of the column. If not specified, the column name is
                not changed.
            - "data_type": pyarrow.DataType, optional
               The new data type of the column. Existing values will be casted
               to this type. If not specified, the column data type is not changed.
            - "nullable": bool, optional
                Whether the column should be nullable. If not specified, the column
                nullability is not changed. Only non-nullable columns can be changed
                to nullable. Currently, you cannot change a nullable column to
                non-nullable.

        Returns
        -------
        AlterColumnsResult
            version: the new version number of the table after the alteration.
        """

    @abstractmethod
    def drop_columns(self, columns: Iterable[str]) -> DropColumnsResult:
        """
        Drop columns from the table.

        Parameters
        ----------
        columns : Iterable[str]
            The names of the columns to drop.

        Returns
        -------
        DropColumnsResult
            version: the new version number of the table dropping the columns.
        """

    @abstractmethod
    def checkout(self, version: Union[int, str]):
        """
        Checks out a specific version of the Table

        Any read operation on the table will now access the data at the checked out
        version. As a consequence, calling this method will disable any read consistency
        interval that was previously set.

        This is a read-only operation that turns the table into a sort of "view"
        or "detached head".  Other table instances will not be affected.  To make the
        change permanent you can use the `[Self::restore]` method.

        Any operation that modifies the table will fail while the table is in a checked
        out state.

        Parameters
        ----------
        version: int | str,
            The version to check out. A version number (`int`) or a tag
            (`str`) can be provided.

        To return the table to a normal state use `[Self::checkout_latest]`
        """

    @abstractmethod
    def checkout_latest(self):
        """
        Ensures the table is pointing at the latest version

        This can be used to manually update a table when the read_consistency_interval
        is None
        It can also be used to undo a `[Self::checkout]` operation
        """

    @abstractmethod
    def restore(self, version: Optional[Union[int, str]] = None):
        """Restore a version of the table. This is an in-place operation.

        This creates a new version where the data is equivalent to the
        specified previous version. Data is not copied (as of python-v0.2.1).

        Parameters
        ----------
        version : int or str, default None
            The version number or version tag to restore.
            If unspecified then restores the currently checked out version.
            If the currently checked out version is the
            latest version then this is a no-op.
        """

    @abstractmethod
    def list_versions(self) -> List[Dict[str, Any]]:
        """List all versions of the table"""

    @cached_property
    def _dataset_uri(self) -> str:
        return _table_uri(self._conn.uri, self.name)

    def _get_fts_index_path(self) -> Tuple[str, pa_fs.FileSystem, bool]:
        from .remote.table import RemoteTable

        if isinstance(self, RemoteTable) or get_uri_scheme(self._dataset_uri) != "file":
            return ("", None, False)
        path = join_uri(self._dataset_uri, "_indices", "fts")
        fs, path = fs_from_uri(path)
        index_exists = fs.get_file_info(path).type != pa_fs.FileType.NotFound
        return (path, fs, index_exists)

    @abstractmethod
    def uses_v2_manifest_paths(self) -> bool:
        """
        Check if the table is using the new v2 manifest paths.

        Returns
        -------
        bool
            True if the table is using the new v2 manifest paths, False otherwise.
        """

    @abstractmethod
    def migrate_v2_manifest_paths(self):
        """
        Migrate the manifest paths to the new format.

        This will update the manifest to use the new v2 format for paths.

        This function is idempotent, and can be run multiple times without
        changing the state of the object store.

        !!! danger

            This should not be run while other concurrent operations are happening.
            And it should also run until completion before resuming other operations.

        You can use
        [Table.uses_v2_manifest_paths][lancedb.table.Table.uses_v2_manifest_paths]
        to check if the table is already using the new path style.
        """

name abstractmethod property

name: str

The name of this Table

version abstractmethod property

version: int

The version of this Table

schema abstractmethod property

schema: Schema

The Arrow Schema of this Table

tags abstractmethod property

tags: Tags

Tag management for the table.

Similar to Git, tags are a way to add metadata to a specific version of the table.

.. warning::

Tagged versions are exempted from the :py:meth:`cleanup_old_versions()`
process.

To remove a version that has been tagged, you must first
:py:meth:`~Tags.delete` the associated tag.

Examples:

.. code-block:: python

table = db.open_table("my_table")
table.tags.create("v2-prod-20250203", 10)

tags = table.tags.list()

embedding_functions abstractmethod property

embedding_functions: Dict[str, EmbeddingFunctionConfig]

Get a mapping from vector column name to it's configured embedding function.

__len__

__len__() -> int

The number of rows in this Table

Source code in lancedb/table.py
def __len__(self) -> int:
    """The number of rows in this Table"""
    return self.count_rows(None)

count_rows abstractmethod

count_rows(filter: Optional[str] = None) -> int

Count the number of rows in the table.

Parameters:

  • filter (Optional[str], default: None ) –

    A SQL where clause to filter the rows to count.

Source code in lancedb/table.py
@abstractmethod
def count_rows(self, filter: Optional[str] = None) -> int:
    """
    Count the number of rows in the table.

    Parameters
    ----------
    filter: str, optional
        A SQL where clause to filter the rows to count.
    """
    raise NotImplementedError

to_pandas

to_pandas() -> 'pandas.DataFrame'

Return the table as a pandas DataFrame.

Returns:

  • DataFrame –
Source code in lancedb/table.py
def to_pandas(self) -> "pandas.DataFrame":
    """Return the table as a pandas DataFrame.

    Returns
    -------
    pd.DataFrame
    """
    return self.to_arrow().to_pandas()

to_arrow abstractmethod

to_arrow() -> Table

Return the table as a pyarrow Table.

Returns:

Source code in lancedb/table.py
@abstractmethod
def to_arrow(self) -> pa.Table:
    """Return the table as a pyarrow Table.

    Returns
    -------
    pa.Table
    """
    raise NotImplementedError

create_index

create_index(metric='l2', num_partitions=256, num_sub_vectors=96, vector_column_name: str = VECTOR_COLUMN_NAME, replace: bool = True, accelerator: Optional[str] = None, index_cache_size: Optional[int] = None, *, index_type: VectorIndexType = 'IVF_PQ', wait_timeout: Optional[timedelta] = None, num_bits: int = 8, max_iterations: int = 50, sample_rate: int = 256, m: int = 20, ef_construction: int = 300)

Create an index on the table.

Parameters:

  • metric –

    The distance metric to use when creating the index. Valid values are "l2", "cosine", "dot", or "hamming". l2 is euclidean distance. Hamming is available only for binary vectors.

  • num_partitions –

    The number of IVF partitions to use when creating the index. Default is 256.

  • num_sub_vectors –

    The number of PQ sub-vectors to use when creating the index. Default is 96.

  • vector_column_name (str, default: VECTOR_COLUMN_NAME ) –

    The vector column name to create the index.

  • replace (bool, default: True ) –
    • If True, replace the existing index if it exists.

    • If False, raise an error if duplicate index exists.

  • accelerator (Optional[str], default: None ) –

    If set, use the given accelerator to create the index. Only support "cuda" for now.

  • index_cache_size (int, default: None ) –

    The size of the index cache in number of entries. Default value is 256.

  • num_bits (int, default: 8 ) –

    The number of bits to encode sub-vectors. Only used with the IVF_PQ index. Only 4 and 8 are supported.

  • wait_timeout (Optional[timedelta], default: None ) –

    The timeout to wait if indexing is asynchronous.

Source code in lancedb/table.py
def create_index(
    self,
    metric="l2",
    num_partitions=256,
    num_sub_vectors=96,
    vector_column_name: str = VECTOR_COLUMN_NAME,
    replace: bool = True,
    accelerator: Optional[str] = None,
    index_cache_size: Optional[int] = None,
    *,
    index_type: VectorIndexType = "IVF_PQ",
    wait_timeout: Optional[timedelta] = None,
    num_bits: int = 8,
    max_iterations: int = 50,
    sample_rate: int = 256,
    m: int = 20,
    ef_construction: int = 300,
):
    """Create an index on the table.

    Parameters
    ----------
    metric: str, default "l2"
        The distance metric to use when creating the index.
        Valid values are "l2", "cosine", "dot", or "hamming".
        l2 is euclidean distance.
        Hamming is available only for binary vectors.
    num_partitions: int, default 256
        The number of IVF partitions to use when creating the index.
        Default is 256.
    num_sub_vectors: int, default 96
        The number of PQ sub-vectors to use when creating the index.
        Default is 96.
    vector_column_name: str, default "vector"
        The vector column name to create the index.
    replace: bool, default True
        - If True, replace the existing index if it exists.

        - If False, raise an error if duplicate index exists.
    accelerator: str, default None
        If set, use the given accelerator to create the index.
        Only support "cuda" for now.
    index_cache_size : int, optional
        The size of the index cache in number of entries. Default value is 256.
    num_bits: int
        The number of bits to encode sub-vectors. Only used with the IVF_PQ index.
        Only 4 and 8 are supported.
    wait_timeout: timedelta, optional
        The timeout to wait if indexing is asynchronous.
    """
    raise NotImplementedError

drop_index

drop_index(name: str) -> None

Drop an index from the table.

Parameters:

  • name (str) –

    The name of the index to drop.

Notes

This does not delete the index from disk, it just removes it from the table. To delete the index, run optimize after dropping the index.

Use list_indices to find the names of the indices.

Source code in lancedb/table.py
def drop_index(self, name: str) -> None:
    """
    Drop an index from the table.

    Parameters
    ----------
    name: str
        The name of the index to drop.

    Notes
    -----
    This does not delete the index from disk, it just removes it from the table.
    To delete the index, run [optimize][lancedb.table.Table.optimize]
    after dropping the index.

    Use [list_indices][lancedb.table.Table.list_indices] to find the names of
    the indices.
    """
    raise NotImplementedError

wait_for_index

wait_for_index(index_names: Iterable[str], timeout: timedelta = timedelta(seconds=300)) -> None

Wait for indexing to complete for the given index names. This will poll the table until all the indices are fully indexed, or raise a timeout exception if the timeout is reached.

Parameters:

  • index_names (Iterable[str]) –

    The name of the indices to poll

  • timeout (timedelta, default: timedelta(seconds=300) ) –

    Timeout to wait for asynchronous indexing. The default is 5 minutes.

Source code in lancedb/table.py
def wait_for_index(
    self, index_names: Iterable[str], timeout: timedelta = timedelta(seconds=300)
) -> None:
    """
    Wait for indexing to complete for the given index names.
    This will poll the table until all the indices are fully indexed,
    or raise a timeout exception if the timeout is reached.

    Parameters
    ----------
    index_names: str
        The name of the indices to poll
    timeout: timedelta
        Timeout to wait for asynchronous indexing. The default is 5 minutes.
    """
    raise NotImplementedError

stats abstractmethod

stats() -> TableStatistics

Retrieve table and fragment statistics.

Source code in lancedb/table.py
@abstractmethod
def stats(self) -> TableStatistics:
    """
    Retrieve table and fragment statistics.
    """
    raise NotImplementedError

create_scalar_index abstractmethod

create_scalar_index(column: str, *, replace: bool = True, index_type: ScalarIndexType = 'BTREE', wait_timeout: Optional[timedelta] = None)

Create a scalar index on a column.

Parameters:

  • column (str) –

    The column to be indexed. Must be a boolean, integer, float, or string column.

  • replace (bool, default: True ) –

    Replace the existing index if it exists.

  • index_type (ScalarIndexType, default: 'BTREE' ) –

    The type of index to create.

  • wait_timeout (Optional[timedelta], default: None ) –

    The timeout to wait if indexing is asynchronous.

Examples:

Scalar indices, like vector indices, can be used to speed up scans. A scalar index can speed up scans that contain filter expressions on the indexed column. For example, the following scan will be faster if the column my_col has a scalar index:

>>> import lancedb
>>> db = lancedb.connect("/data/lance")
>>> img_table = db.open_table("images")
>>> my_df = img_table.search().where("my_col = 7",
...                                  prefilter=True).to_pandas()

Scalar indices can also speed up scans containing a vector search and a prefilter:

>>> import lancedb
>>> db = lancedb.connect("/data/lance")
>>> img_table = db.open_table("images")
>>> img_table.search([1, 2, 3, 4], vector_column_name="vector")
...     .where("my_col != 7", prefilter=True)
...     .to_pandas()

Scalar indices can only speed up scans for basic filters using equality, comparison, range (e.g. my_col BETWEEN 0 AND 100), and set membership (e.g. my_col IN (0, 1, 2))

Scalar indices can be used if the filter contains multiple indexed columns and the filter criteria are AND'd or OR'd together (e.g. my_col < 0 AND other_col> 100)

Scalar indices may be used if the filter contains non-indexed columns but, depending on the structure of the filter, they may not be usable. For example, if the column not_indexed does not have a scalar index then the filter my_col = 0 OR not_indexed = 1 will not be able to use any scalar index on my_col.

Source code in lancedb/table.py
@abstractmethod
def create_scalar_index(
    self,
    column: str,
    *,
    replace: bool = True,
    index_type: ScalarIndexType = "BTREE",
    wait_timeout: Optional[timedelta] = None,
):
    """Create a scalar index on a column.

    Parameters
    ----------
    column : str
        The column to be indexed.  Must be a boolean, integer, float,
        or string column.
    replace : bool, default True
        Replace the existing index if it exists.
    index_type: Literal["BTREE", "BITMAP", "LABEL_LIST"], default "BTREE"
        The type of index to create.
    wait_timeout: timedelta, optional
        The timeout to wait if indexing is asynchronous.
    Examples
    --------

    Scalar indices, like vector indices, can be used to speed up scans.  A scalar
    index can speed up scans that contain filter expressions on the indexed column.
    For example, the following scan will be faster if the column ``my_col`` has
    a scalar index:

    >>> import lancedb # doctest: +SKIP
    >>> db = lancedb.connect("/data/lance") # doctest: +SKIP
    >>> img_table = db.open_table("images") # doctest: +SKIP
    >>> my_df = img_table.search().where("my_col = 7", # doctest: +SKIP
    ...                                  prefilter=True).to_pandas()

    Scalar indices can also speed up scans containing a vector search and a
    prefilter:

    >>> import lancedb # doctest: +SKIP
    >>> db = lancedb.connect("/data/lance") # doctest: +SKIP
    >>> img_table = db.open_table("images") # doctest: +SKIP
    >>> img_table.search([1, 2, 3, 4], vector_column_name="vector") # doctest: +SKIP
    ...     .where("my_col != 7", prefilter=True)
    ...     .to_pandas()

    Scalar indices can only speed up scans for basic filters using
    equality, comparison, range (e.g. ``my_col BETWEEN 0 AND 100``), and set
    membership (e.g. `my_col IN (0, 1, 2)`)

    Scalar indices can be used if the filter contains multiple indexed columns and
    the filter criteria are AND'd or OR'd together
    (e.g. ``my_col < 0 AND other_col> 100``)

    Scalar indices may be used if the filter contains non-indexed columns but,
    depending on the structure of the filter, they may not be usable.  For example,
    if the column ``not_indexed`` does not have a scalar index then the filter
    ``my_col = 0 OR not_indexed = 1`` will not be able to use any scalar index on
    ``my_col``.
    """
    raise NotImplementedError

create_fts_index

create_fts_index(field_names: Union[str, List[str]], *, ordering_field_names: Optional[Union[str, List[str]]] = None, replace: bool = False, writer_heap_size: Optional[int] = 1024 * 1024 * 1024, use_tantivy: bool = True, tokenizer_name: Optional[str] = None, with_position: bool = True, base_tokenizer: BaseTokenizerType = 'simple', language: str = 'English', max_token_length: Optional[int] = 40, lower_case: bool = True, stem: bool = False, remove_stop_words: bool = False, ascii_folding: bool = False, wait_timeout: Optional[timedelta] = None)

Create a full-text search index on the table.

Warning - this API is highly experimental and is highly likely to change in the future.

Parameters:

  • field_names (Union[str, List[str]]) –

    The name(s) of the field to index. can be only str if use_tantivy=True for now.

  • replace (bool, default: False ) –

    If True, replace the existing index if it exists. Note that this is not yet an atomic operation; the index will be temporarily unavailable while the new index is being created.

  • writer_heap_size (Optional[int], default: 1024 * 1024 * 1024 ) –

    Only available with use_tantivy=True

  • ordering_field_names (Optional[Union[str, List[str]]], default: None ) –

    A list of unsigned type fields to index to optionally order results on at search time. only available with use_tantivy=True

  • tokenizer_name (Optional[str], default: None ) –

    The tokenizer to use for the index. Can be "raw", "default" or the 2 letter language code followed by "_stem". So for english it would be "en_stem". For available languages see: https://docs.rs/tantivy/latest/tantivy/tokenizer/enum.Language.html

  • use_tantivy (bool, default: True ) –

    If True, use the legacy full-text search implementation based on tantivy. If False, use the new full-text search implementation based on lance-index.

  • with_position (bool, default: True ) –

    Only available with use_tantivy=False If False, do not store the positions of the terms in the text. This can reduce the size of the index and improve indexing speed. But it will raise an exception for phrase queries.

  • base_tokenizer (str, default: "simple" ) –

    The base tokenizer to use for tokenization. Options are: - "simple": Splits text by whitespace and punctuation. - "whitespace": Split text by whitespace, but not punctuation. - "raw": No tokenization. The entire text is treated as a single token.

  • language (str, default: "English" ) –

    The language to use for tokenization.

  • max_token_length (int, default: 40 ) –

    The maximum token length to index. Tokens longer than this length will be ignored.

  • lower_case (bool, default: True ) –

    Whether to convert the token to lower case. This makes queries case-insensitive.

  • stem (bool, default: False ) –

    Whether to stem the token. Stemming reduces words to their root form. For example, in English "running" and "runs" would both be reduced to "run".

  • remove_stop_words (bool, default: False ) –

    Whether to remove stop words. Stop words are common words that are often removed from text before indexing. For example, in English "the" and "and".

  • ascii_folding (bool, default: False ) –

    Whether to fold ASCII characters. This converts accented characters to their ASCII equivalent. For example, "cafΓ©" would be converted to "cafe".

  • wait_timeout (Optional[timedelta], default: None ) –

    The timeout to wait if indexing is asynchronous.

Source code in lancedb/table.py
def create_fts_index(
    self,
    field_names: Union[str, List[str]],
    *,
    ordering_field_names: Optional[Union[str, List[str]]] = None,
    replace: bool = False,
    writer_heap_size: Optional[int] = 1024 * 1024 * 1024,
    use_tantivy: bool = True,
    tokenizer_name: Optional[str] = None,
    with_position: bool = True,
    # tokenizer configs:
    base_tokenizer: BaseTokenizerType = "simple",
    language: str = "English",
    max_token_length: Optional[int] = 40,
    lower_case: bool = True,
    stem: bool = False,
    remove_stop_words: bool = False,
    ascii_folding: bool = False,
    wait_timeout: Optional[timedelta] = None,
):
    """Create a full-text search index on the table.

    Warning - this API is highly experimental and is highly likely to change
    in the future.

    Parameters
    ----------
    field_names: str or list of str
        The name(s) of the field to index.
        can be only str if use_tantivy=True for now.
    replace: bool, default False
        If True, replace the existing index if it exists. Note that this is
        not yet an atomic operation; the index will be temporarily
        unavailable while the new index is being created.
    writer_heap_size: int, default 1GB
        Only available with use_tantivy=True
    ordering_field_names:
        A list of unsigned type fields to index to optionally order
        results on at search time.
        only available with use_tantivy=True
    tokenizer_name: str, default "default"
        The tokenizer to use for the index. Can be "raw", "default" or the 2 letter
        language code followed by "_stem". So for english it would be "en_stem".
        For available languages see: https://docs.rs/tantivy/latest/tantivy/tokenizer/enum.Language.html
    use_tantivy: bool, default True
        If True, use the legacy full-text search implementation based on tantivy.
        If False, use the new full-text search implementation based on lance-index.
    with_position: bool, default True
        Only available with use_tantivy=False
        If False, do not store the positions of the terms in the text.
        This can reduce the size of the index and improve indexing speed.
        But it will raise an exception for phrase queries.
    base_tokenizer : str, default "simple"
        The base tokenizer to use for tokenization. Options are:
        - "simple": Splits text by whitespace and punctuation.
        - "whitespace": Split text by whitespace, but not punctuation.
        - "raw": No tokenization. The entire text is treated as a single token.
    language : str, default "English"
        The language to use for tokenization.
    max_token_length : int, default 40
        The maximum token length to index. Tokens longer than this length will be
        ignored.
    lower_case : bool, default True
        Whether to convert the token to lower case. This makes queries
        case-insensitive.
    stem : bool, default False
        Whether to stem the token. Stemming reduces words to their root form.
        For example, in English "running" and "runs" would both be reduced to "run".
    remove_stop_words : bool, default False
        Whether to remove stop words. Stop words are common words that are often
        removed from text before indexing. For example, in English "the" and "and".
    ascii_folding : bool, default False
        Whether to fold ASCII characters. This converts accented characters to
        their ASCII equivalent. For example, "cafΓ©" would be converted to "cafe".
    wait_timeout: timedelta, optional
        The timeout to wait if indexing is asynchronous.
    """
    raise NotImplementedError

add abstractmethod

add(data: DATA, mode: AddMode = 'append', on_bad_vectors: OnBadVectorsType = 'error', fill_value: float = 0.0) -> AddResult

Add more data to the Table.

Parameters:

  • data (DATA) –

    The data to insert into the table. Acceptable types are:

    • list-of-dict

    • pandas.DataFrame

    • pyarrow.Table or pyarrow.RecordBatch

  • mode (AddMode, default: 'append' ) –

    The mode to use when writing the data. Valid values are "append" and "overwrite".

  • on_bad_vectors (OnBadVectorsType, default: 'error' ) –

    What to do if any of the vectors are not the same size or contains NaNs. One of "error", "drop", "fill".

  • fill_value (float, default: 0.0 ) –

    The value to use when filling vectors. Only used if on_bad_vectors="fill".

Returns:

  • AddResult –

    An object containing the new version number of the table after adding data.

Source code in lancedb/table.py
@abstractmethod
def add(
    self,
    data: DATA,
    mode: AddMode = "append",
    on_bad_vectors: OnBadVectorsType = "error",
    fill_value: float = 0.0,
) -> AddResult:
    """Add more data to the [Table](Table).

    Parameters
    ----------
    data: DATA
        The data to insert into the table. Acceptable types are:

        - list-of-dict

        - pandas.DataFrame

        - pyarrow.Table or pyarrow.RecordBatch
    mode: str
        The mode to use when writing the data. Valid values are
        "append" and "overwrite".
    on_bad_vectors: str, default "error"
        What to do if any of the vectors are not the same size or contains NaNs.
        One of "error", "drop", "fill".
    fill_value: float, default 0.
        The value to use when filling vectors. Only used if on_bad_vectors="fill".

    Returns
    -------
    AddResult
        An object containing the new version number of the table after adding data.
    """
    raise NotImplementedError

merge_insert

merge_insert(on: Union[str, Iterable[str]]) -> LanceMergeInsertBuilder

Returns a LanceMergeInsertBuilder that can be used to create a "merge insert" operation

This operation can add rows, update rows, and remove rows all in a single transaction. It is a very generic tool that can be used to create behaviors like "insert if not exists", "update or insert (i.e. upsert)", or even replace a portion of existing data with new data (e.g. replace all data where month="january")

The merge insert operation works by combining new data from a source table with existing data in a target table by using a join. There are three categories of records.

"Matched" records are records that exist in both the source table and the target table. "Not matched" records exist only in the source table (e.g. these are new data) "Not matched by source" records exist only in the target table (this is old data)

The builder returned by this method can be used to customize what should happen for each category of data.

Please note that the data may appear to be reordered as part of this operation. This is because updated rows will be deleted from the dataset and then reinserted at the end with the new values.

Parameters:

  • on (Union[str, Iterable[str]]) –

    A column (or columns) to join on. This is how records from the source table and target table are matched. Typically this is some kind of key or id column.

Examples:

>>> import lancedb
>>> data = pa.table({"a": [2, 1, 3], "b": ["a", "b", "c"]})
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data)
>>> new_data = pa.table({"a": [2, 3, 4], "b": ["x", "y", "z"]})
>>> # Perform a "upsert" operation
>>> res = table.merge_insert("a")     \
...      .when_matched_update_all()     \
...      .when_not_matched_insert_all() \
...      .execute(new_data)
>>> res
MergeResult(version=2, num_updated_rows=2, num_inserted_rows=1, num_deleted_rows=0)
>>> # The order of new rows is non-deterministic since we use
>>> # a hash-join as part of this operation and so we sort here
>>> table.to_arrow().sort_by("a").to_pandas()
   a  b
0  1  b
1  2  x
2  3  y
3  4  z
Source code in lancedb/table.py
def merge_insert(self, on: Union[str, Iterable[str]]) -> LanceMergeInsertBuilder:
    """
    Returns a [`LanceMergeInsertBuilder`][lancedb.merge.LanceMergeInsertBuilder]
    that can be used to create a "merge insert" operation

    This operation can add rows, update rows, and remove rows all in a single
    transaction. It is a very generic tool that can be used to create
    behaviors like "insert if not exists", "update or insert (i.e. upsert)",
    or even replace a portion of existing data with new data (e.g. replace
    all data where month="january")

    The merge insert operation works by combining new data from a
    **source table** with existing data in a **target table** by using a
    join.  There are three categories of records.

    "Matched" records are records that exist in both the source table and
    the target table. "Not matched" records exist only in the source table
    (e.g. these are new data) "Not matched by source" records exist only
    in the target table (this is old data)

    The builder returned by this method can be used to customize what
    should happen for each category of data.

    Please note that the data may appear to be reordered as part of this
    operation.  This is because updated rows will be deleted from the
    dataset and then reinserted at the end with the new values.

    Parameters
    ----------

    on: Union[str, Iterable[str]]
        A column (or columns) to join on.  This is how records from the
        source table and target table are matched.  Typically this is some
        kind of key or id column.

    Examples
    --------
    >>> import lancedb
    >>> data = pa.table({"a": [2, 1, 3], "b": ["a", "b", "c"]})
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", data)
    >>> new_data = pa.table({"a": [2, 3, 4], "b": ["x", "y", "z"]})
    >>> # Perform a "upsert" operation
    >>> res = table.merge_insert("a")     \\
    ...      .when_matched_update_all()     \\
    ...      .when_not_matched_insert_all() \\
    ...      .execute(new_data)
    >>> res
    MergeResult(version=2, num_updated_rows=2, num_inserted_rows=1, num_deleted_rows=0)
    >>> # The order of new rows is non-deterministic since we use
    >>> # a hash-join as part of this operation and so we sort here
    >>> table.to_arrow().sort_by("a").to_pandas()
       a  b
    0  1  b
    1  2  x
    2  3  y
    3  4  z
    """  # noqa: E501
    on = [on] if isinstance(on, str) else list(iter(on))

    return LanceMergeInsertBuilder(self, on)

search abstractmethod

search(query: Optional[Union[VEC, str, 'PIL.Image.Image', Tuple, FullTextQuery]] = None, vector_column_name: Optional[str] = None, query_type: QueryType = 'auto', ordering_field_name: Optional[str] = None, fts_columns: Optional[Union[str, List[str]]] = None) -> LanceQueryBuilder

Create a search query to find the nearest neighbors of the given query vector. We currently support vector search and [full-text search][experimental-full-text-search].

All query options are defined in LanceQueryBuilder.

Examples:

>>> import lancedb
>>> db = lancedb.connect("./.lancedb")
>>> data = [
...    {"original_width": 100, "caption": "bar", "vector": [0.1, 2.3, 4.5]},
...    {"original_width": 2000, "caption": "foo",  "vector": [0.5, 3.4, 1.3]},
...    {"original_width": 3000, "caption": "test", "vector": [0.3, 6.2, 2.6]}
... ]
>>> table = db.create_table("my_table", data)
>>> query = [0.4, 1.4, 2.4]
>>> (table.search(query)
...     .where("original_width > 1000", prefilter=True)
...     .select(["caption", "original_width", "vector"])
...     .limit(2)
...     .to_pandas())
  caption  original_width           vector  _distance
0     foo            2000  [0.5, 3.4, 1.3]   5.220000
1    test            3000  [0.3, 6.2, 2.6]  23.089996

Parameters:

  • query (Optional[Union[VEC, str, 'PIL.Image.Image', Tuple, FullTextQuery]], default: None ) –

    The targetted vector to search for.

    • default None. Acceptable types are: list, np.ndarray, PIL.Image.Image

    • If None then the select/where/limit clauses are applied to filter the table

  • vector_column_name (Optional[str], default: None ) –

    The name of the vector column to search.

    The vector column needs to be a pyarrow fixed size list type

    • If not specified then the vector column is inferred from the table schema

    • If the table has multiple vector columns then the vector_column_name needs to be specified. Otherwise, an error is raised.

  • query_type (QueryType, default: 'auto' ) –

    default "auto". Acceptable types are: "vector", "fts", "hybrid", or "auto"

    • If "auto" then the query type is inferred from the query;

      • If query is a list/np.ndarray then the query type is "vector";

      • If query is a PIL.Image.Image then either do vector search, or raise an error if no corresponding embedding function is found.

    • If query is a string, then the query type is "vector" if the table has embedding functions else the query type is "fts"

Returns:

  • LanceQueryBuilder –

    A query builder object representing the query. Once executed, the query returns

    • selected columns

    • the vector

    • and also the "_distance" column which is the distance between the query vector and the returned vector.

Source code in lancedb/table.py
@abstractmethod
def search(
    self,
    query: Optional[
        Union[VEC, str, "PIL.Image.Image", Tuple, FullTextQuery]
    ] = None,
    vector_column_name: Optional[str] = None,
    query_type: QueryType = "auto",
    ordering_field_name: Optional[str] = None,
    fts_columns: Optional[Union[str, List[str]]] = None,
) -> LanceQueryBuilder:
    """Create a search query to find the nearest neighbors
    of the given query vector. We currently support [vector search][search]
    and [full-text search][experimental-full-text-search].

    All query options are defined in
    [LanceQueryBuilder][lancedb.query.LanceQueryBuilder].

    Examples
    --------
    >>> import lancedb
    >>> db = lancedb.connect("./.lancedb")
    >>> data = [
    ...    {"original_width": 100, "caption": "bar", "vector": [0.1, 2.3, 4.5]},
    ...    {"original_width": 2000, "caption": "foo",  "vector": [0.5, 3.4, 1.3]},
    ...    {"original_width": 3000, "caption": "test", "vector": [0.3, 6.2, 2.6]}
    ... ]
    >>> table = db.create_table("my_table", data)
    >>> query = [0.4, 1.4, 2.4]
    >>> (table.search(query)
    ...     .where("original_width > 1000", prefilter=True)
    ...     .select(["caption", "original_width", "vector"])
    ...     .limit(2)
    ...     .to_pandas())
      caption  original_width           vector  _distance
    0     foo            2000  [0.5, 3.4, 1.3]   5.220000
    1    test            3000  [0.3, 6.2, 2.6]  23.089996

    Parameters
    ----------
    query: list/np.ndarray/str/PIL.Image.Image, default None
        The targetted vector to search for.

        - *default None*.
        Acceptable types are: list, np.ndarray, PIL.Image.Image

        - If None then the select/where/limit clauses are applied to filter
        the table
    vector_column_name: str, optional
        The name of the vector column to search.

        The vector column needs to be a pyarrow fixed size list type

        - If not specified then the vector column is inferred from
        the table schema

        - If the table has multiple vector columns then the *vector_column_name*
        needs to be specified. Otherwise, an error is raised.
    query_type: str
        *default "auto"*.
        Acceptable types are: "vector", "fts", "hybrid", or "auto"

        - If "auto" then the query type is inferred from the query;

            - If `query` is a list/np.ndarray then the query type is
            "vector";

            - If `query` is a PIL.Image.Image then either do vector search,
            or raise an error if no corresponding embedding function is found.

        - If `query` is a string, then the query type is "vector" if the
        table has embedding functions else the query type is "fts"

    Returns
    -------
    LanceQueryBuilder
        A query builder object representing the query.
        Once executed, the query returns

        - selected columns

        - the vector

        - and also the "_distance" column which is the distance between the query
        vector and the returned vector.
    """
    raise NotImplementedError

delete abstractmethod

delete(where: str) -> DeleteResult

Delete rows from the table.

This can be used to delete a single row, many rows, all rows, or sometimes no rows (if your predicate matches nothing).

Parameters:

  • where (str) –

    The SQL where clause to use when deleting rows.

    • For example, 'x = 2' or 'x IN (1, 2, 3)'.

    The filter must not be empty, or it will error.

Returns:

  • DeleteResult –

    An object containing the new version number of the table after deletion.

Examples:

>>> import lancedb
>>> data = [
...    {"x": 1, "vector": [1.0, 2]},
...    {"x": 2, "vector": [3.0, 4]},
...    {"x": 3, "vector": [5.0, 6]}
... ]
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data)
>>> table.to_pandas()
   x      vector
0  1  [1.0, 2.0]
1  2  [3.0, 4.0]
2  3  [5.0, 6.0]
>>> table.delete("x = 2")
DeleteResult(version=2)
>>> table.to_pandas()
   x      vector
0  1  [1.0, 2.0]
1  3  [5.0, 6.0]

If you have a list of values to delete, you can combine them into a stringified list and use the IN operator:

>>> to_remove = [1, 5]
>>> to_remove = ", ".join([str(v) for v in to_remove])
>>> to_remove
'1, 5'
>>> table.delete(f"x IN ({to_remove})")
DeleteResult(version=3)
>>> table.to_pandas()
   x      vector
0  3  [5.0, 6.0]
Source code in lancedb/table.py
@abstractmethod
def delete(self, where: str) -> DeleteResult:
    """Delete rows from the table.

    This can be used to delete a single row, many rows, all rows, or
    sometimes no rows (if your predicate matches nothing).

    Parameters
    ----------
    where: str
        The SQL where clause to use when deleting rows.

        - For example, 'x = 2' or 'x IN (1, 2, 3)'.

        The filter must not be empty, or it will error.

    Returns
    -------
    DeleteResult
        An object containing the new version number of the table after deletion.

    Examples
    --------
    >>> import lancedb
    >>> data = [
    ...    {"x": 1, "vector": [1.0, 2]},
    ...    {"x": 2, "vector": [3.0, 4]},
    ...    {"x": 3, "vector": [5.0, 6]}
    ... ]
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", data)
    >>> table.to_pandas()
       x      vector
    0  1  [1.0, 2.0]
    1  2  [3.0, 4.0]
    2  3  [5.0, 6.0]
    >>> table.delete("x = 2")
    DeleteResult(version=2)
    >>> table.to_pandas()
       x      vector
    0  1  [1.0, 2.0]
    1  3  [5.0, 6.0]

    If you have a list of values to delete, you can combine them into a
    stringified list and use the `IN` operator:

    >>> to_remove = [1, 5]
    >>> to_remove = ", ".join([str(v) for v in to_remove])
    >>> to_remove
    '1, 5'
    >>> table.delete(f"x IN ({to_remove})")
    DeleteResult(version=3)
    >>> table.to_pandas()
       x      vector
    0  3  [5.0, 6.0]
    """
    raise NotImplementedError

update abstractmethod

update(where: Optional[str] = None, values: Optional[dict] = None, *, values_sql: Optional[Dict[str, str]] = None) -> UpdateResult

This can be used to update zero to all rows depending on how many rows match the where clause. If no where clause is provided, then all rows will be updated.

Either values or values_sql must be provided. You cannot provide both.

Parameters:

  • where (Optional[str], default: None ) –

    The SQL where clause to use when updating rows. For example, 'x = 2' or 'x IN (1, 2, 3)'. The filter must not be empty, or it will error.

  • values (Optional[dict], default: None ) –

    The values to update. The keys are the column names and the values are the values to set.

  • values_sql (Optional[Dict[str, str]], default: None ) –

    The values to update, expressed as SQL expression strings. These can reference existing columns. For example, {"x": "x + 1"} will increment the x column by 1.

Returns:

  • UpdateResult –
    • rows_updated: The number of rows that were updated
    • version: The new version number of the table after the update

Examples:

>>> import lancedb
>>> import pandas as pd
>>> data = pd.DataFrame({"x": [1, 2, 3], "vector": [[1.0, 2], [3, 4], [5, 6]]})
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data)
>>> table.to_pandas()
   x      vector
0  1  [1.0, 2.0]
1  2  [3.0, 4.0]
2  3  [5.0, 6.0]
>>> table.update(where="x = 2", values={"vector": [10.0, 10]})
UpdateResult(rows_updated=1, version=2)
>>> table.to_pandas()
   x        vector
0  1    [1.0, 2.0]
1  3    [5.0, 6.0]
2  2  [10.0, 10.0]
>>> table.update(values_sql={"x": "x + 1"})
UpdateResult(rows_updated=3, version=3)
>>> table.to_pandas()
   x        vector
0  2    [1.0, 2.0]
1  4    [5.0, 6.0]
2  3  [10.0, 10.0]
Source code in lancedb/table.py
@abstractmethod
def update(
    self,
    where: Optional[str] = None,
    values: Optional[dict] = None,
    *,
    values_sql: Optional[Dict[str, str]] = None,
) -> UpdateResult:
    """
    This can be used to update zero to all rows depending on how many
    rows match the where clause. If no where clause is provided, then
    all rows will be updated.

    Either `values` or `values_sql` must be provided. You cannot provide
    both.

    Parameters
    ----------
    where: str, optional
        The SQL where clause to use when updating rows. For example, 'x = 2'
        or 'x IN (1, 2, 3)'. The filter must not be empty, or it will error.
    values: dict, optional
        The values to update. The keys are the column names and the values
        are the values to set.
    values_sql: dict, optional
        The values to update, expressed as SQL expression strings. These can
        reference existing columns. For example, {"x": "x + 1"} will increment
        the x column by 1.

    Returns
    -------
    UpdateResult
        - rows_updated: The number of rows that were updated
        - version: The new version number of the table after the update

    Examples
    --------
    >>> import lancedb
    >>> import pandas as pd
    >>> data = pd.DataFrame({"x": [1, 2, 3], "vector": [[1.0, 2], [3, 4], [5, 6]]})
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", data)
    >>> table.to_pandas()
       x      vector
    0  1  [1.0, 2.0]
    1  2  [3.0, 4.0]
    2  3  [5.0, 6.0]
    >>> table.update(where="x = 2", values={"vector": [10.0, 10]})
    UpdateResult(rows_updated=1, version=2)
    >>> table.to_pandas()
       x        vector
    0  1    [1.0, 2.0]
    1  3    [5.0, 6.0]
    2  2  [10.0, 10.0]
    >>> table.update(values_sql={"x": "x + 1"})
    UpdateResult(rows_updated=3, version=3)
    >>> table.to_pandas()
       x        vector
    0  2    [1.0, 2.0]
    1  4    [5.0, 6.0]
    2  3  [10.0, 10.0]
    """
    raise NotImplementedError

cleanup_old_versions abstractmethod

cleanup_old_versions(older_than: Optional[timedelta] = None, *, delete_unverified: bool = False) -> 'CleanupStats'

Clean up old versions of the table, freeing disk space.

Parameters:

  • older_than (Optional[timedelta], default: None ) –

    The minimum age of the version to delete. If None, then this defaults to two weeks.

  • delete_unverified (bool, default: False ) –

    Because they may be part of an in-progress transaction, files newer than 7 days old are not deleted by default. If you are sure that there are no in-progress transactions, then you can set this to True to delete all files older than older_than.

Returns:

  • CleanupStats –

    The stats of the cleanup operation, including how many bytes were freed.

See Also

Table.optimize: A more comprehensive optimization operation that includes cleanup as well as other operations.

Notes

This function is not available in LanceDb Cloud (since LanceDB Cloud manages cleanup for you automatically)

Source code in lancedb/table.py
@abstractmethod
def cleanup_old_versions(
    self,
    older_than: Optional[timedelta] = None,
    *,
    delete_unverified: bool = False,
) -> "CleanupStats":
    """
    Clean up old versions of the table, freeing disk space.

    Parameters
    ----------
    older_than: timedelta, default None
        The minimum age of the version to delete. If None, then this defaults
        to two weeks.
    delete_unverified: bool, default False
        Because they may be part of an in-progress transaction, files newer
        than 7 days old are not deleted by default. If you are sure that
        there are no in-progress transactions, then you can set this to True
        to delete all files older than `older_than`.

    Returns
    -------
    CleanupStats
        The stats of the cleanup operation, including how many bytes were
        freed.

    See Also
    --------
    [Table.optimize][lancedb.table.Table.optimize]: A more comprehensive
        optimization operation that includes cleanup as well as other operations.

    Notes
    -----
    This function is not available in LanceDb Cloud (since LanceDB
    Cloud manages cleanup for you automatically)
    """

compact_files abstractmethod

compact_files(*args, **kwargs)

Run the compaction process on the table. This can be run after making several small appends to optimize the table for faster reads.

Arguments are passed onto Lance's compact_files. For most cases, the default should be fine.

See Also

Table.optimize: A more comprehensive optimization operation that includes cleanup as well as other operations.

Notes

This function is not available in LanceDB Cloud (since LanceDB Cloud manages compaction for you automatically)

Source code in lancedb/table.py
@abstractmethod
def compact_files(self, *args, **kwargs):
    """
    Run the compaction process on the table.
    This can be run after making several small appends to optimize the table
    for faster reads.

    Arguments are passed onto Lance's
    [compact_files][lance.dataset.DatasetOptimizer.compact_files].
    For most cases, the default should be fine.

    See Also
    --------
    [Table.optimize][lancedb.table.Table.optimize]: A more comprehensive
        optimization operation that includes cleanup as well as other operations.

    Notes
    -----
    This function is not available in LanceDB Cloud (since LanceDB
    Cloud manages compaction for you automatically)
    """

optimize abstractmethod

optimize(*, cleanup_older_than: Optional[timedelta] = None, delete_unverified: bool = False, retrain: bool = False)

Optimize the on-disk data and indices for better performance.

Modeled after VACUUM in PostgreSQL.

Optimization covers three operations:

  • Compaction: Merges small files into larger ones
  • Prune: Removes old versions of the dataset
  • Index: Optimizes the indices, adding new data to existing indices

Parameters:

  • cleanup_older_than (Optional[timedelta], default: None ) –

    All files belonging to versions older than this will be removed. Set to 0 days to remove all versions except the latest. The latest version is never removed.

  • delete_unverified (bool, default: False ) –

    Files leftover from a failed transaction may appear to be part of an in-progress operation (e.g. appending new data) and these files will not be deleted unless they are at least 7 days old. If delete_unverified is True then these files will be deleted regardless of their age.

  • retrain (bool, default: False ) –

    If True, retrain the vector indices, this would refine the IVF clustering and quantization, which may improve the search accuracy. It's faster than re-creating the index from scratch, so it's recommended to try this first, when the data distribution has changed significantly.

Experimental API

The optimization process is undergoing active development and may change. Our goal with these changes is to improve the performance of optimization and reduce the complexity.

That being said, it is essential today to run optimize if you want the best performance. It should be stable and safe to use in production, but it our hope that the API may be simplified (or not even need to be called) in the future.

The frequency an application shoudl call optimize is based on the frequency of data modifications. If data is frequently added, deleted, or updated then optimize should be run frequently. A good rule of thumb is to run optimize if you have added or modified 100,000 or more records or run more than 20 data modification operations.

Source code in lancedb/table.py
@abstractmethod
def optimize(
    self,
    *,
    cleanup_older_than: Optional[timedelta] = None,
    delete_unverified: bool = False,
    retrain: bool = False,
):
    """
    Optimize the on-disk data and indices for better performance.

    Modeled after ``VACUUM`` in PostgreSQL.

    Optimization covers three operations:

     * Compaction: Merges small files into larger ones
     * Prune: Removes old versions of the dataset
     * Index: Optimizes the indices, adding new data to existing indices

    Parameters
    ----------
    cleanup_older_than: timedelta, optional default 7 days
        All files belonging to versions older than this will be removed.  Set
        to 0 days to remove all versions except the latest.  The latest version
        is never removed.
    delete_unverified: bool, default False
        Files leftover from a failed transaction may appear to be part of an
        in-progress operation (e.g. appending new data) and these files will not
        be deleted unless they are at least 7 days old. If delete_unverified is True
        then these files will be deleted regardless of their age.
    retrain: bool, default False
        If True, retrain the vector indices, this would refine the IVF clustering
        and quantization, which may improve the search accuracy. It's faster than
        re-creating the index from scratch, so it's recommended to try this first,
        when the data distribution has changed significantly.

    Experimental API
    ----------------

    The optimization process is undergoing active development and may change.
    Our goal with these changes is to improve the performance of optimization and
    reduce the complexity.

    That being said, it is essential today to run optimize if you want the best
    performance.  It should be stable and safe to use in production, but it our
    hope that the API may be simplified (or not even need to be called) in the
    future.

    The frequency an application shoudl call optimize is based on the frequency of
    data modifications.  If data is frequently added, deleted, or updated then
    optimize should be run frequently.  A good rule of thumb is to run optimize if
    you have added or modified 100,000 or more records or run more than 20 data
    modification operations.
    """

list_indices abstractmethod

list_indices() -> Iterable[IndexConfig]

List all indices that have been created with Table.create_index

Source code in lancedb/table.py
@abstractmethod
def list_indices(self) -> Iterable[IndexConfig]:
    """
    List all indices that have been created with
    [Table.create_index][lancedb.table.Table.create_index]
    """

index_stats abstractmethod

index_stats(index_name: str) -> Optional[IndexStatistics]

Retrieve statistics about an index

Parameters:

  • index_name (str) –

    The name of the index to retrieve statistics for

Returns:

  • IndexStatistics or None –

    The statistics about the index. Returns None if the index does not exist.

Source code in lancedb/table.py
@abstractmethod
def index_stats(self, index_name: str) -> Optional[IndexStatistics]:
    """
    Retrieve statistics about an index

    Parameters
    ----------
    index_name: str
        The name of the index to retrieve statistics for

    Returns
    -------
    IndexStatistics or None
        The statistics about the index. Returns None if the index does not exist.
    """

add_columns abstractmethod

add_columns(transforms: Dict[str, str] | Field | List[Field] | Schema)

Add new columns with defined values.

Parameters:

  • transforms (Dict[str, str] | Field | List[Field] | Schema) –

    A map of column name to a SQL expression to use to calculate the value of the new column. These expressions will be evaluated for each row in the table, and can reference existing columns. Alternatively, a pyarrow Field or Schema can be provided to add new columns with the specified data types. The new columns will be initialized with null values.

Returns:

  • AddColumnsResult –

    version: the new version number of the table after adding columns.

Source code in lancedb/table.py
@abstractmethod
def add_columns(
    self, transforms: Dict[str, str] | pa.Field | List[pa.Field] | pa.Schema
):
    """
    Add new columns with defined values.

    Parameters
    ----------
    transforms: Dict[str, str], pa.Field, List[pa.Field], pa.Schema
        A map of column name to a SQL expression to use to calculate the
        value of the new column. These expressions will be evaluated for
        each row in the table, and can reference existing columns.
        Alternatively, a pyarrow Field or Schema can be provided to add
        new columns with the specified data types. The new columns will
        be initialized with null values.

    Returns
    -------
    AddColumnsResult
        version: the new version number of the table after adding columns.
    """

alter_columns abstractmethod

alter_columns(*alterations: Iterable[Dict[str, str]])

Alter column names and nullability.

Parameters:

  • alterations (Iterable[Dict[str, Any]], default: () ) –

    A sequence of dictionaries, each with the following keys: - "path": str The column path to alter. For a top-level column, this is the name. For a nested column, this is the dot-separated path, e.g. "a.b.c". - "rename": str, optional The new name of the column. If not specified, the column name is not changed. - "data_type": pyarrow.DataType, optional The new data type of the column. Existing values will be casted to this type. If not specified, the column data type is not changed. - "nullable": bool, optional Whether the column should be nullable. If not specified, the column nullability is not changed. Only non-nullable columns can be changed to nullable. Currently, you cannot change a nullable column to non-nullable.

Returns:

  • AlterColumnsResult –

    version: the new version number of the table after the alteration.

Source code in lancedb/table.py
@abstractmethod
def alter_columns(self, *alterations: Iterable[Dict[str, str]]):
    """
    Alter column names and nullability.

    Parameters
    ----------
    alterations : Iterable[Dict[str, Any]]
        A sequence of dictionaries, each with the following keys:
        - "path": str
            The column path to alter. For a top-level column, this is the name.
            For a nested column, this is the dot-separated path, e.g. "a.b.c".
        - "rename": str, optional
            The new name of the column. If not specified, the column name is
            not changed.
        - "data_type": pyarrow.DataType, optional
           The new data type of the column. Existing values will be casted
           to this type. If not specified, the column data type is not changed.
        - "nullable": bool, optional
            Whether the column should be nullable. If not specified, the column
            nullability is not changed. Only non-nullable columns can be changed
            to nullable. Currently, you cannot change a nullable column to
            non-nullable.

    Returns
    -------
    AlterColumnsResult
        version: the new version number of the table after the alteration.
    """

drop_columns abstractmethod

drop_columns(columns: Iterable[str]) -> DropColumnsResult

Drop columns from the table.

Parameters:

  • columns (Iterable[str]) –

    The names of the columns to drop.

Returns:

  • DropColumnsResult –

    version: the new version number of the table dropping the columns.

Source code in lancedb/table.py
@abstractmethod
def drop_columns(self, columns: Iterable[str]) -> DropColumnsResult:
    """
    Drop columns from the table.

    Parameters
    ----------
    columns : Iterable[str]
        The names of the columns to drop.

    Returns
    -------
    DropColumnsResult
        version: the new version number of the table dropping the columns.
    """

checkout abstractmethod

checkout(version: Union[int, str])

Checks out a specific version of the Table

Any read operation on the table will now access the data at the checked out version. As a consequence, calling this method will disable any read consistency interval that was previously set.

This is a read-only operation that turns the table into a sort of "view" or "detached head". Other table instances will not be affected. To make the change permanent you can use the [Self::restore] method.

Any operation that modifies the table will fail while the table is in a checked out state.

Parameters:

  • version (Union[int, str]) –

    The version to check out. A version number (int) or a tag (str) can be provided.

  • To –
Source code in lancedb/table.py
@abstractmethod
def checkout(self, version: Union[int, str]):
    """
    Checks out a specific version of the Table

    Any read operation on the table will now access the data at the checked out
    version. As a consequence, calling this method will disable any read consistency
    interval that was previously set.

    This is a read-only operation that turns the table into a sort of "view"
    or "detached head".  Other table instances will not be affected.  To make the
    change permanent you can use the `[Self::restore]` method.

    Any operation that modifies the table will fail while the table is in a checked
    out state.

    Parameters
    ----------
    version: int | str,
        The version to check out. A version number (`int`) or a tag
        (`str`) can be provided.

    To return the table to a normal state use `[Self::checkout_latest]`
    """

checkout_latest abstractmethod

checkout_latest()

Ensures the table is pointing at the latest version

This can be used to manually update a table when the read_consistency_interval is None It can also be used to undo a [Self::checkout] operation

Source code in lancedb/table.py
@abstractmethod
def checkout_latest(self):
    """
    Ensures the table is pointing at the latest version

    This can be used to manually update a table when the read_consistency_interval
    is None
    It can also be used to undo a `[Self::checkout]` operation
    """

restore abstractmethod

restore(version: Optional[Union[int, str]] = None)

Restore a version of the table. This is an in-place operation.

This creates a new version where the data is equivalent to the specified previous version. Data is not copied (as of python-v0.2.1).

Parameters:

  • version (int or str, default: None ) –

    The version number or version tag to restore. If unspecified then restores the currently checked out version. If the currently checked out version is the latest version then this is a no-op.

Source code in lancedb/table.py
@abstractmethod
def restore(self, version: Optional[Union[int, str]] = None):
    """Restore a version of the table. This is an in-place operation.

    This creates a new version where the data is equivalent to the
    specified previous version. Data is not copied (as of python-v0.2.1).

    Parameters
    ----------
    version : int or str, default None
        The version number or version tag to restore.
        If unspecified then restores the currently checked out version.
        If the currently checked out version is the
        latest version then this is a no-op.
    """

list_versions abstractmethod

list_versions() -> List[Dict[str, Any]]

List all versions of the table

Source code in lancedb/table.py
@abstractmethod
def list_versions(self) -> List[Dict[str, Any]]:
    """List all versions of the table"""

uses_v2_manifest_paths abstractmethod

uses_v2_manifest_paths() -> bool

Check if the table is using the new v2 manifest paths.

Returns:

  • bool –

    True if the table is using the new v2 manifest paths, False otherwise.

Source code in lancedb/table.py
@abstractmethod
def uses_v2_manifest_paths(self) -> bool:
    """
    Check if the table is using the new v2 manifest paths.

    Returns
    -------
    bool
        True if the table is using the new v2 manifest paths, False otherwise.
    """

migrate_v2_manifest_paths abstractmethod

migrate_v2_manifest_paths()

Migrate the manifest paths to the new format.

This will update the manifest to use the new v2 format for paths.

This function is idempotent, and can be run multiple times without changing the state of the object store.

Danger

This should not be run while other concurrent operations are happening. And it should also run until completion before resuming other operations.

You can use Table.uses_v2_manifest_paths to check if the table is already using the new path style.

Source code in lancedb/table.py
@abstractmethod
def migrate_v2_manifest_paths(self):
    """
    Migrate the manifest paths to the new format.

    This will update the manifest to use the new v2 format for paths.

    This function is idempotent, and can be run multiple times without
    changing the state of the object store.

    !!! danger

        This should not be run while other concurrent operations are happening.
        And it should also run until completion before resuming other operations.

    You can use
    [Table.uses_v2_manifest_paths][lancedb.table.Table.uses_v2_manifest_paths]
    to check if the table is already using the new path style.
    """

Querying (Synchronous)

lancedb.query.Query

Bases: BaseModel

A LanceDB Query

Queries are constructed by the Table.search method. This class is a python representation of the query. Normally you will not need to interact with this class directly. You can build up a query and execute it using collection methods such as to_batches(), to_arrow(), to_pandas(), etc.

However, you can use the to_query() method to get the underlying query object. This can be useful for serializing a query or using it in a different context.

Attributes:

  • filter (Optional[str]) –

    sql filter to refine the query with

  • limit (Optional[int]) –

    The limit on the number of results to return. If this is a vector or FTS query, then this is required. If this is a plain SQL query, then this is optional.

  • offset (Optional[int]) –

    The offset to start fetching results from

    This is ignored for vector / FTS search (will be None).

  • columns (Optional[Union[List[str], Dict[str, str]]]) –

    which columns to return in the results

    This can be a list of column names or a dictionary. If it is a dictionary, then the keys are the column names and the values are sql expressions to use to calculate the result.

    If this is None then all columns are returned. This can be expensive.

  • with_row_id (Optional[bool]) –

    if True then include the row id in the results

  • vector (Optional[Union[List[float], List[List[float]], Array, List[Array]]]) –

    the vector to search for, if this a vector search or hybrid search. It will be None for full text search and plain SQL filtering.

  • vector_column (Optional[str]) –

    the name of the vector column to use for vector search

    If this is None then a default vector column will be used.

  • distance_type (Optional[str]) –

    the distance type to use for vector search

    This can be l2 (default), cosine and dot. See metric definitions for more details.

    If this is not a vector search this will be None.

  • postfilter (bool) –

    if True then apply the filter after vector / FTS search. This is ignored for plain SQL filtering.

  • nprobes (Optional[int]) –

    The number of IVF partitions to search. If this is None then a default number of partitions will be used.

    • A higher number makes search more accurate but also slower.

    • See discussion in Querying an ANN Index for tuning advice.

    Will be None if this is not a vector search.

  • refine_factor (Optional[int]) –

    Refine the results by reading extra elements and re-ranking them in memory.

    • A higher number makes search more accurate but also slower.

    • See discussion in Querying an ANN Index for tuning advice.

    Will be None if this is not a vector search.

  • lower_bound (Optional[float]) –

    The lower bound for distance search

    Only results with a distance greater than or equal to this value will be returned.

    This will only be set on vector search.

  • upper_bound (Optional[float]) –

    The upper bound for distance search

    Only results with a distance less than or equal to this value will be returned.

    This will only be set on vector search.

  • ef (Optional[int]) –

    The size of the nearest neighbor list maintained during HNSW search

    This will only be set on vector search.

  • full_text_query (Optional[Union[str, dict]]) –

    The full text search query

    This can be a string or a dictionary. A dictionary will be used to search multiple columns. The keys are the column names and the values are the search queries.

    This will only be set on FTS or hybrid queries.

  • fast_search (Optional[bool]) –

    Skip a flat search of unindexed data. This will improve search performance but search results will not include unindexed data.

    The default is False

Source code in lancedb/query.py
class Query(pydantic.BaseModel):
    """A LanceDB Query

    Queries are constructed by the `Table.search` method.  This class is a
    python representation of the query.  Normally you will not need to interact
    with this class directly.  You can build up a query and execute it using
    collection methods such as `to_batches()`, `to_arrow()`, `to_pandas()`,
    etc.

    However, you can use the `to_query()` method to get the underlying query object.
    This can be useful for serializing a query or using it in a different context.

    Attributes
    ----------
    filter : Optional[str]
        sql filter to refine the query with
    limit : Optional[int]
        The limit on the number of results to return.  If this is a vector or FTS query,
        then this is required.  If this is a plain SQL query, then this is optional.
    offset: Optional[int]
        The offset to start fetching results from

        This is ignored for vector / FTS search (will be None).
    columns : Optional[Union[List[str], Dict[str, str]]]
        which columns to return in the results

        This can be a list of column names or a dictionary.  If it is a dictionary,
        then the keys are the column names and the values are sql expressions to
        use to calculate the result.

        If this is None then all columns are returned.  This can be expensive.
    with_row_id : Optional[bool]
        if True then include the row id in the results
    vector : Optional[Union[List[float], List[List[float]], pa.Array, List[pa.Array]]]
        the vector to search for, if this a vector search or hybrid search.  It will
        be None for full text search and plain SQL filtering.
    vector_column : Optional[str]
        the name of the vector column to use for vector search

        If this is None then a default vector column will be used.
    distance_type : Optional[str]
        the distance type to use for vector search

        This can be l2 (default), cosine and dot.  See [metric definitions][search] for
        more details.

        If this is not a vector search this will be None.
    postfilter : bool
        if True then apply the filter after vector / FTS search.  This is ignored for
        plain SQL filtering.
    nprobes : Optional[int]
        The number of IVF partitions to search.  If this is None then a default
        number of partitions will be used.

        - A higher number makes search more accurate but also slower.

        - See discussion in [Querying an ANN Index][querying-an-ann-index] for
          tuning advice.

        Will be None if this is not a vector search.
    refine_factor : Optional[int]
        Refine the results by reading extra elements and re-ranking them in memory.

        - A higher number makes search more accurate but also slower.

        - See discussion in [Querying an ANN Index][querying-an-ann-index] for
          tuning advice.

        Will be None if this is not a vector search.
    lower_bound : Optional[float]
        The lower bound for distance search

        Only results with a distance greater than or equal to this value
        will be returned.

        This will only be set on vector search.
    upper_bound : Optional[float]
        The upper bound for distance search

        Only results with a distance less than or equal to this value
        will be returned.

        This will only be set on vector search.
    ef : Optional[int]
        The size of the nearest neighbor list maintained during HNSW search

        This will only be set on vector search.
    full_text_query : Optional[Union[str, dict]]
        The full text search query

        This can be a string or a dictionary.  A dictionary will be used to search
        multiple columns.  The keys are the column names and the values are the
        search queries.

        This will only be set on FTS or hybrid queries.
    fast_search: Optional[bool]
        Skip a flat search of unindexed data. This will improve
        search performance but search results will not include unindexed data.

        The default is False
    """

    # The name of the vector column to use for vector search.
    vector_column: Optional[str] = None

    # vector to search for
    #
    # Note: today this will be floats on the sync path and pa.Array on the async
    # path though in the future we should unify this to pa.Array everywhere
    vector: Annotated[
        Optional[Union[List[float], List[List[float]], pa.Array, List[pa.Array]]],
        ensure_vector_query,
    ] = None

    # sql filter to refine the query with
    filter: Optional[str] = None

    # if True then apply the filter after vector search
    postfilter: Optional[bool] = None

    # full text search query
    full_text_query: Optional[FullTextSearchQuery] = None

    # top k results to return
    limit: Optional[int] = None

    # distance type to use for vector search
    distance_type: Optional[str] = None

    # which columns to return in the results
    columns: Optional[Union[List[str], Dict[str, str]]] = None

    # number of IVF partitions to search
    nprobes: Optional[int] = None

    # lower bound for distance search
    lower_bound: Optional[float] = None

    # upper bound for distance search
    upper_bound: Optional[float] = None

    # multiplier for the number of results to inspect for reranking
    refine_factor: Optional[int] = None

    # if true, include the row id in the results
    with_row_id: Optional[bool] = None

    # offset to start fetching results from
    offset: Optional[int] = None

    # if true, will only search the indexed data
    fast_search: Optional[bool] = None

    # size of the nearest neighbor list maintained during HNSW search
    ef: Optional[int] = None

    # Bypass the vector index and use a brute force search
    bypass_vector_index: Optional[bool] = None

    @classmethod
    def from_inner(cls, req: PyQueryRequest) -> Self:
        query = cls()
        query.limit = req.limit
        query.offset = req.offset
        query.filter = req.filter
        query.full_text_query = req.full_text_search
        query.columns = req.select
        query.with_row_id = req.with_row_id
        query.vector_column = req.column
        query.vector = req.query_vector
        query.distance_type = req.distance_type
        query.nprobes = req.nprobes
        query.lower_bound = req.lower_bound
        query.upper_bound = req.upper_bound
        query.ef = req.ef
        query.refine_factor = req.refine_factor
        query.bypass_vector_index = req.bypass_vector_index
        query.postfilter = req.postfilter
        if req.full_text_search is not None:
            query.full_text_query = FullTextSearchQuery(
                columns=req.full_text_search.columns,
                query=req.full_text_search.query,
                limit=req.full_text_search.limit,
                wand_factor=req.full_text_search.wand_factor,
            )
        return query

    # This tells pydantic to allow custom types (needed for the `vector` query since
    # pa.Array wouln't be allowed otherwise)
    if PYDANTIC_VERSION.major < 2:  # Pydantic 1.x compat

        class Config:
            arbitrary_types_allowed = True
    else:
        model_config = {"arbitrary_types_allowed": True}

lancedb.query.LanceQueryBuilder

Bases: ABC

An abstract query builder. Subclasses are defined for vector search, full text search, hybrid, and plain SQL filtering.

Source code in lancedb/query.py
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
class LanceQueryBuilder(ABC):
    """An abstract query builder. Subclasses are defined for vector search,
    full text search, hybrid, and plain SQL filtering.
    """

    @classmethod
    def create(
        cls,
        table: "Table",
        query: Optional[Union[np.ndarray, str, "PIL.Image.Image", Tuple]],
        query_type: str,
        vector_column_name: str,
        ordering_field_name: Optional[str] = None,
        fts_columns: Optional[Union[str, List[str]]] = None,
        fast_search: bool = None,
    ) -> Self:
        """
        Create a query builder based on the given query and query type.

        Parameters
        ----------
        table: Table
            The table to query.
        query: Optional[Union[np.ndarray, str, "PIL.Image.Image", Tuple]]
            The query to use. If None, an empty query builder is returned
            which performs simple SQL filtering.
        query_type: str
            The type of query to perform. One of "vector", "fts", "hybrid", or "auto".
            If "auto", the query type is inferred based on the query.
        vector_column_name: str
            The name of the vector column to use for vector search.
        fast_search: bool
            Skip flat search of unindexed data.
        """
        # Check hybrid search first as it supports empty query pattern
        if query_type == "hybrid":
            # hybrid fts and vector query
            return LanceHybridQueryBuilder(
                table, query, vector_column_name, fts_columns=fts_columns
            )

        if query is None:
            return LanceEmptyQueryBuilder(table)

        # remember the string query for reranking purpose
        str_query = query if isinstance(query, str) else None

        # convert "auto" query_type to "vector", "fts"
        # or "hybrid" and convert the query to vector if needed
        query, query_type = cls._resolve_query(
            table, query, query_type, vector_column_name
        )

        if query_type == "hybrid":
            return LanceHybridQueryBuilder(
                table, query, vector_column_name, fts_columns=fts_columns
            )

        if isinstance(query, (str, FullTextQuery)):
            # fts
            return LanceFtsQueryBuilder(
                table,
                query,
                ordering_field_name=ordering_field_name,
                fts_columns=fts_columns,
            )

        if isinstance(query, list):
            query = np.array(query, dtype=np.float32)
        elif isinstance(query, np.ndarray):
            query = query.astype(np.float32)
        else:
            raise TypeError(f"Unsupported query type: {type(query)}")

        return LanceVectorQueryBuilder(
            table, query, vector_column_name, str_query, fast_search
        )

    @classmethod
    def _resolve_query(cls, table, query, query_type, vector_column_name):
        # If query_type is fts, then query must be a string.
        # otherwise raise TypeError
        if query_type == "fts":
            if not isinstance(query, (str, FullTextQuery)):
                raise TypeError(
                    f"'fts' query must be a string or FullTextQuery: {type(query)}"
                )
            return query, query_type
        elif query_type == "vector":
            query = cls._query_to_vector(table, query, vector_column_name)
            return query, query_type
        elif query_type == "auto":
            if isinstance(query, (list, np.ndarray)):
                return query, "vector"
            else:
                conf = table.embedding_functions.get(vector_column_name)
                if conf is not None:
                    query = conf.function.compute_query_embeddings_with_retry(query)[0]
                    return query, "vector"
                else:
                    return query, "fts"
        else:
            raise ValueError(
                f"Invalid query_type, must be 'vector', 'fts', or 'auto': {query_type}"
            )

    @classmethod
    def _query_to_vector(cls, table, query, vector_column_name):
        if isinstance(query, (list, np.ndarray)):
            return query
        conf = table.embedding_functions.get(vector_column_name)
        if conf is not None:
            return conf.function.compute_query_embeddings_with_retry(query)[0]
        else:
            msg = f"No embedding function for {vector_column_name}"
            raise ValueError(msg)

    def __init__(self, table: "Table"):
        self._table = table
        self._limit = None
        self._offset = None
        self._columns = None
        self._where = None
        self._postfilter = None
        self._with_row_id = None
        self._vector = None
        self._text = None
        self._ef = None
        self._bypass_vector_index = None

    @deprecation.deprecated(
        deprecated_in="0.3.1",
        removed_in="0.4.0",
        current_version=__version__,
        details="Use to_pandas() instead",
    )
    def to_df(self) -> "pd.DataFrame":
        """
        *Deprecated alias for `to_pandas()`. Please use `to_pandas()` instead.*

        Execute the query and return the results as a pandas DataFrame.
        In addition to the selected columns, LanceDB also returns a vector
        and also the "_distance" column which is the distance between the query
        vector and the returned vector.
        """
        return self.to_pandas()

    def to_pandas(
        self,
        flatten: Optional[Union[int, bool]] = None,
        *,
        timeout: Optional[timedelta] = None,
    ) -> "pd.DataFrame":
        """
        Execute the query and return the results as a pandas DataFrame.
        In addition to the selected columns, LanceDB also returns a vector
        and also the "_distance" column which is the distance between the query
        vector and the returned vector.

        Parameters
        ----------
        flatten: Optional[Union[int, bool]]
            If flatten is True, flatten all nested columns.
            If flatten is an integer, flatten the nested columns up to the
            specified depth.
            If unspecified, do not flatten the nested columns.
        timeout: Optional[timedelta]
            The maximum time to wait for the query to complete.
            If None, wait indefinitely.
        """
        tbl = flatten_columns(self.to_arrow(timeout=timeout), flatten)
        return tbl.to_pandas()

    @abstractmethod
    def to_arrow(self, *, timeout: Optional[timedelta] = None) -> pa.Table:
        """
        Execute the query and return the results as an
        [Apache Arrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table).

        In addition to the selected columns, LanceDB also returns a vector
        and also the "_distance" column which is the distance between the query
        vector and the returned vectors.

        Parameters
        ----------
        timeout: Optional[timedelta]
            The maximum time to wait for the query to complete.
            If None, wait indefinitely.
        """
        raise NotImplementedError

    @abstractmethod
    def to_batches(
        self,
        /,
        batch_size: Optional[int] = None,
        *,
        timeout: Optional[timedelta] = None,
    ) -> pa.RecordBatchReader:
        """
        Execute the query and return the results as a pyarrow
        [RecordBatchReader](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html)

        Parameters
        ----------
        batch_size: int
            The maximum number of selected records in a RecordBatch object.
        timeout: Optional[timedelta]
            The maximum time to wait for the query to complete.
            If None, wait indefinitely.
        """
        raise NotImplementedError

    def to_list(self, *, timeout: Optional[timedelta] = None) -> List[dict]:
        """
        Execute the query and return the results as a list of dictionaries.

        Each list entry is a dictionary with the selected column names as keys,
        or all table columns if `select` is not called. The vector and the "_distance"
        fields are returned whether or not they're explicitly selected.

        Parameters
        ----------
        timeout: Optional[timedelta]
            The maximum time to wait for the query to complete.
            If None, wait indefinitely.
        """
        return self.to_arrow(timeout=timeout).to_pylist()

    def to_pydantic(
        self, model: Type[LanceModel], *, timeout: Optional[timedelta] = None
    ) -> List[LanceModel]:
        """Return the table as a list of pydantic models.

        Parameters
        ----------
        model: Type[LanceModel]
            The pydantic model to use.
        timeout: Optional[timedelta]
            The maximum time to wait for the query to complete.
            If None, wait indefinitely.

        Returns
        -------
        List[LanceModel]
        """
        return [
            model(**{k: v for k, v in row.items() if k in model.field_names()})
            for row in self.to_arrow(timeout=timeout).to_pylist()
        ]

    def to_polars(self, *, timeout: Optional[timedelta] = None) -> "pl.DataFrame":
        """
        Execute the query and return the results as a Polars DataFrame.
        In addition to the selected columns, LanceDB also returns a vector
        and also the "_distance" column which is the distance between the query
        vector and the returned vector.

        Parameters
        ----------
        timeout: Optional[timedelta]
            The maximum time to wait for the query to complete.
            If None, wait indefinitely.
        """
        import polars as pl

        return pl.from_arrow(self.to_arrow(timeout=timeout))

    def limit(self, limit: Union[int, None]) -> Self:
        """Set the maximum number of results to return.

        Parameters
        ----------
        limit: int
            The maximum number of results to return.
            The default query limit is 10 results.
            For ANN/KNN queries, you must specify a limit.
            For plain searches, all records are returned if limit not set.
            *WARNING* if you have a large dataset, setting
            the limit to a large number, e.g. the table size,
            can potentially result in reading a
            large amount of data into memory and cause
            out of memory issues.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        if limit is None or limit <= 0:
            if isinstance(self, LanceVectorQueryBuilder):
                raise ValueError("Limit is required for ANN/KNN queries")
            else:
                self._limit = None
        else:
            self._limit = limit
        return self

    def offset(self, offset: int) -> Self:
        """Set the offset for the results.

        Parameters
        ----------
        offset: int
            The offset to start fetching results from.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        if offset is None or offset <= 0:
            self._offset = 0
        else:
            self._offset = offset
        return self

    def select(self, columns: Union[list[str], dict[str, str]]) -> Self:
        """Set the columns to return.

        Parameters
        ----------
        columns: list of str, or dict of str to str default None
            List of column names to be fetched.
            Or a dictionary of column names to SQL expressions.
            All columns are fetched if None or unspecified.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        if isinstance(columns, list) or isinstance(columns, dict):
            self._columns = columns
        else:
            raise ValueError("columns must be a list or a dictionary")
        return self

    def where(self, where: str, prefilter: bool = True) -> Self:
        """Set the where clause.

        Parameters
        ----------
        where: str
            The where clause which is a valid SQL where clause. See
            `Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>`_
            for valid SQL expressions.
        prefilter: bool, default True
            If True, apply the filter before vector search, otherwise the
            filter is applied on the result of vector search.
            This feature is **EXPERIMENTAL** and may be removed and modified
            without warning in the future.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        self._where = where
        self._postfilter = not prefilter
        return self

    def with_row_id(self, with_row_id: bool) -> Self:
        """Set whether to return row ids.

        Parameters
        ----------
        with_row_id: bool
            If True, return _rowid column in the results.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        self._with_row_id = with_row_id
        return self

    def explain_plan(self, verbose: Optional[bool] = False) -> str:
        """Return the execution plan for this query.

        Examples
        --------
        >>> import lancedb
        >>> db = lancedb.connect("./.lancedb")
        >>> table = db.create_table("my_table", [{"vector": [99.0, 99]}])
        >>> query = [100, 100]
        >>> plan = table.search(query).explain_plan(True)
        >>> print(plan) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
        ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]
        GlobalLimitExec: skip=0, fetch=10
          FilterExec: _distance@2 IS NOT NULL
            SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]
              KNNVectorDistance: metric=l2
                LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false

        Parameters
        ----------
        verbose : bool, default False
            Use a verbose output format.

        Returns
        -------
        plan : str
        """  # noqa: E501
        return self._table._explain_plan(self.to_query_object(), verbose=verbose)

    def analyze_plan(self) -> str:
        """
        Run the query and return its execution plan with runtime metrics.

        This returns detailed metrics for each step, such as elapsed time,
        rows processed, bytes read, and I/O stats. It is useful for debugging
        and performance tuning.

        Examples
        --------
        >>> import lancedb
        >>> db = lancedb.connect("./.lancedb")
        >>> table = db.create_table("my_table", [{"vector": [99.0, 99]}])
        >>> query = [100, 100]
        >>> plan = table.search(query).analyze_plan()
        >>> print(plan)  # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
        AnalyzeExec verbose=true, metrics=[]
          ProjectionExec: expr=[...], metrics=[...]
            GlobalLimitExec: skip=0, fetch=10, metrics=[...]
              FilterExec: _distance@2 IS NOT NULL,
              metrics=[output_rows=..., elapsed_compute=...]
                SortExec: TopK(fetch=10), expr=[...],
                preserve_partitioning=[...],
                metrics=[output_rows=..., elapsed_compute=..., row_replacements=...]
                  KNNVectorDistance: metric=l2,
                  metrics=[output_rows=..., elapsed_compute=..., output_batches=...]
                    LanceScan: uri=..., projection=[vector], row_id=true,
                    row_addr=false, ordered=false,
                    metrics=[output_rows=..., elapsed_compute=...,
                    bytes_read=..., iops=..., requests=...]

        Returns
        -------
        plan : str
            The physical query execution plan with runtime metrics.
        """
        return self._table._analyze_plan(self.to_query_object())

    def vector(self, vector: Union[np.ndarray, list]) -> Self:
        """Set the vector to search for.

        Parameters
        ----------
        vector: np.ndarray or list
            The vector to search for.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        raise NotImplementedError

    def text(self, text: str | FullTextQuery) -> Self:
        """Set the text to search for.

        Parameters
        ----------
        text: str | FullTextQuery
            If a string, it is treated as a MatchQuery.
            If a FullTextQuery object, it is used directly.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        raise NotImplementedError

    @abstractmethod
    def rerank(self, reranker: Reranker) -> Self:
        """Rerank the results using the specified reranker.

        Parameters
        ----------
        reranker: Reranker
            The reranker to use.

        Returns
        -------

        The LanceQueryBuilder object.
        """
        raise NotImplementedError

    @abstractmethod
    def to_query_object(self) -> Query:
        """Return a serializable representation of the query

        Returns
        -------
        Query
            The serializable representation of the query
        """
        raise NotImplementedError

create classmethod

create(table: 'Table', query: Optional[Union[ndarray, str, 'PIL.Image.Image', Tuple]], query_type: str, vector_column_name: str, ordering_field_name: Optional[str] = None, fts_columns: Optional[Union[str, List[str]]] = None, fast_search: bool = None) -> Self

Create a query builder based on the given query and query type.

Parameters:

  • table ('Table') –

    The table to query.

  • query (Optional[Union[ndarray, str, 'PIL.Image.Image', Tuple]]) –

    The query to use. If None, an empty query builder is returned which performs simple SQL filtering.

  • query_type (str) –

    The type of query to perform. One of "vector", "fts", "hybrid", or "auto". If "auto", the query type is inferred based on the query.

  • vector_column_name (str) –

    The name of the vector column to use for vector search.

  • fast_search (bool, default: None ) –

    Skip flat search of unindexed data.

Source code in lancedb/query.py
@classmethod
def create(
    cls,
    table: "Table",
    query: Optional[Union[np.ndarray, str, "PIL.Image.Image", Tuple]],
    query_type: str,
    vector_column_name: str,
    ordering_field_name: Optional[str] = None,
    fts_columns: Optional[Union[str, List[str]]] = None,
    fast_search: bool = None,
) -> Self:
    """
    Create a query builder based on the given query and query type.

    Parameters
    ----------
    table: Table
        The table to query.
    query: Optional[Union[np.ndarray, str, "PIL.Image.Image", Tuple]]
        The query to use. If None, an empty query builder is returned
        which performs simple SQL filtering.
    query_type: str
        The type of query to perform. One of "vector", "fts", "hybrid", or "auto".
        If "auto", the query type is inferred based on the query.
    vector_column_name: str
        The name of the vector column to use for vector search.
    fast_search: bool
        Skip flat search of unindexed data.
    """
    # Check hybrid search first as it supports empty query pattern
    if query_type == "hybrid":
        # hybrid fts and vector query
        return LanceHybridQueryBuilder(
            table, query, vector_column_name, fts_columns=fts_columns
        )

    if query is None:
        return LanceEmptyQueryBuilder(table)

    # remember the string query for reranking purpose
    str_query = query if isinstance(query, str) else None

    # convert "auto" query_type to "vector", "fts"
    # or "hybrid" and convert the query to vector if needed
    query, query_type = cls._resolve_query(
        table, query, query_type, vector_column_name
    )

    if query_type == "hybrid":
        return LanceHybridQueryBuilder(
            table, query, vector_column_name, fts_columns=fts_columns
        )

    if isinstance(query, (str, FullTextQuery)):
        # fts
        return LanceFtsQueryBuilder(
            table,
            query,
            ordering_field_name=ordering_field_name,
            fts_columns=fts_columns,
        )

    if isinstance(query, list):
        query = np.array(query, dtype=np.float32)
    elif isinstance(query, np.ndarray):
        query = query.astype(np.float32)
    else:
        raise TypeError(f"Unsupported query type: {type(query)}")

    return LanceVectorQueryBuilder(
        table, query, vector_column_name, str_query, fast_search
    )

to_df

to_df() -> 'pd.DataFrame'

Deprecated alias for to_pandas(). Please use to_pandas() instead.

Execute the query and return the results as a pandas DataFrame. In addition to the selected columns, LanceDB also returns a vector and also the "_distance" column which is the distance between the query vector and the returned vector.

Source code in lancedb/query.py
@deprecation.deprecated(
    deprecated_in="0.3.1",
    removed_in="0.4.0",
    current_version=__version__,
    details="Use to_pandas() instead",
)
def to_df(self) -> "pd.DataFrame":
    """
    *Deprecated alias for `to_pandas()`. Please use `to_pandas()` instead.*

    Execute the query and return the results as a pandas DataFrame.
    In addition to the selected columns, LanceDB also returns a vector
    and also the "_distance" column which is the distance between the query
    vector and the returned vector.
    """
    return self.to_pandas()

to_pandas

to_pandas(flatten: Optional[Union[int, bool]] = None, *, timeout: Optional[timedelta] = None) -> 'pd.DataFrame'

Execute the query and return the results as a pandas DataFrame. In addition to the selected columns, LanceDB also returns a vector and also the "_distance" column which is the distance between the query vector and the returned vector.

Parameters:

  • flatten (Optional[Union[int, bool]], default: None ) –

    If flatten is True, flatten all nested columns. If flatten is an integer, flatten the nested columns up to the specified depth. If unspecified, do not flatten the nested columns.

  • timeout (Optional[timedelta], default: None ) –

    The maximum time to wait for the query to complete. If None, wait indefinitely.

Source code in lancedb/query.py
def to_pandas(
    self,
    flatten: Optional[Union[int, bool]] = None,
    *,
    timeout: Optional[timedelta] = None,
) -> "pd.DataFrame":
    """
    Execute the query and return the results as a pandas DataFrame.
    In addition to the selected columns, LanceDB also returns a vector
    and also the "_distance" column which is the distance between the query
    vector and the returned vector.

    Parameters
    ----------
    flatten: Optional[Union[int, bool]]
        If flatten is True, flatten all nested columns.
        If flatten is an integer, flatten the nested columns up to the
        specified depth.
        If unspecified, do not flatten the nested columns.
    timeout: Optional[timedelta]
        The maximum time to wait for the query to complete.
        If None, wait indefinitely.
    """
    tbl = flatten_columns(self.to_arrow(timeout=timeout), flatten)
    return tbl.to_pandas()

to_arrow abstractmethod

to_arrow(*, timeout: Optional[timedelta] = None) -> Table

Execute the query and return the results as an Apache Arrow Table.

In addition to the selected columns, LanceDB also returns a vector and also the "_distance" column which is the distance between the query vector and the returned vectors.

Parameters:

  • timeout (Optional[timedelta], default: None ) –

    The maximum time to wait for the query to complete. If None, wait indefinitely.

Source code in lancedb/query.py
@abstractmethod
def to_arrow(self, *, timeout: Optional[timedelta] = None) -> pa.Table:
    """
    Execute the query and return the results as an
    [Apache Arrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table).

    In addition to the selected columns, LanceDB also returns a vector
    and also the "_distance" column which is the distance between the query
    vector and the returned vectors.

    Parameters
    ----------
    timeout: Optional[timedelta]
        The maximum time to wait for the query to complete.
        If None, wait indefinitely.
    """
    raise NotImplementedError

to_batches abstractmethod

to_batches(batch_size: Optional[int] = None, *, timeout: Optional[timedelta] = None) -> RecordBatchReader

Execute the query and return the results as a pyarrow RecordBatchReader

Parameters:

  • batch_size (Optional[int], default: None ) –

    The maximum number of selected records in a RecordBatch object.

  • timeout (Optional[timedelta], default: None ) –

    The maximum time to wait for the query to complete. If None, wait indefinitely.

Source code in lancedb/query.py
@abstractmethod
def to_batches(
    self,
    /,
    batch_size: Optional[int] = None,
    *,
    timeout: Optional[timedelta] = None,
) -> pa.RecordBatchReader:
    """
    Execute the query and return the results as a pyarrow
    [RecordBatchReader](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html)

    Parameters
    ----------
    batch_size: int
        The maximum number of selected records in a RecordBatch object.
    timeout: Optional[timedelta]
        The maximum time to wait for the query to complete.
        If None, wait indefinitely.
    """
    raise NotImplementedError

to_list

to_list(*, timeout: Optional[timedelta] = None) -> List[dict]

Execute the query and return the results as a list of dictionaries.

Each list entry is a dictionary with the selected column names as keys, or all table columns if select is not called. The vector and the "_distance" fields are returned whether or not they're explicitly selected.

Parameters:

  • timeout (Optional[timedelta], default: None ) –

    The maximum time to wait for the query to complete. If None, wait indefinitely.

Source code in lancedb/query.py
def to_list(self, *, timeout: Optional[timedelta] = None) -> List[dict]:
    """
    Execute the query and return the results as a list of dictionaries.

    Each list entry is a dictionary with the selected column names as keys,
    or all table columns if `select` is not called. The vector and the "_distance"
    fields are returned whether or not they're explicitly selected.

    Parameters
    ----------
    timeout: Optional[timedelta]
        The maximum time to wait for the query to complete.
        If None, wait indefinitely.
    """
    return self.to_arrow(timeout=timeout).to_pylist()

to_pydantic

to_pydantic(model: Type[LanceModel], *, timeout: Optional[timedelta] = None) -> List[LanceModel]

Return the table as a list of pydantic models.

Parameters:

  • model (Type[LanceModel]) –

    The pydantic model to use.

  • timeout (Optional[timedelta], default: None ) –

    The maximum time to wait for the query to complete. If None, wait indefinitely.

Returns:

Source code in lancedb/query.py
def to_pydantic(
    self, model: Type[LanceModel], *, timeout: Optional[timedelta] = None
) -> List[LanceModel]:
    """Return the table as a list of pydantic models.

    Parameters
    ----------
    model: Type[LanceModel]
        The pydantic model to use.
    timeout: Optional[timedelta]
        The maximum time to wait for the query to complete.
        If None, wait indefinitely.

    Returns
    -------
    List[LanceModel]
    """
    return [
        model(**{k: v for k, v in row.items() if k in model.field_names()})
        for row in self.to_arrow(timeout=timeout).to_pylist()
    ]

to_polars

to_polars(*, timeout: Optional[timedelta] = None) -> 'pl.DataFrame'

Execute the query and return the results as a Polars DataFrame. In addition to the selected columns, LanceDB also returns a vector and also the "_distance" column which is the distance between the query vector and the returned vector.

Parameters:

  • timeout (Optional[timedelta], default: None ) –

    The maximum time to wait for the query to complete. If None, wait indefinitely.

Source code in lancedb/query.py
def to_polars(self, *, timeout: Optional[timedelta] = None) -> "pl.DataFrame":
    """
    Execute the query and return the results as a Polars DataFrame.
    In addition to the selected columns, LanceDB also returns a vector
    and also the "_distance" column which is the distance between the query
    vector and the returned vector.

    Parameters
    ----------
    timeout: Optional[timedelta]
        The maximum time to wait for the query to complete.
        If None, wait indefinitely.
    """
    import polars as pl

    return pl.from_arrow(self.to_arrow(timeout=timeout))

limit

limit(limit: Union[int, None]) -> Self

Set the maximum number of results to return.

Parameters:

  • limit (Union[int, None]) –

    The maximum number of results to return. The default query limit is 10 results. For ANN/KNN queries, you must specify a limit. For plain searches, all records are returned if limit not set. WARNING if you have a large dataset, setting the limit to a large number, e.g. the table size, can potentially result in reading a large amount of data into memory and cause out of memory issues.

Returns:

Source code in lancedb/query.py
def limit(self, limit: Union[int, None]) -> Self:
    """Set the maximum number of results to return.

    Parameters
    ----------
    limit: int
        The maximum number of results to return.
        The default query limit is 10 results.
        For ANN/KNN queries, you must specify a limit.
        For plain searches, all records are returned if limit not set.
        *WARNING* if you have a large dataset, setting
        the limit to a large number, e.g. the table size,
        can potentially result in reading a
        large amount of data into memory and cause
        out of memory issues.

    Returns
    -------
    LanceQueryBuilder
        The LanceQueryBuilder object.
    """
    if limit is None or limit <= 0:
        if isinstance(self, LanceVectorQueryBuilder):
            raise ValueError("Limit is required for ANN/KNN queries")
        else:
            self._limit = None
    else:
        self._limit = limit
    return self

offset

offset(offset: int) -> Self

Set the offset for the results.

Parameters:

  • offset (int) –

    The offset to start fetching results from.

Returns:

Source code in lancedb/query.py
def offset(self, offset: int) -> Self:
    """Set the offset for the results.

    Parameters
    ----------
    offset: int
        The offset to start fetching results from.

    Returns
    -------
    LanceQueryBuilder
        The LanceQueryBuilder object.
    """
    if offset is None or offset <= 0:
        self._offset = 0
    else:
        self._offset = offset
    return self

select

select(columns: Union[list[str], dict[str, str]]) -> Self

Set the columns to return.

Parameters:

  • columns (Union[list[str], dict[str, str]]) –

    List of column names to be fetched. Or a dictionary of column names to SQL expressions. All columns are fetched if None or unspecified.

Returns:

Source code in lancedb/query.py
def select(self, columns: Union[list[str], dict[str, str]]) -> Self:
    """Set the columns to return.

    Parameters
    ----------
    columns: list of str, or dict of str to str default None
        List of column names to be fetched.
        Or a dictionary of column names to SQL expressions.
        All columns are fetched if None or unspecified.

    Returns
    -------
    LanceQueryBuilder
        The LanceQueryBuilder object.
    """
    if isinstance(columns, list) or isinstance(columns, dict):
        self._columns = columns
    else:
        raise ValueError("columns must be a list or a dictionary")
    return self

where

where(where: str, prefilter: bool = True) -> Self

Set the where clause.

Parameters:

  • where (str) –

    The where clause which is a valid SQL where clause. See Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>_ for valid SQL expressions.

  • prefilter (bool, default: True ) –

    If True, apply the filter before vector search, otherwise the filter is applied on the result of vector search. This feature is EXPERIMENTAL and may be removed and modified without warning in the future.

Returns:

Source code in lancedb/query.py
def where(self, where: str, prefilter: bool = True) -> Self:
    """Set the where clause.

    Parameters
    ----------
    where: str
        The where clause which is a valid SQL where clause. See
        `Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>`_
        for valid SQL expressions.
    prefilter: bool, default True
        If True, apply the filter before vector search, otherwise the
        filter is applied on the result of vector search.
        This feature is **EXPERIMENTAL** and may be removed and modified
        without warning in the future.

    Returns
    -------
    LanceQueryBuilder
        The LanceQueryBuilder object.
    """
    self._where = where
    self._postfilter = not prefilter
    return self

with_row_id

with_row_id(with_row_id: bool) -> Self

Set whether to return row ids.

Parameters:

  • with_row_id (bool) –

    If True, return _rowid column in the results.

Returns:

Source code in lancedb/query.py
def with_row_id(self, with_row_id: bool) -> Self:
    """Set whether to return row ids.

    Parameters
    ----------
    with_row_id: bool
        If True, return _rowid column in the results.

    Returns
    -------
    LanceQueryBuilder
        The LanceQueryBuilder object.
    """
    self._with_row_id = with_row_id
    return self

explain_plan

explain_plan(verbose: Optional[bool] = False) -> str

Return the execution plan for this query.

Examples:

>>> import lancedb
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", [{"vector": [99.0, 99]}])
>>> query = [100, 100]
>>> plan = table.search(query).explain_plan(True)
>>> print(plan)
ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]
GlobalLimitExec: skip=0, fetch=10
  FilterExec: _distance@2 IS NOT NULL
    SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]
      KNNVectorDistance: metric=l2
        LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false

Parameters:

  • verbose (bool, default: False ) –

    Use a verbose output format.

Returns:

  • plan ( str ) –
Source code in lancedb/query.py
def explain_plan(self, verbose: Optional[bool] = False) -> str:
    """Return the execution plan for this query.

    Examples
    --------
    >>> import lancedb
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", [{"vector": [99.0, 99]}])
    >>> query = [100, 100]
    >>> plan = table.search(query).explain_plan(True)
    >>> print(plan) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
    ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]
    GlobalLimitExec: skip=0, fetch=10
      FilterExec: _distance@2 IS NOT NULL
        SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]
          KNNVectorDistance: metric=l2
            LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false

    Parameters
    ----------
    verbose : bool, default False
        Use a verbose output format.

    Returns
    -------
    plan : str
    """  # noqa: E501
    return self._table._explain_plan(self.to_query_object(), verbose=verbose)

analyze_plan

analyze_plan() -> str

Run the query and return its execution plan with runtime metrics.

This returns detailed metrics for each step, such as elapsed time, rows processed, bytes read, and I/O stats. It is useful for debugging and performance tuning.

Examples:

>>> import lancedb
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", [{"vector": [99.0, 99]}])
>>> query = [100, 100]
>>> plan = table.search(query).analyze_plan()
>>> print(plan)
AnalyzeExec verbose=true, metrics=[]
  ProjectionExec: expr=[...], metrics=[...]
    GlobalLimitExec: skip=0, fetch=10, metrics=[...]
      FilterExec: _distance@2 IS NOT NULL,
      metrics=[output_rows=..., elapsed_compute=...]
        SortExec: TopK(fetch=10), expr=[...],
        preserve_partitioning=[...],
        metrics=[output_rows=..., elapsed_compute=..., row_replacements=...]
          KNNVectorDistance: metric=l2,
          metrics=[output_rows=..., elapsed_compute=..., output_batches=...]
            LanceScan: uri=..., projection=[vector], row_id=true,
            row_addr=false, ordered=false,
            metrics=[output_rows=..., elapsed_compute=...,
            bytes_read=..., iops=..., requests=...]

Returns:

  • plan ( str ) –

    The physical query execution plan with runtime metrics.

Source code in lancedb/query.py
def analyze_plan(self) -> str:
    """
    Run the query and return its execution plan with runtime metrics.

    This returns detailed metrics for each step, such as elapsed time,
    rows processed, bytes read, and I/O stats. It is useful for debugging
    and performance tuning.

    Examples
    --------
    >>> import lancedb
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", [{"vector": [99.0, 99]}])
    >>> query = [100, 100]
    >>> plan = table.search(query).analyze_plan()
    >>> print(plan)  # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
    AnalyzeExec verbose=true, metrics=[]
      ProjectionExec: expr=[...], metrics=[...]
        GlobalLimitExec: skip=0, fetch=10, metrics=[...]
          FilterExec: _distance@2 IS NOT NULL,
          metrics=[output_rows=..., elapsed_compute=...]
            SortExec: TopK(fetch=10), expr=[...],
            preserve_partitioning=[...],
            metrics=[output_rows=..., elapsed_compute=..., row_replacements=...]
              KNNVectorDistance: metric=l2,
              metrics=[output_rows=..., elapsed_compute=..., output_batches=...]
                LanceScan: uri=..., projection=[vector], row_id=true,
                row_addr=false, ordered=false,
                metrics=[output_rows=..., elapsed_compute=...,
                bytes_read=..., iops=..., requests=...]

    Returns
    -------
    plan : str
        The physical query execution plan with runtime metrics.
    """
    return self._table._analyze_plan(self.to_query_object())

vector

vector(vector: Union[ndarray, list]) -> Self

Set the vector to search for.

Parameters:

  • vector (Union[ndarray, list]) –

    The vector to search for.

Returns:

Source code in lancedb/query.py
def vector(self, vector: Union[np.ndarray, list]) -> Self:
    """Set the vector to search for.

    Parameters
    ----------
    vector: np.ndarray or list
        The vector to search for.

    Returns
    -------
    LanceQueryBuilder
        The LanceQueryBuilder object.
    """
    raise NotImplementedError

text

text(text: str | FullTextQuery) -> Self

Set the text to search for.

Parameters:

  • text (str | FullTextQuery) –

    If a string, it is treated as a MatchQuery. If a FullTextQuery object, it is used directly.

Returns:

Source code in lancedb/query.py
def text(self, text: str | FullTextQuery) -> Self:
    """Set the text to search for.

    Parameters
    ----------
    text: str | FullTextQuery
        If a string, it is treated as a MatchQuery.
        If a FullTextQuery object, it is used directly.

    Returns
    -------
    LanceQueryBuilder
        The LanceQueryBuilder object.
    """
    raise NotImplementedError

rerank abstractmethod

rerank(reranker: Reranker) -> Self

Rerank the results using the specified reranker.

Parameters:

  • reranker (Reranker) –

    The reranker to use.

Returns:

  • The LanceQueryBuilder object. –
Source code in lancedb/query.py
@abstractmethod
def rerank(self, reranker: Reranker) -> Self:
    """Rerank the results using the specified reranker.

    Parameters
    ----------
    reranker: Reranker
        The reranker to use.

    Returns
    -------

    The LanceQueryBuilder object.
    """
    raise NotImplementedError

to_query_object abstractmethod

to_query_object() -> Query

Return a serializable representation of the query

Returns:

  • Query –

    The serializable representation of the query

Source code in lancedb/query.py
@abstractmethod
def to_query_object(self) -> Query:
    """Return a serializable representation of the query

    Returns
    -------
    Query
        The serializable representation of the query
    """
    raise NotImplementedError

lancedb.query.LanceVectorQueryBuilder

Bases: LanceQueryBuilder

Examples:

>>> import lancedb
>>> data = [{"vector": [1.1, 1.2], "b": 2},
...         {"vector": [0.5, 1.3], "b": 4},
...         {"vector": [0.4, 0.4], "b": 6},
...         {"vector": [0.4, 0.4], "b": 10}]
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data=data)
>>> (table.search([0.4, 0.4])
...       .distance_type("cosine")
...       .where("b < 10")
...       .select(["b", "vector"])
...       .limit(2)
...       .to_pandas())
   b      vector  _distance
0  6  [0.4, 0.4]   0.000000
1  2  [1.1, 1.2]   0.000944
Source code in lancedb/query.py
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
class LanceVectorQueryBuilder(LanceQueryBuilder):
    """
    Examples
    --------
    >>> import lancedb
    >>> data = [{"vector": [1.1, 1.2], "b": 2},
    ...         {"vector": [0.5, 1.3], "b": 4},
    ...         {"vector": [0.4, 0.4], "b": 6},
    ...         {"vector": [0.4, 0.4], "b": 10}]
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", data=data)
    >>> (table.search([0.4, 0.4])
    ...       .distance_type("cosine")
    ...       .where("b < 10")
    ...       .select(["b", "vector"])
    ...       .limit(2)
    ...       .to_pandas())
       b      vector  _distance
    0  6  [0.4, 0.4]   0.000000
    1  2  [1.1, 1.2]   0.000944
    """

    def __init__(
        self,
        table: "Table",
        query: Union[np.ndarray, list, "PIL.Image.Image"],
        vector_column: str,
        str_query: Optional[str] = None,
        fast_search: bool = None,
    ):
        super().__init__(table)
        self._query = query
        self._distance_type = None
        self._nprobes = None
        self._lower_bound = None
        self._upper_bound = None
        self._refine_factor = None
        self._vector_column = vector_column
        self._postfilter = None
        self._reranker = None
        self._str_query = str_query
        self._fast_search = fast_search

    def metric(self, metric: Literal["l2", "cosine", "dot"]) -> LanceVectorQueryBuilder:
        """Set the distance metric to use.

        This is an alias for distance_type() and may be deprecated in the future.
        Please use distance_type() instead.

        Parameters
        ----------
        metric: "l2" or "cosine" or "dot"
            The distance metric to use. By default "l2" is used.

        Returns
        -------
        LanceVectorQueryBuilder
            The LanceQueryBuilder object.
        """
        return self.distance_type(metric)

    def distance_type(
        self, distance_type: Literal["l2", "cosine", "dot"]
    ) -> "LanceVectorQueryBuilder":
        """Set the distance metric to use.

        When performing a vector search we try and find the "nearest" vectors according
        to some kind of distance metric. This parameter controls which distance metric
        to use.

        Note: if there is a vector index then the distance type used MUST match the
        distance type used to train the vector index. If this is not done then the
        results will be invalid.

        Parameters
        ----------
        distance_type: "l2" or "cosine" or "dot"
            The distance metric to use. By default "l2" is used.

        Returns
        -------
        LanceVectorQueryBuilder
            The LanceQueryBuilder object.
        """
        self._distance_type = distance_type.lower()
        return self

    def nprobes(self, nprobes: int) -> LanceVectorQueryBuilder:
        """Set the number of probes to use.

        Higher values will yield better recall (more likely to find vectors if
        they exist) at the expense of latency.

        See discussion in [Querying an ANN Index][querying-an-ann-index] for
        tuning advice.

        Parameters
        ----------
        nprobes: int
            The number of probes to use.

        Returns
        -------
        LanceVectorQueryBuilder
            The LanceQueryBuilder object.
        """
        self._nprobes = nprobes
        return self

    def distance_range(
        self, lower_bound: Optional[float] = None, upper_bound: Optional[float] = None
    ) -> LanceVectorQueryBuilder:
        """Set the distance range to use.

        Only rows with distances within range [lower_bound, upper_bound)
        will be returned.

        Parameters
        ----------
        lower_bound: Optional[float]
            The lower bound of the distance range.
        upper_bound: Optional[float]
            The upper bound of the distance range.

        Returns
        -------
        LanceVectorQueryBuilder
            The LanceQueryBuilder object.
        """
        self._lower_bound = lower_bound
        self._upper_bound = upper_bound
        return self

    def ef(self, ef: int) -> LanceVectorQueryBuilder:
        """Set the number of candidates to consider during search.

        Higher values will yield better recall (more likely to find vectors if
        they exist) at the expense of latency.

        This only applies to the HNSW-related index.
        The default value is 1.5 * limit.

        Parameters
        ----------
        ef: int
            The number of candidates to consider during search.

        Returns
        -------
        LanceVectorQueryBuilder
            The LanceQueryBuilder object.
        """
        self._ef = ef
        return self

    def refine_factor(self, refine_factor: int) -> LanceVectorQueryBuilder:
        """Set the refine factor to use, increasing the number of vectors sampled.

        As an example, a refine factor of 2 will sample 2x as many vectors as
        requested, re-ranks them, and returns the top half most relevant results.

        See discussion in [Querying an ANN Index][querying-an-ann-index] for
        tuning advice.

        Parameters
        ----------
        refine_factor: int
            The refine factor to use.

        Returns
        -------
        LanceVectorQueryBuilder
            The LanceQueryBuilder object.
        """
        self._refine_factor = refine_factor
        return self

    def to_arrow(self, *, timeout: Optional[timedelta] = None) -> pa.Table:
        """
        Execute the query and return the results as an
        [Apache Arrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table).

        In addition to the selected columns, LanceDB also returns a vector
        and also the "_distance" column which is the distance between the query
        vector and the returned vectors.

        Parameters
        ----------
        timeout: Optional[timedelta]
            The maximum time to wait for the query to complete.
            If None, wait indefinitely.
        """
        return self.to_batches(timeout=timeout).read_all()

    def to_query_object(self) -> Query:
        """
        Build a Query object

        This can be used to serialize a query
        """
        vector = self._query if isinstance(self._query, list) else self._query.tolist()
        if isinstance(vector[0], np.ndarray):
            vector = [v.tolist() for v in vector]
        return Query(
            vector=vector,
            filter=self._where,
            postfilter=self._postfilter,
            limit=self._limit,
            distance_type=self._distance_type,
            columns=self._columns,
            nprobes=self._nprobes,
            lower_bound=self._lower_bound,
            upper_bound=self._upper_bound,
            refine_factor=self._refine_factor,
            vector_column=self._vector_column,
            with_row_id=self._with_row_id,
            offset=self._offset,
            fast_search=self._fast_search,
            ef=self._ef,
            bypass_vector_index=self._bypass_vector_index,
        )

    def to_batches(
        self,
        /,
        batch_size: Optional[int] = None,
        *,
        timeout: Optional[timedelta] = None,
    ) -> pa.RecordBatchReader:
        """
        Execute the query and return the result as a RecordBatchReader object.

        Parameters
        ----------
        batch_size: int
            The maximum number of selected records in a RecordBatch object.
        timeout: timedelta, default None
            The maximum time to wait for the query to complete.
            If None, wait indefinitely.

        Returns
        -------
        pa.RecordBatchReader
        """
        vector = self._query if isinstance(self._query, list) else self._query.tolist()
        if isinstance(vector[0], np.ndarray):
            vector = [v.tolist() for v in vector]
        query = self.to_query_object()
        result_set = self._table._execute_query(
            query, batch_size=batch_size, timeout=timeout
        )
        if self._reranker is not None:
            rs_table = result_set.read_all()
            result_set = self._reranker.rerank_vector(self._str_query, rs_table)
            check_reranker_result(result_set)
            # convert result_set back to RecordBatchReader
            result_set = pa.RecordBatchReader.from_batches(
                result_set.schema, result_set.to_batches()
            )

        return result_set

    def where(self, where: str, prefilter: bool = None) -> LanceVectorQueryBuilder:
        """Set the where clause.

        Parameters
        ----------
        where: str
            The where clause which is a valid SQL where clause. See
            `Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>`_
            for valid SQL expressions.
        prefilter: bool, default True
            If True, apply the filter before vector search, otherwise the
            filter is applied on the result of vector search.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        self._where = where
        if prefilter is not None:
            self._postfilter = not prefilter
        return self

    def rerank(
        self, reranker: Reranker, query_string: Optional[str] = None
    ) -> LanceVectorQueryBuilder:
        """Rerank the results using the specified reranker.

        Parameters
        ----------
        reranker: Reranker
            The reranker to use.

        query_string: Optional[str]
            The query to use for reranking. This needs to be specified explicitly here
            as the query used for vector search may already be vectorized and the
            reranker requires a string query.
            This is only required if the query used for vector search is not a string.
            Note: This doesn't yet support the case where the query is multimodal or a
            list of vectors.

        Returns
        -------
        LanceVectorQueryBuilder
            The LanceQueryBuilder object.
        """
        self._reranker = reranker
        if self._str_query is None and query_string is None:
            raise ValueError(
                """
                The query used for vector search is not a string.
                In this case, the reranker query needs to be specified explicitly.
                """
            )
        if query_string is not None and not isinstance(query_string, str):
            raise ValueError("Reranking currently only supports string queries")
        self._str_query = query_string if query_string is not None else self._str_query
        return self

    def bypass_vector_index(self) -> LanceVectorQueryBuilder:
        """
        If this is called then any vector index is skipped

        An exhaustive (flat) search will be performed.  The query vector will
        be compared to every vector in the table.  At high scales this can be
        expensive.  However, this is often still useful.  For example, skipping
        the vector index can give you ground truth results which you can use to
        calculate your recall to select an appropriate value for nprobes.

        Returns
        -------
        LanceVectorQueryBuilder
            The LanceVectorQueryBuilder object.
        """
        self._bypass_vector_index = True
        return self

metric

metric(metric: Literal['l2', 'cosine', 'dot']) -> LanceVectorQueryBuilder

Set the distance metric to use.

This is an alias for distance_type() and may be deprecated in the future. Please use distance_type() instead.

Parameters:

  • metric (Literal['l2', 'cosine', 'dot']) –

    The distance metric to use. By default "l2" is used.

Returns:

Source code in lancedb/query.py
def metric(self, metric: Literal["l2", "cosine", "dot"]) -> LanceVectorQueryBuilder:
    """Set the distance metric to use.

    This is an alias for distance_type() and may be deprecated in the future.
    Please use distance_type() instead.

    Parameters
    ----------
    metric: "l2" or "cosine" or "dot"
        The distance metric to use. By default "l2" is used.

    Returns
    -------
    LanceVectorQueryBuilder
        The LanceQueryBuilder object.
    """
    return self.distance_type(metric)

distance_type

distance_type(distance_type: Literal['l2', 'cosine', 'dot']) -> 'LanceVectorQueryBuilder'

Set the distance metric to use.

When performing a vector search we try and find the "nearest" vectors according to some kind of distance metric. This parameter controls which distance metric to use.

Note: if there is a vector index then the distance type used MUST match the distance type used to train the vector index. If this is not done then the results will be invalid.

Parameters:

  • distance_type (Literal['l2', 'cosine', 'dot']) –

    The distance metric to use. By default "l2" is used.

Returns:

Source code in lancedb/query.py
def distance_type(
    self, distance_type: Literal["l2", "cosine", "dot"]
) -> "LanceVectorQueryBuilder":
    """Set the distance metric to use.

    When performing a vector search we try and find the "nearest" vectors according
    to some kind of distance metric. This parameter controls which distance metric
    to use.

    Note: if there is a vector index then the distance type used MUST match the
    distance type used to train the vector index. If this is not done then the
    results will be invalid.

    Parameters
    ----------
    distance_type: "l2" or "cosine" or "dot"
        The distance metric to use. By default "l2" is used.

    Returns
    -------
    LanceVectorQueryBuilder
        The LanceQueryBuilder object.
    """
    self._distance_type = distance_type.lower()
    return self

nprobes

nprobes(nprobes: int) -> LanceVectorQueryBuilder

Set the number of probes to use.

Higher values will yield better recall (more likely to find vectors if they exist) at the expense of latency.

See discussion in Querying an ANN Index for tuning advice.

Parameters:

  • nprobes (int) –

    The number of probes to use.

Returns:

Source code in lancedb/query.py
def nprobes(self, nprobes: int) -> LanceVectorQueryBuilder:
    """Set the number of probes to use.

    Higher values will yield better recall (more likely to find vectors if
    they exist) at the expense of latency.

    See discussion in [Querying an ANN Index][querying-an-ann-index] for
    tuning advice.

    Parameters
    ----------
    nprobes: int
        The number of probes to use.

    Returns
    -------
    LanceVectorQueryBuilder
        The LanceQueryBuilder object.
    """
    self._nprobes = nprobes
    return self

distance_range

distance_range(lower_bound: Optional[float] = None, upper_bound: Optional[float] = None) -> LanceVectorQueryBuilder

Set the distance range to use.

Only rows with distances within range [lower_bound, upper_bound) will be returned.

Parameters:

  • lower_bound (Optional[float], default: None ) –

    The lower bound of the distance range.

  • upper_bound (Optional[float], default: None ) –

    The upper bound of the distance range.

Returns:

Source code in lancedb/query.py
def distance_range(
    self, lower_bound: Optional[float] = None, upper_bound: Optional[float] = None
) -> LanceVectorQueryBuilder:
    """Set the distance range to use.

    Only rows with distances within range [lower_bound, upper_bound)
    will be returned.

    Parameters
    ----------
    lower_bound: Optional[float]
        The lower bound of the distance range.
    upper_bound: Optional[float]
        The upper bound of the distance range.

    Returns
    -------
    LanceVectorQueryBuilder
        The LanceQueryBuilder object.
    """
    self._lower_bound = lower_bound
    self._upper_bound = upper_bound
    return self

ef

ef(ef: int) -> LanceVectorQueryBuilder

Set the number of candidates to consider during search.

Higher values will yield better recall (more likely to find vectors if they exist) at the expense of latency.

This only applies to the HNSW-related index. The default value is 1.5 * limit.

Parameters:

  • ef (int) –

    The number of candidates to consider during search.

Returns:

Source code in lancedb/query.py
def ef(self, ef: int) -> LanceVectorQueryBuilder:
    """Set the number of candidates to consider during search.

    Higher values will yield better recall (more likely to find vectors if
    they exist) at the expense of latency.

    This only applies to the HNSW-related index.
    The default value is 1.5 * limit.

    Parameters
    ----------
    ef: int
        The number of candidates to consider during search.

    Returns
    -------
    LanceVectorQueryBuilder
        The LanceQueryBuilder object.
    """
    self._ef = ef
    return self

refine_factor

refine_factor(refine_factor: int) -> LanceVectorQueryBuilder

Set the refine factor to use, increasing the number of vectors sampled.

As an example, a refine factor of 2 will sample 2x as many vectors as requested, re-ranks them, and returns the top half most relevant results.

See discussion in Querying an ANN Index for tuning advice.

Parameters:

  • refine_factor (int) –

    The refine factor to use.

Returns:

Source code in lancedb/query.py
def refine_factor(self, refine_factor: int) -> LanceVectorQueryBuilder:
    """Set the refine factor to use, increasing the number of vectors sampled.

    As an example, a refine factor of 2 will sample 2x as many vectors as
    requested, re-ranks them, and returns the top half most relevant results.

    See discussion in [Querying an ANN Index][querying-an-ann-index] for
    tuning advice.

    Parameters
    ----------
    refine_factor: int
        The refine factor to use.

    Returns
    -------
    LanceVectorQueryBuilder
        The LanceQueryBuilder object.
    """
    self._refine_factor = refine_factor
    return self

to_arrow

to_arrow(*, timeout: Optional[timedelta] = None) -> Table

Execute the query and return the results as an Apache Arrow Table.

In addition to the selected columns, LanceDB also returns a vector and also the "_distance" column which is the distance between the query vector and the returned vectors.

Parameters:

  • timeout (Optional[timedelta], default: None ) –

    The maximum time to wait for the query to complete. If None, wait indefinitely.

Source code in lancedb/query.py
def to_arrow(self, *, timeout: Optional[timedelta] = None) -> pa.Table:
    """
    Execute the query and return the results as an
    [Apache Arrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table).

    In addition to the selected columns, LanceDB also returns a vector
    and also the "_distance" column which is the distance between the query
    vector and the returned vectors.

    Parameters
    ----------
    timeout: Optional[timedelta]
        The maximum time to wait for the query to complete.
        If None, wait indefinitely.
    """
    return self.to_batches(timeout=timeout).read_all()

to_query_object

to_query_object() -> Query

Build a Query object

This can be used to serialize a query

Source code in lancedb/query.py
def to_query_object(self) -> Query:
    """
    Build a Query object

    This can be used to serialize a query
    """
    vector = self._query if isinstance(self._query, list) else self._query.tolist()
    if isinstance(vector[0], np.ndarray):
        vector = [v.tolist() for v in vector]
    return Query(
        vector=vector,
        filter=self._where,
        postfilter=self._postfilter,
        limit=self._limit,
        distance_type=self._distance_type,
        columns=self._columns,
        nprobes=self._nprobes,
        lower_bound=self._lower_bound,
        upper_bound=self._upper_bound,
        refine_factor=self._refine_factor,
        vector_column=self._vector_column,
        with_row_id=self._with_row_id,
        offset=self._offset,
        fast_search=self._fast_search,
        ef=self._ef,
        bypass_vector_index=self._bypass_vector_index,
    )

to_batches

to_batches(batch_size: Optional[int] = None, *, timeout: Optional[timedelta] = None) -> RecordBatchReader

Execute the query and return the result as a RecordBatchReader object.

Parameters:

  • batch_size (Optional[int], default: None ) –

    The maximum number of selected records in a RecordBatch object.

  • timeout (Optional[timedelta], default: None ) –

    The maximum time to wait for the query to complete. If None, wait indefinitely.

Returns:

Source code in lancedb/query.py
def to_batches(
    self,
    /,
    batch_size: Optional[int] = None,
    *,
    timeout: Optional[timedelta] = None,
) -> pa.RecordBatchReader:
    """
    Execute the query and return the result as a RecordBatchReader object.

    Parameters
    ----------
    batch_size: int
        The maximum number of selected records in a RecordBatch object.
    timeout: timedelta, default None
        The maximum time to wait for the query to complete.
        If None, wait indefinitely.

    Returns
    -------
    pa.RecordBatchReader
    """
    vector = self._query if isinstance(self._query, list) else self._query.tolist()
    if isinstance(vector[0], np.ndarray):
        vector = [v.tolist() for v in vector]
    query = self.to_query_object()
    result_set = self._table._execute_query(
        query, batch_size=batch_size, timeout=timeout
    )
    if self._reranker is not None:
        rs_table = result_set.read_all()
        result_set = self._reranker.rerank_vector(self._str_query, rs_table)
        check_reranker_result(result_set)
        # convert result_set back to RecordBatchReader
        result_set = pa.RecordBatchReader.from_batches(
            result_set.schema, result_set.to_batches()
        )

    return result_set

where

where(where: str, prefilter: bool = None) -> LanceVectorQueryBuilder

Set the where clause.

Parameters:

  • where (str) –

    The where clause which is a valid SQL where clause. See Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>_ for valid SQL expressions.

  • prefilter (bool, default: None ) –

    If True, apply the filter before vector search, otherwise the filter is applied on the result of vector search.

Returns:

Source code in lancedb/query.py
def where(self, where: str, prefilter: bool = None) -> LanceVectorQueryBuilder:
    """Set the where clause.

    Parameters
    ----------
    where: str
        The where clause which is a valid SQL where clause. See
        `Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>`_
        for valid SQL expressions.
    prefilter: bool, default True
        If True, apply the filter before vector search, otherwise the
        filter is applied on the result of vector search.

    Returns
    -------
    LanceQueryBuilder
        The LanceQueryBuilder object.
    """
    self._where = where
    if prefilter is not None:
        self._postfilter = not prefilter
    return self

rerank

rerank(reranker: Reranker, query_string: Optional[str] = None) -> LanceVectorQueryBuilder

Rerank the results using the specified reranker.

Parameters:

  • reranker (Reranker) –

    The reranker to use.

  • query_string (Optional[str], default: None ) –

    The query to use for reranking. This needs to be specified explicitly here as the query used for vector search may already be vectorized and the reranker requires a string query. This is only required if the query used for vector search is not a string. Note: This doesn't yet support the case where the query is multimodal or a list of vectors.

Returns:

Source code in lancedb/query.py
def rerank(
    self, reranker: Reranker, query_string: Optional[str] = None
) -> LanceVectorQueryBuilder:
    """Rerank the results using the specified reranker.

    Parameters
    ----------
    reranker: Reranker
        The reranker to use.

    query_string: Optional[str]
        The query to use for reranking. This needs to be specified explicitly here
        as the query used for vector search may already be vectorized and the
        reranker requires a string query.
        This is only required if the query used for vector search is not a string.
        Note: This doesn't yet support the case where the query is multimodal or a
        list of vectors.

    Returns
    -------
    LanceVectorQueryBuilder
        The LanceQueryBuilder object.
    """
    self._reranker = reranker
    if self._str_query is None and query_string is None:
        raise ValueError(
            """
            The query used for vector search is not a string.
            In this case, the reranker query needs to be specified explicitly.
            """
        )
    if query_string is not None and not isinstance(query_string, str):
        raise ValueError("Reranking currently only supports string queries")
    self._str_query = query_string if query_string is not None else self._str_query
    return self

bypass_vector_index

bypass_vector_index() -> LanceVectorQueryBuilder

If this is called then any vector index is skipped

An exhaustive (flat) search will be performed. The query vector will be compared to every vector in the table. At high scales this can be expensive. However, this is often still useful. For example, skipping the vector index can give you ground truth results which you can use to calculate your recall to select an appropriate value for nprobes.

Returns:

Source code in lancedb/query.py
def bypass_vector_index(self) -> LanceVectorQueryBuilder:
    """
    If this is called then any vector index is skipped

    An exhaustive (flat) search will be performed.  The query vector will
    be compared to every vector in the table.  At high scales this can be
    expensive.  However, this is often still useful.  For example, skipping
    the vector index can give you ground truth results which you can use to
    calculate your recall to select an appropriate value for nprobes.

    Returns
    -------
    LanceVectorQueryBuilder
        The LanceVectorQueryBuilder object.
    """
    self._bypass_vector_index = True
    return self

lancedb.query.LanceFtsQueryBuilder

Bases: LanceQueryBuilder

A builder for full text search for LanceDB.

Source code in lancedb/query.py
class LanceFtsQueryBuilder(LanceQueryBuilder):
    """A builder for full text search for LanceDB."""

    def __init__(
        self,
        table: "Table",
        query: str | FullTextQuery,
        ordering_field_name: Optional[str] = None,
        fts_columns: Optional[Union[str, List[str]]] = None,
    ):
        super().__init__(table)
        self._query = query
        self._phrase_query = False
        self.ordering_field_name = ordering_field_name
        self._reranker = None
        if isinstance(fts_columns, str):
            fts_columns = [fts_columns]
        self._fts_columns = fts_columns

    def phrase_query(self, phrase_query: bool = True) -> LanceFtsQueryBuilder:
        """Set whether to use phrase query.

        Parameters
        ----------
        phrase_query: bool, default True
            If True, then the query will be wrapped in quotes and
            double quotes replaced by single quotes.

        Returns
        -------
        LanceFtsQueryBuilder
            The LanceFtsQueryBuilder object.
        """
        self._phrase_query = phrase_query
        return self

    def to_query_object(self) -> Query:
        return Query(
            columns=self._columns,
            filter=self._where,
            limit=self._limit,
            postfilter=self._postfilter,
            with_row_id=self._with_row_id,
            full_text_query=FullTextSearchQuery(
                query=self._query, columns=self._fts_columns
            ),
            offset=self._offset,
        )

    def to_arrow(self, *, timeout: Optional[timedelta] = None) -> pa.Table:
        path, fs, exist = self._table._get_fts_index_path()
        if exist:
            return self.tantivy_to_arrow()

        query = self._query
        if self._phrase_query:
            raise NotImplementedError(
                "Phrase query is not yet supported in Lance FTS. "
                "Use tantivy-based index instead for now."
            )
        query = self.to_query_object()
        results = self._table._execute_query(query, timeout=timeout)
        results = results.read_all()
        if self._reranker is not None:
            results = self._reranker.rerank_fts(self._query, results)
            check_reranker_result(results)
        return results

    def to_batches(
        self, /, batch_size: Optional[int] = None, timeout: Optional[timedelta] = None
    ):
        raise NotImplementedError("to_batches on an FTS query")

    def tantivy_to_arrow(self) -> pa.Table:
        try:
            import tantivy
        except ImportError:
            raise ImportError(
                "Please install tantivy-py `pip install tantivy` to use the full text search feature."  # noqa: E501
            )

        from .fts import search_index

        # get the index path
        path, fs, exist = self._table._get_fts_index_path()

        # check if the index exist
        if not exist:
            raise FileNotFoundError(
                "Fts index does not exist. "
                "Please first call table.create_fts_index(['<field_names>']) to "
                "create the fts index."
            )

        # Check that we are on local filesystem
        if not isinstance(fs, pa_fs.LocalFileSystem):
            raise NotImplementedError(
                "Tantivy-based full text search "
                "is only supported on the local filesystem"
            )
        # open the index
        index = tantivy.Index.open(path)
        # get the scores and doc ids
        query = self._query
        if self._phrase_query:
            query = query.replace('"', "'")
            query = f'"{query}"'
        limit = self._limit if self._limit is not None else 10
        row_ids, scores = search_index(
            index, query, limit, ordering_field=self.ordering_field_name
        )
        if len(row_ids) == 0:
            empty_schema = pa.schema([pa.field("_score", pa.float32())])
            return pa.Table.from_batches([], schema=empty_schema)
        scores = pa.array(scores)
        output_tbl = self._table.to_lance().take(row_ids, columns=self._columns)
        output_tbl = output_tbl.append_column("_score", scores)
        # this needs to match vector search results which are uint64
        row_ids = pa.array(row_ids, type=pa.uint64())

        if self._where is not None:
            tmp_name = "__lancedb__duckdb__indexer__"
            output_tbl = output_tbl.append_column(
                tmp_name, pa.array(range(len(output_tbl)))
            )
            try:
                # TODO would be great to have Substrait generate pyarrow compute
                # expressions or conversely have pyarrow support SQL expressions
                # using Substrait
                import duckdb

                indexer = duckdb.sql(
                    f"SELECT {tmp_name} FROM output_tbl WHERE {self._where}"
                ).to_arrow_table()[tmp_name]
                output_tbl = output_tbl.take(indexer).drop([tmp_name])
                row_ids = row_ids.take(indexer)

            except ImportError:
                import tempfile

                import lance

                # TODO Use "memory://" instead once that's supported
                with tempfile.TemporaryDirectory() as tmp:
                    ds = lance.write_dataset(output_tbl, tmp)
                    output_tbl = ds.to_table(filter=self._where)
                    indexer = output_tbl[tmp_name]
                    row_ids = row_ids.take(indexer)
                    output_tbl = output_tbl.drop([tmp_name])

        if self._with_row_id:
            output_tbl = output_tbl.append_column("_rowid", row_ids)

        if self._reranker is not None:
            output_tbl = self._reranker.rerank_fts(self._query, output_tbl)
        return output_tbl

    def rerank(self, reranker: Reranker) -> LanceFtsQueryBuilder:
        """Rerank the results using the specified reranker.

        Parameters
        ----------
        reranker: Reranker
            The reranker to use.

        Returns
        -------
        LanceFtsQueryBuilder
            The LanceQueryBuilder object.
        """
        self._reranker = reranker
        return self

phrase_query

phrase_query(phrase_query: bool = True) -> LanceFtsQueryBuilder

Set whether to use phrase query.

Parameters:

  • phrase_query (bool, default: True ) –

    If True, then the query will be wrapped in quotes and double quotes replaced by single quotes.

Returns:

Source code in lancedb/query.py
def phrase_query(self, phrase_query: bool = True) -> LanceFtsQueryBuilder:
    """Set whether to use phrase query.

    Parameters
    ----------
    phrase_query: bool, default True
        If True, then the query will be wrapped in quotes and
        double quotes replaced by single quotes.

    Returns
    -------
    LanceFtsQueryBuilder
        The LanceFtsQueryBuilder object.
    """
    self._phrase_query = phrase_query
    return self

rerank

rerank(reranker: Reranker) -> LanceFtsQueryBuilder

Rerank the results using the specified reranker.

Parameters:

  • reranker (Reranker) –

    The reranker to use.

Returns:

Source code in lancedb/query.py
def rerank(self, reranker: Reranker) -> LanceFtsQueryBuilder:
    """Rerank the results using the specified reranker.

    Parameters
    ----------
    reranker: Reranker
        The reranker to use.

    Returns
    -------
    LanceFtsQueryBuilder
        The LanceQueryBuilder object.
    """
    self._reranker = reranker
    return self

lancedb.query.LanceHybridQueryBuilder

Bases: LanceQueryBuilder

A query builder that performs hybrid vector and full text search. Results are combined and reranked based on the specified reranker. By default, the results are reranked using the RRFReranker, which uses reciprocal rank fusion score for reranking.

To make the vector and fts results comparable, the scores are normalized. Instead of normalizing scores, the normalize parameter can be set to "rank" in the rerank method to convert the scores to ranks and then normalize them.

Source code in lancedb/query.py
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
class LanceHybridQueryBuilder(LanceQueryBuilder):
    """
    A query builder that performs hybrid vector and full text search.
    Results are combined and reranked based on the specified reranker.
    By default, the results are reranked using the RRFReranker, which
    uses reciprocal rank fusion score for reranking.

    To make the vector and fts results comparable, the scores are normalized.
    Instead of normalizing scores, the `normalize` parameter can be set to "rank"
    in the `rerank` method to convert the scores to ranks and then normalize them.
    """

    def __init__(
        self,
        table: "Table",
        query: Optional[Union[str, FullTextQuery]] = None,
        vector_column: Optional[str] = None,
        fts_columns: Optional[Union[str, List[str]]] = None,
    ):
        super().__init__(table)
        self._query = query
        self._vector_column = vector_column
        self._fts_columns = fts_columns
        self._norm = None
        self._reranker = None
        self._nprobes = None
        self._refine_factor = None
        self._distance_type = None
        self._phrase_query = None
        self._lower_bound = None
        self._upper_bound = None

    def _validate_query(self, query, vector=None, text=None):
        if query is not None and (vector is not None or text is not None):
            raise ValueError(
                "You can either provide a string query in search() method"
                "or set `vector()` and `text()` explicitly for hybrid search."
                "But not both."
            )

        vector_query = vector if vector is not None else query
        if not isinstance(vector_query, (str, list, np.ndarray)):
            raise ValueError("Vector query must be either a string or a vector")

        text_query = text or query
        if text_query is None:
            raise ValueError("Text query must be provided for hybrid search.")
        if not isinstance(text_query, (str, FullTextQuery)):
            raise ValueError("Text query must be a string or FullTextQuery")

        return vector_query, text_query

    def phrase_query(self, phrase_query: bool = None) -> LanceHybridQueryBuilder:
        """Set whether to use phrase query.

        Parameters
        ----------
        phrase_query: bool, default True
            If True, then the query will be wrapped in quotes and
            double quotes replaced by single quotes.

        Returns
        -------
        LanceHybridQueryBuilder
            The LanceHybridQueryBuilder object.
        """
        self._phrase_query = phrase_query
        return self

    def to_query_object(self) -> Query:
        raise NotImplementedError("to_query_object not yet supported on a hybrid query")

    def to_arrow(self, *, timeout: Optional[timedelta] = None) -> pa.Table:
        self._create_query_builders()
        with ThreadPoolExecutor() as executor:
            fts_future = executor.submit(
                self._fts_query.with_row_id(True).to_arrow, timeout=timeout
            )
            vector_future = executor.submit(
                self._vector_query.with_row_id(True).to_arrow, timeout=timeout
            )
            fts_results = fts_future.result()
            vector_results = vector_future.result()

        return self._combine_hybrid_results(
            fts_results=fts_results,
            vector_results=vector_results,
            norm=self._norm,
            fts_query=self._fts_query._query,
            reranker=self._reranker,
            limit=self._limit,
            with_row_ids=self._with_row_id,
        )

    @staticmethod
    def _combine_hybrid_results(
        fts_results: pa.Table,
        vector_results: pa.Table,
        norm: str,
        fts_query: str,
        reranker,
        limit: int,
        with_row_ids: bool,
    ) -> pa.Table:
        if norm == "rank":
            vector_results = LanceHybridQueryBuilder._rank(vector_results, "_distance")
            fts_results = LanceHybridQueryBuilder._rank(fts_results, "_score")

        original_distances = None
        original_scores = None
        original_distance_row_ids = None
        original_score_row_ids = None
        # normalize the scores to be between 0 and 1, 0 being most relevant
        # We check whether the results (vector and FTS) are empty, because when
        # they are, they often are missing the _rowid column, which causes an error
        if vector_results.num_rows > 0:
            distance_i = vector_results.column_names.index("_distance")
            original_distances = vector_results.column(distance_i)
            original_distance_row_ids = vector_results.column("_rowid")
            vector_results = vector_results.set_column(
                distance_i,
                vector_results.field(distance_i),
                LanceHybridQueryBuilder._normalize_scores(original_distances),
            )

        # In fts higher scores represent relevance. Not inverting them here as
        # rerankers might need to preserve this score to support `return_score="all"`
        if fts_results.num_rows > 0:
            score_i = fts_results.column_names.index("_score")
            original_scores = fts_results.column(score_i)
            original_score_row_ids = fts_results.column("_rowid")
            fts_results = fts_results.set_column(
                score_i,
                fts_results.field(score_i),
                LanceHybridQueryBuilder._normalize_scores(original_scores),
            )

        results = reranker.rerank_hybrid(fts_query, vector_results, fts_results)

        check_reranker_result(results)

        if "_distance" in results.column_names and original_distances is not None:
            # restore the original distances
            indices = pc.index_in(
                results["_rowid"], original_distance_row_ids, skip_nulls=True
            )
            original_distances = pc.take(original_distances, indices)
            distance_i = results.column_names.index("_distance")
            results = results.set_column(distance_i, "_distance", original_distances)

        if "_score" in results.column_names and original_scores is not None:
            # restore the original scores
            indices = pc.index_in(
                results["_rowid"], original_score_row_ids, skip_nulls=True
            )
            original_scores = pc.take(original_scores, indices)
            score_i = results.column_names.index("_score")
            results = results.set_column(score_i, "_score", original_scores)

        results = results.slice(length=limit)

        if not with_row_ids:
            results = results.drop(["_rowid"])

        return results

    def to_batches(
        self, /, batch_size: Optional[int] = None, timeout: Optional[timedelta] = None
    ):
        raise NotImplementedError("to_batches not yet supported on a hybrid query")

    @staticmethod
    def _rank(results: pa.Table, column: str, ascending: bool = True):
        if len(results) == 0:
            return results
        # Get the _score column from results
        scores = results.column(column).to_numpy()
        sort_indices = np.argsort(scores)
        if not ascending:
            sort_indices = sort_indices[::-1]
        ranks = np.empty_like(sort_indices)
        ranks[sort_indices] = np.arange(len(scores)) + 1
        # replace the _score column with the ranks
        _score_idx = results.column_names.index(column)
        results = results.set_column(
            _score_idx, column, pa.array(ranks, type=pa.float32())
        )
        return results

    @staticmethod
    def _normalize_scores(scores: pa.Array, invert=False) -> pa.Array:
        if len(scores) == 0:
            return scores
        # normalize the scores by subtracting the min and dividing by the max
        min, max = pc.min_max(scores).values()
        rng = pc.subtract(max, min)

        if not pc.equal(rng, pa.scalar(0.0)).as_py():
            scores = pc.divide(pc.subtract(scores, min), rng)
        elif not pc.equal(max, pa.scalar(0.0)).as_py():
            # If rng is 0, then we at least want the scores to be 0
            scores = pc.subtract(scores, min)

        if invert:
            scores = pc.subtract(1, scores)

        return scores

    def rerank(
        self,
        reranker: Reranker = RRFReranker(),
        normalize: str = "score",
    ) -> LanceHybridQueryBuilder:
        """
        Rerank the hybrid search results using the specified reranker. The reranker
        must be an instance of Reranker class.

        Parameters
        ----------
        reranker: Reranker, default RRFReranker()
            The reranker to use. Must be an instance of Reranker class.
        normalize: str, default "score"
            The method to normalize the scores. Can be "rank" or "score". If "rank",
            the scores are converted to ranks and then normalized. If "score", the
            scores are normalized directly.
        Returns
        -------
        LanceHybridQueryBuilder
            The LanceHybridQueryBuilder object.
        """
        if normalize not in ["rank", "score"]:
            raise ValueError("normalize must be 'rank' or 'score'.")
        if reranker and not isinstance(reranker, Reranker):
            raise ValueError("reranker must be an instance of Reranker class.")

        self._norm = normalize
        self._reranker = reranker

        return self

    def nprobes(self, nprobes: int) -> LanceHybridQueryBuilder:
        """
        Set the number of probes to use for vector search.

        Higher values will yield better recall (more likely to find vectors if
        they exist) at the expense of latency.

        Parameters
        ----------
        nprobes: int
            The number of probes to use.

        Returns
        -------
        LanceHybridQueryBuilder
            The LanceHybridQueryBuilder object.
        """
        self._nprobes = nprobes
        return self

    def distance_range(
        self, lower_bound: Optional[float] = None, upper_bound: Optional[float] = None
    ) -> LanceHybridQueryBuilder:
        """
        Set the distance range to use.

        Only rows with distances within range [lower_bound, upper_bound)
        will be returned.

        Parameters
        ----------
        lower_bound: Optional[float]
            The lower bound of the distance range.
        upper_bound: Optional[float]
            The upper bound of the distance range.

        Returns
        -------
        LanceHybridQueryBuilder
            The LanceHybridQueryBuilder object.
        """
        self._lower_bound = lower_bound
        self._upper_bound = upper_bound
        return self

    def ef(self, ef: int) -> LanceHybridQueryBuilder:
        """
        Set the number of candidates to consider during search.

        Higher values will yield better recall (more likely to find vectors if
        they exist) at the expense of latency.

        This only applies to the HNSW-related index.
        The default value is 1.5 * limit.

        Parameters
        ----------
        ef: int
            The number of candidates to consider during search.

        Returns
        -------
        LanceHybridQueryBuilder
            The LanceHybridQueryBuilder object.
        """
        self._ef = ef
        return self

    def metric(self, metric: Literal["l2", "cosine", "dot"]) -> LanceHybridQueryBuilder:
        """Set the distance metric to use.

        This is an alias for distance_type() and may be deprecated in the future.
        Please use distance_type() instead.

        Parameters
        ----------
        metric: "l2" or "cosine" or "dot"
            The distance metric to use. By default "l2" is used.

        Returns
        -------
        LanceVectorQueryBuilder
            The LanceQueryBuilder object.
        """
        return self.distance_type(metric)

    def distance_type(
        self, distance_type: Literal["l2", "cosine", "dot"]
    ) -> "LanceHybridQueryBuilder":
        """Set the distance metric to use.

        When performing a vector search we try and find the "nearest" vectors according
        to some kind of distance metric. This parameter controls which distance metric
        to use.

        Note: if there is a vector index then the distance type used MUST match the
        distance type used to train the vector index. If this is not done then the
        results will be invalid.

        Parameters
        ----------
        distance_type: "l2" or "cosine" or "dot"
            The distance metric to use. By default "l2" is used.

        Returns
        -------
        LanceVectorQueryBuilder
            The LanceQueryBuilder object.
        """
        self._distance_type = distance_type.lower()
        return self

    def refine_factor(self, refine_factor: int) -> LanceHybridQueryBuilder:
        """
        Refine the vector search results by reading extra elements and
        re-ranking them in memory.

        Parameters
        ----------
        refine_factor: int
            The refine factor to use.

        Returns
        -------
        LanceHybridQueryBuilder
            The LanceHybridQueryBuilder object.
        """
        self._refine_factor = refine_factor
        return self

    def vector(self, vector: Union[np.ndarray, list]) -> LanceHybridQueryBuilder:
        self._vector = vector
        return self

    def text(self, text: str | FullTextQuery) -> LanceHybridQueryBuilder:
        self._text = text
        return self

    def bypass_vector_index(self) -> LanceHybridQueryBuilder:
        """
        If this is called then any vector index is skipped

        An exhaustive (flat) search will be performed.  The query vector will
        be compared to every vector in the table.  At high scales this can be
        expensive.  However, this is often still useful.  For example, skipping
        the vector index can give you ground truth results which you can use to
        calculate your recall to select an appropriate value for nprobes.

        Returns
        -------
        LanceHybridQueryBuilder
            The LanceHybridQueryBuilder object.
        """
        self._bypass_vector_index = True
        return self

    def explain_plan(self, verbose: Optional[bool] = False) -> str:
        """Return the execution plan for this query.

        Examples
        --------
        >>> import lancedb
        >>> db = lancedb.connect("./.lancedb")
        >>> table = db.create_table("my_table", [{"vector": [99.0, 99]}])
        >>> query = [100, 100]
        >>> plan = table.search(query).explain_plan(True)
        >>> print(plan) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
        ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]
        GlobalLimitExec: skip=0, fetch=10
          FilterExec: _distance@2 IS NOT NULL
            SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]
              KNNVectorDistance: metric=l2
                LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false

        Parameters
        ----------
        verbose : bool, default False
            Use a verbose output format.

        Returns
        -------
        plan : str
        """  # noqa: E501
        self._create_query_builders()

        results = ["Vector Search Plan:"]
        results.append(
            self._table._explain_plan(
                self._vector_query.to_query_object(), verbose=verbose
            )
        )
        results.append("FTS Search Plan:")
        results.append(
            self._table._explain_plan(
                self._fts_query.to_query_object(), verbose=verbose
            )
        )
        return "\n".join(results)

    def analyze_plan(self):
        """Execute the query and display with runtime metrics.

        Returns
        -------
        plan : str
        """
        self._create_query_builders()

        results = ["Vector Search Plan:"]
        results.append(self._table._analyze_plan(self._vector_query.to_query_object()))
        results.append("FTS Search Plan:")
        results.append(self._table._analyze_plan(self._fts_query.to_query_object()))
        return "\n".join(results)

    def _create_query_builders(self):
        """Set up and configure the vector and FTS query builders."""
        vector_query, fts_query = self._validate_query(
            self._query, self._vector, self._text
        )
        self._fts_query = LanceFtsQueryBuilder(
            self._table, fts_query, fts_columns=self._fts_columns
        )
        vector_query = self._query_to_vector(
            self._table, vector_query, self._vector_column
        )
        self._vector_query = LanceVectorQueryBuilder(
            self._table, vector_query, self._vector_column
        )

        # Apply common configurations
        if self._limit:
            self._vector_query.limit(self._limit)
            self._fts_query.limit(self._limit)
        if self._columns:
            self._vector_query.select(self._columns)
            self._fts_query.select(self._columns)
        if self._where:
            self._vector_query.where(self._where, self._postfilter)
            self._fts_query.where(self._where, self._postfilter)
        if self._with_row_id:
            self._vector_query.with_row_id(True)
            self._fts_query.with_row_id(True)
        if self._phrase_query:
            self._fts_query.phrase_query(True)
        if self._distance_type:
            self._vector_query.metric(self._distance_type)
        if self._nprobes:
            self._vector_query.nprobes(self._nprobes)
        if self._refine_factor:
            self._vector_query.refine_factor(self._refine_factor)
        if self._ef:
            self._vector_query.ef(self._ef)
        if self._bypass_vector_index:
            self._vector_query.bypass_vector_index()
        if self._lower_bound or self._upper_bound:
            self._vector_query.distance_range(
                lower_bound=self._lower_bound, upper_bound=self._upper_bound
            )

        if self._reranker is None:
            self._reranker = RRFReranker()

phrase_query

phrase_query(phrase_query: bool = None) -> LanceHybridQueryBuilder

Set whether to use phrase query.

Parameters:

  • phrase_query (bool, default: None ) –

    If True, then the query will be wrapped in quotes and double quotes replaced by single quotes.

Returns:

Source code in lancedb/query.py
def phrase_query(self, phrase_query: bool = None) -> LanceHybridQueryBuilder:
    """Set whether to use phrase query.

    Parameters
    ----------
    phrase_query: bool, default True
        If True, then the query will be wrapped in quotes and
        double quotes replaced by single quotes.

    Returns
    -------
    LanceHybridQueryBuilder
        The LanceHybridQueryBuilder object.
    """
    self._phrase_query = phrase_query
    return self

rerank

rerank(reranker: Reranker = RRFReranker(), normalize: str = 'score') -> LanceHybridQueryBuilder

Rerank the hybrid search results using the specified reranker. The reranker must be an instance of Reranker class.

Parameters:

  • reranker (Reranker, default: RRFReranker() ) –

    The reranker to use. Must be an instance of Reranker class.

  • normalize (str, default: 'score' ) –

    The method to normalize the scores. Can be "rank" or "score". If "rank", the scores are converted to ranks and then normalized. If "score", the scores are normalized directly.

Returns:

Source code in lancedb/query.py
def rerank(
    self,
    reranker: Reranker = RRFReranker(),
    normalize: str = "score",
) -> LanceHybridQueryBuilder:
    """
    Rerank the hybrid search results using the specified reranker. The reranker
    must be an instance of Reranker class.

    Parameters
    ----------
    reranker: Reranker, default RRFReranker()
        The reranker to use. Must be an instance of Reranker class.
    normalize: str, default "score"
        The method to normalize the scores. Can be "rank" or "score". If "rank",
        the scores are converted to ranks and then normalized. If "score", the
        scores are normalized directly.
    Returns
    -------
    LanceHybridQueryBuilder
        The LanceHybridQueryBuilder object.
    """
    if normalize not in ["rank", "score"]:
        raise ValueError("normalize must be 'rank' or 'score'.")
    if reranker and not isinstance(reranker, Reranker):
        raise ValueError("reranker must be an instance of Reranker class.")

    self._norm = normalize
    self._reranker = reranker

    return self

nprobes

nprobes(nprobes: int) -> LanceHybridQueryBuilder

Set the number of probes to use for vector search.

Higher values will yield better recall (more likely to find vectors if they exist) at the expense of latency.

Parameters:

  • nprobes (int) –

    The number of probes to use.

Returns:

Source code in lancedb/query.py
def nprobes(self, nprobes: int) -> LanceHybridQueryBuilder:
    """
    Set the number of probes to use for vector search.

    Higher values will yield better recall (more likely to find vectors if
    they exist) at the expense of latency.

    Parameters
    ----------
    nprobes: int
        The number of probes to use.

    Returns
    -------
    LanceHybridQueryBuilder
        The LanceHybridQueryBuilder object.
    """
    self._nprobes = nprobes
    return self

distance_range

distance_range(lower_bound: Optional[float] = None, upper_bound: Optional[float] = None) -> LanceHybridQueryBuilder

Set the distance range to use.

Only rows with distances within range [lower_bound, upper_bound) will be returned.

Parameters:

  • lower_bound (Optional[float], default: None ) –

    The lower bound of the distance range.

  • upper_bound (Optional[float], default: None ) –

    The upper bound of the distance range.

Returns:

Source code in lancedb/query.py
def distance_range(
    self, lower_bound: Optional[float] = None, upper_bound: Optional[float] = None
) -> LanceHybridQueryBuilder:
    """
    Set the distance range to use.

    Only rows with distances within range [lower_bound, upper_bound)
    will be returned.

    Parameters
    ----------
    lower_bound: Optional[float]
        The lower bound of the distance range.
    upper_bound: Optional[float]
        The upper bound of the distance range.

    Returns
    -------
    LanceHybridQueryBuilder
        The LanceHybridQueryBuilder object.
    """
    self._lower_bound = lower_bound
    self._upper_bound = upper_bound
    return self

ef

ef(ef: int) -> LanceHybridQueryBuilder

Set the number of candidates to consider during search.

Higher values will yield better recall (more likely to find vectors if they exist) at the expense of latency.

This only applies to the HNSW-related index. The default value is 1.5 * limit.

Parameters:

  • ef (int) –

    The number of candidates to consider during search.

Returns:

Source code in lancedb/query.py
def ef(self, ef: int) -> LanceHybridQueryBuilder:
    """
    Set the number of candidates to consider during search.

    Higher values will yield better recall (more likely to find vectors if
    they exist) at the expense of latency.

    This only applies to the HNSW-related index.
    The default value is 1.5 * limit.

    Parameters
    ----------
    ef: int
        The number of candidates to consider during search.

    Returns
    -------
    LanceHybridQueryBuilder
        The LanceHybridQueryBuilder object.
    """
    self._ef = ef
    return self

metric

metric(metric: Literal['l2', 'cosine', 'dot']) -> LanceHybridQueryBuilder

Set the distance metric to use.

This is an alias for distance_type() and may be deprecated in the future. Please use distance_type() instead.

Parameters:

  • metric (Literal['l2', 'cosine', 'dot']) –

    The distance metric to use. By default "l2" is used.

Returns:

Source code in lancedb/query.py
def metric(self, metric: Literal["l2", "cosine", "dot"]) -> LanceHybridQueryBuilder:
    """Set the distance metric to use.

    This is an alias for distance_type() and may be deprecated in the future.
    Please use distance_type() instead.

    Parameters
    ----------
    metric: "l2" or "cosine" or "dot"
        The distance metric to use. By default "l2" is used.

    Returns
    -------
    LanceVectorQueryBuilder
        The LanceQueryBuilder object.
    """
    return self.distance_type(metric)

distance_type

distance_type(distance_type: Literal['l2', 'cosine', 'dot']) -> 'LanceHybridQueryBuilder'

Set the distance metric to use.

When performing a vector search we try and find the "nearest" vectors according to some kind of distance metric. This parameter controls which distance metric to use.

Note: if there is a vector index then the distance type used MUST match the distance type used to train the vector index. If this is not done then the results will be invalid.

Parameters:

  • distance_type (Literal['l2', 'cosine', 'dot']) –

    The distance metric to use. By default "l2" is used.

Returns:

Source code in lancedb/query.py
def distance_type(
    self, distance_type: Literal["l2", "cosine", "dot"]
) -> "LanceHybridQueryBuilder":
    """Set the distance metric to use.

    When performing a vector search we try and find the "nearest" vectors according
    to some kind of distance metric. This parameter controls which distance metric
    to use.

    Note: if there is a vector index then the distance type used MUST match the
    distance type used to train the vector index. If this is not done then the
    results will be invalid.

    Parameters
    ----------
    distance_type: "l2" or "cosine" or "dot"
        The distance metric to use. By default "l2" is used.

    Returns
    -------
    LanceVectorQueryBuilder
        The LanceQueryBuilder object.
    """
    self._distance_type = distance_type.lower()
    return self

refine_factor

refine_factor(refine_factor: int) -> LanceHybridQueryBuilder

Refine the vector search results by reading extra elements and re-ranking them in memory.

Parameters:

  • refine_factor (int) –

    The refine factor to use.

Returns:

Source code in lancedb/query.py
def refine_factor(self, refine_factor: int) -> LanceHybridQueryBuilder:
    """
    Refine the vector search results by reading extra elements and
    re-ranking them in memory.

    Parameters
    ----------
    refine_factor: int
        The refine factor to use.

    Returns
    -------
    LanceHybridQueryBuilder
        The LanceHybridQueryBuilder object.
    """
    self._refine_factor = refine_factor
    return self

bypass_vector_index

bypass_vector_index() -> LanceHybridQueryBuilder

If this is called then any vector index is skipped

An exhaustive (flat) search will be performed. The query vector will be compared to every vector in the table. At high scales this can be expensive. However, this is often still useful. For example, skipping the vector index can give you ground truth results which you can use to calculate your recall to select an appropriate value for nprobes.

Returns:

Source code in lancedb/query.py
def bypass_vector_index(self) -> LanceHybridQueryBuilder:
    """
    If this is called then any vector index is skipped

    An exhaustive (flat) search will be performed.  The query vector will
    be compared to every vector in the table.  At high scales this can be
    expensive.  However, this is often still useful.  For example, skipping
    the vector index can give you ground truth results which you can use to
    calculate your recall to select an appropriate value for nprobes.

    Returns
    -------
    LanceHybridQueryBuilder
        The LanceHybridQueryBuilder object.
    """
    self._bypass_vector_index = True
    return self

explain_plan

explain_plan(verbose: Optional[bool] = False) -> str

Return the execution plan for this query.

Examples:

>>> import lancedb
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", [{"vector": [99.0, 99]}])
>>> query = [100, 100]
>>> plan = table.search(query).explain_plan(True)
>>> print(plan)
ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]
GlobalLimitExec: skip=0, fetch=10
  FilterExec: _distance@2 IS NOT NULL
    SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]
      KNNVectorDistance: metric=l2
        LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false

Parameters:

  • verbose (bool, default: False ) –

    Use a verbose output format.

Returns:

  • plan ( str ) –
Source code in lancedb/query.py
def explain_plan(self, verbose: Optional[bool] = False) -> str:
    """Return the execution plan for this query.

    Examples
    --------
    >>> import lancedb
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", [{"vector": [99.0, 99]}])
    >>> query = [100, 100]
    >>> plan = table.search(query).explain_plan(True)
    >>> print(plan) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
    ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]
    GlobalLimitExec: skip=0, fetch=10
      FilterExec: _distance@2 IS NOT NULL
        SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]
          KNNVectorDistance: metric=l2
            LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false

    Parameters
    ----------
    verbose : bool, default False
        Use a verbose output format.

    Returns
    -------
    plan : str
    """  # noqa: E501
    self._create_query_builders()

    results = ["Vector Search Plan:"]
    results.append(
        self._table._explain_plan(
            self._vector_query.to_query_object(), verbose=verbose
        )
    )
    results.append("FTS Search Plan:")
    results.append(
        self._table._explain_plan(
            self._fts_query.to_query_object(), verbose=verbose
        )
    )
    return "\n".join(results)

analyze_plan

analyze_plan()

Execute the query and display with runtime metrics.

Returns:

  • plan ( str ) –
Source code in lancedb/query.py
def analyze_plan(self):
    """Execute the query and display with runtime metrics.

    Returns
    -------
    plan : str
    """
    self._create_query_builders()

    results = ["Vector Search Plan:"]
    results.append(self._table._analyze_plan(self._vector_query.to_query_object()))
    results.append("FTS Search Plan:")
    results.append(self._table._analyze_plan(self._fts_query.to_query_object()))
    return "\n".join(results)

Embeddings

lancedb.embeddings.registry.EmbeddingFunctionRegistry

This is a singleton class used to register embedding functions and fetch them by name. It also handles serializing and deserializing. You can implement your own embedding function by subclassing EmbeddingFunction or TextEmbeddingFunction and registering it with the registry.

NOTE: Here TEXT is a type alias for Union[str, List[str], pa.Array, pa.ChunkedArray, np.ndarray]

Examples:

>>> registry = EmbeddingFunctionRegistry.get_instance()
>>> @registry.register("my-embedding-function")
... class MyEmbeddingFunction(EmbeddingFunction):
...     def ndims(self) -> int:
...         return 128
...
...     def compute_query_embeddings(self, query: str, *args, **kwargs):
...         return self.compute_source_embeddings(query, *args, **kwargs)
...
...     def compute_source_embeddings(self, texts, *args, **kwargs):
...         return [np.random.rand(self.ndims()) for _ in range(len(texts))]
...
>>> registry.get("my-embedding-function")
<class 'lancedb.embeddings.registry.MyEmbeddingFunction'>
Source code in lancedb/embeddings/registry.py
class EmbeddingFunctionRegistry:
    """
    This is a singleton class used to register embedding functions
    and fetch them by name. It also handles serializing and deserializing.
    You can implement your own embedding function by subclassing EmbeddingFunction
    or TextEmbeddingFunction and registering it with the registry.

    NOTE: Here TEXT is a type alias for Union[str, List[str], pa.Array,
          pa.ChunkedArray, np.ndarray]

    Examples
    --------
    >>> registry = EmbeddingFunctionRegistry.get_instance()
    >>> @registry.register("my-embedding-function")
    ... class MyEmbeddingFunction(EmbeddingFunction):
    ...     def ndims(self) -> int:
    ...         return 128
    ...
    ...     def compute_query_embeddings(self, query: str, *args, **kwargs):
    ...         return self.compute_source_embeddings(query, *args, **kwargs)
    ...
    ...     def compute_source_embeddings(self, texts, *args, **kwargs):
    ...         return [np.random.rand(self.ndims()) for _ in range(len(texts))]
    ...
    >>> registry.get("my-embedding-function")
    <class 'lancedb.embeddings.registry.MyEmbeddingFunction'>
    """

    @classmethod
    def get_instance(cls):
        return __REGISTRY__

    def __init__(self):
        self._functions = {}
        self._variables = {}

    def register(self, alias: str = None):
        """
        This creates a decorator that can be used to register
        an EmbeddingFunction.

        Parameters
        ----------
        alias : Optional[str]
            a human friendly name for the embedding function. If not
            provided, the class name will be used.
        """

        # This is a decorator for a class that inherits from BaseModel
        # It adds the class to the registry
        def decorator(cls):
            if not issubclass(cls, EmbeddingFunction):
                raise TypeError("Must be a subclass of EmbeddingFunction")
            if cls.__name__ in self._functions:
                raise KeyError(f"{cls.__name__} was already registered")
            key = alias or cls.__name__
            self._functions[key] = cls
            cls.__embedding_function_registry_alias__ = alias
            return cls

        return decorator

    def reset(self):
        """
        Reset the registry to its initial state
        """
        self._functions = {}

    def get(self, name: str):
        """
        Fetch an embedding function class by name

        Parameters
        ----------
        name : str
            The name of the embedding function to fetch
            Either the alias or the class name if no alias was provided
            during registration
        """
        return self._functions[name]

    def parse_functions(
        self, metadata: Optional[Dict[bytes, bytes]]
    ) -> Dict[str, "EmbeddingFunctionConfig"]:
        """
        Parse the metadata from an arrow table and
        return a mapping of the vector column to the
        embedding function and source column

        Parameters
        ----------
        metadata : Optional[Dict[bytes, bytes]]
            The metadata from an arrow table. Note that
            the keys and values are bytes (pyarrow api)

        Returns
        -------
        functions : dict
            A mapping of vector column name to embedding function.
            An empty dict is returned if input is None or does not
            contain b"embedding_functions".
        """
        if metadata is None:
            return {}
        # Look at both bytes and string keys, since we might use either
        serialized = metadata.get(
            b"embedding_functions", metadata.get("embedding_functions")
        )
        if serialized is None:
            return {}
        raw_list = json.loads(serialized.decode("utf-8"))
        return {
            obj["vector_column"]: EmbeddingFunctionConfig(
                vector_column=obj["vector_column"],
                source_column=obj["source_column"],
                function=self.get(obj["name"])(**obj["model"]),
            )
            for obj in raw_list
        }

    def function_to_metadata(self, conf: "EmbeddingFunctionConfig"):
        """
        Convert the given embedding function and source / vector column configs
        into a config dictionary that can be serialized into arrow metadata
        """
        func = conf.function
        name = getattr(
            func, "__embedding_function_registry_alias__", func.__class__.__name__
        )
        json_data = func.safe_model_dump()
        return {
            "name": name,
            "model": json_data,
            "source_column": conf.source_column,
            "vector_column": conf.vector_column,
        }

    def get_table_metadata(self, func_list):
        """
        Convert a list of embedding functions and source / vector configs
        into a config dictionary that can be serialized into arrow metadata
        """
        if func_list is None or len(func_list) == 0:
            return None
        json_data = [self.function_to_metadata(func) for func in func_list]
        # Note that metadata dictionary values must be bytes
        # so we need to json dump then utf8 encode
        metadata = json.dumps(json_data, indent=2).encode("utf-8")
        return {"embedding_functions": metadata}

    def set_var(self, name: str, value: str) -> None:
        """
        Set a variable. These can be accessed in embedding configuration using
        the syntax `$var:variable_name`. If they are not set, an error will be
        thrown letting you know which variable is missing. If you want to supply
        a default value, you can add an additional part in the configuration
        like so: `$var:variable_name:default_value`. Default values can be
        used for runtime configurations that are not sensitive, such as
        whether to use a GPU for inference.

        The name must not contain a colon. Default values can contain colons.
        """
        if ":" in name:
            raise ValueError("Variable names cannot contain colons")
        self._variables[name] = value

    def get_var(self, name: str) -> str:
        """
        Get a variable.
        """
        return self._variables[name]

register

register(alias: str = None)

This creates a decorator that can be used to register an EmbeddingFunction.

Parameters:

  • alias (Optional[str], default: None ) –

    a human friendly name for the embedding function. If not provided, the class name will be used.

Source code in lancedb/embeddings/registry.py
def register(self, alias: str = None):
    """
    This creates a decorator that can be used to register
    an EmbeddingFunction.

    Parameters
    ----------
    alias : Optional[str]
        a human friendly name for the embedding function. If not
        provided, the class name will be used.
    """

    # This is a decorator for a class that inherits from BaseModel
    # It adds the class to the registry
    def decorator(cls):
        if not issubclass(cls, EmbeddingFunction):
            raise TypeError("Must be a subclass of EmbeddingFunction")
        if cls.__name__ in self._functions:
            raise KeyError(f"{cls.__name__} was already registered")
        key = alias or cls.__name__
        self._functions[key] = cls
        cls.__embedding_function_registry_alias__ = alias
        return cls

    return decorator

reset

reset()

Reset the registry to its initial state

Source code in lancedb/embeddings/registry.py
def reset(self):
    """
    Reset the registry to its initial state
    """
    self._functions = {}

get

get(name: str)

Fetch an embedding function class by name

Parameters:

  • name (str) –

    The name of the embedding function to fetch Either the alias or the class name if no alias was provided during registration

Source code in lancedb/embeddings/registry.py
def get(self, name: str):
    """
    Fetch an embedding function class by name

    Parameters
    ----------
    name : str
        The name of the embedding function to fetch
        Either the alias or the class name if no alias was provided
        during registration
    """
    return self._functions[name]

parse_functions

parse_functions(metadata: Optional[Dict[bytes, bytes]]) -> Dict[str, EmbeddingFunctionConfig]

Parse the metadata from an arrow table and return a mapping of the vector column to the embedding function and source column

Parameters:

  • metadata (Optional[Dict[bytes, bytes]]) –

    The metadata from an arrow table. Note that the keys and values are bytes (pyarrow api)

Returns:

  • functions ( dict ) –

    A mapping of vector column name to embedding function. An empty dict is returned if input is None or does not contain b"embedding_functions".

Source code in lancedb/embeddings/registry.py
def parse_functions(
    self, metadata: Optional[Dict[bytes, bytes]]
) -> Dict[str, "EmbeddingFunctionConfig"]:
    """
    Parse the metadata from an arrow table and
    return a mapping of the vector column to the
    embedding function and source column

    Parameters
    ----------
    metadata : Optional[Dict[bytes, bytes]]
        The metadata from an arrow table. Note that
        the keys and values are bytes (pyarrow api)

    Returns
    -------
    functions : dict
        A mapping of vector column name to embedding function.
        An empty dict is returned if input is None or does not
        contain b"embedding_functions".
    """
    if metadata is None:
        return {}
    # Look at both bytes and string keys, since we might use either
    serialized = metadata.get(
        b"embedding_functions", metadata.get("embedding_functions")
    )
    if serialized is None:
        return {}
    raw_list = json.loads(serialized.decode("utf-8"))
    return {
        obj["vector_column"]: EmbeddingFunctionConfig(
            vector_column=obj["vector_column"],
            source_column=obj["source_column"],
            function=self.get(obj["name"])(**obj["model"]),
        )
        for obj in raw_list
    }

function_to_metadata

function_to_metadata(conf: EmbeddingFunctionConfig)

Convert the given embedding function and source / vector column configs into a config dictionary that can be serialized into arrow metadata

Source code in lancedb/embeddings/registry.py
def function_to_metadata(self, conf: "EmbeddingFunctionConfig"):
    """
    Convert the given embedding function and source / vector column configs
    into a config dictionary that can be serialized into arrow metadata
    """
    func = conf.function
    name = getattr(
        func, "__embedding_function_registry_alias__", func.__class__.__name__
    )
    json_data = func.safe_model_dump()
    return {
        "name": name,
        "model": json_data,
        "source_column": conf.source_column,
        "vector_column": conf.vector_column,
    }

get_table_metadata

get_table_metadata(func_list)

Convert a list of embedding functions and source / vector configs into a config dictionary that can be serialized into arrow metadata

Source code in lancedb/embeddings/registry.py
def get_table_metadata(self, func_list):
    """
    Convert a list of embedding functions and source / vector configs
    into a config dictionary that can be serialized into arrow metadata
    """
    if func_list is None or len(func_list) == 0:
        return None
    json_data = [self.function_to_metadata(func) for func in func_list]
    # Note that metadata dictionary values must be bytes
    # so we need to json dump then utf8 encode
    metadata = json.dumps(json_data, indent=2).encode("utf-8")
    return {"embedding_functions": metadata}

set_var

set_var(name: str, value: str) -> None

Set a variable. These can be accessed in embedding configuration using the syntax $var:variable_name. If they are not set, an error will be thrown letting you know which variable is missing. If you want to supply a default value, you can add an additional part in the configuration like so: $var:variable_name:default_value. Default values can be used for runtime configurations that are not sensitive, such as whether to use a GPU for inference.

The name must not contain a colon. Default values can contain colons.

Source code in lancedb/embeddings/registry.py
def set_var(self, name: str, value: str) -> None:
    """
    Set a variable. These can be accessed in embedding configuration using
    the syntax `$var:variable_name`. If they are not set, an error will be
    thrown letting you know which variable is missing. If you want to supply
    a default value, you can add an additional part in the configuration
    like so: `$var:variable_name:default_value`. Default values can be
    used for runtime configurations that are not sensitive, such as
    whether to use a GPU for inference.

    The name must not contain a colon. Default values can contain colons.
    """
    if ":" in name:
        raise ValueError("Variable names cannot contain colons")
    self._variables[name] = value

get_var

get_var(name: str) -> str

Get a variable.

Source code in lancedb/embeddings/registry.py
def get_var(self, name: str) -> str:
    """
    Get a variable.
    """
    return self._variables[name]

lancedb.embeddings.base.EmbeddingFunctionConfig

Bases: BaseModel

This model encapsulates the configuration for a embedding function in a lancedb table. It holds the embedding function, the source column, and the vector column

Source code in lancedb/embeddings/base.py
class EmbeddingFunctionConfig(BaseModel):
    """
    This model encapsulates the configuration for a embedding function
    in a lancedb table. It holds the embedding function, the source column,
    and the vector column
    """

    vector_column: str
    source_column: str
    function: EmbeddingFunction

lancedb.embeddings.base.EmbeddingFunction

Bases: BaseModel, ABC

An ABC for embedding functions.

All concrete embedding functions must implement the following methods: 1. compute_query_embeddings() which takes a query and returns a list of embeddings 2. compute_source_embeddings() which returns a list of embeddings for the source column For text data, the two will be the same. For multi-modal data, the source column might be images and the vector column might be text. 3. ndims() which returns the number of dimensions of the vector column

Source code in lancedb/embeddings/base.py
class EmbeddingFunction(BaseModel, ABC):
    """
    An ABC for embedding functions.

    All concrete embedding functions must implement the following methods:
    1. compute_query_embeddings() which takes a query and returns a list of embeddings
    2. compute_source_embeddings() which returns a list of embeddings for
       the source column
    For text data, the two will be the same. For multi-modal data, the source column
    might be images and the vector column might be text.
    3. ndims() which returns the number of dimensions of the vector column
    """

    __slots__ = ("__weakref__",)  # pydantic 1.x compatibility
    max_retries: int = (
        7  # Setting 0 disables retires. Maybe this should not be enabled by default,
    )
    _ndims: int = PrivateAttr()
    _original_args: dict = PrivateAttr()

    @classmethod
    def create(cls, **kwargs):
        """
        Create an instance of the embedding function
        """
        resolved_kwargs = cls.__resolveVariables(kwargs)
        instance = cls(**resolved_kwargs)
        instance._original_args = kwargs
        return instance

    @classmethod
    def __resolveVariables(cls, args: dict) -> dict:
        """
        Resolve variables in the args
        """
        from .registry import EmbeddingFunctionRegistry

        new_args = copy.deepcopy(args)

        registry = EmbeddingFunctionRegistry.get_instance()
        sensitive_keys = cls.sensitive_keys()
        for k, v in new_args.items():
            if isinstance(v, str) and not v.startswith("$var:") and k in sensitive_keys:
                exc = ValueError(
                    f"Sensitive key '{k}' cannot be set to a hardcoded value"
                )
                add_note(exc, "Help: Use $var: to set sensitive keys to variables")
                raise exc

            if isinstance(v, str) and v.startswith("$var:"):
                parts = v[5:].split(":", maxsplit=1)
                if len(parts) == 1:
                    try:
                        new_args[k] = registry.get_var(parts[0])
                    except KeyError:
                        exc = ValueError(
                            "Variable '{}' not found in registry".format(parts[0])
                        )
                        add_note(
                            exc,
                            "Help: Variables are reset in new Python sessions. "
                            "Use `registry.set_var` to set variables.",
                        )
                        raise exc
                else:
                    name, default = parts
                    try:
                        new_args[k] = registry.get_var(name)
                    except KeyError:
                        new_args[k] = default
        return new_args

    @staticmethod
    def sensitive_keys() -> List[str]:
        """
        Return a list of keys that are sensitive and should not be allowed
        to be set to hardcoded values in the config. For example, API keys.
        """
        return []

    @abstractmethod
    def compute_query_embeddings(self, *args, **kwargs) -> list[Union[np.array, None]]:
        """
        Compute the embeddings for a given user query

        Returns
        -------
        A list of embeddings for each input. The embedding of each input can be None
        when the embedding is not valid.
        """
        pass

    @abstractmethod
    def compute_source_embeddings(self, *args, **kwargs) -> list[Union[np.array, None]]:
        """Compute the embeddings for the source column in the database

        Returns
        -------
        A list of embeddings for each input. The embedding of each input can be None
        when the embedding is not valid.
        """
        pass

    def compute_query_embeddings_with_retry(
        self, *args, **kwargs
    ) -> list[Union[np.array, None]]:
        """Compute the embeddings for a given user query with retries

        Returns
        -------
        A list of embeddings for each input. The embedding of each input can be None
        when the embedding is not valid.
        """
        return retry_with_exponential_backoff(
            self.compute_query_embeddings, max_retries=self.max_retries
        )(
            *args,
            **kwargs,
        )

    def compute_source_embeddings_with_retry(
        self, *args, **kwargs
    ) -> list[Union[np.array, None]]:
        """Compute the embeddings for the source column in the database with retries.

        Returns
        -------
        A list of embeddings for each input. The embedding of each input can be None
        when the embedding is not valid.
        """
        return retry_with_exponential_backoff(
            self.compute_source_embeddings, max_retries=self.max_retries
        )(*args, **kwargs)

    def sanitize_input(self, texts: TEXT) -> Union[List[str], np.ndarray]:
        """
        Sanitize the input to the embedding function.
        """
        if isinstance(texts, str):
            texts = [texts]
        elif isinstance(texts, pa.Array):
            texts = texts.to_pylist()
        elif isinstance(texts, pa.ChunkedArray):
            texts = texts.combine_chunks().to_pylist()
        return texts

    def safe_model_dump(self):
        if not hasattr(self, "_original_args"):
            raise ValueError(
                "EmbeddingFunction was not created with EmbeddingFunction.create()"
            )
        return self._original_args

    @abstractmethod
    def ndims(self) -> int:
        """
        Return the dimensions of the vector column
        """
        pass

    def SourceField(self, **kwargs):
        """
        Creates a pydantic Field that can automatically annotate
        the source column for this embedding function
        """
        return Field(json_schema_extra={"source_column_for": self}, **kwargs)

    def VectorField(self, **kwargs):
        """
        Creates a pydantic Field that can automatically annotate
        the target vector column for this embedding function
        """
        return Field(json_schema_extra={"vector_column_for": self}, **kwargs)

    def __eq__(self, __value: object) -> bool:
        if not hasattr(__value, "__dict__"):
            return False
        return vars(self) == vars(__value)

    def __hash__(self) -> int:
        return hash(frozenset(vars(self).items()))

create classmethod

create(**kwargs)

Create an instance of the embedding function

Source code in lancedb/embeddings/base.py
@classmethod
def create(cls, **kwargs):
    """
    Create an instance of the embedding function
    """
    resolved_kwargs = cls.__resolveVariables(kwargs)
    instance = cls(**resolved_kwargs)
    instance._original_args = kwargs
    return instance

__resolveVariables classmethod

__resolveVariables(args: dict) -> dict

Resolve variables in the args

Source code in lancedb/embeddings/base.py
@classmethod
def __resolveVariables(cls, args: dict) -> dict:
    """
    Resolve variables in the args
    """
    from .registry import EmbeddingFunctionRegistry

    new_args = copy.deepcopy(args)

    registry = EmbeddingFunctionRegistry.get_instance()
    sensitive_keys = cls.sensitive_keys()
    for k, v in new_args.items():
        if isinstance(v, str) and not v.startswith("$var:") and k in sensitive_keys:
            exc = ValueError(
                f"Sensitive key '{k}' cannot be set to a hardcoded value"
            )
            add_note(exc, "Help: Use $var: to set sensitive keys to variables")
            raise exc

        if isinstance(v, str) and v.startswith("$var:"):
            parts = v[5:].split(":", maxsplit=1)
            if len(parts) == 1:
                try:
                    new_args[k] = registry.get_var(parts[0])
                except KeyError:
                    exc = ValueError(
                        "Variable '{}' not found in registry".format(parts[0])
                    )
                    add_note(
                        exc,
                        "Help: Variables are reset in new Python sessions. "
                        "Use `registry.set_var` to set variables.",
                    )
                    raise exc
            else:
                name, default = parts
                try:
                    new_args[k] = registry.get_var(name)
                except KeyError:
                    new_args[k] = default
    return new_args

sensitive_keys staticmethod

sensitive_keys() -> List[str]

Return a list of keys that are sensitive and should not be allowed to be set to hardcoded values in the config. For example, API keys.

Source code in lancedb/embeddings/base.py
@staticmethod
def sensitive_keys() -> List[str]:
    """
    Return a list of keys that are sensitive and should not be allowed
    to be set to hardcoded values in the config. For example, API keys.
    """
    return []

compute_query_embeddings abstractmethod

compute_query_embeddings(*args, **kwargs) -> list[Union[array, None]]

Compute the embeddings for a given user query

Returns:

  • A list of embeddings for each input. The embedding of each input can be None –
  • when the embedding is not valid. –
Source code in lancedb/embeddings/base.py
@abstractmethod
def compute_query_embeddings(self, *args, **kwargs) -> list[Union[np.array, None]]:
    """
    Compute the embeddings for a given user query

    Returns
    -------
    A list of embeddings for each input. The embedding of each input can be None
    when the embedding is not valid.
    """
    pass

compute_source_embeddings abstractmethod

compute_source_embeddings(*args, **kwargs) -> list[Union[array, None]]

Compute the embeddings for the source column in the database

Returns:

  • A list of embeddings for each input. The embedding of each input can be None –
  • when the embedding is not valid. –
Source code in lancedb/embeddings/base.py
@abstractmethod
def compute_source_embeddings(self, *args, **kwargs) -> list[Union[np.array, None]]:
    """Compute the embeddings for the source column in the database

    Returns
    -------
    A list of embeddings for each input. The embedding of each input can be None
    when the embedding is not valid.
    """
    pass

compute_query_embeddings_with_retry

compute_query_embeddings_with_retry(*args, **kwargs) -> list[Union[array, None]]

Compute the embeddings for a given user query with retries

Returns:

  • A list of embeddings for each input. The embedding of each input can be None –
  • when the embedding is not valid. –
Source code in lancedb/embeddings/base.py
def compute_query_embeddings_with_retry(
    self, *args, **kwargs
) -> list[Union[np.array, None]]:
    """Compute the embeddings for a given user query with retries

    Returns
    -------
    A list of embeddings for each input. The embedding of each input can be None
    when the embedding is not valid.
    """
    return retry_with_exponential_backoff(
        self.compute_query_embeddings, max_retries=self.max_retries
    )(
        *args,
        **kwargs,
    )

compute_source_embeddings_with_retry

compute_source_embeddings_with_retry(*args, **kwargs) -> list[Union[array, None]]

Compute the embeddings for the source column in the database with retries.

Returns:

  • A list of embeddings for each input. The embedding of each input can be None –
  • when the embedding is not valid. –
Source code in lancedb/embeddings/base.py
def compute_source_embeddings_with_retry(
    self, *args, **kwargs
) -> list[Union[np.array, None]]:
    """Compute the embeddings for the source column in the database with retries.

    Returns
    -------
    A list of embeddings for each input. The embedding of each input can be None
    when the embedding is not valid.
    """
    return retry_with_exponential_backoff(
        self.compute_source_embeddings, max_retries=self.max_retries
    )(*args, **kwargs)

sanitize_input

sanitize_input(texts: TEXT) -> Union[List[str], ndarray]

Sanitize the input to the embedding function.

Source code in lancedb/embeddings/base.py
def sanitize_input(self, texts: TEXT) -> Union[List[str], np.ndarray]:
    """
    Sanitize the input to the embedding function.
    """
    if isinstance(texts, str):
        texts = [texts]
    elif isinstance(texts, pa.Array):
        texts = texts.to_pylist()
    elif isinstance(texts, pa.ChunkedArray):
        texts = texts.combine_chunks().to_pylist()
    return texts

ndims abstractmethod

ndims() -> int

Return the dimensions of the vector column

Source code in lancedb/embeddings/base.py
@abstractmethod
def ndims(self) -> int:
    """
    Return the dimensions of the vector column
    """
    pass

SourceField

SourceField(**kwargs)

Creates a pydantic Field that can automatically annotate the source column for this embedding function

Source code in lancedb/embeddings/base.py
def SourceField(self, **kwargs):
    """
    Creates a pydantic Field that can automatically annotate
    the source column for this embedding function
    """
    return Field(json_schema_extra={"source_column_for": self}, **kwargs)

VectorField

VectorField(**kwargs)

Creates a pydantic Field that can automatically annotate the target vector column for this embedding function

Source code in lancedb/embeddings/base.py
def VectorField(self, **kwargs):
    """
    Creates a pydantic Field that can automatically annotate
    the target vector column for this embedding function
    """
    return Field(json_schema_extra={"vector_column_for": self}, **kwargs)

lancedb.embeddings.base.TextEmbeddingFunction

Bases: EmbeddingFunction

A callable ABC for embedding functions that take text as input

Source code in lancedb/embeddings/base.py
class TextEmbeddingFunction(EmbeddingFunction):
    """
    A callable ABC for embedding functions that take text as input
    """

    def compute_query_embeddings(
        self, query: str, *args, **kwargs
    ) -> list[Union[np.array, None]]:
        return self.compute_source_embeddings(query, *args, **kwargs)

    def compute_source_embeddings(
        self, texts: TEXT, *args, **kwargs
    ) -> list[Union[np.array, None]]:
        texts = self.sanitize_input(texts)
        return self.generate_embeddings(texts)

    @abstractmethod
    def generate_embeddings(
        self, texts: Union[List[str], np.ndarray], *args, **kwargs
    ) -> list[Union[np.array, None]]:
        """Generate the embeddings for the given texts"""
        pass

generate_embeddings abstractmethod

generate_embeddings(texts: Union[List[str], ndarray], *args, **kwargs) -> list[Union[array, None]]

Generate the embeddings for the given texts

Source code in lancedb/embeddings/base.py
@abstractmethod
def generate_embeddings(
    self, texts: Union[List[str], np.ndarray], *args, **kwargs
) -> list[Union[np.array, None]]:
    """Generate the embeddings for the given texts"""
    pass

lancedb.embeddings.sentence_transformers.SentenceTransformerEmbeddings

Bases: TextEmbeddingFunction

An embedding function that uses the sentence-transformers library

https://huggingface.co/sentence-transformers

Parameters:

  • name –

    The name of the model to use.

  • device –

    The device to use for the model

  • normalize –

    Whether to normalize the embeddings

  • trust_remote_code –

    Whether to trust the remote code

Source code in lancedb/embeddings/sentence_transformers.py
@register("sentence-transformers")
class SentenceTransformerEmbeddings(TextEmbeddingFunction):
    """
    An embedding function that uses the sentence-transformers library

    https://huggingface.co/sentence-transformers

    Parameters
    ----------
    name: str, default "all-MiniLM-L6-v2"
        The name of the model to use.
    device: str, default "cpu"
        The device to use for the model
    normalize: bool, default True
        Whether to normalize the embeddings
    trust_remote_code: bool, default True
        Whether to trust the remote code
    """

    name: str = "all-MiniLM-L6-v2"
    device: str = "cpu"
    normalize: bool = True
    trust_remote_code: bool = True

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self._ndims = None

    @property
    def embedding_model(self):
        """
        Get the sentence-transformers embedding model specified by the
        name, device, and trust_remote_code. This is cached so that the
        model is only loaded once per process.
        """
        return self.get_embedding_model()

    def ndims(self):
        if self._ndims is None:
            self._ndims = len(self.generate_embeddings("foo")[0])
        return self._ndims

    def generate_embeddings(
        self, texts: Union[List[str], np.ndarray]
    ) -> List[np.array]:
        """
        Get the embeddings for the given texts

        Parameters
        ----------
        texts: list[str] or np.ndarray (of str)
            The texts to embed
        """
        return self.embedding_model.encode(
            list(texts),
            convert_to_numpy=True,
            normalize_embeddings=self.normalize,
        ).tolist()

    @weak_lru(maxsize=1)
    def get_embedding_model(self):
        """
        Get the sentence-transformers embedding model specified by the
        name, device, and trust_remote_code. This is cached so that the
        model is only loaded once per process.

        TODO: use lru_cache instead with a reasonable/configurable maxsize
        """
        sentence_transformers = attempt_import_or_raise(
            "sentence_transformers", "sentence-transformers"
        )
        return sentence_transformers.SentenceTransformer(
            self.name, device=self.device, trust_remote_code=self.trust_remote_code
        )

embedding_model property

embedding_model

Get the sentence-transformers embedding model specified by the name, device, and trust_remote_code. This is cached so that the model is only loaded once per process.

generate_embeddings

generate_embeddings(texts: Union[List[str], ndarray]) -> List[array]

Get the embeddings for the given texts

Parameters:

  • texts (Union[List[str], ndarray]) –

    The texts to embed

Source code in lancedb/embeddings/sentence_transformers.py
def generate_embeddings(
    self, texts: Union[List[str], np.ndarray]
) -> List[np.array]:
    """
    Get the embeddings for the given texts

    Parameters
    ----------
    texts: list[str] or np.ndarray (of str)
        The texts to embed
    """
    return self.embedding_model.encode(
        list(texts),
        convert_to_numpy=True,
        normalize_embeddings=self.normalize,
    ).tolist()

get_embedding_model

get_embedding_model()

Get the sentence-transformers embedding model specified by the name, device, and trust_remote_code. This is cached so that the model is only loaded once per process.

TODO: use lru_cache instead with a reasonable/configurable maxsize

Source code in lancedb/embeddings/sentence_transformers.py
@weak_lru(maxsize=1)
def get_embedding_model(self):
    """
    Get the sentence-transformers embedding model specified by the
    name, device, and trust_remote_code. This is cached so that the
    model is only loaded once per process.

    TODO: use lru_cache instead with a reasonable/configurable maxsize
    """
    sentence_transformers = attempt_import_or_raise(
        "sentence_transformers", "sentence-transformers"
    )
    return sentence_transformers.SentenceTransformer(
        self.name, device=self.device, trust_remote_code=self.trust_remote_code
    )

lancedb.embeddings.openai.OpenAIEmbeddings

Bases: TextEmbeddingFunction

An embedding function that uses the OpenAI API

https://platform.openai.com/docs/guides/embeddings

This can also be used for open source models that are compatible with the OpenAI API.

Notes

If you're running an Ollama server locally, you can just override the base_url parameter and provide the Ollama embedding model you want to use (https://ollama.com/library):

from lancedb.embeddings import get_registry
openai = get_registry().get("openai")
embedding_function = openai.create(
    name="<ollama-embedding-model-name>",
    base_url="http://localhost:11434",
    )
Source code in lancedb/embeddings/openai.py
@register("openai")
class OpenAIEmbeddings(TextEmbeddingFunction):
    """
    An embedding function that uses the OpenAI API

    https://platform.openai.com/docs/guides/embeddings

    This can also be used for open source models that
    are compatible with the OpenAI API.

    Notes
    -----
    If you're running an Ollama server locally,
    you can just override the `base_url` parameter
    and provide the Ollama embedding model you want
    to use (https://ollama.com/library):

    ```python
    from lancedb.embeddings import get_registry
    openai = get_registry().get("openai")
    embedding_function = openai.create(
        name="<ollama-embedding-model-name>",
        base_url="http://localhost:11434",
        )
    ```

    """

    name: str = "text-embedding-ada-002"
    dim: Optional[int] = None
    base_url: Optional[str] = None
    default_headers: Optional[dict] = None
    organization: Optional[str] = None
    api_key: Optional[str] = None

    # Set true to use Azure OpenAI API
    use_azure: bool = False

    def ndims(self):
        return self._ndims

    @staticmethod
    def sensitive_keys():
        return ["api_key"]

    @staticmethod
    def model_names():
        return [
            "text-embedding-ada-002",
            "text-embedding-3-large",
            "text-embedding-3-small",
        ]

    @cached_property
    def _ndims(self):
        if self.name == "text-embedding-ada-002":
            return 1536
        elif self.name == "text-embedding-3-large":
            return self.dim or 3072
        elif self.name == "text-embedding-3-small":
            return self.dim or 1536
        else:
            raise ValueError(f"Unknown model name {self.name}")

    def generate_embeddings(
        self, texts: Union[List[str], "np.ndarray"]
    ) -> List["np.array"]:
        """
        Get the embeddings for the given texts

        Parameters
        ----------
        texts: list[str] or np.ndarray (of str)
            The texts to embed
        """
        openai = attempt_import_or_raise("openai")

        valid_texts = []
        valid_indices = []
        for idx, text in enumerate(texts):
            if text:
                valid_texts.append(text)
                valid_indices.append(idx)

        # TODO retry, rate limit, token limit
        try:
            kwargs = {
                "input": valid_texts,
                "model": self.name,
            }
            if self.name != "text-embedding-ada-002":
                kwargs["dimensions"] = self.dim

            rs = self._openai_client.embeddings.create(**kwargs)
            valid_embeddings = {
                idx: v.embedding for v, idx in zip(rs.data, valid_indices)
            }
        except openai.BadRequestError:
            logging.exception("Bad request: %s", texts)
            return [None] * len(texts)
        except Exception:
            logging.exception("OpenAI embeddings error")
            raise
        return [valid_embeddings.get(idx, None) for idx in range(len(texts))]

    @cached_property
    def _openai_client(self):
        openai = attempt_import_or_raise("openai")
        kwargs = {}
        if self.base_url:
            kwargs["base_url"] = self.base_url
        if self.default_headers:
            kwargs["default_headers"] = self.default_headers
        if self.organization:
            kwargs["organization"] = self.organization
        if self.api_key:
            kwargs["api_key"] = self.api_key

        if self.use_azure:
            return openai.AzureOpenAI(**kwargs)
        else:
            return openai.OpenAI(**kwargs)

generate_embeddings

generate_embeddings(texts: Union[List[str], ndarray]) -> List[array]

Get the embeddings for the given texts

Parameters:

  • texts (Union[List[str], ndarray]) –

    The texts to embed

Source code in lancedb/embeddings/openai.py
def generate_embeddings(
    self, texts: Union[List[str], "np.ndarray"]
) -> List["np.array"]:
    """
    Get the embeddings for the given texts

    Parameters
    ----------
    texts: list[str] or np.ndarray (of str)
        The texts to embed
    """
    openai = attempt_import_or_raise("openai")

    valid_texts = []
    valid_indices = []
    for idx, text in enumerate(texts):
        if text:
            valid_texts.append(text)
            valid_indices.append(idx)

    # TODO retry, rate limit, token limit
    try:
        kwargs = {
            "input": valid_texts,
            "model": self.name,
        }
        if self.name != "text-embedding-ada-002":
            kwargs["dimensions"] = self.dim

        rs = self._openai_client.embeddings.create(**kwargs)
        valid_embeddings = {
            idx: v.embedding for v, idx in zip(rs.data, valid_indices)
        }
    except openai.BadRequestError:
        logging.exception("Bad request: %s", texts)
        return [None] * len(texts)
    except Exception:
        logging.exception("OpenAI embeddings error")
        raise
    return [valid_embeddings.get(idx, None) for idx in range(len(texts))]

lancedb.embeddings.open_clip.OpenClipEmbeddings

Bases: EmbeddingFunction

An embedding function that uses the OpenClip API For multi-modal text-to-image search

https://github.com/mlfoundations/open_clip

Source code in lancedb/embeddings/open_clip.py
@register("open-clip")
class OpenClipEmbeddings(EmbeddingFunction):
    """
    An embedding function that uses the OpenClip API
    For multi-modal text-to-image search

    https://github.com/mlfoundations/open_clip
    """

    name: str = "ViT-B-32"
    pretrained: str = "laion2b_s34b_b79k"
    device: str = "cpu"
    batch_size: int = 64
    normalize: bool = True
    _model = PrivateAttr()
    _preprocess = PrivateAttr()
    _tokenizer = PrivateAttr()

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        open_clip = attempt_import_or_raise("open_clip", "open-clip")
        model, _, preprocess = open_clip.create_model_and_transforms(
            self.name, pretrained=self.pretrained
        )
        model.to(self.device)
        self._model, self._preprocess = model, preprocess
        self._tokenizer = open_clip.get_tokenizer(self.name)
        self._ndims = None

    def ndims(self):
        if self._ndims is None:
            self._ndims = self.generate_text_embeddings("foo").shape[0]
        return self._ndims

    def compute_query_embeddings(
        self, query: Union[str, "PIL.Image.Image"], *args, **kwargs
    ) -> List[np.ndarray]:
        """
        Compute the embeddings for a given user query

        Parameters
        ----------
        query : Union[str, PIL.Image.Image]
            The query to embed. A query can be either text or an image.
        """
        if isinstance(query, str):
            return [self.generate_text_embeddings(query)]
        else:
            PIL = attempt_import_or_raise("PIL", "pillow")
            if isinstance(query, PIL.Image.Image):
                return [self.generate_image_embedding(query)]
            else:
                raise TypeError("OpenClip supports str or PIL Image as query")

    def generate_text_embeddings(self, text: str) -> np.ndarray:
        torch = attempt_import_or_raise("torch")
        text = self.sanitize_input(text)
        text = self._tokenizer(text)
        text.to(self.device)
        with torch.no_grad():
            text_features = self._model.encode_text(text.to(self.device))
            if self.normalize:
                text_features /= text_features.norm(dim=-1, keepdim=True)
            return text_features.cpu().numpy().squeeze()

    def sanitize_input(self, images: IMAGES) -> Union[List[bytes], np.ndarray]:
        """
        Sanitize the input to the embedding function.
        """
        if isinstance(images, (str, bytes)):
            images = [images]
        elif isinstance(images, pa.Array):
            images = images.to_pylist()
        elif isinstance(images, pa.ChunkedArray):
            images = images.combine_chunks().to_pylist()
        return images

    def compute_source_embeddings(
        self, images: IMAGES, *args, **kwargs
    ) -> List[np.array]:
        """
        Get the embeddings for the given images
        """
        images = self.sanitize_input(images)
        embeddings = []
        for i in range(0, len(images), self.batch_size):
            j = min(i + self.batch_size, len(images))
            batch = images[i:j]
            embeddings.extend(self._parallel_get(batch))
        return embeddings

    def _parallel_get(self, images: Union[List[str], List[bytes]]) -> List[np.ndarray]:
        """
        Issue concurrent requests to retrieve the image data
        """
        with concurrent.futures.ThreadPoolExecutor() as executor:
            futures = [
                executor.submit(self.generate_image_embedding, image)
                for image in images
            ]
            return [future.result() for future in tqdm(futures)]

    def generate_image_embedding(
        self, image: Union[str, bytes, "PIL.Image.Image"]
    ) -> np.ndarray:
        """
        Generate the embedding for a single image

        Parameters
        ----------
        image : Union[str, bytes, PIL.Image.Image]
            The image to embed. If the image is a str, it is treated as a uri.
            If the image is bytes, it is treated as the raw image bytes.
        """
        torch = attempt_import_or_raise("torch")
        # TODO handle retry and errors for https
        image = self._to_pil(image)
        image = self._preprocess(image).unsqueeze(0)
        with torch.no_grad():
            return self._encode_and_normalize_image(image)

    def _to_pil(self, image: Union[str, bytes]):
        PIL = attempt_import_or_raise("PIL", "pillow")
        if isinstance(image, bytes):
            return PIL.Image.open(io.BytesIO(image))
        if isinstance(image, PIL.Image.Image):
            return image
        elif isinstance(image, str):
            parsed = urlparse.urlparse(image)
            # TODO handle drive letter on windows.
            if parsed.scheme == "file":
                return PIL.Image.open(parsed.path)
            elif parsed.scheme == "":
                return PIL.Image.open(image if os.name == "nt" else parsed.path)
            elif parsed.scheme.startswith("http"):
                return PIL.Image.open(io.BytesIO(url_retrieve(image)))
            else:
                raise NotImplementedError("Only local and http(s) urls are supported")

    def _encode_and_normalize_image(self, image_tensor: "torch.Tensor"):
        """
        encode a single image tensor and optionally normalize the output
        """
        image_features = self._model.encode_image(image_tensor.to(self.device))
        if self.normalize:
            image_features /= image_features.norm(dim=-1, keepdim=True)
        return image_features.cpu().numpy().squeeze()

compute_query_embeddings

compute_query_embeddings(query: Union[str, Image], *args, **kwargs) -> List[ndarray]

Compute the embeddings for a given user query

Parameters:

  • query (Union[str, Image]) –

    The query to embed. A query can be either text or an image.

Source code in lancedb/embeddings/open_clip.py
def compute_query_embeddings(
    self, query: Union[str, "PIL.Image.Image"], *args, **kwargs
) -> List[np.ndarray]:
    """
    Compute the embeddings for a given user query

    Parameters
    ----------
    query : Union[str, PIL.Image.Image]
        The query to embed. A query can be either text or an image.
    """
    if isinstance(query, str):
        return [self.generate_text_embeddings(query)]
    else:
        PIL = attempt_import_or_raise("PIL", "pillow")
        if isinstance(query, PIL.Image.Image):
            return [self.generate_image_embedding(query)]
        else:
            raise TypeError("OpenClip supports str or PIL Image as query")

sanitize_input

sanitize_input(images: IMAGES) -> Union[List[bytes], ndarray]

Sanitize the input to the embedding function.

Source code in lancedb/embeddings/open_clip.py
def sanitize_input(self, images: IMAGES) -> Union[List[bytes], np.ndarray]:
    """
    Sanitize the input to the embedding function.
    """
    if isinstance(images, (str, bytes)):
        images = [images]
    elif isinstance(images, pa.Array):
        images = images.to_pylist()
    elif isinstance(images, pa.ChunkedArray):
        images = images.combine_chunks().to_pylist()
    return images

compute_source_embeddings

compute_source_embeddings(images: IMAGES, *args, **kwargs) -> List[array]

Get the embeddings for the given images

Source code in lancedb/embeddings/open_clip.py
def compute_source_embeddings(
    self, images: IMAGES, *args, **kwargs
) -> List[np.array]:
    """
    Get the embeddings for the given images
    """
    images = self.sanitize_input(images)
    embeddings = []
    for i in range(0, len(images), self.batch_size):
        j = min(i + self.batch_size, len(images))
        batch = images[i:j]
        embeddings.extend(self._parallel_get(batch))
    return embeddings

generate_image_embedding

generate_image_embedding(image: Union[str, bytes, Image]) -> ndarray

Generate the embedding for a single image

Parameters:

  • image (Union[str, bytes, Image]) –

    The image to embed. If the image is a str, it is treated as a uri. If the image is bytes, it is treated as the raw image bytes.

Source code in lancedb/embeddings/open_clip.py
def generate_image_embedding(
    self, image: Union[str, bytes, "PIL.Image.Image"]
) -> np.ndarray:
    """
    Generate the embedding for a single image

    Parameters
    ----------
    image : Union[str, bytes, PIL.Image.Image]
        The image to embed. If the image is a str, it is treated as a uri.
        If the image is bytes, it is treated as the raw image bytes.
    """
    torch = attempt_import_or_raise("torch")
    # TODO handle retry and errors for https
    image = self._to_pil(image)
    image = self._preprocess(image).unsqueeze(0)
    with torch.no_grad():
        return self._encode_and_normalize_image(image)

Context

lancedb.context.contextualize

contextualize(raw_df: 'pd.DataFrame') -> Contextualizer

Create a Contextualizer object for the given DataFrame.

Used to create context windows. Context windows are rolling subsets of text data.

The input text column should already be separated into rows that will be the unit of the window. So to create a context window over tokens, start with a DataFrame with one token per row. To create a context window over sentences, start with a DataFrame with one sentence per row.

Examples:

>>> from lancedb.context import contextualize
>>> import pandas as pd
>>> data = pd.DataFrame({
...    'token': ['The', 'quick', 'brown', 'fox', 'jumped', 'over',
...              'the', 'lazy', 'dog', 'I', 'love', 'sandwiches'],
...    'document_id': [1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2]
... })

window determines how many rows to include in each window. In our case this how many tokens, but depending on the input data, it could be sentences, paragraphs, messages, etc.

>>> contextualize(data).window(3).stride(1).text_col('token').to_pandas()
                token  document_id
0     The quick brown            1
1     quick brown fox            1
2    brown fox jumped            1
3     fox jumped over            1
4     jumped over the            1
5       over the lazy            1
6        the lazy dog            1
7          lazy dog I            1
8          dog I love            1
9   I love sandwiches            2
10    love sandwiches            2
>>> (contextualize(data).window(7).stride(1).min_window_size(7)
...   .text_col('token').to_pandas())
                                  token  document_id
0   The quick brown fox jumped over the            1
1  quick brown fox jumped over the lazy            1
2    brown fox jumped over the lazy dog            1
3        fox jumped over the lazy dog I            1
4       jumped over the lazy dog I love            1
5   over the lazy dog I love sandwiches            1

stride determines how many rows to skip between each window start. This can be used to reduce the total number of windows generated.

>>> contextualize(data).window(4).stride(2).text_col('token').to_pandas()
                    token  document_id
0     The quick brown fox            1
2   brown fox jumped over            1
4    jumped over the lazy            1
6          the lazy dog I            1
8   dog I love sandwiches            1
10        love sandwiches            2

groupby determines how to group the rows. For example, we would like to have context windows that don't cross document boundaries. In this case, we can pass document_id as the group by.

>>> (contextualize(data)
...     .window(4).stride(2).text_col('token').groupby('document_id')
...     .to_pandas())
                   token  document_id
0    The quick brown fox            1
2  brown fox jumped over            1
4   jumped over the lazy            1
6           the lazy dog            1
9      I love sandwiches            2

min_window_size determines the minimum size of the context windows that are generated.This can be used to trim the last few context windows which have size less than min_window_size. By default context windows of size 1 are skipped.

>>> (contextualize(data)
...     .window(6).stride(3).text_col('token').groupby('document_id')
...     .to_pandas())
                             token  document_id
0  The quick brown fox jumped over            1
3     fox jumped over the lazy dog            1
6                     the lazy dog            1
9                I love sandwiches            2
>>> (contextualize(data)
...     .window(6).stride(3).min_window_size(4).text_col('token')
...     .groupby('document_id')
...     .to_pandas())
                             token  document_id
0  The quick brown fox jumped over            1
3     fox jumped over the lazy dog            1
Source code in lancedb/context.py
def contextualize(raw_df: "pd.DataFrame") -> Contextualizer:
    """Create a Contextualizer object for the given DataFrame.

    Used to create context windows. Context windows are rolling subsets of text
    data.

    The input text column should already be separated into rows that will be the
    unit of the window. So to create a context window over tokens, start with
    a DataFrame with one token per row. To create a context window over sentences,
    start with a DataFrame with one sentence per row.

    Examples
    --------
    >>> from lancedb.context import contextualize
    >>> import pandas as pd
    >>> data = pd.DataFrame({
    ...    'token': ['The', 'quick', 'brown', 'fox', 'jumped', 'over',
    ...              'the', 'lazy', 'dog', 'I', 'love', 'sandwiches'],
    ...    'document_id': [1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2]
    ... })

    ``window`` determines how many rows to include in each window. In our case
    this how many tokens, but depending on the input data, it could be sentences,
    paragraphs, messages, etc.

    >>> contextualize(data).window(3).stride(1).text_col('token').to_pandas()
                    token  document_id
    0     The quick brown            1
    1     quick brown fox            1
    2    brown fox jumped            1
    3     fox jumped over            1
    4     jumped over the            1
    5       over the lazy            1
    6        the lazy dog            1
    7          lazy dog I            1
    8          dog I love            1
    9   I love sandwiches            2
    10    love sandwiches            2
    >>> (contextualize(data).window(7).stride(1).min_window_size(7)
    ...   .text_col('token').to_pandas())
                                      token  document_id
    0   The quick brown fox jumped over the            1
    1  quick brown fox jumped over the lazy            1
    2    brown fox jumped over the lazy dog            1
    3        fox jumped over the lazy dog I            1
    4       jumped over the lazy dog I love            1
    5   over the lazy dog I love sandwiches            1

    ``stride`` determines how many rows to skip between each window start. This can
    be used to reduce the total number of windows generated.

    >>> contextualize(data).window(4).stride(2).text_col('token').to_pandas()
                        token  document_id
    0     The quick brown fox            1
    2   brown fox jumped over            1
    4    jumped over the lazy            1
    6          the lazy dog I            1
    8   dog I love sandwiches            1
    10        love sandwiches            2

    ``groupby`` determines how to group the rows. For example, we would like to have
    context windows that don't cross document boundaries. In this case, we can
    pass ``document_id`` as the group by.

    >>> (contextualize(data)
    ...     .window(4).stride(2).text_col('token').groupby('document_id')
    ...     .to_pandas())
                       token  document_id
    0    The quick brown fox            1
    2  brown fox jumped over            1
    4   jumped over the lazy            1
    6           the lazy dog            1
    9      I love sandwiches            2

    ``min_window_size`` determines the minimum size of the context windows
    that are generated.This can be used to trim the last few context windows
    which have size less than ``min_window_size``.
    By default context windows of size 1 are skipped.

    >>> (contextualize(data)
    ...     .window(6).stride(3).text_col('token').groupby('document_id')
    ...     .to_pandas())
                                 token  document_id
    0  The quick brown fox jumped over            1
    3     fox jumped over the lazy dog            1
    6                     the lazy dog            1
    9                I love sandwiches            2

    >>> (contextualize(data)
    ...     .window(6).stride(3).min_window_size(4).text_col('token')
    ...     .groupby('document_id')
    ...     .to_pandas())
                                 token  document_id
    0  The quick brown fox jumped over            1
    3     fox jumped over the lazy dog            1

    """
    return Contextualizer(raw_df)

lancedb.context.Contextualizer

Create context windows from a DataFrame. See lancedb.context.contextualize.

Source code in lancedb/context.py
class Contextualizer:
    """Create context windows from a DataFrame.
    See [lancedb.context.contextualize][].
    """

    def __init__(self, raw_df):
        self._text_col = None
        self._groupby = None
        self._stride = None
        self._window = None
        self._min_window_size = 2
        self._raw_df = raw_df

    def window(self, window: int) -> Contextualizer:
        """Set the window size. i.e., how many rows to include in each window.

        Parameters
        ----------
        window: int
            The window size.
        """
        self._window = window
        return self

    def stride(self, stride: int) -> Contextualizer:
        """Set the stride. i.e., how many rows to skip between each window.

        Parameters
        ----------
        stride: int
            The stride.
        """
        self._stride = stride
        return self

    def groupby(self, groupby: str) -> Contextualizer:
        """Set the groupby column. i.e., how to group the rows.
        Windows don't cross groups

        Parameters
        ----------
        groupby: str
            The groupby column.
        """
        self._groupby = groupby
        return self

    def text_col(self, text_col: str) -> Contextualizer:
        """Set the text column used to make the context window.

        Parameters
        ----------
        text_col: str
            The text column.
        """
        self._text_col = text_col
        return self

    def min_window_size(self, min_window_size: int) -> Contextualizer:
        """Set the (optional) min_window_size size for the context window.

        Parameters
        ----------
        min_window_size: int
            The min_window_size.
        """
        self._min_window_size = min_window_size
        return self

    @deprecation.deprecated(
        deprecated_in="0.3.1",
        removed_in="0.4.0",
        current_version=__version__,
        details="Use to_pandas() instead",
    )
    def to_df(self) -> "pd.DataFrame":
        return self.to_pandas()

    def to_pandas(self) -> "pd.DataFrame":
        """Create the context windows and return a DataFrame."""
        if pd is None:
            raise ImportError(
                "pandas is required to create context windows using lancedb"
            )

        if self._text_col not in self._raw_df.columns.tolist():
            raise MissingColumnError(self._text_col)

        if self._window is None or self._window < 1:
            raise MissingValueError(
                "The value of window is None or less than 1. Specify the "
                "window size (number of rows to include in each window)"
            )

        if self._stride is None or self._stride < 1:
            raise MissingValueError(
                "The value of stride is None or less than 1. Specify the "
                "stride (number of rows to skip between each window)"
            )

        def process_group(grp):
            # For each group, create the text rolling window
            # with values of size >= min_window_size
            text = grp[self._text_col].values
            contexts = grp.iloc[:: self._stride, :].copy()
            windows = [
                " ".join(text[start_i : min(start_i + self._window, len(grp))])
                for start_i in range(0, len(grp), self._stride)
                if start_i + self._window <= len(grp)
                or len(grp) - start_i >= self._min_window_size
            ]
            # if last few rows dropped
            if len(windows) < len(contexts):
                contexts = contexts.iloc[: len(windows)]
            contexts[self._text_col] = windows
            return contexts

        if self._groupby is None:
            return process_group(self._raw_df)
        # concat result from all groups
        return pd.concat(
            [process_group(grp) for _, grp in self._raw_df.groupby(self._groupby)]
        )

window

window(window: int) -> Contextualizer

Set the window size. i.e., how many rows to include in each window.

Parameters:

  • window (int) –

    The window size.

Source code in lancedb/context.py
def window(self, window: int) -> Contextualizer:
    """Set the window size. i.e., how many rows to include in each window.

    Parameters
    ----------
    window: int
        The window size.
    """
    self._window = window
    return self

stride

stride(stride: int) -> Contextualizer

Set the stride. i.e., how many rows to skip between each window.

Parameters:

  • stride (int) –

    The stride.

Source code in lancedb/context.py
def stride(self, stride: int) -> Contextualizer:
    """Set the stride. i.e., how many rows to skip between each window.

    Parameters
    ----------
    stride: int
        The stride.
    """
    self._stride = stride
    return self

groupby

groupby(groupby: str) -> Contextualizer

Set the groupby column. i.e., how to group the rows. Windows don't cross groups

Parameters:

  • groupby (str) –

    The groupby column.

Source code in lancedb/context.py
def groupby(self, groupby: str) -> Contextualizer:
    """Set the groupby column. i.e., how to group the rows.
    Windows don't cross groups

    Parameters
    ----------
    groupby: str
        The groupby column.
    """
    self._groupby = groupby
    return self

text_col

text_col(text_col: str) -> Contextualizer

Set the text column used to make the context window.

Parameters:

  • text_col (str) –

    The text column.

Source code in lancedb/context.py
def text_col(self, text_col: str) -> Contextualizer:
    """Set the text column used to make the context window.

    Parameters
    ----------
    text_col: str
        The text column.
    """
    self._text_col = text_col
    return self

min_window_size

min_window_size(min_window_size: int) -> Contextualizer

Set the (optional) min_window_size size for the context window.

Parameters:

  • min_window_size (int) –

    The min_window_size.

Source code in lancedb/context.py
def min_window_size(self, min_window_size: int) -> Contextualizer:
    """Set the (optional) min_window_size size for the context window.

    Parameters
    ----------
    min_window_size: int
        The min_window_size.
    """
    self._min_window_size = min_window_size
    return self

to_pandas

to_pandas() -> 'pd.DataFrame'

Create the context windows and return a DataFrame.

Source code in lancedb/context.py
def to_pandas(self) -> "pd.DataFrame":
    """Create the context windows and return a DataFrame."""
    if pd is None:
        raise ImportError(
            "pandas is required to create context windows using lancedb"
        )

    if self._text_col not in self._raw_df.columns.tolist():
        raise MissingColumnError(self._text_col)

    if self._window is None or self._window < 1:
        raise MissingValueError(
            "The value of window is None or less than 1. Specify the "
            "window size (number of rows to include in each window)"
        )

    if self._stride is None or self._stride < 1:
        raise MissingValueError(
            "The value of stride is None or less than 1. Specify the "
            "stride (number of rows to skip between each window)"
        )

    def process_group(grp):
        # For each group, create the text rolling window
        # with values of size >= min_window_size
        text = grp[self._text_col].values
        contexts = grp.iloc[:: self._stride, :].copy()
        windows = [
            " ".join(text[start_i : min(start_i + self._window, len(grp))])
            for start_i in range(0, len(grp), self._stride)
            if start_i + self._window <= len(grp)
            or len(grp) - start_i >= self._min_window_size
        ]
        # if last few rows dropped
        if len(windows) < len(contexts):
            contexts = contexts.iloc[: len(windows)]
        contexts[self._text_col] = windows
        return contexts

    if self._groupby is None:
        return process_group(self._raw_df)
    # concat result from all groups
    return pd.concat(
        [process_group(grp) for _, grp in self._raw_df.groupby(self._groupby)]
    )

lancedb.fts.create_index

create_index(index_path: str, text_fields: List[str], ordering_fields: Optional[List[str]] = None, tokenizer_name: str = 'default') -> Index

Create a new Index (not populated)

Parameters:

  • index_path (str) –

    Path to the index directory

  • text_fields (List[str]) –

    List of text fields to index

  • ordering_fields (Optional[List[str]], default: None ) –

    List of unsigned type fields to order by at search time

  • tokenizer_name (str, default: "default" ) –

    The tokenizer to use

Returns:

  • index ( Index ) –

    The index object (not yet populated)

Source code in lancedb/fts.py
def create_index(
    index_path: str,
    text_fields: List[str],
    ordering_fields: Optional[List[str]] = None,
    tokenizer_name: str = "default",
) -> tantivy.Index:
    """
    Create a new Index (not populated)

    Parameters
    ----------
    index_path : str
        Path to the index directory
    text_fields : List[str]
        List of text fields to index
    ordering_fields: List[str]
        List of unsigned type fields to order by at search time
    tokenizer_name : str, default "default"
        The tokenizer to use

    Returns
    -------
    index : tantivy.Index
        The index object (not yet populated)
    """
    if ordering_fields is None:
        ordering_fields = []
    # Declaring our schema.
    schema_builder = tantivy.SchemaBuilder()
    # special field that we'll populate with row_id
    schema_builder.add_integer_field("doc_id", stored=True)
    # data fields
    for name in text_fields:
        schema_builder.add_text_field(name, stored=True, tokenizer_name=tokenizer_name)
    if ordering_fields:
        for name in ordering_fields:
            schema_builder.add_unsigned_field(name, fast=True)
    schema = schema_builder.build()
    os.makedirs(index_path, exist_ok=True)
    index = tantivy.Index(schema, path=index_path)
    return index

lancedb.fts.populate_index

populate_index(index: Index, table: LanceTable, fields: List[str], writer_heap_size: Optional[int] = None, ordering_fields: Optional[List[str]] = None) -> int

Populate an index with data from a LanceTable

Parameters:

  • index (Index) –

    The index object

  • table (LanceTable) –

    The table to index

  • fields (List[str]) –

    List of fields to index

  • writer_heap_size (int, default: None ) –

    The writer heap size in bytes, defaults to 1GB

Returns:

  • int –

    The number of rows indexed

Source code in lancedb/fts.py
def populate_index(
    index: tantivy.Index,
    table: LanceTable,
    fields: List[str],
    writer_heap_size: Optional[int] = None,
    ordering_fields: Optional[List[str]] = None,
) -> int:
    """
    Populate an index with data from a LanceTable

    Parameters
    ----------
    index : tantivy.Index
        The index object
    table : LanceTable
        The table to index
    fields : List[str]
        List of fields to index
    writer_heap_size : int
        The writer heap size in bytes, defaults to 1GB

    Returns
    -------
    int
        The number of rows indexed
    """
    if ordering_fields is None:
        ordering_fields = []
    writer_heap_size = writer_heap_size or 1024 * 1024 * 1024
    # first check the fields exist and are string or large string type
    nested = []

    for name in fields:
        try:
            f = table.schema.field(name)  # raises KeyError if not found
        except KeyError:
            f = resolve_path(table.schema, name)
            nested.append(name)

        if not pa.types.is_string(f.type) and not pa.types.is_large_string(f.type):
            raise TypeError(f"Field {name} is not a string type")

    # create a tantivy writer
    writer = index.writer(heap_size=writer_heap_size)
    # write data into index
    dataset = table.to_lance()
    row_id = 0

    max_nested_level = 0
    if len(nested) > 0:
        max_nested_level = max([len(name.split(".")) for name in nested])

    for b in dataset.to_batches(columns=fields + ordering_fields):
        if max_nested_level > 0:
            b = pa.Table.from_batches([b])
            for _ in range(max_nested_level - 1):
                b = b.flatten()
        for i in range(b.num_rows):
            doc = tantivy.Document()
            for name in fields:
                value = b[name][i].as_py()
                if value is not None:
                    doc.add_text(name, value)
            for name in ordering_fields:
                value = b[name][i].as_py()
                if value is not None:
                    doc.add_unsigned(name, value)
            if not doc.is_empty:
                doc.add_integer("doc_id", row_id)
                writer.add_document(doc)
            row_id += 1
    # commit changes
    writer.commit()
    return row_id

lancedb.fts.search_index

search_index(index: Index, query: str, limit: int = 10, ordering_field=None) -> Tuple[Tuple[int], Tuple[float]]

Search an index for a query

Parameters:

  • index (Index) –

    The index object

  • query (str) –

    The query string

  • limit (int, default: 10 ) –

    The maximum number of results to return

Returns:

  • ids_and_score ( list[tuple[int], tuple[float]] ) –

    A tuple of two tuples, the first containing the document ids and the second containing the scores

Source code in lancedb/fts.py
def search_index(
    index: tantivy.Index, query: str, limit: int = 10, ordering_field=None
) -> Tuple[Tuple[int], Tuple[float]]:
    """
    Search an index for a query

    Parameters
    ----------
    index : tantivy.Index
        The index object
    query : str
        The query string
    limit : int
        The maximum number of results to return

    Returns
    -------
    ids_and_score: list[tuple[int], tuple[float]]
        A tuple of two tuples, the first containing the document ids
        and the second containing the scores
    """
    searcher = index.searcher()
    query = index.parse_query(query)
    # get top results
    if ordering_field:
        results = searcher.search(query, limit, order_by_field=ordering_field)
    else:
        results = searcher.search(query, limit)
    if results.count == 0:
        return tuple(), tuple()
    return tuple(
        zip(
            *[
                (searcher.doc(doc_address)["doc_id"][0], score)
                for score, doc_address in results.hits
            ]
        )
    )

Utilities

lancedb.schema.vector

vector(dimension: int, value_type: DataType = pa.float32()) -> DataType

A help function to create a vector type.

Parameters:

  • dimension (int) –
  • value_type (DataType, default: float32() ) –

    The type of the value in the vector.

Returns:

  • A PyArrow DataType for vectors. –

Examples:

>>> import pyarrow as pa
>>> import lancedb
>>> schema = pa.schema([
...     pa.field("id", pa.int64()),
...     pa.field("vector", lancedb.vector(756)),
... ])
Source code in lancedb/schema.py
def vector(dimension: int, value_type: pa.DataType = pa.float32()) -> pa.DataType:
    """A help function to create a vector type.

    Parameters
    ----------
    dimension: The dimension of the vector.
    value_type: pa.DataType, optional
        The type of the value in the vector.

    Returns
    -------
    A PyArrow DataType for vectors.

    Examples
    --------

    >>> import pyarrow as pa
    >>> import lancedb
    >>> schema = pa.schema([
    ...     pa.field("id", pa.int64()),
    ...     pa.field("vector", lancedb.vector(756)),
    ... ])
    """
    return pa.list_(value_type, dimension)

lancedb.merge.LanceMergeInsertBuilder

Bases: object

Builder for a LanceDB merge insert operation

See merge_insert for more context

Source code in lancedb/merge.py
class LanceMergeInsertBuilder(object):
    """Builder for a LanceDB merge insert operation

    See [`merge_insert`][lancedb.table.Table.merge_insert] for
    more context
    """

    def __init__(self, table: "Table", on: List[str]):  # noqa: F821
        # Do not put a docstring here.  This method should be hidden
        # from API docs.  Users should use merge_insert to create
        # this object.
        self._table = table
        self._on = on
        self._when_matched_update_all = False
        self._when_matched_update_all_condition = None
        self._when_not_matched_insert_all = False
        self._when_not_matched_by_source_delete = False
        self._when_not_matched_by_source_condition = None
        self._timeout = None

    def when_matched_update_all(
        self, *, where: Optional[str] = None
    ) -> LanceMergeInsertBuilder:
        """
        Rows that exist in both the source table (new data) and
        the target table (old data) will be updated, replacing
        the old row with the corresponding matching row.

        If there are multiple matches then the behavior is undefined.
        Currently this causes multiple copies of the row to be created
        but that behavior is subject to change.
        """
        self._when_matched_update_all = True
        self._when_matched_update_all_condition = where
        return self

    def when_not_matched_insert_all(self) -> LanceMergeInsertBuilder:
        """
        Rows that exist only in the source table (new data) should
        be inserted into the target table.
        """
        self._when_not_matched_insert_all = True
        return self

    def when_not_matched_by_source_delete(
        self, condition: Optional[str] = None
    ) -> LanceMergeInsertBuilder:
        """
        Rows that exist only in the target table (old data) will be
        deleted.  An optional condition can be provided to limit what
        data is deleted.

        Parameters
        ----------
        condition: Optional[str], default None
            If None then all such rows will be deleted.  Otherwise the
            condition will be used as an SQL filter to limit what rows
            are deleted.
        """
        self._when_not_matched_by_source_delete = True
        if condition is not None:
            self._when_not_matched_by_source_condition = condition
        return self

    def execute(
        self,
        new_data: DATA,
        on_bad_vectors: str = "error",
        fill_value: float = 0.0,
        timeout: Optional[timedelta] = None,
    ) -> MergeInsertResult:
        """
        Executes the merge insert operation

        Nothing is returned but the [`Table`][lancedb.table.Table] is updated

        Parameters
        ----------
        new_data: DATA
            New records which will be matched against the existing records
            to potentially insert or update into the table.  This parameter
            can be anything you use for [`add`][lancedb.table.Table.add]
        on_bad_vectors: str, default "error"
            What to do if any of the vectors are not the same size or contains NaNs.
            One of "error", "drop", "fill".
        fill_value: float, default 0.
            The value to use when filling vectors. Only used if on_bad_vectors="fill".
        timeout: Optional[timedelta], default None
            Maximum time to run the operation before cancelling it.

            By default, there is a 30-second timeout that is only enforced after the
            first attempt. This is to prevent spending too long retrying to resolve
            conflicts. For example, if a write attempt takes 20 seconds and fails,
            the second attempt will be cancelled after 10 seconds, hitting the
            30-second timeout. However, a write that takes one hour and succeeds on the
            first attempt will not be cancelled.

            When this is set, the timeout is enforced on all attempts, including
            the first.

        Returns
        -------
        MergeInsertResult
            version: the new version number of the table after doing merge insert.
        """
        if timeout is not None:
            self._timeout = timeout
        return self._table._do_merge(self, new_data, on_bad_vectors, fill_value)

when_matched_update_all

when_matched_update_all(*, where: Optional[str] = None) -> LanceMergeInsertBuilder

Rows that exist in both the source table (new data) and the target table (old data) will be updated, replacing the old row with the corresponding matching row.

If there are multiple matches then the behavior is undefined. Currently this causes multiple copies of the row to be created but that behavior is subject to change.

Source code in lancedb/merge.py
def when_matched_update_all(
    self, *, where: Optional[str] = None
) -> LanceMergeInsertBuilder:
    """
    Rows that exist in both the source table (new data) and
    the target table (old data) will be updated, replacing
    the old row with the corresponding matching row.

    If there are multiple matches then the behavior is undefined.
    Currently this causes multiple copies of the row to be created
    but that behavior is subject to change.
    """
    self._when_matched_update_all = True
    self._when_matched_update_all_condition = where
    return self

when_not_matched_insert_all

when_not_matched_insert_all() -> LanceMergeInsertBuilder

Rows that exist only in the source table (new data) should be inserted into the target table.

Source code in lancedb/merge.py
def when_not_matched_insert_all(self) -> LanceMergeInsertBuilder:
    """
    Rows that exist only in the source table (new data) should
    be inserted into the target table.
    """
    self._when_not_matched_insert_all = True
    return self

when_not_matched_by_source_delete

when_not_matched_by_source_delete(condition: Optional[str] = None) -> LanceMergeInsertBuilder

Rows that exist only in the target table (old data) will be deleted. An optional condition can be provided to limit what data is deleted.

Parameters:

  • condition (Optional[str], default: None ) –

    If None then all such rows will be deleted. Otherwise the condition will be used as an SQL filter to limit what rows are deleted.

Source code in lancedb/merge.py
def when_not_matched_by_source_delete(
    self, condition: Optional[str] = None
) -> LanceMergeInsertBuilder:
    """
    Rows that exist only in the target table (old data) will be
    deleted.  An optional condition can be provided to limit what
    data is deleted.

    Parameters
    ----------
    condition: Optional[str], default None
        If None then all such rows will be deleted.  Otherwise the
        condition will be used as an SQL filter to limit what rows
        are deleted.
    """
    self._when_not_matched_by_source_delete = True
    if condition is not None:
        self._when_not_matched_by_source_condition = condition
    return self

execute

execute(new_data: DATA, on_bad_vectors: str = 'error', fill_value: float = 0.0, timeout: Optional[timedelta] = None) -> MergeInsertResult

Executes the merge insert operation

Nothing is returned but the Table is updated

Parameters:

  • new_data (DATA) –

    New records which will be matched against the existing records to potentially insert or update into the table. This parameter can be anything you use for add

  • on_bad_vectors (str, default: 'error' ) –

    What to do if any of the vectors are not the same size or contains NaNs. One of "error", "drop", "fill".

  • fill_value (float, default: 0.0 ) –

    The value to use when filling vectors. Only used if on_bad_vectors="fill".

  • timeout (Optional[timedelta], default: None ) –

    Maximum time to run the operation before cancelling it.

    By default, there is a 30-second timeout that is only enforced after the first attempt. This is to prevent spending too long retrying to resolve conflicts. For example, if a write attempt takes 20 seconds and fails, the second attempt will be cancelled after 10 seconds, hitting the 30-second timeout. However, a write that takes one hour and succeeds on the first attempt will not be cancelled.

    When this is set, the timeout is enforced on all attempts, including the first.

Returns:

  • MergeInsertResult –

    version: the new version number of the table after doing merge insert.

Source code in lancedb/merge.py
def execute(
    self,
    new_data: DATA,
    on_bad_vectors: str = "error",
    fill_value: float = 0.0,
    timeout: Optional[timedelta] = None,
) -> MergeInsertResult:
    """
    Executes the merge insert operation

    Nothing is returned but the [`Table`][lancedb.table.Table] is updated

    Parameters
    ----------
    new_data: DATA
        New records which will be matched against the existing records
        to potentially insert or update into the table.  This parameter
        can be anything you use for [`add`][lancedb.table.Table.add]
    on_bad_vectors: str, default "error"
        What to do if any of the vectors are not the same size or contains NaNs.
        One of "error", "drop", "fill".
    fill_value: float, default 0.
        The value to use when filling vectors. Only used if on_bad_vectors="fill".
    timeout: Optional[timedelta], default None
        Maximum time to run the operation before cancelling it.

        By default, there is a 30-second timeout that is only enforced after the
        first attempt. This is to prevent spending too long retrying to resolve
        conflicts. For example, if a write attempt takes 20 seconds and fails,
        the second attempt will be cancelled after 10 seconds, hitting the
        30-second timeout. However, a write that takes one hour and succeeds on the
        first attempt will not be cancelled.

        When this is set, the timeout is enforced on all attempts, including
        the first.

    Returns
    -------
    MergeInsertResult
        version: the new version number of the table after doing merge insert.
    """
    if timeout is not None:
        self._timeout = timeout
    return self._table._do_merge(self, new_data, on_bad_vectors, fill_value)

Integrations

Pydantic

lancedb.pydantic.pydantic_to_schema

pydantic_to_schema(model: Type[BaseModel]) -> Schema

Convert a Pydantic Model to a PyArrow Schema.

Parameters:

  • model (Type[BaseModel]) –

    The Pydantic BaseModel to convert to Arrow Schema.

Returns:

  • Schema –

    The Arrow Schema

Examples:

>>> from typing import List, Optional
>>> import pydantic
>>> from lancedb.pydantic import pydantic_to_schema, Vector
>>> class FooModel(pydantic.BaseModel):
...     id: int
...     s: str
...     vec: Vector(1536)  # fixed_size_list<item: float32>[1536]
...     li: List[int]
...
>>> schema = pydantic_to_schema(FooModel)
>>> assert schema == pa.schema([
...     pa.field("id", pa.int64(), False),
...     pa.field("s", pa.utf8(), False),
...     pa.field("vec", pa.list_(pa.float32(), 1536)),
...     pa.field("li", pa.list_(pa.int64()), False),
... ])
Source code in lancedb/pydantic.py
def pydantic_to_schema(model: Type[pydantic.BaseModel]) -> pa.Schema:
    """Convert a [Pydantic Model][pydantic.BaseModel] to a
       [PyArrow Schema][pyarrow.Schema].

    Parameters
    ----------
    model : Type[pydantic.BaseModel]
        The Pydantic BaseModel to convert to Arrow Schema.

    Returns
    -------
    pyarrow.Schema
        The Arrow Schema

    Examples
    --------

    >>> from typing import List, Optional
    >>> import pydantic
    >>> from lancedb.pydantic import pydantic_to_schema, Vector
    >>> class FooModel(pydantic.BaseModel):
    ...     id: int
    ...     s: str
    ...     vec: Vector(1536)  # fixed_size_list<item: float32>[1536]
    ...     li: List[int]
    ...
    >>> schema = pydantic_to_schema(FooModel)
    >>> assert schema == pa.schema([
    ...     pa.field("id", pa.int64(), False),
    ...     pa.field("s", pa.utf8(), False),
    ...     pa.field("vec", pa.list_(pa.float32(), 1536)),
    ...     pa.field("li", pa.list_(pa.int64()), False),
    ... ])
    """
    fields = _pydantic_model_to_fields(model)
    return pa.schema(fields)

lancedb.pydantic.vector

vector(dim: int, value_type: DataType = pa.float32())
Source code in lancedb/pydantic.py
def vector(dim: int, value_type: pa.DataType = pa.float32()):
    # TODO: remove in future release
    from warnings import warn

    warn(
        "lancedb.pydantic.vector() is deprecated, use lancedb.pydantic.Vector instead."
        "This function will be removed in future release",
        DeprecationWarning,
    )
    return Vector(dim, value_type)

lancedb.pydantic.LanceModel

Bases: BaseModel

A Pydantic Model base class that can be converted to a LanceDB Table.

Examples:

>>> import lancedb
>>> from lancedb.pydantic import LanceModel, Vector
>>>
>>> class TestModel(LanceModel):
...     name: str
...     vector: Vector(2)
...
>>> db = lancedb.connect("./example")
>>> table = db.create_table("test", schema=TestModel)
>>> table.add([
...     TestModel(name="test", vector=[1.0, 2.0])
... ])
AddResult(version=2)
>>> table.search([0., 0.]).limit(1).to_pydantic(TestModel)
[TestModel(name='test', vector=FixedSizeList(dim=2))]
Source code in lancedb/pydantic.py
class LanceModel(pydantic.BaseModel):
    """
    A Pydantic Model base class that can be converted to a LanceDB Table.

    Examples
    --------
    >>> import lancedb
    >>> from lancedb.pydantic import LanceModel, Vector
    >>>
    >>> class TestModel(LanceModel):
    ...     name: str
    ...     vector: Vector(2)
    ...
    >>> db = lancedb.connect("./example")
    >>> table = db.create_table("test", schema=TestModel)
    >>> table.add([
    ...     TestModel(name="test", vector=[1.0, 2.0])
    ... ])
    AddResult(version=2)
    >>> table.search([0., 0.]).limit(1).to_pydantic(TestModel)
    [TestModel(name='test', vector=FixedSizeList(dim=2))]
    """

    @classmethod
    def to_arrow_schema(cls):
        """
        Get the Arrow Schema for this model.
        """
        schema = pydantic_to_schema(cls)
        functions = cls.parse_embedding_functions()
        if len(functions) > 0:
            # Prevent circular import
            from .embeddings import EmbeddingFunctionRegistry

            metadata = EmbeddingFunctionRegistry.get_instance().get_table_metadata(
                functions
            )
            schema = schema.with_metadata(metadata)
        return schema

    @classmethod
    def field_names(cls) -> List[str]:
        """
        Get the field names of this model.
        """
        return list(cls.safe_get_fields().keys())

    @classmethod
    def safe_get_fields(cls):
        if PYDANTIC_VERSION.major < 2:
            return cls.__fields__
        return cls.model_fields

    @classmethod
    def parse_embedding_functions(cls) -> List["EmbeddingFunctionConfig"]:
        """
        Parse the embedding functions from this model.
        """
        from .embeddings import EmbeddingFunctionConfig

        vec_and_function = []
        for name, field_info in cls.safe_get_fields().items():
            func = get_extras(field_info, "vector_column_for")
            if func is not None:
                vec_and_function.append([name, func])

        configs = []
        for vec, func in vec_and_function:
            for source, field_info in cls.safe_get_fields().items():
                src_func = get_extras(field_info, "source_column_for")
                if src_func is func:
                    # note we can't use == here since the function is a pydantic
                    # model so two instances of the same function are ==, so if you
                    # have multiple vector columns from multiple sources, both will
                    # be mapped to the same source column
                    # GH594
                    configs.append(
                        EmbeddingFunctionConfig(
                            source_column=source, vector_column=vec, function=func
                        )
                    )
        return configs

to_arrow_schema classmethod

to_arrow_schema()

Get the Arrow Schema for this model.

Source code in lancedb/pydantic.py
@classmethod
def to_arrow_schema(cls):
    """
    Get the Arrow Schema for this model.
    """
    schema = pydantic_to_schema(cls)
    functions = cls.parse_embedding_functions()
    if len(functions) > 0:
        # Prevent circular import
        from .embeddings import EmbeddingFunctionRegistry

        metadata = EmbeddingFunctionRegistry.get_instance().get_table_metadata(
            functions
        )
        schema = schema.with_metadata(metadata)
    return schema

field_names classmethod

field_names() -> List[str]

Get the field names of this model.

Source code in lancedb/pydantic.py
@classmethod
def field_names(cls) -> List[str]:
    """
    Get the field names of this model.
    """
    return list(cls.safe_get_fields().keys())

parse_embedding_functions classmethod

parse_embedding_functions() -> List['EmbeddingFunctionConfig']

Parse the embedding functions from this model.

Source code in lancedb/pydantic.py
@classmethod
def parse_embedding_functions(cls) -> List["EmbeddingFunctionConfig"]:
    """
    Parse the embedding functions from this model.
    """
    from .embeddings import EmbeddingFunctionConfig

    vec_and_function = []
    for name, field_info in cls.safe_get_fields().items():
        func = get_extras(field_info, "vector_column_for")
        if func is not None:
            vec_and_function.append([name, func])

    configs = []
    for vec, func in vec_and_function:
        for source, field_info in cls.safe_get_fields().items():
            src_func = get_extras(field_info, "source_column_for")
            if src_func is func:
                # note we can't use == here since the function is a pydantic
                # model so two instances of the same function are ==, so if you
                # have multiple vector columns from multiple sources, both will
                # be mapped to the same source column
                # GH594
                configs.append(
                    EmbeddingFunctionConfig(
                        source_column=source, vector_column=vec, function=func
                    )
                )
    return configs

Reranking

lancedb.rerankers.linear_combination.LinearCombinationReranker

Bases: Reranker

Reranks the results using a linear combination of the scores from the vector and FTS search. For missing scores, fill with fill value.

Parameters:

  • weight (float, default: 0.7 ) –

    The weight to give to the vector score. Must be between 0 and 1.

  • fill (float, default: 1.0 ) –

    The score to give to results that are only in one of the two result sets. This is treated as penalty, so a higher value means a lower score. TODO: We should just hardcode this-- its pretty confusing as we invert scores to calculate final score

  • return_score (str, default: "relevance" ) –

    opntions are "relevance" or "all" The type of score to return. If "relevance", will return only the relevance score. If "all", will return all scores from the vector and FTS search along with the relevance score.

Source code in lancedb/rerankers/linear_combination.py
class LinearCombinationReranker(Reranker):
    """
    Reranks the results using a linear combination of the scores from the
    vector and FTS search. For missing scores, fill with `fill` value.
    Parameters
    ----------
    weight : float, default 0.7
        The weight to give to the vector score. Must be between 0 and 1.
    fill : float, default 1.0
        The score to give to results that are only in one of the two result sets.
        This is treated as penalty, so a higher value means a lower score.
        TODO: We should just hardcode this--
        its pretty confusing as we invert scores to calculate final score
    return_score : str, default "relevance"
        opntions are "relevance" or "all"
        The type of score to return. If "relevance", will return only the relevance
        score. If "all", will return all scores from the vector and FTS search along
        with the relevance score.
    """

    def __init__(
        self, weight: float = 0.7, fill: float = 1.0, return_score="relevance"
    ):
        if weight < 0 or weight > 1:
            raise ValueError("weight must be between 0 and 1.")
        super().__init__(return_score)
        self.weight = weight
        self.fill = fill

    def rerank_hybrid(
        self,
        query: str,  # noqa: F821
        vector_results: pa.Table,
        fts_results: pa.Table,
    ):
        combined_results = self.merge_results(vector_results, fts_results, self.fill)

        return combined_results

    def merge_results(
        self, vector_results: pa.Table, fts_results: pa.Table, fill: float
    ):
        # If one is empty then return the other and add _relevance_score
        # column equal the existing vector or fts score
        if len(vector_results) == 0:
            results = fts_results.append_column(
                "_relevance_score",
                pa.array(fts_results["_score"], type=pa.float32()),
            )
            if self.score == "relevance":
                results = self._keep_relevance_score(results)
            elif self.score == "all":
                results = results.append_column(
                    "_distance",
                    pa.array([nan] * len(fts_results), type=pa.float32()),
                )
            return results

        if len(fts_results) == 0:
            # invert the distance to relevance score
            results = vector_results.append_column(
                "_relevance_score",
                pa.array(
                    [
                        self._invert_score(distance)
                        for distance in vector_results["_distance"].to_pylist()
                    ],
                    type=pa.float32(),
                ),
            )
            if self.score == "relevance":
                results = self._keep_relevance_score(results)
            elif self.score == "all":
                results = results.append_column(
                    "_score",
                    pa.array([nan] * len(vector_results), type=pa.float32()),
                )
            return results
        results = defaultdict()
        for vector_result in vector_results.to_pylist():
            results[vector_result["_rowid"]] = vector_result
        for fts_result in fts_results.to_pylist():
            row_id = fts_result["_rowid"]
            if row_id in results:
                results[row_id]["_score"] = fts_result["_score"]
            else:
                results[row_id] = fts_result

        combined_list = []
        for row_id, result in results.items():
            vector_score = self._invert_score(result.get("_distance", fill))
            fts_score = result.get("_score", fill)
            result["_relevance_score"] = self._combine_score(vector_score, fts_score)
            combined_list.append(result)

        relevance_score_schema = pa.schema(
            [
                pa.field("_relevance_score", pa.float32()),
            ]
        )
        combined_schema = pa.unify_schemas(
            [vector_results.schema, fts_results.schema, relevance_score_schema]
        )
        tbl = pa.Table.from_pylist(combined_list, schema=combined_schema).sort_by(
            [("_relevance_score", "descending")]
        )
        if self.score == "relevance":
            tbl = self._keep_relevance_score(tbl)
        return tbl

    def _combine_score(self, vector_score, fts_score):
        # these scores represent distance
        return 1 - (self.weight * vector_score + (1 - self.weight) * fts_score)

    def _invert_score(self, dist: float):
        # Invert the score between relevance and distance
        return 1 - dist

lancedb.rerankers.cohere.CohereReranker

Bases: Reranker

Reranks the results using the Cohere Rerank API. https://docs.cohere.com/docs/rerank-guide

Parameters:

  • model_name (str, default: "rerank-english-v2.0" ) –

    The name of the cross encoder model to use. Available cohere models are: - rerank-english-v2.0 - rerank-multilingual-v2.0

  • column (str, default: "text" ) –

    The name of the column to use as input to the cross encoder model.

  • top_n (str, default: None ) –

    The number of results to return. If None, will return all results.

Source code in lancedb/rerankers/cohere.py
class CohereReranker(Reranker):
    """
    Reranks the results using the Cohere Rerank API.
    https://docs.cohere.com/docs/rerank-guide

    Parameters
    ----------
    model_name : str, default "rerank-english-v2.0"
        The name of the cross encoder model to use. Available cohere models are:
        - rerank-english-v2.0
        - rerank-multilingual-v2.0
    column : str, default "text"
        The name of the column to use as input to the cross encoder model.
    top_n : str, default None
        The number of results to return. If None, will return all results.
    """

    def __init__(
        self,
        model_name: str = "rerank-english-v3.0",
        column: str = "text",
        top_n: Union[int, None] = None,
        return_score="relevance",
        api_key: Union[str, None] = None,
    ):
        super().__init__(return_score)
        self.model_name = model_name
        self.column = column
        self.top_n = top_n
        self.api_key = api_key

    @cached_property
    def _client(self):
        cohere = attempt_import_or_raise("cohere")
        # ensure version is at least 0.5.0
        if hasattr(cohere, "__version__") and Version(cohere.__version__) < Version(
            "0.5.0"
        ):
            raise ValueError(
                f"cohere version must be at least 0.5.0, found {cohere.__version__}"
            )
        if os.environ.get("COHERE_API_KEY") is None and self.api_key is None:
            raise ValueError(
                "COHERE_API_KEY not set. Either set it in your environment or \
                pass it as `api_key` argument to the CohereReranker."
            )
        return cohere.Client(os.environ.get("COHERE_API_KEY") or self.api_key)

    def _rerank(self, result_set: pa.Table, query: str):
        result_set = self._handle_empty_results(result_set)
        if len(result_set) == 0:
            return result_set
        docs = result_set[self.column].to_pylist()
        response = self._client.rerank(
            query=query,
            documents=docs,
            top_n=self.top_n,
            model=self.model_name,
        )
        results = (
            response.results
        )  # returns list (text, idx, relevance) attributes sorted descending by score
        indices, scores = list(
            zip(*[(result.index, result.relevance_score) for result in results])
        )  # tuples
        result_set = result_set.take(list(indices))
        # add the scores
        result_set = result_set.append_column(
            "_relevance_score", pa.array(scores, type=pa.float32())
        )

        return result_set

    def rerank_hybrid(
        self,
        query: str,
        vector_results: pa.Table,
        fts_results: pa.Table,
    ):
        combined_results = self.merge_results(vector_results, fts_results)
        combined_results = self._rerank(combined_results, query)
        if self.score == "relevance":
            combined_results = self._keep_relevance_score(combined_results)
        elif self.score == "all":
            raise NotImplementedError(
                "return_score='all' not implemented for cohere reranker"
            )
        return combined_results

    def rerank_vector(self, query: str, vector_results: pa.Table):
        vector_results = self._rerank(vector_results, query)
        if self.score == "relevance":
            vector_results = vector_results.drop_columns(["_distance"])
        return vector_results

    def rerank_fts(self, query: str, fts_results: pa.Table):
        fts_results = self._rerank(fts_results, query)
        if self.score == "relevance":
            fts_results = fts_results.drop_columns(["_score"])
        return fts_results

lancedb.rerankers.colbert.ColbertReranker

Bases: AnswerdotaiRerankers

Reranks the results using the ColBERT model.

Parameters:

  • model_name (str, default: "colbert" (colbert-ir/colbert-v2.0) ) –

    The name of the cross encoder model to use.

  • column (str, default: "text" ) –

    The name of the column to use as input to the cross encoder model.

  • return_score (str, default: "relevance" ) –

    options are "relevance" or "all". Only "relevance" is supported for now.

  • **kwargs –

    Additional keyword arguments to pass to the model, for example, 'device'. See AnswerDotAI/rerankers for more information.

Source code in lancedb/rerankers/colbert.py
class ColbertReranker(AnswerdotaiRerankers):
    """
    Reranks the results using the ColBERT model.

    Parameters
    ----------
    model_name : str, default "colbert" (colbert-ir/colbert-v2.0)
        The name of the cross encoder model to use.
    column : str, default "text"
        The name of the column to use as input to the cross encoder model.
    return_score : str, default "relevance"
        options are "relevance" or "all". Only "relevance" is supported for now.
    **kwargs
        Additional keyword arguments to pass to the model, for example, 'device'.
        See AnswerDotAI/rerankers for more information.
    """

    def __init__(
        self,
        model_name: str = "colbert-ir/colbertv2.0",
        column: str = "text",
        return_score="relevance",
        **kwargs,
    ):
        super().__init__(
            model_type="colbert",
            model_name=model_name,
            column=column,
            return_score=return_score,
            **kwargs,
        )

lancedb.rerankers.cross_encoder.CrossEncoderReranker

Bases: Reranker

Reranks the results using a cross encoder model. The cross encoder model is used to score the query and each result. The results are then sorted by the score.

Parameters:

  • model_name (str, default: "cross-encoder/ms-marco-TinyBERT-L-6" ) –

    The name of the cross encoder model to use. See the sentence transformers documentation for a list of available models.

  • column (str, default: "text" ) –

    The name of the column to use as input to the cross encoder model.

  • device (str, default: None ) –

    The device to use for the cross encoder model. If None, will use "cuda" if available, otherwise "cpu".

  • return_score (str, default: "relevance" ) –

    options are "relevance" or "all". Only "relevance" is supported for now.

  • trust_remote_code (bool, default: True ) –

    If True, will trust the remote code to be safe. If False, will not trust the remote code and will not run it

Source code in lancedb/rerankers/cross_encoder.py
class CrossEncoderReranker(Reranker):
    """
    Reranks the results using a cross encoder model. The cross encoder model is
    used to score the query and each result. The results are then sorted by the score.

    Parameters
    ----------
    model_name : str, default "cross-encoder/ms-marco-TinyBERT-L-6"
        The name of the cross encoder model to use. See the sentence transformers
        documentation for a list of available models.
    column : str, default "text"
        The name of the column to use as input to the cross encoder model.
    device : str, default None
        The device to use for the cross encoder model. If None, will use "cuda"
        if available, otherwise "cpu".
    return_score : str, default "relevance"
        options are "relevance" or "all". Only "relevance" is supported for now.
    trust_remote_code : bool, default True
        If True, will trust the remote code to be safe. If False, will not trust
        the remote code and will not run it
    """

    def __init__(
        self,
        model_name: str = "cross-encoder/ms-marco-TinyBERT-L-6",
        column: str = "text",
        device: Union[str, None] = None,
        return_score="relevance",
        trust_remote_code: bool = True,
    ):
        super().__init__(return_score)
        torch = attempt_import_or_raise("torch")
        self.model_name = model_name
        self.column = column
        self.device = device
        self.trust_remote_code = trust_remote_code
        if self.device is None:
            self.device = "cuda" if torch.cuda.is_available() else "cpu"

    @cached_property
    def model(self):
        sbert = attempt_import_or_raise("sentence_transformers")
        # Allows overriding the automatically selected device
        cross_encoder = sbert.CrossEncoder(
            self.model_name,
            device=self.device,
            trust_remote_code=self.trust_remote_code,
        )

        return cross_encoder

    def _rerank(self, result_set: pa.Table, query: str):
        result_set = self._handle_empty_results(result_set)
        if len(result_set) == 0:
            return result_set
        passages = result_set[self.column].to_pylist()
        cross_inp = [[query, passage] for passage in passages]
        cross_scores = self.model.predict(cross_inp)
        result_set = result_set.append_column(
            "_relevance_score", pa.array(cross_scores, type=pa.float32())
        )

        return result_set

    def rerank_hybrid(
        self,
        query: str,
        vector_results: pa.Table,
        fts_results: pa.Table,
    ):
        combined_results = self.merge_results(vector_results, fts_results)
        combined_results = self._rerank(combined_results, query)
        # sort the results by _score
        if self.score == "relevance":
            combined_results = self._keep_relevance_score(combined_results)
        elif self.score == "all":
            raise NotImplementedError(
                "return_score='all' not implemented for CrossEncoderReranker"
            )
        combined_results = combined_results.sort_by(
            [("_relevance_score", "descending")]
        )

        return combined_results

    def rerank_vector(self, query: str, vector_results: pa.Table):
        vector_results = self._rerank(vector_results, query)
        if self.score == "relevance":
            vector_results = vector_results.drop_columns(["_distance"])

        vector_results = vector_results.sort_by([("_relevance_score", "descending")])
        return vector_results

    def rerank_fts(self, query: str, fts_results: pa.Table):
        fts_results = self._rerank(fts_results, query)
        if self.score == "relevance":
            fts_results = fts_results.drop_columns(["_score"])

        fts_results = fts_results.sort_by([("_relevance_score", "descending")])
        return fts_results

lancedb.rerankers.openai.OpenaiReranker

Bases: Reranker

Reranks the results using the OpenAI API. WARNING: This is a prompt based reranker that uses chat model that is not a dedicated reranker API. This should be treated as experimental.

Parameters:

  • model_name (str, default: "gpt-4-turbo-preview" ) –

    The name of the cross encoder model to use.

  • column (str, default: "text" ) –

    The name of the column to use as input to the cross encoder model.

  • return_score (str, default: "relevance" ) –

    options are "relevance" or "all". Only "relevance" is supported for now.

  • api_key (str, default: None ) –

    The API key to use. If None, will use the OPENAI_API_KEY environment variable.

Source code in lancedb/rerankers/openai.py
class OpenaiReranker(Reranker):
    """
    Reranks the results using the OpenAI API.
    WARNING: This is a prompt based reranker that uses chat model that is
    not a dedicated reranker API. This should be treated as experimental.

    Parameters
    ----------
    model_name : str, default "gpt-4-turbo-preview"
        The name of the cross encoder model to use.
    column : str, default "text"
        The name of the column to use as input to the cross encoder model.
    return_score : str, default "relevance"
        options are "relevance" or "all". Only "relevance" is supported for now.
    api_key : str, default None
        The API key to use. If None, will use the OPENAI_API_KEY environment variable.
    """

    def __init__(
        self,
        model_name: str = "gpt-4-turbo-preview",
        column: str = "text",
        return_score="relevance",
        api_key: Optional[str] = None,
    ):
        super().__init__(return_score)
        self.model_name = model_name
        self.column = column
        self.api_key = api_key

    def _rerank(self, result_set: pa.Table, query: str):
        result_set = self._handle_empty_results(result_set)
        if len(result_set) == 0:
            return result_set
        docs = result_set[self.column].to_pylist()
        response = self._client.chat.completions.create(
            model=self.model_name,
            response_format={"type": "json_object"},
            temperature=0,
            messages=[
                {
                    "role": "system",
                    "content": "You are an expert relevance ranker. Given a list of\
                        documents and a query, your job is to determine the relevance\
                        each document is for answering the query. Your output is JSON,\
                        which is a list of documents. Each document has two fields,\
                        content and relevance_score.  relevance_score is from 0.0 to\
                        1.0 indicating the relevance of the text to the given query.\
                        Make sure to include all documents in the response.",
                },
                {"role": "user", "content": f"Query: {query} Docs: {docs}"},
            ],
        )
        results = json.loads(response.choices[0].message.content)["documents"]
        docs, scores = list(
            zip(*[(result["content"], result["relevance_score"]) for result in results])
        )  # tuples
        # replace the self.column column with the docs
        result_set = result_set.drop(self.column)
        result_set = result_set.append_column(
            self.column, pa.array(docs, type=pa.string())
        )
        # add the scores
        result_set = result_set.append_column(
            "_relevance_score", pa.array(scores, type=pa.float32())
        )

        return result_set

    def rerank_hybrid(
        self,
        query: str,
        vector_results: pa.Table,
        fts_results: pa.Table,
    ):
        combined_results = self.merge_results(vector_results, fts_results)
        combined_results = self._rerank(combined_results, query)
        if self.score == "relevance":
            combined_results = self._keep_relevance_score(combined_results)
        elif self.score == "all":
            raise NotImplementedError(
                "OpenAI Reranker does not support score='all' yet"
            )

        combined_results = combined_results.sort_by(
            [("_relevance_score", "descending")]
        )

        return combined_results

    def rerank_vector(self, query: str, vector_results: pa.Table):
        vector_results = self._rerank(vector_results, query)
        if self.score == "relevance":
            vector_results = vector_results.drop_columns(["_distance"])
        vector_results = vector_results.sort_by([("_relevance_score", "descending")])
        return vector_results

    def rerank_fts(self, query: str, fts_results: pa.Table):
        fts_results = self._rerank(fts_results, query)
        if self.score == "relevance":
            fts_results = fts_results.drop_columns(["_score"])
        fts_results = fts_results.sort_by([("_relevance_score", "descending")])
        return fts_results

    @cached_property
    def _client(self):
        openai = attempt_import_or_raise(
            "openai"
        )  # TODO: force version or handle versions < 1.0
        if os.environ.get("OPENAI_API_KEY") is None and self.api_key is None:
            raise ValueError(
                "OPENAI_API_KEY not set. Either set it in your environment or \
                pass it as `api_key` argument to the CohereReranker."
            )
        return openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY") or self.api_key)

Connections (Asynchronous)

Connections represent a connection to a LanceDb database and can be used to create, list, or open tables.

lancedb.connect_async async

connect_async(uri: URI, *, api_key: Optional[str] = None, region: str = 'us-east-1', host_override: Optional[str] = None, read_consistency_interval: Optional[timedelta] = None, client_config: Optional[Union[ClientConfig, Dict[str, Any]]] = None, storage_options: Optional[Dict[str, str]] = None) -> AsyncConnection

Connect to a LanceDB database.

Parameters:

  • uri (URI) –

    The uri of the database.

  • api_key (Optional[str], default: None ) –

    If present, connect to LanceDB cloud. Otherwise, connect to a database on file system or cloud storage. Can be set via environment variable LANCEDB_API_KEY.

  • region (str, default: 'us-east-1' ) –

    The region to use for LanceDB Cloud.

  • host_override (Optional[str], default: None ) –

    The override url for LanceDB Cloud.

  • read_consistency_interval (Optional[timedelta], default: None ) –

    (For LanceDB OSS only) The interval at which to check for updates to the table from other processes. If None, then consistency is not checked. For performance reasons, this is the default. For strong consistency, set this to zero seconds. Then every read will check for updates from other processes. As a compromise, you can set this to a non-zero timedelta for eventual consistency. If more than that interval has passed since the last check, then the table will be checked for updates. Note: this consistency only applies to read operations. Write operations are always consistent.

  • client_config (Optional[Union[ClientConfig, Dict[str, Any]]], default: None ) –

    Configuration options for the LanceDB Cloud HTTP client. If a dict, then the keys are the attributes of the ClientConfig class. If None, then the default configuration is used.

  • storage_options (Optional[Dict[str, str]], default: None ) –

    Additional options for the storage backend. See available options at https://lancedb.github.io/lancedb/guides/storage/

Examples:

>>> import lancedb
>>> async def doctest_example():
...     # For a local directory, provide a path to the database
...     db = await lancedb.connect_async("~/.lancedb")
...     # For object storage, use a URI prefix
...     db = await lancedb.connect_async("s3://my-bucket/lancedb",
...                                      storage_options={
...                                          "aws_access_key_id": "***"})
...     # Connect to LanceDB cloud
...     db = await lancedb.connect_async("db://my_database", api_key="ldb_...",
...                                      client_config={
...                                          "retry_config": {"retries": 5}})

Returns:

Source code in lancedb/__init__.py
async def connect_async(
    uri: URI,
    *,
    api_key: Optional[str] = None,
    region: str = "us-east-1",
    host_override: Optional[str] = None,
    read_consistency_interval: Optional[timedelta] = None,
    client_config: Optional[Union[ClientConfig, Dict[str, Any]]] = None,
    storage_options: Optional[Dict[str, str]] = None,
) -> AsyncConnection:
    """Connect to a LanceDB database.

    Parameters
    ----------
    uri: str or Path
        The uri of the database.
    api_key: str, optional
        If present, connect to LanceDB cloud.
        Otherwise, connect to a database on file system or cloud storage.
        Can be set via environment variable `LANCEDB_API_KEY`.
    region: str, default "us-east-1"
        The region to use for LanceDB Cloud.
    host_override: str, optional
        The override url for LanceDB Cloud.
    read_consistency_interval: timedelta, default None
        (For LanceDB OSS only)
        The interval at which to check for updates to the table from other
        processes. If None, then consistency is not checked. For performance
        reasons, this is the default. For strong consistency, set this to
        zero seconds. Then every read will check for updates from other
        processes. As a compromise, you can set this to a non-zero timedelta
        for eventual consistency. If more than that interval has passed since
        the last check, then the table will be checked for updates. Note: this
        consistency only applies to read operations. Write operations are
        always consistent.
    client_config: ClientConfig or dict, optional
        Configuration options for the LanceDB Cloud HTTP client. If a dict, then
        the keys are the attributes of the ClientConfig class. If None, then the
        default configuration is used.
    storage_options: dict, optional
        Additional options for the storage backend. See available options at
        <https://lancedb.github.io/lancedb/guides/storage/>

    Examples
    --------

    >>> import lancedb
    >>> async def doctest_example():
    ...     # For a local directory, provide a path to the database
    ...     db = await lancedb.connect_async("~/.lancedb")
    ...     # For object storage, use a URI prefix
    ...     db = await lancedb.connect_async("s3://my-bucket/lancedb",
    ...                                      storage_options={
    ...                                          "aws_access_key_id": "***"})
    ...     # Connect to LanceDB cloud
    ...     db = await lancedb.connect_async("db://my_database", api_key="ldb_...",
    ...                                      client_config={
    ...                                          "retry_config": {"retries": 5}})

    Returns
    -------
    conn : AsyncConnection
        A connection to a LanceDB database.
    """
    if read_consistency_interval is not None:
        read_consistency_interval_secs = read_consistency_interval.total_seconds()
    else:
        read_consistency_interval_secs = None

    if isinstance(client_config, dict):
        client_config = ClientConfig(**client_config)

    return AsyncConnection(
        await lancedb_connect(
            sanitize_uri(uri),
            api_key,
            region,
            host_override,
            read_consistency_interval_secs,
            client_config,
            storage_options,
        )
    )

lancedb.db.AsyncConnection

Bases: object

An active LanceDB connection

To obtain a connection you can use the connect_async function.

This could be a native connection (using lance) or a remote connection (e.g. for connecting to LanceDb Cloud)

Local connections do not currently hold any open resources but they may do so in the future (for example, for shared cache or connections to catalog services) Remote connections represent an open connection to the remote server. The close method can be used to release any underlying resources eagerly. The connection can also be used as a context manager.

Connections can be shared on multiple threads and are expected to be long lived. Connections can also be used as a context manager, however, in many cases a single connection can be used for the lifetime of the application and so this is often not needed. Closing a connection is optional. If it is not closed then it will be automatically closed when the connection object is deleted.

Examples:

>>> import lancedb
>>> async def doctest_example():
...   with await lancedb.connect_async("/tmp/my_dataset") as conn:
...     # do something with the connection
...     pass
...   # conn is closed here
Source code in lancedb/db.py
class AsyncConnection(object):
    """An active LanceDB connection

    To obtain a connection you can use the [connect_async][lancedb.connect_async]
    function.

    This could be a native connection (using lance) or a remote connection (e.g. for
    connecting to LanceDb Cloud)

    Local connections do not currently hold any open resources but they may do so in the
    future (for example, for shared cache or connections to catalog services) Remote
    connections represent an open connection to the remote server.  The
    [close][lancedb.db.AsyncConnection.close] method can be used to release any
    underlying resources eagerly.  The connection can also be used as a context manager.

    Connections can be shared on multiple threads and are expected to be long lived.
    Connections can also be used as a context manager, however, in many cases a single
    connection can be used for the lifetime of the application and so this is often
    not needed.  Closing a connection is optional.  If it is not closed then it will
    be automatically closed when the connection object is deleted.

    Examples
    --------

    >>> import lancedb
    >>> async def doctest_example():
    ...   with await lancedb.connect_async("/tmp/my_dataset") as conn:
    ...     # do something with the connection
    ...     pass
    ...   # conn is closed here
    """

    def __init__(self, connection: LanceDbConnection):
        self._inner = connection

    def __repr__(self):
        return self._inner.__repr__()

    def __enter__(self):
        return self

    def __exit__(self, *_):
        self.close()

    def is_open(self):
        """Return True if the connection is open."""
        return self._inner.is_open()

    def close(self):
        """Close the connection, releasing any underlying resources.

        It is safe to call this method multiple times.

        Any attempt to use the connection after it is closed will result in an error."""
        self._inner.close()

    @property
    def uri(self) -> str:
        return self._inner.uri

    async def table_names(
        self, *, start_after: Optional[str] = None, limit: Optional[int] = None
    ) -> Iterable[str]:
        """List all tables in this database, in sorted order

        Parameters
        ----------
        start_after: str, optional
            If present, only return names that come lexicographically after the supplied
            value.

            This can be combined with limit to implement pagination by setting this to
            the last table name from the previous page.
        limit: int, default 10
            The number of results to return.

        Returns
        -------
        Iterable of str
        """
        return await self._inner.table_names(start_after=start_after, limit=limit)

    async def create_table(
        self,
        name: str,
        data: Optional[DATA] = None,
        schema: Optional[Union[pa.Schema, LanceModel]] = None,
        mode: Optional[Literal["create", "overwrite"]] = None,
        exist_ok: Optional[bool] = None,
        on_bad_vectors: Optional[str] = None,
        fill_value: Optional[float] = None,
        storage_options: Optional[Dict[str, str]] = None,
        *,
        embedding_functions: Optional[List[EmbeddingFunctionConfig]] = None,
    ) -> AsyncTable:
        """Create an [AsyncTable][lancedb.table.AsyncTable] in the database.

        Parameters
        ----------
        name: str
            The name of the table.
        data: The data to initialize the table, *optional*
            User must provide at least one of `data` or `schema`.
            Acceptable types are:

            - list-of-dict

            - pandas.DataFrame

            - pyarrow.Table or pyarrow.RecordBatch
        schema: The schema of the table, *optional*
            Acceptable types are:

            - pyarrow.Schema

            - [LanceModel][lancedb.pydantic.LanceModel]
        mode: Literal["create", "overwrite"]; default "create"
            The mode to use when creating the table.
            Can be either "create" or "overwrite".
            By default, if the table already exists, an exception is raised.
            If you want to overwrite the table, use mode="overwrite".
        exist_ok: bool, default False
            If a table by the same name already exists, then raise an exception
            if exist_ok=False. If exist_ok=True, then open the existing table;
            it will not add the provided data but will validate against any
            schema that's specified.
        on_bad_vectors: str, default "error"
            What to do if any of the vectors are not the same size or contains NaNs.
            One of "error", "drop", "fill".
        fill_value: float
            The value to use when filling vectors. Only used if on_bad_vectors="fill".
        storage_options: dict, optional
            Additional options for the storage backend. Options already set on the
            connection will be inherited by the table, but can be overridden here.
            See available options at
            <https://lancedb.github.io/lancedb/guides/storage/>

        Returns
        -------
        AsyncTable
            A reference to the newly created table.

        !!! note

            The vector index won't be created by default.
            To create the index, call the `create_index` method on the table.

        Examples
        --------

        Can create with list of tuples or dictionaries:

        >>> import lancedb
        >>> async def doctest_example():
        ...     db = await lancedb.connect_async("./.lancedb")
        ...     data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},
        ...             {"vector": [0.2, 1.8], "lat": 40.1, "long":  -74.1}]
        ...     my_table = await db.create_table("my_table", data)
        ...     print(await my_table.query().limit(5).to_arrow())
        >>> import asyncio
        >>> asyncio.run(doctest_example())
        pyarrow.Table
        vector: fixed_size_list<item: float>[2]
          child 0, item: float
        lat: double
        long: double
        ----
        vector: [[[1.1,1.2],[0.2,1.8]]]
        lat: [[45.5,40.1]]
        long: [[-122.7,-74.1]]

        You can also pass a pandas DataFrame:

        >>> import pandas as pd
        >>> data = pd.DataFrame({
        ...    "vector": [[1.1, 1.2], [0.2, 1.8]],
        ...    "lat": [45.5, 40.1],
        ...    "long": [-122.7, -74.1]
        ... })
        >>> async def pandas_example():
        ...     db = await lancedb.connect_async("./.lancedb")
        ...     my_table = await db.create_table("table2", data)
        ...     print(await my_table.query().limit(5).to_arrow())
        >>> asyncio.run(pandas_example())
        pyarrow.Table
        vector: fixed_size_list<item: float>[2]
          child 0, item: float
        lat: double
        long: double
        ----
        vector: [[[1.1,1.2],[0.2,1.8]]]
        lat: [[45.5,40.1]]
        long: [[-122.7,-74.1]]

        Data is converted to Arrow before being written to disk. For maximum
        control over how data is saved, either provide the PyArrow schema to
        convert to or else provide a [PyArrow Table](pyarrow.Table) directly.

        >>> import pyarrow as pa
        >>> custom_schema = pa.schema([
        ...   pa.field("vector", pa.list_(pa.float32(), 2)),
        ...   pa.field("lat", pa.float32()),
        ...   pa.field("long", pa.float32())
        ... ])
        >>> async def with_schema():
        ...     db = await lancedb.connect_async("./.lancedb")
        ...     my_table = await db.create_table("table3", data, schema = custom_schema)
        ...     print(await my_table.query().limit(5).to_arrow())
        >>> asyncio.run(with_schema())
        pyarrow.Table
        vector: fixed_size_list<item: float>[2]
          child 0, item: float
        lat: float
        long: float
        ----
        vector: [[[1.1,1.2],[0.2,1.8]]]
        lat: [[45.5,40.1]]
        long: [[-122.7,-74.1]]


        It is also possible to create an table from `[Iterable[pa.RecordBatch]]`:


        >>> import pyarrow as pa
        >>> def make_batches():
        ...     for i in range(5):
        ...         yield pa.RecordBatch.from_arrays(
        ...             [
        ...                 pa.array([[3.1, 4.1], [5.9, 26.5]],
        ...                     pa.list_(pa.float32(), 2)),
        ...                 pa.array(["foo", "bar"]),
        ...                 pa.array([10.0, 20.0]),
        ...             ],
        ...             ["vector", "item", "price"],
        ...         )
        >>> schema=pa.schema([
        ...     pa.field("vector", pa.list_(pa.float32(), 2)),
        ...     pa.field("item", pa.utf8()),
        ...     pa.field("price", pa.float32()),
        ... ])
        >>> async def iterable_example():
        ...     db = await lancedb.connect_async("./.lancedb")
        ...     await db.create_table("table4", make_batches(), schema=schema)
        >>> asyncio.run(iterable_example())
        """
        metadata = None

        if embedding_functions is not None:
            # If we passed in embedding functions explicitly
            # then we'll override any schema metadata that
            # may was implicitly specified by the LanceModel schema
            registry = EmbeddingFunctionRegistry.get_instance()
            metadata = registry.get_table_metadata(embedding_functions)

        # Defining defaults here and not in function prototype.  In the future
        # these defaults will move into rust so better to keep them as None.
        if on_bad_vectors is None:
            on_bad_vectors = "error"

        if fill_value is None:
            fill_value = 0.0

        data, schema = sanitize_create_table(
            data, schema, metadata, on_bad_vectors, fill_value
        )
        validate_schema(schema)

        if exist_ok is None:
            exist_ok = False
        if mode is None:
            mode = "create"
        if mode == "create" and exist_ok:
            mode = "exist_ok"

        if data is None:
            new_table = await self._inner.create_empty_table(
                name,
                mode,
                schema,
                storage_options=storage_options,
            )
        else:
            data = data_to_reader(data, schema)
            new_table = await self._inner.create_table(
                name,
                mode,
                data,
                storage_options=storage_options,
            )

        return AsyncTable(new_table)

    async def open_table(
        self,
        name: str,
        storage_options: Optional[Dict[str, str]] = None,
        index_cache_size: Optional[int] = None,
    ) -> AsyncTable:
        """Open a Lance Table in the database.

        Parameters
        ----------
        name: str
            The name of the table.
        storage_options: dict, optional
            Additional options for the storage backend. Options already set on the
            connection will be inherited by the table, but can be overridden here.
            See available options at
            <https://lancedb.github.io/lancedb/guides/storage/>
        index_cache_size: int, default 256
            Set the size of the index cache, specified as a number of entries

            The exact meaning of an "entry" will depend on the type of index:
            * IVF - there is one entry for each IVF partition
            * BTREE - there is one entry for the entire index

            This cache applies to the entire opened table, across all indices.
            Setting this value higher will increase performance on larger datasets
            at the expense of more RAM

        Returns
        -------
        A LanceTable object representing the table.
        """
        table = await self._inner.open_table(name, storage_options, index_cache_size)
        return AsyncTable(table)

    async def rename_table(self, old_name: str, new_name: str):
        """Rename a table in the database.

        Parameters
        ----------
        old_name: str
            The current name of the table.
        new_name: str
            The new name of the table.
        """
        await self._inner.rename_table(old_name, new_name)

    async def drop_table(self, name: str, *, ignore_missing: bool = False):
        """Drop a table from the database.

        Parameters
        ----------
        name: str
            The name of the table.
        ignore_missing: bool, default False
            If True, ignore if the table does not exist.
        """
        try:
            await self._inner.drop_table(name)
        except ValueError as e:
            if not ignore_missing:
                raise e
            if f"Table '{name}' was not found" not in str(e):
                raise e

    async def drop_all_tables(self):
        """Drop all tables from the database."""
        await self._inner.drop_all_tables()

    @deprecation.deprecated(
        deprecated_in="0.15.1",
        removed_in="0.17",
        current_version=__version__,
        details="Use drop_all_tables() instead",
    )
    async def drop_database(self):
        """
        Drop database
        This is the same thing as dropping all the tables
        """
        await self._inner.drop_all_tables()

is_open

is_open()

Return True if the connection is open.

Source code in lancedb/db.py
def is_open(self):
    """Return True if the connection is open."""
    return self._inner.is_open()

close

close()

Close the connection, releasing any underlying resources.

It is safe to call this method multiple times.

Any attempt to use the connection after it is closed will result in an error.

Source code in lancedb/db.py
def close(self):
    """Close the connection, releasing any underlying resources.

    It is safe to call this method multiple times.

    Any attempt to use the connection after it is closed will result in an error."""
    self._inner.close()

table_names async

table_names(*, start_after: Optional[str] = None, limit: Optional[int] = None) -> Iterable[str]

List all tables in this database, in sorted order

Parameters:

  • start_after (Optional[str], default: None ) –

    If present, only return names that come lexicographically after the supplied value.

    This can be combined with limit to implement pagination by setting this to the last table name from the previous page.

  • limit (Optional[int], default: None ) –

    The number of results to return.

Returns:

  • Iterable of str –
Source code in lancedb/db.py
async def table_names(
    self, *, start_after: Optional[str] = None, limit: Optional[int] = None
) -> Iterable[str]:
    """List all tables in this database, in sorted order

    Parameters
    ----------
    start_after: str, optional
        If present, only return names that come lexicographically after the supplied
        value.

        This can be combined with limit to implement pagination by setting this to
        the last table name from the previous page.
    limit: int, default 10
        The number of results to return.

    Returns
    -------
    Iterable of str
    """
    return await self._inner.table_names(start_after=start_after, limit=limit)

create_table async

create_table(name: str, data: Optional[DATA] = None, schema: Optional[Union[Schema, LanceModel]] = None, mode: Optional[Literal['create', 'overwrite']] = None, exist_ok: Optional[bool] = None, on_bad_vectors: Optional[str] = None, fill_value: Optional[float] = None, storage_options: Optional[Dict[str, str]] = None, *, embedding_functions: Optional[List[EmbeddingFunctionConfig]] = None) -> AsyncTable

Create an AsyncTable in the database.

Parameters:

  • name (str) –

    The name of the table.

  • data (Optional[DATA], default: None ) –

    User must provide at least one of data or schema. Acceptable types are:

    • list-of-dict

    • pandas.DataFrame

    • pyarrow.Table or pyarrow.RecordBatch

  • schema (Optional[Union[Schema, LanceModel]], default: None ) –

    Acceptable types are:

  • mode (Optional[Literal['create', 'overwrite']], default: None ) –

    The mode to use when creating the table. Can be either "create" or "overwrite". By default, if the table already exists, an exception is raised. If you want to overwrite the table, use mode="overwrite".

  • exist_ok (Optional[bool], default: None ) –

    If a table by the same name already exists, then raise an exception if exist_ok=False. If exist_ok=True, then open the existing table; it will not add the provided data but will validate against any schema that's specified.

  • on_bad_vectors (Optional[str], default: None ) –

    What to do if any of the vectors are not the same size or contains NaNs. One of "error", "drop", "fill".

  • fill_value (Optional[float], default: None ) –

    The value to use when filling vectors. Only used if on_bad_vectors="fill".

  • storage_options (Optional[Dict[str, str]], default: None ) –

    Additional options for the storage backend. Options already set on the connection will be inherited by the table, but can be overridden here. See available options at https://lancedb.github.io/lancedb/guides/storage/

Returns:

  • AsyncTable –

    A reference to the newly created table.

  • !!! note –

    The vector index won't be created by default. To create the index, call the create_index method on the table.

Examples:

Can create with list of tuples or dictionaries:

>>> import lancedb
>>> async def doctest_example():
...     db = await lancedb.connect_async("./.lancedb")
...     data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},
...             {"vector": [0.2, 1.8], "lat": 40.1, "long":  -74.1}]
...     my_table = await db.create_table("my_table", data)
...     print(await my_table.query().limit(5).to_arrow())
>>> import asyncio
>>> asyncio.run(doctest_example())
pyarrow.Table
vector: fixed_size_list<item: float>[2]
  child 0, item: float
lat: double
long: double
----
vector: [[[1.1,1.2],[0.2,1.8]]]
lat: [[45.5,40.1]]
long: [[-122.7,-74.1]]

You can also pass a pandas DataFrame:

>>> import pandas as pd
>>> data = pd.DataFrame({
...    "vector": [[1.1, 1.2], [0.2, 1.8]],
...    "lat": [45.5, 40.1],
...    "long": [-122.7, -74.1]
... })
>>> async def pandas_example():
...     db = await lancedb.connect_async("./.lancedb")
...     my_table = await db.create_table("table2", data)
...     print(await my_table.query().limit(5).to_arrow())
>>> asyncio.run(pandas_example())
pyarrow.Table
vector: fixed_size_list<item: float>[2]
  child 0, item: float
lat: double
long: double
----
vector: [[[1.1,1.2],[0.2,1.8]]]
lat: [[45.5,40.1]]
long: [[-122.7,-74.1]]

Data is converted to Arrow before being written to disk. For maximum control over how data is saved, either provide the PyArrow schema to convert to or else provide a PyArrow Table directly.

>>> import pyarrow as pa
>>> custom_schema = pa.schema([
...   pa.field("vector", pa.list_(pa.float32(), 2)),
...   pa.field("lat", pa.float32()),
...   pa.field("long", pa.float32())
... ])
>>> async def with_schema():
...     db = await lancedb.connect_async("./.lancedb")
...     my_table = await db.create_table("table3", data, schema = custom_schema)
...     print(await my_table.query().limit(5).to_arrow())
>>> asyncio.run(with_schema())
pyarrow.Table
vector: fixed_size_list<item: float>[2]
  child 0, item: float
lat: float
long: float
----
vector: [[[1.1,1.2],[0.2,1.8]]]
lat: [[45.5,40.1]]
long: [[-122.7,-74.1]]

It is also possible to create an table from [Iterable[pa.RecordBatch]]:

>>> import pyarrow as pa
>>> def make_batches():
...     for i in range(5):
...         yield pa.RecordBatch.from_arrays(
...             [
...                 pa.array([[3.1, 4.1], [5.9, 26.5]],
...                     pa.list_(pa.float32(), 2)),
...                 pa.array(["foo", "bar"]),
...                 pa.array([10.0, 20.0]),
...             ],
...             ["vector", "item", "price"],
...         )
>>> schema=pa.schema([
...     pa.field("vector", pa.list_(pa.float32(), 2)),
...     pa.field("item", pa.utf8()),
...     pa.field("price", pa.float32()),
... ])
>>> async def iterable_example():
...     db = await lancedb.connect_async("./.lancedb")
...     await db.create_table("table4", make_batches(), schema=schema)
>>> asyncio.run(iterable_example())
Source code in lancedb/db.py
async def create_table(
    self,
    name: str,
    data: Optional[DATA] = None,
    schema: Optional[Union[pa.Schema, LanceModel]] = None,
    mode: Optional[Literal["create", "overwrite"]] = None,
    exist_ok: Optional[bool] = None,
    on_bad_vectors: Optional[str] = None,
    fill_value: Optional[float] = None,
    storage_options: Optional[Dict[str, str]] = None,
    *,
    embedding_functions: Optional[List[EmbeddingFunctionConfig]] = None,
) -> AsyncTable:
    """Create an [AsyncTable][lancedb.table.AsyncTable] in the database.

    Parameters
    ----------
    name: str
        The name of the table.
    data: The data to initialize the table, *optional*
        User must provide at least one of `data` or `schema`.
        Acceptable types are:

        - list-of-dict

        - pandas.DataFrame

        - pyarrow.Table or pyarrow.RecordBatch
    schema: The schema of the table, *optional*
        Acceptable types are:

        - pyarrow.Schema

        - [LanceModel][lancedb.pydantic.LanceModel]
    mode: Literal["create", "overwrite"]; default "create"
        The mode to use when creating the table.
        Can be either "create" or "overwrite".
        By default, if the table already exists, an exception is raised.
        If you want to overwrite the table, use mode="overwrite".
    exist_ok: bool, default False
        If a table by the same name already exists, then raise an exception
        if exist_ok=False. If exist_ok=True, then open the existing table;
        it will not add the provided data but will validate against any
        schema that's specified.
    on_bad_vectors: str, default "error"
        What to do if any of the vectors are not the same size or contains NaNs.
        One of "error", "drop", "fill".
    fill_value: float
        The value to use when filling vectors. Only used if on_bad_vectors="fill".
    storage_options: dict, optional
        Additional options for the storage backend. Options already set on the
        connection will be inherited by the table, but can be overridden here.
        See available options at
        <https://lancedb.github.io/lancedb/guides/storage/>

    Returns
    -------
    AsyncTable
        A reference to the newly created table.

    !!! note

        The vector index won't be created by default.
        To create the index, call the `create_index` method on the table.

    Examples
    --------

    Can create with list of tuples or dictionaries:

    >>> import lancedb
    >>> async def doctest_example():
    ...     db = await lancedb.connect_async("./.lancedb")
    ...     data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},
    ...             {"vector": [0.2, 1.8], "lat": 40.1, "long":  -74.1}]
    ...     my_table = await db.create_table("my_table", data)
    ...     print(await my_table.query().limit(5).to_arrow())
    >>> import asyncio
    >>> asyncio.run(doctest_example())
    pyarrow.Table
    vector: fixed_size_list<item: float>[2]
      child 0, item: float
    lat: double
    long: double
    ----
    vector: [[[1.1,1.2],[0.2,1.8]]]
    lat: [[45.5,40.1]]
    long: [[-122.7,-74.1]]

    You can also pass a pandas DataFrame:

    >>> import pandas as pd
    >>> data = pd.DataFrame({
    ...    "vector": [[1.1, 1.2], [0.2, 1.8]],
    ...    "lat": [45.5, 40.1],
    ...    "long": [-122.7, -74.1]
    ... })
    >>> async def pandas_example():
    ...     db = await lancedb.connect_async("./.lancedb")
    ...     my_table = await db.create_table("table2", data)
    ...     print(await my_table.query().limit(5).to_arrow())
    >>> asyncio.run(pandas_example())
    pyarrow.Table
    vector: fixed_size_list<item: float>[2]
      child 0, item: float
    lat: double
    long: double
    ----
    vector: [[[1.1,1.2],[0.2,1.8]]]
    lat: [[45.5,40.1]]
    long: [[-122.7,-74.1]]

    Data is converted to Arrow before being written to disk. For maximum
    control over how data is saved, either provide the PyArrow schema to
    convert to or else provide a [PyArrow Table](pyarrow.Table) directly.

    >>> import pyarrow as pa
    >>> custom_schema = pa.schema([
    ...   pa.field("vector", pa.list_(pa.float32(), 2)),
    ...   pa.field("lat", pa.float32()),
    ...   pa.field("long", pa.float32())
    ... ])
    >>> async def with_schema():
    ...     db = await lancedb.connect_async("./.lancedb")
    ...     my_table = await db.create_table("table3", data, schema = custom_schema)
    ...     print(await my_table.query().limit(5).to_arrow())
    >>> asyncio.run(with_schema())
    pyarrow.Table
    vector: fixed_size_list<item: float>[2]
      child 0, item: float
    lat: float
    long: float
    ----
    vector: [[[1.1,1.2],[0.2,1.8]]]
    lat: [[45.5,40.1]]
    long: [[-122.7,-74.1]]


    It is also possible to create an table from `[Iterable[pa.RecordBatch]]`:


    >>> import pyarrow as pa
    >>> def make_batches():
    ...     for i in range(5):
    ...         yield pa.RecordBatch.from_arrays(
    ...             [
    ...                 pa.array([[3.1, 4.1], [5.9, 26.5]],
    ...                     pa.list_(pa.float32(), 2)),
    ...                 pa.array(["foo", "bar"]),
    ...                 pa.array([10.0, 20.0]),
    ...             ],
    ...             ["vector", "item", "price"],
    ...         )
    >>> schema=pa.schema([
    ...     pa.field("vector", pa.list_(pa.float32(), 2)),
    ...     pa.field("item", pa.utf8()),
    ...     pa.field("price", pa.float32()),
    ... ])
    >>> async def iterable_example():
    ...     db = await lancedb.connect_async("./.lancedb")
    ...     await db.create_table("table4", make_batches(), schema=schema)
    >>> asyncio.run(iterable_example())
    """
    metadata = None

    if embedding_functions is not None:
        # If we passed in embedding functions explicitly
        # then we'll override any schema metadata that
        # may was implicitly specified by the LanceModel schema
        registry = EmbeddingFunctionRegistry.get_instance()
        metadata = registry.get_table_metadata(embedding_functions)

    # Defining defaults here and not in function prototype.  In the future
    # these defaults will move into rust so better to keep them as None.
    if on_bad_vectors is None:
        on_bad_vectors = "error"

    if fill_value is None:
        fill_value = 0.0

    data, schema = sanitize_create_table(
        data, schema, metadata, on_bad_vectors, fill_value
    )
    validate_schema(schema)

    if exist_ok is None:
        exist_ok = False
    if mode is None:
        mode = "create"
    if mode == "create" and exist_ok:
        mode = "exist_ok"

    if data is None:
        new_table = await self._inner.create_empty_table(
            name,
            mode,
            schema,
            storage_options=storage_options,
        )
    else:
        data = data_to_reader(data, schema)
        new_table = await self._inner.create_table(
            name,
            mode,
            data,
            storage_options=storage_options,
        )

    return AsyncTable(new_table)

open_table async

open_table(name: str, storage_options: Optional[Dict[str, str]] = None, index_cache_size: Optional[int] = None) -> AsyncTable

Open a Lance Table in the database.

Parameters:

  • name (str) –

    The name of the table.

  • storage_options (Optional[Dict[str, str]], default: None ) –

    Additional options for the storage backend. Options already set on the connection will be inherited by the table, but can be overridden here. See available options at https://lancedb.github.io/lancedb/guides/storage/

  • index_cache_size (Optional[int], default: None ) –

    Set the size of the index cache, specified as a number of entries

    The exact meaning of an "entry" will depend on the type of index: * IVF - there is one entry for each IVF partition * BTREE - there is one entry for the entire index

    This cache applies to the entire opened table, across all indices. Setting this value higher will increase performance on larger datasets at the expense of more RAM

Returns:

  • A LanceTable object representing the table. –
Source code in lancedb/db.py
async def open_table(
    self,
    name: str,
    storage_options: Optional[Dict[str, str]] = None,
    index_cache_size: Optional[int] = None,
) -> AsyncTable:
    """Open a Lance Table in the database.

    Parameters
    ----------
    name: str
        The name of the table.
    storage_options: dict, optional
        Additional options for the storage backend. Options already set on the
        connection will be inherited by the table, but can be overridden here.
        See available options at
        <https://lancedb.github.io/lancedb/guides/storage/>
    index_cache_size: int, default 256
        Set the size of the index cache, specified as a number of entries

        The exact meaning of an "entry" will depend on the type of index:
        * IVF - there is one entry for each IVF partition
        * BTREE - there is one entry for the entire index

        This cache applies to the entire opened table, across all indices.
        Setting this value higher will increase performance on larger datasets
        at the expense of more RAM

    Returns
    -------
    A LanceTable object representing the table.
    """
    table = await self._inner.open_table(name, storage_options, index_cache_size)
    return AsyncTable(table)

rename_table async

rename_table(old_name: str, new_name: str)

Rename a table in the database.

Parameters:

  • old_name (str) –

    The current name of the table.

  • new_name (str) –

    The new name of the table.

Source code in lancedb/db.py
async def rename_table(self, old_name: str, new_name: str):
    """Rename a table in the database.

    Parameters
    ----------
    old_name: str
        The current name of the table.
    new_name: str
        The new name of the table.
    """
    await self._inner.rename_table(old_name, new_name)

drop_table async

drop_table(name: str, *, ignore_missing: bool = False)

Drop a table from the database.

Parameters:

  • name (str) –

    The name of the table.

  • ignore_missing (bool, default: False ) –

    If True, ignore if the table does not exist.

Source code in lancedb/db.py
async def drop_table(self, name: str, *, ignore_missing: bool = False):
    """Drop a table from the database.

    Parameters
    ----------
    name: str
        The name of the table.
    ignore_missing: bool, default False
        If True, ignore if the table does not exist.
    """
    try:
        await self._inner.drop_table(name)
    except ValueError as e:
        if not ignore_missing:
            raise e
        if f"Table '{name}' was not found" not in str(e):
            raise e

drop_all_tables async

drop_all_tables()

Drop all tables from the database.

Source code in lancedb/db.py
async def drop_all_tables(self):
    """Drop all tables from the database."""
    await self._inner.drop_all_tables()

drop_database async

drop_database()

Drop database This is the same thing as dropping all the tables

Source code in lancedb/db.py
@deprecation.deprecated(
    deprecated_in="0.15.1",
    removed_in="0.17",
    current_version=__version__,
    details="Use drop_all_tables() instead",
)
async def drop_database(self):
    """
    Drop database
    This is the same thing as dropping all the tables
    """
    await self._inner.drop_all_tables()

Tables (Asynchronous)

Table hold your actual data as a collection of records / rows.

lancedb.table.AsyncTable

An AsyncTable is a collection of Records in a LanceDB Database.

An AsyncTable can be obtained from the AsyncConnection.create_table and AsyncConnection.open_table methods.

An AsyncTable object is expected to be long lived and reused for multiple operations. AsyncTable objects will cache a certain amount of index data in memory. This cache will be freed when the Table is garbage collected. To eagerly free the cache you can call the close method. Once the AsyncTable is closed, it cannot be used for any further operations.

An AsyncTable can also be used as a context manager, and will automatically close when the context is exited. Closing a table is optional. If you do not close the table, it will be closed when the AsyncTable object is garbage collected.

Examples:

Create using AsyncConnection.create_table (more examples in that method's documentation).

>>> import lancedb
>>> async def create_a_table():
...     db = await lancedb.connect_async("./.lancedb")
...     data = [{"vector": [1.1, 1.2], "b": 2}]
...     table = await db.create_table("my_table", data=data)
...     print(await table.query().limit(5).to_arrow())
>>> import asyncio
>>> asyncio.run(create_a_table())
pyarrow.Table
vector: fixed_size_list<item: float>[2]
  child 0, item: float
b: int64
----
vector: [[[1.1,1.2]]]
b: [[2]]

Can append new data with AsyncTable.add().

>>> async def add_to_table():
...     db = await lancedb.connect_async("./.lancedb")
...     table = await db.open_table("my_table")
...     await table.add([{"vector": [0.5, 1.3], "b": 4}])
>>> asyncio.run(add_to_table())

Can query the table with AsyncTable.vector_search.

>>> async def search_table_for_vector():
...     db = await lancedb.connect_async("./.lancedb")
...     table = await db.open_table("my_table")
...     results = (
...       await table.vector_search([0.4, 0.4]).select(["b", "vector"]).to_pandas()
...     )
...     print(results)
>>> asyncio.run(search_table_for_vector())
   b      vector  _distance
0  4  [0.5, 1.3]       0.82
1  2  [1.1, 1.2]       1.13

Search queries are much faster when an index is created. See AsyncTable.create_index.

Source code in lancedb/table.py
2969
2970
2971
2972
2973
2974
2975
2976
2977
2978
2979
2980
2981
2982
2983
2984
2985
2986
2987
2988
2989
2990
2991
2992
2993
2994
2995
2996
2997
2998
2999
3000
3001
3002
3003
3004
3005
3006
3007
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
3025
3026
3027
3028
3029
3030
3031
3032
3033
3034
3035
3036
3037
3038
3039
3040
3041
3042
3043
3044
3045
3046
3047
3048
3049
3050
3051
3052
3053
3054
3055
3056
3057
3058
3059
3060
3061
3062
3063
3064
3065
3066
3067
3068
3069
3070
3071
3072
3073
3074
3075
3076
3077
3078
3079
3080
3081
3082
3083
3084
3085
3086
3087
3088
3089
3090
3091
3092
3093
3094
3095
3096
3097
3098
3099
3100
3101
3102
3103
3104
3105
3106
3107
3108
3109
3110
3111
3112
3113
3114
3115
3116
3117
3118
3119
3120
3121
3122
3123
3124
3125
3126
3127
3128
3129
3130
3131
3132
3133
3134
3135
3136
3137
3138
3139
3140
3141
3142
3143
3144
3145
3146
3147
3148
3149
3150
3151
3152
3153
3154
3155
3156
3157
3158
3159
3160
3161
3162
3163
3164
3165
3166
3167
3168
3169
3170
3171
3172
3173
3174
3175
3176
3177
3178
3179
3180
3181
3182
3183
3184
3185
3186
3187
3188
3189
3190
3191
3192
3193
3194
3195
3196
3197
3198
3199
3200
3201
3202
3203
3204
3205
3206
3207
3208
3209
3210
3211
3212
3213
3214
3215
3216
3217
3218
3219
3220
3221
3222
3223
3224
3225
3226
3227
3228
3229
3230
3231
3232
3233
3234
3235
3236
3237
3238
3239
3240
3241
3242
3243
3244
3245
3246
3247
3248
3249
3250
3251
3252
3253
3254
3255
3256
3257
3258
3259
3260
3261
3262
3263
3264
3265
3266
3267
3268
3269
3270
3271
3272
3273
3274
3275
3276
3277
3278
3279
3280
3281
3282
3283
3284
3285
3286
3287
3288
3289
3290
3291
3292
3293
3294
3295
3296
3297
3298
3299
3300
3301
3302
3303
3304
3305
3306
3307
3308
3309
3310
3311
3312
3313
3314
3315
3316
3317
3318
3319
3320
3321
3322
3323
3324
3325
3326
3327
3328
3329
3330
3331
3332
3333
3334
3335
3336
3337
3338
3339
3340
3341
3342
3343
3344
3345
3346
3347
3348
3349
3350
3351
3352
3353
3354
3355
3356
3357
3358
3359
3360
3361
3362
3363
3364
3365
3366
3367
3368
3369
3370
3371
3372
3373
3374
3375
3376
3377
3378
3379
3380
3381
3382
3383
3384
3385
3386
3387
3388
3389
3390
3391
3392
3393
3394
3395
3396
3397
3398
3399
3400
3401
3402
3403
3404
3405
3406
3407
3408
3409
3410
3411
3412
3413
3414
3415
3416
3417
3418
3419
3420
3421
3422
3423
3424
3425
3426
3427
3428
3429
3430
3431
3432
3433
3434
3435
3436
3437
3438
3439
3440
3441
3442
3443
3444
3445
3446
3447
3448
3449
3450
3451
3452
3453
3454
3455
3456
3457
3458
3459
3460
3461
3462
3463
3464
3465
3466
3467
3468
3469
3470
3471
3472
3473
3474
3475
3476
3477
3478
3479
3480
3481
3482
3483
3484
3485
3486
3487
3488
3489
3490
3491
3492
3493
3494
3495
3496
3497
3498
3499
3500
3501
3502
3503
3504
3505
3506
3507
3508
3509
3510
3511
3512
3513
3514
3515
3516
3517
3518
3519
3520
3521
3522
3523
3524
3525
3526
3527
3528
3529
3530
3531
3532
3533
3534
3535
3536
3537
3538
3539
3540
3541
3542
3543
3544
3545
3546
3547
3548
3549
3550
3551
3552
3553
3554
3555
3556
3557
3558
3559
3560
3561
3562
3563
3564
3565
3566
3567
3568
3569
3570
3571
3572
3573
3574
3575
3576
3577
3578
3579
3580
3581
3582
3583
3584
3585
3586
3587
3588
3589
3590
3591
3592
3593
3594
3595
3596
3597
3598
3599
3600
3601
3602
3603
3604
3605
3606
3607
3608
3609
3610
3611
3612
3613
3614
3615
3616
3617
3618
3619
3620
3621
3622
3623
3624
3625
3626
3627
3628
3629
3630
3631
3632
3633
3634
3635
3636
3637
3638
3639
3640
3641
3642
3643
3644
3645
3646
3647
3648
3649
3650
3651
3652
3653
3654
3655
3656
3657
3658
3659
3660
3661
3662
3663
3664
3665
3666
3667
3668
3669
3670
3671
3672
3673
3674
3675
3676
3677
3678
3679
3680
3681
3682
3683
3684
3685
3686
3687
3688
3689
3690
3691
3692
3693
3694
3695
3696
3697
3698
3699
3700
3701
3702
3703
3704
3705
3706
3707
3708
3709
3710
3711
3712
3713
3714
3715
3716
3717
3718
3719
3720
3721
3722
3723
3724
3725
3726
3727
3728
3729
3730
3731
3732
3733
3734
3735
3736
3737
3738
3739
3740
3741
3742
3743
3744
3745
3746
3747
3748
3749
3750
3751
3752
3753
3754
3755
3756
3757
3758
3759
3760
3761
3762
3763
3764
3765
3766
3767
3768
3769
3770
3771
3772
3773
3774
3775
3776
3777
3778
3779
3780
3781
3782
3783
3784
3785
3786
3787
3788
3789
3790
3791
3792
3793
3794
3795
3796
3797
3798
3799
3800
3801
3802
3803
3804
3805
3806
3807
3808
3809
3810
3811
3812
3813
3814
3815
3816
3817
3818
3819
3820
3821
3822
3823
3824
3825
3826
3827
3828
3829
3830
3831
3832
3833
3834
3835
3836
3837
3838
3839
3840
3841
3842
3843
3844
3845
3846
3847
3848
3849
3850
3851
3852
3853
3854
3855
3856
3857
3858
3859
3860
3861
3862
3863
3864
3865
3866
3867
3868
3869
3870
3871
3872
3873
3874
3875
3876
3877
3878
3879
3880
3881
3882
3883
3884
3885
3886
3887
3888
3889
3890
3891
3892
3893
3894
3895
3896
3897
3898
3899
3900
3901
3902
3903
3904
3905
3906
3907
3908
3909
3910
3911
3912
3913
3914
3915
3916
3917
3918
3919
3920
3921
3922
3923
3924
3925
3926
3927
3928
3929
3930
3931
3932
3933
3934
3935
3936
3937
3938
3939
3940
3941
3942
3943
3944
3945
3946
3947
3948
3949
3950
3951
3952
3953
3954
3955
3956
3957
3958
3959
3960
3961
3962
3963
3964
3965
3966
3967
3968
3969
3970
3971
3972
3973
3974
3975
3976
3977
3978
3979
3980
3981
3982
3983
3984
3985
3986
3987
3988
3989
3990
3991
3992
3993
3994
3995
3996
3997
3998
3999
4000
4001
4002
4003
4004
4005
4006
4007
4008
4009
4010
4011
4012
4013
4014
4015
4016
4017
4018
4019
4020
4021
4022
4023
4024
4025
4026
4027
4028
4029
4030
4031
4032
4033
4034
4035
4036
4037
4038
4039
4040
4041
4042
4043
4044
4045
4046
4047
4048
4049
4050
4051
4052
4053
4054
4055
4056
4057
4058
4059
4060
4061
4062
4063
4064
4065
4066
4067
4068
4069
4070
4071
4072
4073
4074
4075
4076
4077
4078
4079
4080
4081
4082
4083
4084
4085
4086
4087
4088
4089
4090
4091
4092
4093
4094
4095
4096
4097
4098
4099
4100
4101
4102
4103
4104
4105
4106
4107
4108
4109
4110
4111
4112
4113
4114
4115
4116
4117
4118
4119
4120
4121
4122
4123
4124
4125
4126
4127
4128
4129
4130
4131
4132
4133
4134
4135
4136
4137
4138
4139
4140
4141
4142
class AsyncTable:
    """
    An AsyncTable is a collection of Records in a LanceDB Database.

    An AsyncTable can be obtained from the
    [AsyncConnection.create_table][lancedb.AsyncConnection.create_table] and
    [AsyncConnection.open_table][lancedb.AsyncConnection.open_table] methods.

    An AsyncTable object is expected to be long lived and reused for multiple
    operations. AsyncTable objects will cache a certain amount of index data in memory.
    This cache will be freed when the Table is garbage collected.  To eagerly free the
    cache you can call the [close][lancedb.AsyncTable.close] method.  Once the
    AsyncTable is closed, it cannot be used for any further operations.

    An AsyncTable can also be used as a context manager, and will automatically close
    when the context is exited.  Closing a table is optional.  If you do not close the
    table, it will be closed when the AsyncTable object is garbage collected.

    Examples
    --------

    Create using [AsyncConnection.create_table][lancedb.AsyncConnection.create_table]
    (more examples in that method's documentation).

    >>> import lancedb
    >>> async def create_a_table():
    ...     db = await lancedb.connect_async("./.lancedb")
    ...     data = [{"vector": [1.1, 1.2], "b": 2}]
    ...     table = await db.create_table("my_table", data=data)
    ...     print(await table.query().limit(5).to_arrow())
    >>> import asyncio
    >>> asyncio.run(create_a_table())
    pyarrow.Table
    vector: fixed_size_list<item: float>[2]
      child 0, item: float
    b: int64
    ----
    vector: [[[1.1,1.2]]]
    b: [[2]]

    Can append new data with [AsyncTable.add()][lancedb.table.AsyncTable.add].

    >>> async def add_to_table():
    ...     db = await lancedb.connect_async("./.lancedb")
    ...     table = await db.open_table("my_table")
    ...     await table.add([{"vector": [0.5, 1.3], "b": 4}])
    >>> asyncio.run(add_to_table())

    Can query the table with
    [AsyncTable.vector_search][lancedb.table.AsyncTable.vector_search].

    >>> async def search_table_for_vector():
    ...     db = await lancedb.connect_async("./.lancedb")
    ...     table = await db.open_table("my_table")
    ...     results = (
    ...       await table.vector_search([0.4, 0.4]).select(["b", "vector"]).to_pandas()
    ...     )
    ...     print(results)
    >>> asyncio.run(search_table_for_vector())
       b      vector  _distance
    0  4  [0.5, 1.3]       0.82
    1  2  [1.1, 1.2]       1.13

    Search queries are much faster when an index is created. See
    [AsyncTable.create_index][lancedb.table.AsyncTable.create_index].
    """

    def __init__(self, table: LanceDBTable):
        """Create a new AsyncTable object.

        You should not create AsyncTable objects directly.

        Use [AsyncConnection.create_table][lancedb.AsyncConnection.create_table] and
        [AsyncConnection.open_table][lancedb.AsyncConnection.open_table] to obtain
        Table objects."""
        self._inner = table

    def __repr__(self):
        return self._inner.__repr__()

    def __enter__(self):
        return self

    def __exit__(self, *_):
        self.close()

    def is_open(self) -> bool:
        """Return True if the table is open."""
        return self._inner.is_open()

    def close(self):
        """Close the table and free any resources associated with it.

        It is safe to call this method multiple times.

        Any attempt to use the table after it has been closed will raise an error."""
        return self._inner.close()

    @property
    def name(self) -> str:
        """The name of the table."""
        return self._inner.name()

    async def schema(self) -> pa.Schema:
        """The [Arrow Schema](https://arrow.apache.org/docs/python/api/datatypes.html#)
        of this Table

        """
        return await self._inner.schema()

    async def embedding_functions(self) -> Dict[str, EmbeddingFunctionConfig]:
        """
        Get the embedding functions for the table

        Returns
        -------
        funcs: Dict[str, EmbeddingFunctionConfig]
            A mapping of the vector column to the embedding function
            or empty dict if not configured.
        """
        schema = await self.schema()
        return EmbeddingFunctionRegistry.get_instance().parse_functions(schema.metadata)

    async def count_rows(self, filter: Optional[str] = None) -> int:
        """
        Count the number of rows in the table.

        Parameters
        ----------
        filter: str, optional
            A SQL where clause to filter the rows to count.
        """
        return await self._inner.count_rows(filter)

    async def head(self, n=5) -> pa.Table:
        """
        Return the first `n` rows of the table.

        Parameters
        ----------
        n: int, default 5
            The number of rows to return.
        """
        return await self.query().limit(n).to_arrow()

    def query(self) -> AsyncQuery:
        """
        Returns an [AsyncQuery][lancedb.query.AsyncQuery] that can be used
        to search the table.

        Use methods on the returned query to control query behavior.  The query
        can be executed with methods like [to_arrow][lancedb.query.AsyncQuery.to_arrow],
        [to_pandas][lancedb.query.AsyncQuery.to_pandas] and more.
        """
        return AsyncQuery(self._inner.query())

    async def to_pandas(self) -> "pd.DataFrame":
        """Return the table as a pandas DataFrame.

        Returns
        -------
        pd.DataFrame
        """
        return (await self.to_arrow()).to_pandas()

    async def to_arrow(self) -> pa.Table:
        """Return the table as a pyarrow Table.

        Returns
        -------
        pa.Table
        """
        return await self.query().to_arrow()

    async def create_index(
        self,
        column: str,
        *,
        replace: Optional[bool] = None,
        config: Optional[
            Union[IvfFlat, IvfPq, HnswPq, HnswSq, BTree, Bitmap, LabelList, FTS]
        ] = None,
        wait_timeout: Optional[timedelta] = None,
    ):
        """Create an index to speed up queries

        Indices can be created on vector columns or scalar columns.
        Indices on vector columns will speed up vector searches.
        Indices on scalar columns will speed up filtering (in both
        vector and non-vector searches)

        Parameters
        ----------
        column: str
            The column to index.
        replace: bool, default True
            Whether to replace the existing index

            If this is false, and another index already exists on the same columns
            and the same name, then an error will be returned.  This is true even if
            that index is out of date.

            The default is True
        config: default None
            For advanced configuration you can specify the type of index you would
            like to create.   You can also specify index-specific parameters when
            creating an index object.
        wait_timeout: timedelta, optional
            The timeout to wait if indexing is asynchronous.
        """
        if config is not None:
            if not isinstance(
                config, (IvfFlat, IvfPq, HnswPq, HnswSq, BTree, Bitmap, LabelList, FTS)
            ):
                raise TypeError(
                    "config must be an instance of IvfPq, HnswPq, HnswSq, BTree,"
                    " Bitmap, LabelList, or FTS"
                )
        try:
            await self._inner.create_index(
                column, index=config, replace=replace, wait_timeout=wait_timeout
            )
        except ValueError as e:
            if "not support the requested language" in str(e):
                supported_langs = ", ".join(lang_mapping.values())
                help_msg = f"Supported languages: {supported_langs}"
                add_note(e, help_msg)
            raise e

    async def drop_index(self, name: str) -> None:
        """
        Drop an index from the table.

        Parameters
        ----------
        name: str
            The name of the index to drop.

        Notes
        -----
        This does not delete the index from disk, it just removes it from the table.
        To delete the index, run [optimize][lancedb.table.AsyncTable.optimize]
        after dropping the index.

        Use [list_indices][lancedb.table.AsyncTable.list_indices] to find the names
        of the indices.
        """
        await self._inner.drop_index(name)

    async def prewarm_index(self, name: str) -> None:
        """
        Prewarm an index in the table.

        Parameters
        ----------
        name: str
            The name of the index to prewarm

        Notes
        -----
        This will load the index into memory.  This may reduce the cold-start time for
        future queries.  If the index does not fit in the cache then this call may be
        wasteful.
        """
        await self._inner.prewarm_index(name)

    async def wait_for_index(
        self, index_names: Iterable[str], timeout: timedelta = timedelta(seconds=300)
    ) -> None:
        """
        Wait for indexing to complete for the given index names.
        This will poll the table until all the indices are fully indexed,
        or raise a timeout exception if the timeout is reached.

        Parameters
        ----------
        index_names: str
            The name of the indices to poll
        timeout: timedelta
            Timeout to wait for asynchronous indexing. The default is 5 minutes.
        """
        await self._inner.wait_for_index(index_names, timeout)

    async def stats(self) -> TableStatistics:
        """
        Retrieve table and fragment statistics.
        """
        return await self._inner.stats()

    async def add(
        self,
        data: DATA,
        *,
        mode: Optional[Literal["append", "overwrite"]] = "append",
        on_bad_vectors: Optional[OnBadVectorsType] = None,
        fill_value: Optional[float] = None,
    ) -> AddResult:
        """Add more data to the [Table](Table).

        Parameters
        ----------
        data: DATA
            The data to insert into the table. Acceptable types are:

            - list-of-dict

            - pandas.DataFrame

            - pyarrow.Table or pyarrow.RecordBatch
        mode: str
            The mode to use when writing the data. Valid values are
            "append" and "overwrite".
        on_bad_vectors: str, default "error"
            What to do if any of the vectors are not the same size or contains NaNs.
            One of "error", "drop", "fill", "null".
        fill_value: float, default 0.
            The value to use when filling vectors. Only used if on_bad_vectors="fill".

        """
        schema = await self.schema()
        if on_bad_vectors is None:
            on_bad_vectors = "error"
        if fill_value is None:
            fill_value = 0.0
        data = _sanitize_data(
            data,
            schema,
            metadata=schema.metadata,
            on_bad_vectors=on_bad_vectors,
            fill_value=fill_value,
            allow_subschema=True,
        )
        if isinstance(data, pa.Table):
            data = data.to_reader()

        return await self._inner.add(data, mode or "append")

    def merge_insert(self, on: Union[str, Iterable[str]]) -> LanceMergeInsertBuilder:
        """
        Returns a [`LanceMergeInsertBuilder`][lancedb.merge.LanceMergeInsertBuilder]
        that can be used to create a "merge insert" operation

        This operation can add rows, update rows, and remove rows all in a single
        transaction. It is a very generic tool that can be used to create
        behaviors like "insert if not exists", "update or insert (i.e. upsert)",
        or even replace a portion of existing data with new data (e.g. replace
        all data where month="january")

        The merge insert operation works by combining new data from a
        **source table** with existing data in a **target table** by using a
        join.  There are three categories of records.

        "Matched" records are records that exist in both the source table and
        the target table. "Not matched" records exist only in the source table
        (e.g. these are new data) "Not matched by source" records exist only
        in the target table (this is old data)

        The builder returned by this method can be used to customize what
        should happen for each category of data.

        Please note that the data may appear to be reordered as part of this
        operation.  This is because updated rows will be deleted from the
        dataset and then reinserted at the end with the new values.

        Parameters
        ----------

        on: Union[str, Iterable[str]]
            A column (or columns) to join on.  This is how records from the
            source table and target table are matched.  Typically this is some
            kind of key or id column.

        Examples
        --------
        >>> import lancedb
        >>> data = pa.table({"a": [2, 1, 3], "b": ["a", "b", "c"]})
        >>> db = lancedb.connect("./.lancedb")
        >>> table = db.create_table("my_table", data)
        >>> new_data = pa.table({"a": [2, 3, 4], "b": ["x", "y", "z"]})
        >>> # Perform a "upsert" operation
        >>> res = table.merge_insert("a")     \\
        ...      .when_matched_update_all()     \\
        ...      .when_not_matched_insert_all() \\
        ...      .execute(new_data)
        >>> res
        MergeResult(version=2, num_updated_rows=2, num_inserted_rows=1, num_deleted_rows=0)
        >>> # The order of new rows is non-deterministic since we use
        >>> # a hash-join as part of this operation and so we sort here
        >>> table.to_arrow().sort_by("a").to_pandas()
           a  b
        0  1  b
        1  2  x
        2  3  y
        3  4  z
        """  # noqa: E501
        on = [on] if isinstance(on, str) else list(iter(on))

        return LanceMergeInsertBuilder(self, on)

    @overload
    async def search(
        self,
        query: Optional[str] = None,
        vector_column_name: Optional[str] = None,
        query_type: Literal["auto"] = ...,
        ordering_field_name: Optional[str] = None,
        fts_columns: Optional[Union[str, List[str]]] = None,
    ) -> Union[AsyncHybridQuery, AsyncFTSQuery, AsyncVectorQuery]: ...

    @overload
    async def search(
        self,
        query: Optional[str] = None,
        vector_column_name: Optional[str] = None,
        query_type: Literal["hybrid"] = ...,
        ordering_field_name: Optional[str] = None,
        fts_columns: Optional[Union[str, List[str]]] = None,
    ) -> AsyncHybridQuery: ...

    @overload
    async def search(
        self,
        query: Optional[Union[VEC, "PIL.Image.Image", Tuple]] = None,
        vector_column_name: Optional[str] = None,
        query_type: Literal["auto"] = ...,
        ordering_field_name: Optional[str] = None,
        fts_columns: Optional[Union[str, List[str]]] = None,
    ) -> AsyncVectorQuery: ...

    @overload
    async def search(
        self,
        query: Optional[str] = None,
        vector_column_name: Optional[str] = None,
        query_type: Literal["fts"] = ...,
        ordering_field_name: Optional[str] = None,
        fts_columns: Optional[Union[str, List[str]]] = None,
    ) -> AsyncFTSQuery: ...

    @overload
    async def search(
        self,
        query: Optional[
            Union[VEC, str, "PIL.Image.Image", Tuple, FullTextQuery]
        ] = None,
        vector_column_name: Optional[str] = None,
        query_type: Literal["vector"] = ...,
        ordering_field_name: Optional[str] = None,
        fts_columns: Optional[Union[str, List[str]]] = None,
    ) -> AsyncVectorQuery: ...

    async def search(
        self,
        query: Optional[
            Union[VEC, str, "PIL.Image.Image", Tuple, FullTextQuery]
        ] = None,
        vector_column_name: Optional[str] = None,
        query_type: QueryType = "auto",
        ordering_field_name: Optional[str] = None,
        fts_columns: Optional[Union[str, List[str]]] = None,
    ) -> Union[AsyncHybridQuery, AsyncFTSQuery, AsyncVectorQuery]:
        """Create a search query to find the nearest neighbors
        of the given query vector. We currently support [vector search][search]
        and [full-text search][experimental-full-text-search].

        All query options are defined in [AsyncQuery][lancedb.query.AsyncQuery].

        Parameters
        ----------
        query: list/np.ndarray/str/PIL.Image.Image, default None
            The targetted vector to search for.

            - *default None*.
            Acceptable types are: list, np.ndarray, PIL.Image.Image

            - If None then the select/where/limit clauses are applied to filter
            the table
        vector_column_name: str, optional
            The name of the vector column to search.

            The vector column needs to be a pyarrow fixed size list type

            - If not specified then the vector column is inferred from
            the table schema

            - If the table has multiple vector columns then the *vector_column_name*
            needs to be specified. Otherwise, an error is raised.
        query_type: str
            *default "auto"*.
            Acceptable types are: "vector", "fts", "hybrid", or "auto"

            - If "auto" then the query type is inferred from the query;

                - If `query` is a list/np.ndarray then the query type is
                "vector";

                - If `query` is a PIL.Image.Image then either do vector search,
                or raise an error if no corresponding embedding function is found.

            - If `query` is a string, then the query type is "vector" if the
              table has embedding functions else the query type is "fts"

        Returns
        -------
        LanceQueryBuilder
            A query builder object representing the query.
        """

        def is_embedding(query):
            return isinstance(query, (list, np.ndarray, pa.Array, pa.ChunkedArray))

        async def get_embedding_func(
            vector_column_name: Optional[str],
            query_type: QueryType,
            query: Optional[Union[VEC, str, "PIL.Image.Image", Tuple, FullTextQuery]],
        ) -> Tuple[str, EmbeddingFunctionConfig]:
            if isinstance(query, FullTextQuery):
                query_type = "fts"
            schema = await self.schema()
            vector_column_name = infer_vector_column_name(
                schema=schema,
                query_type=query_type,
                query=query,
                vector_column_name=vector_column_name,
            )
            funcs = EmbeddingFunctionRegistry.get_instance().parse_functions(
                schema.metadata
            )
            func = funcs.get(vector_column_name)
            if func is None:
                error = ValueError(
                    f"Column '{vector_column_name}' has no registered "
                    "embedding function."
                )
                if len(funcs) > 0:
                    add_note(
                        error,
                        "Embedding functions are registered for columns: "
                        f"{list(funcs.keys())}",
                    )
                else:
                    add_note(
                        error, "No embedding functions are registered for any columns."
                    )
                raise error
            return vector_column_name, func

        async def make_embedding(embedding, query):
            if embedding is not None:
                loop = asyncio.get_running_loop()
                # This function is likely to block, since it either calls an expensive
                # function or makes an HTTP request to an embeddings REST API.
                return (
                    await loop.run_in_executor(
                        None,
                        embedding.function.compute_query_embeddings_with_retry,
                        query,
                    )
                )[0]
            else:
                return None

        if query_type == "auto":
            # Infer the query type.
            if is_embedding(query):
                vector_query = query
                query_type = "vector"
            elif isinstance(query, FullTextQuery):
                query_type = "fts"
            elif isinstance(query, str):
                try:
                    (
                        indices,
                        (vector_column_name, embedding_conf),
                    ) = await asyncio.gather(
                        self.list_indices(),
                        get_embedding_func(vector_column_name, "auto", query),
                    )
                except ValueError as e:
                    if "Column" in str(
                        e
                    ) and "has no registered embedding function" in str(e):
                        # If the column has no registered embedding function,
                        # then it's an FTS query.
                        query_type = "fts"
                    else:
                        raise e
                else:
                    if embedding_conf is not None:
                        vector_query = await make_embedding(embedding_conf, query)
                        if any(
                            i.columns[0] == embedding_conf.source_column
                            and i.index_type == "FTS"
                            for i in indices
                        ):
                            query_type = "hybrid"
                        else:
                            query_type = "vector"
                    else:
                        query_type = "fts"
            else:
                # it's an image or something else embeddable.
                query_type = "vector"
        elif query_type == "vector":
            if is_embedding(query):
                vector_query = query
            else:
                vector_column_name, embedding_conf = await get_embedding_func(
                    vector_column_name, query_type, query
                )
                vector_query = await make_embedding(embedding_conf, query)
        elif query_type == "hybrid":
            if is_embedding(query):
                raise ValueError("Hybrid search requires a text query")
            else:
                vector_column_name, embedding_conf = await get_embedding_func(
                    vector_column_name, query_type, query
                )
                vector_query = await make_embedding(embedding_conf, query)

        if query_type == "vector":
            builder = self.query().nearest_to(vector_query)
            if vector_column_name:
                builder = builder.column(vector_column_name)
            return builder
        elif query_type == "fts":
            return self.query().nearest_to_text(query, columns=fts_columns)
        elif query_type == "hybrid":
            builder = self.query().nearest_to(vector_query)
            if vector_column_name:
                builder = builder.column(vector_column_name)
            return builder.nearest_to_text(query, columns=fts_columns)
        else:
            raise ValueError(f"Unknown query type: '{query_type}'")

    def vector_search(
        self,
        query_vector: Union[VEC, Tuple],
    ) -> AsyncVectorQuery:
        """
        Search the table with a given query vector.
        This is a convenience method for preparing a vector query and
        is the same thing as calling `nearestTo` on the builder returned
        by `query`.  Seer [nearest_to][lancedb.query.AsyncQuery.nearest_to] for more
        details.
        """
        return self.query().nearest_to(query_vector)

    def _sync_query_to_async(
        self, query: Query
    ) -> AsyncHybridQuery | AsyncFTSQuery | AsyncVectorQuery | AsyncQuery:
        async_query = self.query()
        if query.limit is not None:
            async_query = async_query.limit(query.limit)
        if query.offset is not None:
            async_query = async_query.offset(query.offset)
        if query.columns:
            async_query = async_query.select(query.columns)
        if query.filter:
            async_query = async_query.where(query.filter)
        if query.fast_search:
            async_query = async_query.fast_search()
        if query.with_row_id:
            async_query = async_query.with_row_id()

        if query.vector:
            async_query = async_query.nearest_to(query.vector).distance_range(
                query.lower_bound, query.upper_bound
            )
            if query.distance_type is not None:
                async_query = async_query.distance_type(query.distance_type)
            if query.nprobes is not None:
                async_query = async_query.nprobes(query.nprobes)
            if query.refine_factor is not None:
                async_query = async_query.refine_factor(query.refine_factor)
            if query.vector_column:
                async_query = async_query.column(query.vector_column)
            if query.ef:
                async_query = async_query.ef(query.ef)
            if query.bypass_vector_index:
                async_query = async_query.bypass_vector_index()

        if query.postfilter:
            async_query = async_query.postfilter()

        if query.full_text_query:
            async_query = async_query.nearest_to_text(
                query.full_text_query.query, query.full_text_query.columns
            )

        return async_query

    async def _execute_query(
        self,
        query: Query,
        *,
        batch_size: Optional[int] = None,
        timeout: Optional[timedelta] = None,
    ) -> pa.RecordBatchReader:
        # The sync table calls into this method, so we need to map the
        # query to the async version of the query and run that here. This is only
        # used for that code path right now.

        async_query = self._sync_query_to_async(query)

        return await async_query.to_batches(
            max_batch_length=batch_size, timeout=timeout
        )

    async def _explain_plan(self, query: Query, verbose: Optional[bool]) -> str:
        # This method is used by the sync table
        async_query = self._sync_query_to_async(query)
        return await async_query.explain_plan(verbose)

    async def _analyze_plan(self, query: Query) -> str:
        # This method is used by the sync table
        async_query = self._sync_query_to_async(query)
        return await async_query.analyze_plan()

    async def _do_merge(
        self,
        merge: LanceMergeInsertBuilder,
        new_data: DATA,
        on_bad_vectors: OnBadVectorsType,
        fill_value: float,
    ) -> MergeResult:
        schema = await self.schema()
        if on_bad_vectors is None:
            on_bad_vectors = "error"
        if fill_value is None:
            fill_value = 0.0
        data = _sanitize_data(
            new_data,
            schema,
            metadata=schema.metadata,
            on_bad_vectors=on_bad_vectors,
            fill_value=fill_value,
            allow_subschema=True,
        )
        if isinstance(data, pa.Table):
            data = pa.RecordBatchReader.from_batches(data.schema, data.to_batches())
        return await self._inner.execute_merge_insert(
            data,
            dict(
                on=merge._on,
                when_matched_update_all=merge._when_matched_update_all,
                when_matched_update_all_condition=merge._when_matched_update_all_condition,
                when_not_matched_insert_all=merge._when_not_matched_insert_all,
                when_not_matched_by_source_delete=merge._when_not_matched_by_source_delete,
                when_not_matched_by_source_condition=merge._when_not_matched_by_source_condition,
                timeout=merge._timeout,
            ),
        )

    async def delete(self, where: str) -> DeleteResult:
        """Delete rows from the table.

        This can be used to delete a single row, many rows, all rows, or
        sometimes no rows (if your predicate matches nothing).

        Parameters
        ----------
        where: str
            The SQL where clause to use when deleting rows.

            - For example, 'x = 2' or 'x IN (1, 2, 3)'.

            The filter must not be empty, or it will error.

        Examples
        --------
        >>> import lancedb
        >>> data = [
        ...    {"x": 1, "vector": [1.0, 2]},
        ...    {"x": 2, "vector": [3.0, 4]},
        ...    {"x": 3, "vector": [5.0, 6]}
        ... ]
        >>> db = lancedb.connect("./.lancedb")
        >>> table = db.create_table("my_table", data)
        >>> table.to_pandas()
           x      vector
        0  1  [1.0, 2.0]
        1  2  [3.0, 4.0]
        2  3  [5.0, 6.0]
        >>> table.delete("x = 2")
        DeleteResult(version=2)
        >>> table.to_pandas()
           x      vector
        0  1  [1.0, 2.0]
        1  3  [5.0, 6.0]

        If you have a list of values to delete, you can combine them into a
        stringified list and use the `IN` operator:

        >>> to_remove = [1, 5]
        >>> to_remove = ", ".join([str(v) for v in to_remove])
        >>> to_remove
        '1, 5'
        >>> table.delete(f"x IN ({to_remove})")
        DeleteResult(version=3)
        >>> table.to_pandas()
           x      vector
        0  3  [5.0, 6.0]
        """
        return await self._inner.delete(where)

    async def update(
        self,
        updates: Optional[Dict[str, Any]] = None,
        *,
        where: Optional[str] = None,
        updates_sql: Optional[Dict[str, str]] = None,
    ) -> UpdateResult:
        """
        This can be used to update zero to all rows in the table.

        If a filter is provided with `where` then only rows matching the
        filter will be updated.  Otherwise all rows will be updated.

        Parameters
        ----------
        updates: dict, optional
            The updates to apply.  The keys should be the name of the column to
            update.  The values should be the new values to assign.  This is
            required unless updates_sql is supplied.
        where: str, optional
            An SQL filter that controls which rows are updated. For example, 'x = 2'
            or 'x IN (1, 2, 3)'.  Only rows that satisfy this filter will be udpated.
        updates_sql: dict, optional
            The updates to apply, expressed as SQL expression strings.  The keys should
            be column names. The values should be SQL expressions.  These can be SQL
            literals (e.g. "7" or "'foo'") or they can be expressions based on the
            previous value of the row (e.g. "x + 1" to increment the x column by 1)

        Returns
        -------
        UpdateResult
            An object containing:
            - rows_updated: The number of rows that were updated
            - version: The new version number of the table after the update

        Examples
        --------
        >>> import asyncio
        >>> import lancedb
        >>> import pandas as pd
        >>> async def demo_update():
        ...     data = pd.DataFrame({"x": [1, 2], "vector": [[1, 2], [3, 4]]})
        ...     db = await lancedb.connect_async("./.lancedb")
        ...     table = await db.create_table("my_table", data)
        ...     # x is [1, 2], vector is [[1, 2], [3, 4]]
        ...     await table.update({"vector": [10, 10]}, where="x = 2")
        ...     # x is [1, 2], vector is [[1, 2], [10, 10]]
        ...     await table.update(updates_sql={"x": "x + 1"})
        ...     # x is [2, 3], vector is [[1, 2], [10, 10]]
        >>> asyncio.run(demo_update())
        """
        if updates is not None and updates_sql is not None:
            raise ValueError("Only one of updates or updates_sql can be provided")
        if updates is None and updates_sql is None:
            raise ValueError("Either updates or updates_sql must be provided")

        if updates is not None:
            updates_sql = {k: value_to_sql(v) for k, v in updates.items()}

        return await self._inner.update(updates_sql, where)

    async def add_columns(
        self, transforms: dict[str, str] | pa.field | List[pa.field] | pa.Schema
    ) -> AddColumnsResult:
        """
        Add new columns with defined values.

        Parameters
        ----------
        transforms: Dict[str, str]
            A map of column name to a SQL expression to use to calculate the
            value of the new column. These expressions will be evaluated for
            each row in the table, and can reference existing columns.
            Alternatively, you can pass a pyarrow field or schema to add
            new columns with NULLs.

        Returns
        -------
        AddColumnsResult
            version: the new version number of the table after adding columns.

        """
        if isinstance(transforms, pa.Field):
            transforms = [transforms]
        if isinstance(transforms, list) and all(
            {isinstance(f, pa.Field) for f in transforms}
        ):
            transforms = pa.schema(transforms)
        if isinstance(transforms, pa.Schema):
            return await self._inner.add_columns_with_schema(transforms)
        else:
            return await self._inner.add_columns(list(transforms.items()))

    async def alter_columns(
        self, *alterations: Iterable[dict[str, Any]]
    ) -> AlterColumnsResult:
        """
        Alter column names and nullability.

        alterations : Iterable[Dict[str, Any]]
            A sequence of dictionaries, each with the following keys:
            - "path": str
                The column path to alter. For a top-level column, this is the name.
                For a nested column, this is the dot-separated path, e.g. "a.b.c".
            - "rename": str, optional
                The new name of the column. If not specified, the column name is
                not changed.
            - "data_type": pyarrow.DataType, optional
               The new data type of the column. Existing values will be casted
               to this type. If not specified, the column data type is not changed.
            - "nullable": bool, optional
                Whether the column should be nullable. If not specified, the column
                nullability is not changed. Only non-nullable columns can be changed
                to nullable. Currently, you cannot change a nullable column to
                non-nullable.

        Returns
        -------
        AlterColumnsResult
            version: the new version number of the table after the alteration.
        """
        return await self._inner.alter_columns(alterations)

    async def drop_columns(self, columns: Iterable[str]):
        """
        Drop columns from the table.

        Parameters
        ----------
        columns : Iterable[str]
            The names of the columns to drop.
        """
        return await self._inner.drop_columns(columns)

    async def version(self) -> int:
        """
        Retrieve the version of the table

        LanceDb supports versioning.  Every operation that modifies the table increases
        version.  As long as a version hasn't been deleted you can `[Self::checkout]`
        that version to view the data at that point.  In addition, you can
        `[Self::restore]` the version to replace the current table with a previous
        version.
        """
        return await self._inner.version()

    async def list_versions(self):
        """
        List all versions of the table
        """
        versions = await self._inner.list_versions()
        for v in versions:
            ts_nanos = v["timestamp"]
            v["timestamp"] = datetime.fromtimestamp(ts_nanos // 1e9) + timedelta(
                microseconds=(ts_nanos % 1e9) // 1e3
            )

        return versions

    async def checkout(self, version: int | str):
        """
        Checks out a specific version of the Table

        Any read operation on the table will now access the data at the checked out
        version. As a consequence, calling this method will disable any read consistency
        interval that was previously set.

        This is a read-only operation that turns the table into a sort of "view"
        or "detached head".  Other table instances will not be affected.  To make the
        change permanent you can use the `[Self::restore]` method.

        Any operation that modifies the table will fail while the table is in a checked
        out state.

        Parameters
        ----------
        version: int | str,
            The version to check out. A version number (`int`) or a tag
            (`str`) can be provided.

        To return the table to a normal state use `[Self::checkout_latest]`
        """
        try:
            await self._inner.checkout(version)
        except RuntimeError as e:
            if "not found" in str(e):
                raise ValueError(
                    f"Version {version} no longer exists. Was it cleaned up?"
                )
            else:
                raise

    async def checkout_latest(self):
        """
        Ensures the table is pointing at the latest version

        This can be used to manually update a table when the read_consistency_interval
        is None
        It can also be used to undo a `[Self::checkout]` operation
        """
        await self._inner.checkout_latest()

    async def restore(self, version: Optional[int | str] = None):
        """
        Restore the table to the currently checked out version

        This operation will fail if checkout has not been called previously

        This operation will overwrite the latest version of the table with a
        previous version.  Any changes made since the checked out version will
        no longer be visible.

        Once the operation concludes the table will no longer be in a checked
        out state and the read_consistency_interval, if any, will apply.
        """
        await self._inner.restore(version)

    @property
    def tags(self) -> AsyncTags:
        """Tag management for the dataset.

        Similar to Git, tags are a way to add metadata to a specific version of the
        dataset.

        .. warning::

            Tagged versions are exempted from the
            :py:meth:`optimize(cleanup_older_than)` process.

            To remove a version that has been tagged, you must first
            :py:meth:`~Tags.delete` the associated tag.

        """
        return AsyncTags(self._inner)

    async def optimize(
        self,
        *,
        cleanup_older_than: Optional[timedelta] = None,
        delete_unverified: bool = False,
        retrain=False,
    ) -> OptimizeStats:
        """
        Optimize the on-disk data and indices for better performance.

        Modeled after ``VACUUM`` in PostgreSQL.

        Optimization covers three operations:

         * Compaction: Merges small files into larger ones
         * Prune: Removes old versions of the dataset
         * Index: Optimizes the indices, adding new data to existing indices

        Parameters
        ----------
        cleanup_older_than: timedelta, optional default 7 days
            All files belonging to versions older than this will be removed.  Set
            to 0 days to remove all versions except the latest.  The latest version
            is never removed.
        delete_unverified: bool, default False
            Files leftover from a failed transaction may appear to be part of an
            in-progress operation (e.g. appending new data) and these files will not
            be deleted unless they are at least 7 days old. If delete_unverified is True
            then these files will be deleted regardless of their age.
        retrain: bool, default False
            If True, retrain the vector indices, this would refine the IVF clustering
            and quantization, which may improve the search accuracy. It's faster than
            re-creating the index from scratch, so it's recommended to try this first,
            when the data distribution has changed significantly.

        Experimental API
        ----------------

        The optimization process is undergoing active development and may change.
        Our goal with these changes is to improve the performance of optimization and
        reduce the complexity.

        That being said, it is essential today to run optimize if you want the best
        performance.  It should be stable and safe to use in production, but it our
        hope that the API may be simplified (or not even need to be called) in the
        future.

        The frequency an application shoudl call optimize is based on the frequency of
        data modifications.  If data is frequently added, deleted, or updated then
        optimize should be run frequently.  A good rule of thumb is to run optimize if
        you have added or modified 100,000 or more records or run more than 20 data
        modification operations.
        """
        cleanup_since_ms: Optional[int] = None
        if cleanup_older_than is not None:
            cleanup_since_ms = round(cleanup_older_than.total_seconds() * 1000)
        return await self._inner.optimize(
            cleanup_since_ms=cleanup_since_ms,
            delete_unverified=delete_unverified,
            retrain=retrain,
        )

    async def list_indices(self) -> Iterable[IndexConfig]:
        """
        List all indices that have been created with Self::create_index
        """
        return await self._inner.list_indices()

    async def index_stats(self, index_name: str) -> Optional[IndexStatistics]:
        """
        Retrieve statistics about an index

        Parameters
        ----------
        index_name: str
            The name of the index to retrieve statistics for

        Returns
        -------
        IndexStatistics or None
            The statistics about the index. Returns None if the index does not exist.
        """
        stats = await self._inner.index_stats(index_name)
        if stats is None:
            return None
        else:
            return IndexStatistics(**stats)

    async def uses_v2_manifest_paths(self) -> bool:
        """
        Check if the table is using the new v2 manifest paths.

        Returns
        -------
        bool
            True if the table is using the new v2 manifest paths, False otherwise.
        """
        return await self._inner.uses_v2_manifest_paths()

    async def migrate_manifest_paths_v2(self):
        """
        Migrate the manifest paths to the new format.

        This will update the manifest to use the new v2 format for paths.

        This function is idempotent, and can be run multiple times without
        changing the state of the object store.

        !!! danger

            This should not be run while other concurrent operations are happening.
            And it should also run until completion before resuming other operations.

        You can use
        [AsyncTable.uses_v2_manifest_paths][lancedb.table.AsyncTable.uses_v2_manifest_paths]
        to check if the table is already using the new path style.
        """
        await self._inner.migrate_manifest_paths_v2()

    async def replace_field_metadata(
        self, field_name: str, new_metadata: dict[str, str]
    ):
        """
        Replace the metadata of a field in the schema

        Parameters
        ----------
        field_name: str
            The name of the field to replace the metadata for
        new_metadata: dict
            The new metadata to set
        """
        await self._inner.replace_field_metadata(field_name, new_metadata)

name property

name: str

The name of the table.

tags property

tags: AsyncTags

Tag management for the dataset.

Similar to Git, tags are a way to add metadata to a specific version of the dataset.

.. warning::

Tagged versions are exempted from the
:py:meth:`optimize(cleanup_older_than)` process.

To remove a version that has been tagged, you must first
:py:meth:`~Tags.delete` the associated tag.

__init__

__init__(table: Table)

Create a new AsyncTable object.

You should not create AsyncTable objects directly.

Use AsyncConnection.create_table and AsyncConnection.open_table to obtain Table objects.

Source code in lancedb/table.py
def __init__(self, table: LanceDBTable):
    """Create a new AsyncTable object.

    You should not create AsyncTable objects directly.

    Use [AsyncConnection.create_table][lancedb.AsyncConnection.create_table] and
    [AsyncConnection.open_table][lancedb.AsyncConnection.open_table] to obtain
    Table objects."""
    self._inner = table

is_open

is_open() -> bool

Return True if the table is open.

Source code in lancedb/table.py
def is_open(self) -> bool:
    """Return True if the table is open."""
    return self._inner.is_open()

close

close()

Close the table and free any resources associated with it.

It is safe to call this method multiple times.

Any attempt to use the table after it has been closed will raise an error.

Source code in lancedb/table.py
def close(self):
    """Close the table and free any resources associated with it.

    It is safe to call this method multiple times.

    Any attempt to use the table after it has been closed will raise an error."""
    return self._inner.close()

schema async

schema() -> Schema

The Arrow Schema of this Table

Source code in lancedb/table.py
async def schema(self) -> pa.Schema:
    """The [Arrow Schema](https://arrow.apache.org/docs/python/api/datatypes.html#)
    of this Table

    """
    return await self._inner.schema()

embedding_functions async

embedding_functions() -> Dict[str, EmbeddingFunctionConfig]

Get the embedding functions for the table

Returns:

  • funcs ( Dict[str, EmbeddingFunctionConfig] ) –

    A mapping of the vector column to the embedding function or empty dict if not configured.

Source code in lancedb/table.py
async def embedding_functions(self) -> Dict[str, EmbeddingFunctionConfig]:
    """
    Get the embedding functions for the table

    Returns
    -------
    funcs: Dict[str, EmbeddingFunctionConfig]
        A mapping of the vector column to the embedding function
        or empty dict if not configured.
    """
    schema = await self.schema()
    return EmbeddingFunctionRegistry.get_instance().parse_functions(schema.metadata)

count_rows async

count_rows(filter: Optional[str] = None) -> int

Count the number of rows in the table.

Parameters:

  • filter (Optional[str], default: None ) –

    A SQL where clause to filter the rows to count.

Source code in lancedb/table.py
async def count_rows(self, filter: Optional[str] = None) -> int:
    """
    Count the number of rows in the table.

    Parameters
    ----------
    filter: str, optional
        A SQL where clause to filter the rows to count.
    """
    return await self._inner.count_rows(filter)

head async

head(n=5) -> Table

Return the first n rows of the table.

Parameters:

  • n –

    The number of rows to return.

Source code in lancedb/table.py
async def head(self, n=5) -> pa.Table:
    """
    Return the first `n` rows of the table.

    Parameters
    ----------
    n: int, default 5
        The number of rows to return.
    """
    return await self.query().limit(n).to_arrow()

query

query() -> AsyncQuery

Returns an AsyncQuery that can be used to search the table.

Use methods on the returned query to control query behavior. The query can be executed with methods like to_arrow, to_pandas and more.

Source code in lancedb/table.py
def query(self) -> AsyncQuery:
    """
    Returns an [AsyncQuery][lancedb.query.AsyncQuery] that can be used
    to search the table.

    Use methods on the returned query to control query behavior.  The query
    can be executed with methods like [to_arrow][lancedb.query.AsyncQuery.to_arrow],
    [to_pandas][lancedb.query.AsyncQuery.to_pandas] and more.
    """
    return AsyncQuery(self._inner.query())

to_pandas async

to_pandas() -> 'pd.DataFrame'

Return the table as a pandas DataFrame.

Returns:

  • DataFrame –
Source code in lancedb/table.py
async def to_pandas(self) -> "pd.DataFrame":
    """Return the table as a pandas DataFrame.

    Returns
    -------
    pd.DataFrame
    """
    return (await self.to_arrow()).to_pandas()

to_arrow async

to_arrow() -> Table

Return the table as a pyarrow Table.

Returns:

Source code in lancedb/table.py
async def to_arrow(self) -> pa.Table:
    """Return the table as a pyarrow Table.

    Returns
    -------
    pa.Table
    """
    return await self.query().to_arrow()

create_index async

create_index(column: str, *, replace: Optional[bool] = None, config: Optional[Union[IvfFlat, IvfPq, HnswPq, HnswSq, BTree, Bitmap, LabelList, FTS]] = None, wait_timeout: Optional[timedelta] = None)

Create an index to speed up queries

Indices can be created on vector columns or scalar columns. Indices on vector columns will speed up vector searches. Indices on scalar columns will speed up filtering (in both vector and non-vector searches)

Parameters:

  • column (str) –

    The column to index.

  • replace (Optional[bool], default: None ) –

    Whether to replace the existing index

    If this is false, and another index already exists on the same columns and the same name, then an error will be returned. This is true even if that index is out of date.

    The default is True

  • config (Optional[Union[IvfFlat, IvfPq, HnswPq, HnswSq, BTree, Bitmap, LabelList, FTS]], default: None ) –

    For advanced configuration you can specify the type of index you would like to create. You can also specify index-specific parameters when creating an index object.

  • wait_timeout (Optional[timedelta], default: None ) –

    The timeout to wait if indexing is asynchronous.

Source code in lancedb/table.py
async def create_index(
    self,
    column: str,
    *,
    replace: Optional[bool] = None,
    config: Optional[
        Union[IvfFlat, IvfPq, HnswPq, HnswSq, BTree, Bitmap, LabelList, FTS]
    ] = None,
    wait_timeout: Optional[timedelta] = None,
):
    """Create an index to speed up queries

    Indices can be created on vector columns or scalar columns.
    Indices on vector columns will speed up vector searches.
    Indices on scalar columns will speed up filtering (in both
    vector and non-vector searches)

    Parameters
    ----------
    column: str
        The column to index.
    replace: bool, default True
        Whether to replace the existing index

        If this is false, and another index already exists on the same columns
        and the same name, then an error will be returned.  This is true even if
        that index is out of date.

        The default is True
    config: default None
        For advanced configuration you can specify the type of index you would
        like to create.   You can also specify index-specific parameters when
        creating an index object.
    wait_timeout: timedelta, optional
        The timeout to wait if indexing is asynchronous.
    """
    if config is not None:
        if not isinstance(
            config, (IvfFlat, IvfPq, HnswPq, HnswSq, BTree, Bitmap, LabelList, FTS)
        ):
            raise TypeError(
                "config must be an instance of IvfPq, HnswPq, HnswSq, BTree,"
                " Bitmap, LabelList, or FTS"
            )
    try:
        await self._inner.create_index(
            column, index=config, replace=replace, wait_timeout=wait_timeout
        )
    except ValueError as e:
        if "not support the requested language" in str(e):
            supported_langs = ", ".join(lang_mapping.values())
            help_msg = f"Supported languages: {supported_langs}"
            add_note(e, help_msg)
        raise e

drop_index async

drop_index(name: str) -> None

Drop an index from the table.

Parameters:

  • name (str) –

    The name of the index to drop.

Notes

This does not delete the index from disk, it just removes it from the table. To delete the index, run optimize after dropping the index.

Use list_indices to find the names of the indices.

Source code in lancedb/table.py
async def drop_index(self, name: str) -> None:
    """
    Drop an index from the table.

    Parameters
    ----------
    name: str
        The name of the index to drop.

    Notes
    -----
    This does not delete the index from disk, it just removes it from the table.
    To delete the index, run [optimize][lancedb.table.AsyncTable.optimize]
    after dropping the index.

    Use [list_indices][lancedb.table.AsyncTable.list_indices] to find the names
    of the indices.
    """
    await self._inner.drop_index(name)

prewarm_index async

prewarm_index(name: str) -> None

Prewarm an index in the table.

Parameters:

  • name (str) –

    The name of the index to prewarm

Notes

This will load the index into memory. This may reduce the cold-start time for future queries. If the index does not fit in the cache then this call may be wasteful.

Source code in lancedb/table.py
async def prewarm_index(self, name: str) -> None:
    """
    Prewarm an index in the table.

    Parameters
    ----------
    name: str
        The name of the index to prewarm

    Notes
    -----
    This will load the index into memory.  This may reduce the cold-start time for
    future queries.  If the index does not fit in the cache then this call may be
    wasteful.
    """
    await self._inner.prewarm_index(name)

wait_for_index async

wait_for_index(index_names: Iterable[str], timeout: timedelta = timedelta(seconds=300)) -> None

Wait for indexing to complete for the given index names. This will poll the table until all the indices are fully indexed, or raise a timeout exception if the timeout is reached.

Parameters:

  • index_names (Iterable[str]) –

    The name of the indices to poll

  • timeout (timedelta, default: timedelta(seconds=300) ) –

    Timeout to wait for asynchronous indexing. The default is 5 minutes.

Source code in lancedb/table.py
async def wait_for_index(
    self, index_names: Iterable[str], timeout: timedelta = timedelta(seconds=300)
) -> None:
    """
    Wait for indexing to complete for the given index names.
    This will poll the table until all the indices are fully indexed,
    or raise a timeout exception if the timeout is reached.

    Parameters
    ----------
    index_names: str
        The name of the indices to poll
    timeout: timedelta
        Timeout to wait for asynchronous indexing. The default is 5 minutes.
    """
    await self._inner.wait_for_index(index_names, timeout)

stats async

stats() -> TableStatistics

Retrieve table and fragment statistics.

Source code in lancedb/table.py
async def stats(self) -> TableStatistics:
    """
    Retrieve table and fragment statistics.
    """
    return await self._inner.stats()

add async

add(data: DATA, *, mode: Optional[Literal['append', 'overwrite']] = 'append', on_bad_vectors: Optional[OnBadVectorsType] = None, fill_value: Optional[float] = None) -> AddResult

Add more data to the Table.

Parameters:

  • data (DATA) –

    The data to insert into the table. Acceptable types are:

    • list-of-dict

    • pandas.DataFrame

    • pyarrow.Table or pyarrow.RecordBatch

  • mode (Optional[Literal['append', 'overwrite']], default: 'append' ) –

    The mode to use when writing the data. Valid values are "append" and "overwrite".

  • on_bad_vectors (Optional[OnBadVectorsType], default: None ) –

    What to do if any of the vectors are not the same size or contains NaNs. One of "error", "drop", "fill", "null".

  • fill_value (Optional[float], default: None ) –

    The value to use when filling vectors. Only used if on_bad_vectors="fill".

Source code in lancedb/table.py
async def add(
    self,
    data: DATA,
    *,
    mode: Optional[Literal["append", "overwrite"]] = "append",
    on_bad_vectors: Optional[OnBadVectorsType] = None,
    fill_value: Optional[float] = None,
) -> AddResult:
    """Add more data to the [Table](Table).

    Parameters
    ----------
    data: DATA
        The data to insert into the table. Acceptable types are:

        - list-of-dict

        - pandas.DataFrame

        - pyarrow.Table or pyarrow.RecordBatch
    mode: str
        The mode to use when writing the data. Valid values are
        "append" and "overwrite".
    on_bad_vectors: str, default "error"
        What to do if any of the vectors are not the same size or contains NaNs.
        One of "error", "drop", "fill", "null".
    fill_value: float, default 0.
        The value to use when filling vectors. Only used if on_bad_vectors="fill".

    """
    schema = await self.schema()
    if on_bad_vectors is None:
        on_bad_vectors = "error"
    if fill_value is None:
        fill_value = 0.0
    data = _sanitize_data(
        data,
        schema,
        metadata=schema.metadata,
        on_bad_vectors=on_bad_vectors,
        fill_value=fill_value,
        allow_subschema=True,
    )
    if isinstance(data, pa.Table):
        data = data.to_reader()

    return await self._inner.add(data, mode or "append")

merge_insert

merge_insert(on: Union[str, Iterable[str]]) -> LanceMergeInsertBuilder

Returns a LanceMergeInsertBuilder that can be used to create a "merge insert" operation

This operation can add rows, update rows, and remove rows all in a single transaction. It is a very generic tool that can be used to create behaviors like "insert if not exists", "update or insert (i.e. upsert)", or even replace a portion of existing data with new data (e.g. replace all data where month="january")

The merge insert operation works by combining new data from a source table with existing data in a target table by using a join. There are three categories of records.

"Matched" records are records that exist in both the source table and the target table. "Not matched" records exist only in the source table (e.g. these are new data) "Not matched by source" records exist only in the target table (this is old data)

The builder returned by this method can be used to customize what should happen for each category of data.

Please note that the data may appear to be reordered as part of this operation. This is because updated rows will be deleted from the dataset and then reinserted at the end with the new values.

Parameters:

  • on (Union[str, Iterable[str]]) –

    A column (or columns) to join on. This is how records from the source table and target table are matched. Typically this is some kind of key or id column.

Examples:

>>> import lancedb
>>> data = pa.table({"a": [2, 1, 3], "b": ["a", "b", "c"]})
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data)
>>> new_data = pa.table({"a": [2, 3, 4], "b": ["x", "y", "z"]})
>>> # Perform a "upsert" operation
>>> res = table.merge_insert("a")     \
...      .when_matched_update_all()     \
...      .when_not_matched_insert_all() \
...      .execute(new_data)
>>> res
MergeResult(version=2, num_updated_rows=2, num_inserted_rows=1, num_deleted_rows=0)
>>> # The order of new rows is non-deterministic since we use
>>> # a hash-join as part of this operation and so we sort here
>>> table.to_arrow().sort_by("a").to_pandas()
   a  b
0  1  b
1  2  x
2  3  y
3  4  z
Source code in lancedb/table.py
def merge_insert(self, on: Union[str, Iterable[str]]) -> LanceMergeInsertBuilder:
    """
    Returns a [`LanceMergeInsertBuilder`][lancedb.merge.LanceMergeInsertBuilder]
    that can be used to create a "merge insert" operation

    This operation can add rows, update rows, and remove rows all in a single
    transaction. It is a very generic tool that can be used to create
    behaviors like "insert if not exists", "update or insert (i.e. upsert)",
    or even replace a portion of existing data with new data (e.g. replace
    all data where month="january")

    The merge insert operation works by combining new data from a
    **source table** with existing data in a **target table** by using a
    join.  There are three categories of records.

    "Matched" records are records that exist in both the source table and
    the target table. "Not matched" records exist only in the source table
    (e.g. these are new data) "Not matched by source" records exist only
    in the target table (this is old data)

    The builder returned by this method can be used to customize what
    should happen for each category of data.

    Please note that the data may appear to be reordered as part of this
    operation.  This is because updated rows will be deleted from the
    dataset and then reinserted at the end with the new values.

    Parameters
    ----------

    on: Union[str, Iterable[str]]
        A column (or columns) to join on.  This is how records from the
        source table and target table are matched.  Typically this is some
        kind of key or id column.

    Examples
    --------
    >>> import lancedb
    >>> data = pa.table({"a": [2, 1, 3], "b": ["a", "b", "c"]})
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", data)
    >>> new_data = pa.table({"a": [2, 3, 4], "b": ["x", "y", "z"]})
    >>> # Perform a "upsert" operation
    >>> res = table.merge_insert("a")     \\
    ...      .when_matched_update_all()     \\
    ...      .when_not_matched_insert_all() \\
    ...      .execute(new_data)
    >>> res
    MergeResult(version=2, num_updated_rows=2, num_inserted_rows=1, num_deleted_rows=0)
    >>> # The order of new rows is non-deterministic since we use
    >>> # a hash-join as part of this operation and so we sort here
    >>> table.to_arrow().sort_by("a").to_pandas()
       a  b
    0  1  b
    1  2  x
    2  3  y
    3  4  z
    """  # noqa: E501
    on = [on] if isinstance(on, str) else list(iter(on))

    return LanceMergeInsertBuilder(self, on)

search async

search(query: Optional[Union[VEC, str, 'PIL.Image.Image', Tuple, FullTextQuery]] = None, vector_column_name: Optional[str] = None, query_type: QueryType = 'auto', ordering_field_name: Optional[str] = None, fts_columns: Optional[Union[str, List[str]]] = None) -> Union[AsyncHybridQuery, AsyncFTSQuery, AsyncVectorQuery]

Create a search query to find the nearest neighbors of the given query vector. We currently support vector search and [full-text search][experimental-full-text-search].

All query options are defined in AsyncQuery.

Parameters:

  • query (Optional[Union[VEC, str, 'PIL.Image.Image', Tuple, FullTextQuery]], default: None ) –

    The targetted vector to search for.

    • default None. Acceptable types are: list, np.ndarray, PIL.Image.Image

    • If None then the select/where/limit clauses are applied to filter the table

  • vector_column_name (Optional[str], default: None ) –

    The name of the vector column to search.

    The vector column needs to be a pyarrow fixed size list type

    • If not specified then the vector column is inferred from the table schema

    • If the table has multiple vector columns then the vector_column_name needs to be specified. Otherwise, an error is raised.

  • query_type (QueryType, default: 'auto' ) –

    default "auto". Acceptable types are: "vector", "fts", "hybrid", or "auto"

    • If "auto" then the query type is inferred from the query;

      • If query is a list/np.ndarray then the query type is "vector";

      • If query is a PIL.Image.Image then either do vector search, or raise an error if no corresponding embedding function is found.

    • If query is a string, then the query type is "vector" if the table has embedding functions else the query type is "fts"

Returns:

Source code in lancedb/table.py
async def search(
    self,
    query: Optional[
        Union[VEC, str, "PIL.Image.Image", Tuple, FullTextQuery]
    ] = None,
    vector_column_name: Optional[str] = None,
    query_type: QueryType = "auto",
    ordering_field_name: Optional[str] = None,
    fts_columns: Optional[Union[str, List[str]]] = None,
) -> Union[AsyncHybridQuery, AsyncFTSQuery, AsyncVectorQuery]:
    """Create a search query to find the nearest neighbors
    of the given query vector. We currently support [vector search][search]
    and [full-text search][experimental-full-text-search].

    All query options are defined in [AsyncQuery][lancedb.query.AsyncQuery].

    Parameters
    ----------
    query: list/np.ndarray/str/PIL.Image.Image, default None
        The targetted vector to search for.

        - *default None*.
        Acceptable types are: list, np.ndarray, PIL.Image.Image

        - If None then the select/where/limit clauses are applied to filter
        the table
    vector_column_name: str, optional
        The name of the vector column to search.

        The vector column needs to be a pyarrow fixed size list type

        - If not specified then the vector column is inferred from
        the table schema

        - If the table has multiple vector columns then the *vector_column_name*
        needs to be specified. Otherwise, an error is raised.
    query_type: str
        *default "auto"*.
        Acceptable types are: "vector", "fts", "hybrid", or "auto"

        - If "auto" then the query type is inferred from the query;

            - If `query` is a list/np.ndarray then the query type is
            "vector";

            - If `query` is a PIL.Image.Image then either do vector search,
            or raise an error if no corresponding embedding function is found.

        - If `query` is a string, then the query type is "vector" if the
          table has embedding functions else the query type is "fts"

    Returns
    -------
    LanceQueryBuilder
        A query builder object representing the query.
    """

    def is_embedding(query):
        return isinstance(query, (list, np.ndarray, pa.Array, pa.ChunkedArray))

    async def get_embedding_func(
        vector_column_name: Optional[str],
        query_type: QueryType,
        query: Optional[Union[VEC, str, "PIL.Image.Image", Tuple, FullTextQuery]],
    ) -> Tuple[str, EmbeddingFunctionConfig]:
        if isinstance(query, FullTextQuery):
            query_type = "fts"
        schema = await self.schema()
        vector_column_name = infer_vector_column_name(
            schema=schema,
            query_type=query_type,
            query=query,
            vector_column_name=vector_column_name,
        )
        funcs = EmbeddingFunctionRegistry.get_instance().parse_functions(
            schema.metadata
        )
        func = funcs.get(vector_column_name)
        if func is None:
            error = ValueError(
                f"Column '{vector_column_name}' has no registered "
                "embedding function."
            )
            if len(funcs) > 0:
                add_note(
                    error,
                    "Embedding functions are registered for columns: "
                    f"{list(funcs.keys())}",
                )
            else:
                add_note(
                    error, "No embedding functions are registered for any columns."
                )
            raise error
        return vector_column_name, func

    async def make_embedding(embedding, query):
        if embedding is not None:
            loop = asyncio.get_running_loop()
            # This function is likely to block, since it either calls an expensive
            # function or makes an HTTP request to an embeddings REST API.
            return (
                await loop.run_in_executor(
                    None,
                    embedding.function.compute_query_embeddings_with_retry,
                    query,
                )
            )[0]
        else:
            return None

    if query_type == "auto":
        # Infer the query type.
        if is_embedding(query):
            vector_query = query
            query_type = "vector"
        elif isinstance(query, FullTextQuery):
            query_type = "fts"
        elif isinstance(query, str):
            try:
                (
                    indices,
                    (vector_column_name, embedding_conf),
                ) = await asyncio.gather(
                    self.list_indices(),
                    get_embedding_func(vector_column_name, "auto", query),
                )
            except ValueError as e:
                if "Column" in str(
                    e
                ) and "has no registered embedding function" in str(e):
                    # If the column has no registered embedding function,
                    # then it's an FTS query.
                    query_type = "fts"
                else:
                    raise e
            else:
                if embedding_conf is not None:
                    vector_query = await make_embedding(embedding_conf, query)
                    if any(
                        i.columns[0] == embedding_conf.source_column
                        and i.index_type == "FTS"
                        for i in indices
                    ):
                        query_type = "hybrid"
                    else:
                        query_type = "vector"
                else:
                    query_type = "fts"
        else:
            # it's an image or something else embeddable.
            query_type = "vector"
    elif query_type == "vector":
        if is_embedding(query):
            vector_query = query
        else:
            vector_column_name, embedding_conf = await get_embedding_func(
                vector_column_name, query_type, query
            )
            vector_query = await make_embedding(embedding_conf, query)
    elif query_type == "hybrid":
        if is_embedding(query):
            raise ValueError("Hybrid search requires a text query")
        else:
            vector_column_name, embedding_conf = await get_embedding_func(
                vector_column_name, query_type, query
            )
            vector_query = await make_embedding(embedding_conf, query)

    if query_type == "vector":
        builder = self.query().nearest_to(vector_query)
        if vector_column_name:
            builder = builder.column(vector_column_name)
        return builder
    elif query_type == "fts":
        return self.query().nearest_to_text(query, columns=fts_columns)
    elif query_type == "hybrid":
        builder = self.query().nearest_to(vector_query)
        if vector_column_name:
            builder = builder.column(vector_column_name)
        return builder.nearest_to_text(query, columns=fts_columns)
    else:
        raise ValueError(f"Unknown query type: '{query_type}'")
vector_search(query_vector: Union[VEC, Tuple]) -> AsyncVectorQuery

Search the table with a given query vector. This is a convenience method for preparing a vector query and is the same thing as calling nearestTo on the builder returned by query. Seer nearest_to for more details.

Source code in lancedb/table.py
def vector_search(
    self,
    query_vector: Union[VEC, Tuple],
) -> AsyncVectorQuery:
    """
    Search the table with a given query vector.
    This is a convenience method for preparing a vector query and
    is the same thing as calling `nearestTo` on the builder returned
    by `query`.  Seer [nearest_to][lancedb.query.AsyncQuery.nearest_to] for more
    details.
    """
    return self.query().nearest_to(query_vector)

delete async

delete(where: str) -> DeleteResult

Delete rows from the table.

This can be used to delete a single row, many rows, all rows, or sometimes no rows (if your predicate matches nothing).

Parameters:

  • where (str) –

    The SQL where clause to use when deleting rows.

    • For example, 'x = 2' or 'x IN (1, 2, 3)'.

    The filter must not be empty, or it will error.

Examples:

>>> import lancedb
>>> data = [
...    {"x": 1, "vector": [1.0, 2]},
...    {"x": 2, "vector": [3.0, 4]},
...    {"x": 3, "vector": [5.0, 6]}
... ]
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data)
>>> table.to_pandas()
   x      vector
0  1  [1.0, 2.0]
1  2  [3.0, 4.0]
2  3  [5.0, 6.0]
>>> table.delete("x = 2")
DeleteResult(version=2)
>>> table.to_pandas()
   x      vector
0  1  [1.0, 2.0]
1  3  [5.0, 6.0]

If you have a list of values to delete, you can combine them into a stringified list and use the IN operator:

>>> to_remove = [1, 5]
>>> to_remove = ", ".join([str(v) for v in to_remove])
>>> to_remove
'1, 5'
>>> table.delete(f"x IN ({to_remove})")
DeleteResult(version=3)
>>> table.to_pandas()
   x      vector
0  3  [5.0, 6.0]
Source code in lancedb/table.py
async def delete(self, where: str) -> DeleteResult:
    """Delete rows from the table.

    This can be used to delete a single row, many rows, all rows, or
    sometimes no rows (if your predicate matches nothing).

    Parameters
    ----------
    where: str
        The SQL where clause to use when deleting rows.

        - For example, 'x = 2' or 'x IN (1, 2, 3)'.

        The filter must not be empty, or it will error.

    Examples
    --------
    >>> import lancedb
    >>> data = [
    ...    {"x": 1, "vector": [1.0, 2]},
    ...    {"x": 2, "vector": [3.0, 4]},
    ...    {"x": 3, "vector": [5.0, 6]}
    ... ]
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", data)
    >>> table.to_pandas()
       x      vector
    0  1  [1.0, 2.0]
    1  2  [3.0, 4.0]
    2  3  [5.0, 6.0]
    >>> table.delete("x = 2")
    DeleteResult(version=2)
    >>> table.to_pandas()
       x      vector
    0  1  [1.0, 2.0]
    1  3  [5.0, 6.0]

    If you have a list of values to delete, you can combine them into a
    stringified list and use the `IN` operator:

    >>> to_remove = [1, 5]
    >>> to_remove = ", ".join([str(v) for v in to_remove])
    >>> to_remove
    '1, 5'
    >>> table.delete(f"x IN ({to_remove})")
    DeleteResult(version=3)
    >>> table.to_pandas()
       x      vector
    0  3  [5.0, 6.0]
    """
    return await self._inner.delete(where)

update async

update(updates: Optional[Dict[str, Any]] = None, *, where: Optional[str] = None, updates_sql: Optional[Dict[str, str]] = None) -> UpdateResult

This can be used to update zero to all rows in the table.

If a filter is provided with where then only rows matching the filter will be updated. Otherwise all rows will be updated.

Parameters:

  • updates (Optional[Dict[str, Any]], default: None ) –

    The updates to apply. The keys should be the name of the column to update. The values should be the new values to assign. This is required unless updates_sql is supplied.

  • where (Optional[str], default: None ) –

    An SQL filter that controls which rows are updated. For example, 'x = 2' or 'x IN (1, 2, 3)'. Only rows that satisfy this filter will be udpated.

  • updates_sql (Optional[Dict[str, str]], default: None ) –

    The updates to apply, expressed as SQL expression strings. The keys should be column names. The values should be SQL expressions. These can be SQL literals (e.g. "7" or "'foo'") or they can be expressions based on the previous value of the row (e.g. "x + 1" to increment the x column by 1)

Returns:

  • UpdateResult –

    An object containing: - rows_updated: The number of rows that were updated - version: The new version number of the table after the update

Examples:

>>> import asyncio
>>> import lancedb
>>> import pandas as pd
>>> async def demo_update():
...     data = pd.DataFrame({"x": [1, 2], "vector": [[1, 2], [3, 4]]})
...     db = await lancedb.connect_async("./.lancedb")
...     table = await db.create_table("my_table", data)
...     # x is [1, 2], vector is [[1, 2], [3, 4]]
...     await table.update({"vector": [10, 10]}, where="x = 2")
...     # x is [1, 2], vector is [[1, 2], [10, 10]]
...     await table.update(updates_sql={"x": "x + 1"})
...     # x is [2, 3], vector is [[1, 2], [10, 10]]
>>> asyncio.run(demo_update())
Source code in lancedb/table.py
async def update(
    self,
    updates: Optional[Dict[str, Any]] = None,
    *,
    where: Optional[str] = None,
    updates_sql: Optional[Dict[str, str]] = None,
) -> UpdateResult:
    """
    This can be used to update zero to all rows in the table.

    If a filter is provided with `where` then only rows matching the
    filter will be updated.  Otherwise all rows will be updated.

    Parameters
    ----------
    updates: dict, optional
        The updates to apply.  The keys should be the name of the column to
        update.  The values should be the new values to assign.  This is
        required unless updates_sql is supplied.
    where: str, optional
        An SQL filter that controls which rows are updated. For example, 'x = 2'
        or 'x IN (1, 2, 3)'.  Only rows that satisfy this filter will be udpated.
    updates_sql: dict, optional
        The updates to apply, expressed as SQL expression strings.  The keys should
        be column names. The values should be SQL expressions.  These can be SQL
        literals (e.g. "7" or "'foo'") or they can be expressions based on the
        previous value of the row (e.g. "x + 1" to increment the x column by 1)

    Returns
    -------
    UpdateResult
        An object containing:
        - rows_updated: The number of rows that were updated
        - version: The new version number of the table after the update

    Examples
    --------
    >>> import asyncio
    >>> import lancedb
    >>> import pandas as pd
    >>> async def demo_update():
    ...     data = pd.DataFrame({"x": [1, 2], "vector": [[1, 2], [3, 4]]})
    ...     db = await lancedb.connect_async("./.lancedb")
    ...     table = await db.create_table("my_table", data)
    ...     # x is [1, 2], vector is [[1, 2], [3, 4]]
    ...     await table.update({"vector": [10, 10]}, where="x = 2")
    ...     # x is [1, 2], vector is [[1, 2], [10, 10]]
    ...     await table.update(updates_sql={"x": "x + 1"})
    ...     # x is [2, 3], vector is [[1, 2], [10, 10]]
    >>> asyncio.run(demo_update())
    """
    if updates is not None and updates_sql is not None:
        raise ValueError("Only one of updates or updates_sql can be provided")
    if updates is None and updates_sql is None:
        raise ValueError("Either updates or updates_sql must be provided")

    if updates is not None:
        updates_sql = {k: value_to_sql(v) for k, v in updates.items()}

    return await self._inner.update(updates_sql, where)

add_columns async

add_columns(transforms: dict[str, str] | field | List[field] | Schema) -> AddColumnsResult

Add new columns with defined values.

Parameters:

  • transforms (dict[str, str] | field | List[field] | Schema) –

    A map of column name to a SQL expression to use to calculate the value of the new column. These expressions will be evaluated for each row in the table, and can reference existing columns. Alternatively, you can pass a pyarrow field or schema to add new columns with NULLs.

Returns:

  • AddColumnsResult –

    version: the new version number of the table after adding columns.

Source code in lancedb/table.py
async def add_columns(
    self, transforms: dict[str, str] | pa.field | List[pa.field] | pa.Schema
) -> AddColumnsResult:
    """
    Add new columns with defined values.

    Parameters
    ----------
    transforms: Dict[str, str]
        A map of column name to a SQL expression to use to calculate the
        value of the new column. These expressions will be evaluated for
        each row in the table, and can reference existing columns.
        Alternatively, you can pass a pyarrow field or schema to add
        new columns with NULLs.

    Returns
    -------
    AddColumnsResult
        version: the new version number of the table after adding columns.

    """
    if isinstance(transforms, pa.Field):
        transforms = [transforms]
    if isinstance(transforms, list) and all(
        {isinstance(f, pa.Field) for f in transforms}
    ):
        transforms = pa.schema(transforms)
    if isinstance(transforms, pa.Schema):
        return await self._inner.add_columns_with_schema(transforms)
    else:
        return await self._inner.add_columns(list(transforms.items()))

alter_columns async

alter_columns(*alterations: Iterable[dict[str, Any]]) -> AlterColumnsResult

Alter column names and nullability.

alterations : Iterable[Dict[str, Any]] A sequence of dictionaries, each with the following keys: - "path": str The column path to alter. For a top-level column, this is the name. For a nested column, this is the dot-separated path, e.g. "a.b.c". - "rename": str, optional The new name of the column. If not specified, the column name is not changed. - "data_type": pyarrow.DataType, optional The new data type of the column. Existing values will be casted to this type. If not specified, the column data type is not changed. - "nullable": bool, optional Whether the column should be nullable. If not specified, the column nullability is not changed. Only non-nullable columns can be changed to nullable. Currently, you cannot change a nullable column to non-nullable.

Returns:

  • AlterColumnsResult –

    version: the new version number of the table after the alteration.

Source code in lancedb/table.py
async def alter_columns(
    self, *alterations: Iterable[dict[str, Any]]
) -> AlterColumnsResult:
    """
    Alter column names and nullability.

    alterations : Iterable[Dict[str, Any]]
        A sequence of dictionaries, each with the following keys:
        - "path": str
            The column path to alter. For a top-level column, this is the name.
            For a nested column, this is the dot-separated path, e.g. "a.b.c".
        - "rename": str, optional
            The new name of the column. If not specified, the column name is
            not changed.
        - "data_type": pyarrow.DataType, optional
           The new data type of the column. Existing values will be casted
           to this type. If not specified, the column data type is not changed.
        - "nullable": bool, optional
            Whether the column should be nullable. If not specified, the column
            nullability is not changed. Only non-nullable columns can be changed
            to nullable. Currently, you cannot change a nullable column to
            non-nullable.

    Returns
    -------
    AlterColumnsResult
        version: the new version number of the table after the alteration.
    """
    return await self._inner.alter_columns(alterations)

drop_columns async

drop_columns(columns: Iterable[str])

Drop columns from the table.

Parameters:

  • columns (Iterable[str]) –

    The names of the columns to drop.

Source code in lancedb/table.py
async def drop_columns(self, columns: Iterable[str]):
    """
    Drop columns from the table.

    Parameters
    ----------
    columns : Iterable[str]
        The names of the columns to drop.
    """
    return await self._inner.drop_columns(columns)

version async

version() -> int

Retrieve the version of the table

LanceDb supports versioning. Every operation that modifies the table increases version. As long as a version hasn't been deleted you can [Self::checkout] that version to view the data at that point. In addition, you can [Self::restore] the version to replace the current table with a previous version.

Source code in lancedb/table.py
async def version(self) -> int:
    """
    Retrieve the version of the table

    LanceDb supports versioning.  Every operation that modifies the table increases
    version.  As long as a version hasn't been deleted you can `[Self::checkout]`
    that version to view the data at that point.  In addition, you can
    `[Self::restore]` the version to replace the current table with a previous
    version.
    """
    return await self._inner.version()

list_versions async

list_versions()

List all versions of the table

Source code in lancedb/table.py
async def list_versions(self):
    """
    List all versions of the table
    """
    versions = await self._inner.list_versions()
    for v in versions:
        ts_nanos = v["timestamp"]
        v["timestamp"] = datetime.fromtimestamp(ts_nanos // 1e9) + timedelta(
            microseconds=(ts_nanos % 1e9) // 1e3
        )

    return versions

checkout async

checkout(version: int | str)

Checks out a specific version of the Table

Any read operation on the table will now access the data at the checked out version. As a consequence, calling this method will disable any read consistency interval that was previously set.

This is a read-only operation that turns the table into a sort of "view" or "detached head". Other table instances will not be affected. To make the change permanent you can use the [Self::restore] method.

Any operation that modifies the table will fail while the table is in a checked out state.

Parameters:

  • version (int | str) –

    The version to check out. A version number (int) or a tag (str) can be provided.

  • To –
Source code in lancedb/table.py
async def checkout(self, version: int | str):
    """
    Checks out a specific version of the Table

    Any read operation on the table will now access the data at the checked out
    version. As a consequence, calling this method will disable any read consistency
    interval that was previously set.

    This is a read-only operation that turns the table into a sort of "view"
    or "detached head".  Other table instances will not be affected.  To make the
    change permanent you can use the `[Self::restore]` method.

    Any operation that modifies the table will fail while the table is in a checked
    out state.

    Parameters
    ----------
    version: int | str,
        The version to check out. A version number (`int`) or a tag
        (`str`) can be provided.

    To return the table to a normal state use `[Self::checkout_latest]`
    """
    try:
        await self._inner.checkout(version)
    except RuntimeError as e:
        if "not found" in str(e):
            raise ValueError(
                f"Version {version} no longer exists. Was it cleaned up?"
            )
        else:
            raise

checkout_latest async

checkout_latest()

Ensures the table is pointing at the latest version

This can be used to manually update a table when the read_consistency_interval is None It can also be used to undo a [Self::checkout] operation

Source code in lancedb/table.py
async def checkout_latest(self):
    """
    Ensures the table is pointing at the latest version

    This can be used to manually update a table when the read_consistency_interval
    is None
    It can also be used to undo a `[Self::checkout]` operation
    """
    await self._inner.checkout_latest()

restore async

restore(version: Optional[int | str] = None)

Restore the table to the currently checked out version

This operation will fail if checkout has not been called previously

This operation will overwrite the latest version of the table with a previous version. Any changes made since the checked out version will no longer be visible.

Once the operation concludes the table will no longer be in a checked out state and the read_consistency_interval, if any, will apply.

Source code in lancedb/table.py
async def restore(self, version: Optional[int | str] = None):
    """
    Restore the table to the currently checked out version

    This operation will fail if checkout has not been called previously

    This operation will overwrite the latest version of the table with a
    previous version.  Any changes made since the checked out version will
    no longer be visible.

    Once the operation concludes the table will no longer be in a checked
    out state and the read_consistency_interval, if any, will apply.
    """
    await self._inner.restore(version)

optimize async

optimize(*, cleanup_older_than: Optional[timedelta] = None, delete_unverified: bool = False, retrain=False) -> OptimizeStats

Optimize the on-disk data and indices for better performance.

Modeled after VACUUM in PostgreSQL.

Optimization covers three operations:

  • Compaction: Merges small files into larger ones
  • Prune: Removes old versions of the dataset
  • Index: Optimizes the indices, adding new data to existing indices

Parameters:

  • cleanup_older_than (Optional[timedelta], default: None ) –

    All files belonging to versions older than this will be removed. Set to 0 days to remove all versions except the latest. The latest version is never removed.

  • delete_unverified (bool, default: False ) –

    Files leftover from a failed transaction may appear to be part of an in-progress operation (e.g. appending new data) and these files will not be deleted unless they are at least 7 days old. If delete_unverified is True then these files will be deleted regardless of their age.

  • retrain –

    If True, retrain the vector indices, this would refine the IVF clustering and quantization, which may improve the search accuracy. It's faster than re-creating the index from scratch, so it's recommended to try this first, when the data distribution has changed significantly.

Experimental API

The optimization process is undergoing active development and may change. Our goal with these changes is to improve the performance of optimization and reduce the complexity.

That being said, it is essential today to run optimize if you want the best performance. It should be stable and safe to use in production, but it our hope that the API may be simplified (or not even need to be called) in the future.

The frequency an application shoudl call optimize is based on the frequency of data modifications. If data is frequently added, deleted, or updated then optimize should be run frequently. A good rule of thumb is to run optimize if you have added or modified 100,000 or more records or run more than 20 data modification operations.

Source code in lancedb/table.py
async def optimize(
    self,
    *,
    cleanup_older_than: Optional[timedelta] = None,
    delete_unverified: bool = False,
    retrain=False,
) -> OptimizeStats:
    """
    Optimize the on-disk data and indices for better performance.

    Modeled after ``VACUUM`` in PostgreSQL.

    Optimization covers three operations:

     * Compaction: Merges small files into larger ones
     * Prune: Removes old versions of the dataset
     * Index: Optimizes the indices, adding new data to existing indices

    Parameters
    ----------
    cleanup_older_than: timedelta, optional default 7 days
        All files belonging to versions older than this will be removed.  Set
        to 0 days to remove all versions except the latest.  The latest version
        is never removed.
    delete_unverified: bool, default False
        Files leftover from a failed transaction may appear to be part of an
        in-progress operation (e.g. appending new data) and these files will not
        be deleted unless they are at least 7 days old. If delete_unverified is True
        then these files will be deleted regardless of their age.
    retrain: bool, default False
        If True, retrain the vector indices, this would refine the IVF clustering
        and quantization, which may improve the search accuracy. It's faster than
        re-creating the index from scratch, so it's recommended to try this first,
        when the data distribution has changed significantly.

    Experimental API
    ----------------

    The optimization process is undergoing active development and may change.
    Our goal with these changes is to improve the performance of optimization and
    reduce the complexity.

    That being said, it is essential today to run optimize if you want the best
    performance.  It should be stable and safe to use in production, but it our
    hope that the API may be simplified (or not even need to be called) in the
    future.

    The frequency an application shoudl call optimize is based on the frequency of
    data modifications.  If data is frequently added, deleted, or updated then
    optimize should be run frequently.  A good rule of thumb is to run optimize if
    you have added or modified 100,000 or more records or run more than 20 data
    modification operations.
    """
    cleanup_since_ms: Optional[int] = None
    if cleanup_older_than is not None:
        cleanup_since_ms = round(cleanup_older_than.total_seconds() * 1000)
    return await self._inner.optimize(
        cleanup_since_ms=cleanup_since_ms,
        delete_unverified=delete_unverified,
        retrain=retrain,
    )

list_indices async

list_indices() -> Iterable[IndexConfig]

List all indices that have been created with Self::create_index

Source code in lancedb/table.py
async def list_indices(self) -> Iterable[IndexConfig]:
    """
    List all indices that have been created with Self::create_index
    """
    return await self._inner.list_indices()

index_stats async

index_stats(index_name: str) -> Optional[IndexStatistics]

Retrieve statistics about an index

Parameters:

  • index_name (str) –

    The name of the index to retrieve statistics for

Returns:

  • IndexStatistics or None –

    The statistics about the index. Returns None if the index does not exist.

Source code in lancedb/table.py
async def index_stats(self, index_name: str) -> Optional[IndexStatistics]:
    """
    Retrieve statistics about an index

    Parameters
    ----------
    index_name: str
        The name of the index to retrieve statistics for

    Returns
    -------
    IndexStatistics or None
        The statistics about the index. Returns None if the index does not exist.
    """
    stats = await self._inner.index_stats(index_name)
    if stats is None:
        return None
    else:
        return IndexStatistics(**stats)

uses_v2_manifest_paths async

uses_v2_manifest_paths() -> bool

Check if the table is using the new v2 manifest paths.

Returns:

  • bool –

    True if the table is using the new v2 manifest paths, False otherwise.

Source code in lancedb/table.py
async def uses_v2_manifest_paths(self) -> bool:
    """
    Check if the table is using the new v2 manifest paths.

    Returns
    -------
    bool
        True if the table is using the new v2 manifest paths, False otherwise.
    """
    return await self._inner.uses_v2_manifest_paths()

migrate_manifest_paths_v2 async

migrate_manifest_paths_v2()

Migrate the manifest paths to the new format.

This will update the manifest to use the new v2 format for paths.

This function is idempotent, and can be run multiple times without changing the state of the object store.

Danger

This should not be run while other concurrent operations are happening. And it should also run until completion before resuming other operations.

You can use AsyncTable.uses_v2_manifest_paths to check if the table is already using the new path style.

Source code in lancedb/table.py
async def migrate_manifest_paths_v2(self):
    """
    Migrate the manifest paths to the new format.

    This will update the manifest to use the new v2 format for paths.

    This function is idempotent, and can be run multiple times without
    changing the state of the object store.

    !!! danger

        This should not be run while other concurrent operations are happening.
        And it should also run until completion before resuming other operations.

    You can use
    [AsyncTable.uses_v2_manifest_paths][lancedb.table.AsyncTable.uses_v2_manifest_paths]
    to check if the table is already using the new path style.
    """
    await self._inner.migrate_manifest_paths_v2()

replace_field_metadata async

replace_field_metadata(field_name: str, new_metadata: dict[str, str])

Replace the metadata of a field in the schema

Parameters:

  • field_name (str) –

    The name of the field to replace the metadata for

  • new_metadata (dict[str, str]) –

    The new metadata to set

Source code in lancedb/table.py
async def replace_field_metadata(
    self, field_name: str, new_metadata: dict[str, str]
):
    """
    Replace the metadata of a field in the schema

    Parameters
    ----------
    field_name: str
        The name of the field to replace the metadata for
    new_metadata: dict
        The new metadata to set
    """
    await self._inner.replace_field_metadata(field_name, new_metadata)

Indices (Asynchronous)

Indices can be created on a table to speed up queries. This section lists the indices that LanceDb supports.

lancedb.index.BTree dataclass

Describes a btree index configuration

A btree index is an index on scalar columns. The index stores a copy of the column in sorted order. A header entry is created for each block of rows (currently the block size is fixed at 4096). These header entries are stored in a separate cacheable structure (a btree). To search for data the header is used to determine which blocks need to be read from disk.

For example, a btree index in a table with 1Bi rows requires sizeof(Scalar) * 256Ki bytes of memory and will generally need to read sizeof(Scalar) * 4096 bytes to find the correct row ids.

This index is good for scalar columns with mostly distinct values and does best when the query is highly selective. It works with numeric, temporal, and string columns.

The btree index does not currently have any parameters though parameters such as the block size may be added in the future.

Source code in lancedb/index.py
@dataclass
class BTree:
    """Describes a btree index configuration

    A btree index is an index on scalar columns.  The index stores a copy of the
    column in sorted order.  A header entry is created for each block of rows
    (currently the block size is fixed at 4096).  These header entries are stored
    in a separate cacheable structure (a btree).  To search for data the header is
    used to determine which blocks need to be read from disk.

    For example, a btree index in a table with 1Bi rows requires
    sizeof(Scalar) * 256Ki bytes of memory and will generally need to read
    sizeof(Scalar) * 4096 bytes to find the correct row ids.

    This index is good for scalar columns with mostly distinct values and does best
    when the query is highly selective. It works with numeric, temporal, and string
    columns.

    The btree index does not currently have any parameters though parameters such as
    the block size may be added in the future.
    """

    pass

lancedb.index.Bitmap dataclass

Describe a Bitmap index configuration.

A Bitmap index stores a bitmap for each distinct value in the column for every row.

This index works best for low-cardinality numeric or string columns, where the number of unique values is small (i.e., less than a few thousands). Bitmap index can accelerate the following filters:

  • <, <=, =, >, >=
  • IN (value1, value2, ...)
  • between (value1, value2)
  • is null

For example, a bitmap index with a table with 1Bi rows, and 128 distinct values, requires 128 / 8 * 1Bi bytes on disk.

Source code in lancedb/index.py
@dataclass
class Bitmap:
    """Describe a Bitmap index configuration.

    A `Bitmap` index stores a bitmap for each distinct value in the column for
    every row.

    This index works best for low-cardinality numeric or string columns,
    where the number of unique values is small (i.e., less than a few thousands).
    `Bitmap` index can accelerate the following filters:

    - `<`, `<=`, `=`, `>`, `>=`
    - `IN (value1, value2, ...)`
    - `between (value1, value2)`
    - `is null`

    For example, a bitmap index with a table with 1Bi rows, and 128 distinct values,
    requires 128 / 8 * 1Bi bytes on disk.
    """

    pass

lancedb.index.LabelList dataclass

Describe a LabelList index configuration.

LabelList is a scalar index that can be used on List<T> columns to support queries with array_contains_all and array_contains_any using an underlying bitmap index.

For example, it works with tags, categories, keywords, etc.

Source code in lancedb/index.py
@dataclass
class LabelList:
    """Describe a LabelList index configuration.

    `LabelList` is a scalar index that can be used on `List<T>` columns to
    support queries with `array_contains_all` and `array_contains_any`
    using an underlying bitmap index.

    For example, it works with `tags`, `categories`, `keywords`, etc.
    """

    pass

lancedb.index.FTS dataclass

Describe a FTS index configuration.

FTS is a full-text search index that can be used on String columns

For example, it works with title, description, content, etc.

Attributes:

  • with_position (bool, default True) –

    Whether to store the position of the token in the document. Setting this to False can reduce the size of the index and improve indexing speed, but it will disable support for phrase queries.

  • base_tokenizer (str, default "simple") –

    The base tokenizer to use for tokenization. Options are: - "simple": Splits text by whitespace and punctuation. - "whitespace": Split text by whitespace, but not punctuation. - "raw": No tokenization. The entire text is treated as a single token.

  • language (str, default "English") –

    The language to use for tokenization.

  • max_token_length (int, default 40) –

    The maximum token length to index. Tokens longer than this length will be ignored.

  • lower_case (bool, default True) –

    Whether to convert the token to lower case. This makes queries case-insensitive.

  • stem (bool, default False) –

    Whether to stem the token. Stemming reduces words to their root form. For example, in English "running" and "runs" would both be reduced to "run".

  • remove_stop_words (bool, default False) –

    Whether to remove stop words. Stop words are common words that are often removed from text before indexing. For example, in English "the" and "and".

  • ascii_folding (bool, default False) –

    Whether to fold ASCII characters. This converts accented characters to their ASCII equivalent. For example, "cafΓ©" would be converted to "cafe".

Source code in lancedb/index.py
@dataclass
class FTS:
    """Describe a FTS index configuration.

    `FTS` is a full-text search index that can be used on `String` columns

    For example, it works with `title`, `description`, `content`, etc.

    Attributes
    ----------
    with_position : bool, default True
        Whether to store the position of the token in the document. Setting this
        to False can reduce the size of the index and improve indexing speed,
        but it will disable support for phrase queries.
    base_tokenizer : str, default "simple"
        The base tokenizer to use for tokenization. Options are:
        - "simple": Splits text by whitespace and punctuation.
        - "whitespace": Split text by whitespace, but not punctuation.
        - "raw": No tokenization. The entire text is treated as a single token.
    language : str, default "English"
        The language to use for tokenization.
    max_token_length : int, default 40
        The maximum token length to index. Tokens longer than this length will be
        ignored.
    lower_case : bool, default True
        Whether to convert the token to lower case. This makes queries case-insensitive.
    stem : bool, default False
        Whether to stem the token. Stemming reduces words to their root form.
        For example, in English "running" and "runs" would both be reduced to "run".
    remove_stop_words : bool, default False
        Whether to remove stop words. Stop words are common words that are often
        removed from text before indexing. For example, in English "the" and "and".
    ascii_folding : bool, default False
        Whether to fold ASCII characters. This converts accented characters to
        their ASCII equivalent. For example, "cafΓ©" would be converted to "cafe".
    """

    with_position: bool = True
    base_tokenizer: Literal["simple", "raw", "whitespace"] = "simple"
    language: str = "English"
    max_token_length: Optional[int] = 40
    lower_case: bool = True
    stem: bool = False
    remove_stop_words: bool = False
    ascii_folding: bool = False

lancedb.index.IvfPq dataclass

Describes an IVF PQ Index

This index stores a compressed (quantized) copy of every vector. These vectors are grouped into partitions of similar vectors. Each partition keeps track of a centroid which is the average value of all vectors in the group.

During a query the centroids are compared with the query vector to find the closest partitions. The compressed vectors in these partitions are then searched to find the closest vectors.

The compression scheme is called product quantization. Each vector is divide into subvectors and then each subvector is quantized into a small number of bits. the parameters num_bits and num_subvectors control this process, providing a tradeoff between index size (and thus search speed) and index accuracy.

The partitioning process is called IVF and the num_partitions parameter controls how many groups to create.

Note that training an IVF PQ index on a large dataset is a slow operation and currently is also a memory intensive operation.

Attributes:

  • distance_type (str, default "l2") –

    The distance metric used to train the index

    This is used when training the index to calculate the IVF partitions (vectors are grouped in partitions with similar vectors according to this distance type) and to calculate a subvector's code during quantization.

    The distance type used to train an index MUST match the distance type used to search the index. Failure to do so will yield inaccurate results.

    The following distance types are available:

    "l2" - Euclidean distance. This is a very common distance metric that accounts for both magnitude and direction when determining the distance between vectors. l2 distance has a range of [0, ∞).

    "cosine" - Cosine distance. Cosine distance is a distance metric calculated from the cosine similarity between two vectors. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is defined to equal the cosine of the angle between them. Unlike l2, the cosine distance is not affected by the magnitude of the vectors. Cosine distance has a range of [0, 2].

    Note: the cosine distance is undefined when one (or both) of the vectors are all zeros (there is no direction). These vectors are invalid and may never be returned from a vector search.

    "dot" - Dot product. Dot distance is the dot product of two vectors. Dot distance has a range of (-∞, ∞). If the vectors are normalized (i.e. their l2 norm is 1), then dot distance is equivalent to the cosine distance.

  • num_partitions (int, default sqrt(num_rows)) –

    The number of IVF partitions to create.

    This value should generally scale with the number of rows in the dataset. By default the number of partitions is the square root of the number of rows.

    If this value is too large then the first part of the search (picking the right partition) will be slow. If this value is too small then the second part of the search (searching within a partition) will be slow.

  • num_sub_vectors (int, default is vector dimension / 16) –

    Number of sub-vectors of PQ.

    This value controls how much the vector is compressed during the quantization step. The more sub vectors there are the less the vector is compressed. The default is the dimension of the vector divided by 16. If the dimension is not evenly divisible by 16 we use the dimension divded by 8.

    The above two cases are highly preferred. Having 8 or 16 values per subvector allows us to use efficient SIMD instructions.

    If the dimension is not visible by 8 then we use 1 subvector. This is not ideal and will likely result in poor performance.

  • num_bits (int, default 8) –

    Number of bits to encode each sub-vector.

    This value controls how much the sub-vectors are compressed. The more bits the more accurate the index but the slower search. The default is 8 bits. Only 4 and 8 are supported.

  • max_iterations (int, default 50) –

    Max iteration to train kmeans.

    When training an IVF PQ index we use kmeans to calculate the partitions. This parameter controls how many iterations of kmeans to run.

    Increasing this might improve the quality of the index but in most cases these extra iterations have diminishing returns.

    The default value is 50.

  • sample_rate (int, default 256) –

    The rate used to calculate the number of training vectors for kmeans.

    When an IVF PQ index is trained, we need to calculate partitions. These are groups of vectors that are similar to each other. To do this we use an algorithm called kmeans.

    Running kmeans on a large dataset can be slow. To speed this up we run kmeans on a random sample of the data. This parameter controls the size of the sample. The total number of vectors used to train the index is sample_rate * num_partitions.

    Increasing this value might improve the quality of the index but in most cases the default should be sufficient.

    The default value is 256.

Source code in lancedb/index.py
@dataclass
class IvfPq:
    """Describes an IVF PQ Index

    This index stores a compressed (quantized) copy of every vector.  These vectors
    are grouped into partitions of similar vectors.  Each partition keeps track of
    a centroid which is the average value of all vectors in the group.

    During a query the centroids are compared with the query vector to find the
    closest partitions.  The compressed vectors in these partitions are then
    searched to find the closest vectors.

    The compression scheme is called product quantization.  Each vector is divide
    into subvectors and then each subvector is quantized into a small number of
    bits.  the parameters `num_bits` and `num_subvectors` control this process,
    providing a tradeoff between index size (and thus search speed) and index
    accuracy.

    The partitioning process is called IVF and the `num_partitions` parameter
    controls how many groups to create.

    Note that training an IVF PQ index on a large dataset is a slow operation and
    currently is also a memory intensive operation.

    Attributes
    ----------
    distance_type: str, default "l2"
        The distance metric used to train the index

        This is used when training the index to calculate the IVF partitions
        (vectors are grouped in partitions with similar vectors according to this
        distance type) and to calculate a subvector's code during quantization.

        The distance type used to train an index MUST match the distance type used
        to search the index.  Failure to do so will yield inaccurate results.

        The following distance types are available:

        "l2" - Euclidean distance. This is a very common distance metric that
        accounts for both magnitude and direction when determining the distance
        between vectors. l2 distance has a range of [0, ∞).

        "cosine" - Cosine distance.  Cosine distance is a distance metric
        calculated from the cosine similarity between two vectors. Cosine
        similarity is a measure of similarity between two non-zero vectors of an
        inner product space. It is defined to equal the cosine of the angle
        between them.  Unlike l2, the cosine distance is not affected by the
        magnitude of the vectors.  Cosine distance has a range of [0, 2].

        Note: the cosine distance is undefined when one (or both) of the vectors
        are all zeros (there is no direction).  These vectors are invalid and may
        never be returned from a vector search.

        "dot" - Dot product. Dot distance is the dot product of two vectors. Dot
        distance has a range of (-∞, ∞). If the vectors are normalized (i.e. their
        l2 norm is 1), then dot distance is equivalent to the cosine distance.
    num_partitions: int, default sqrt(num_rows)
        The number of IVF partitions to create.

        This value should generally scale with the number of rows in the dataset.
        By default the number of partitions is the square root of the number of
        rows.

        If this value is too large then the first part of the search (picking the
        right partition) will be slow.  If this value is too small then the second
        part of the search (searching within a partition) will be slow.
    num_sub_vectors: int, default is vector dimension / 16
        Number of sub-vectors of PQ.

        This value controls how much the vector is compressed during the
        quantization step.  The more sub vectors there are the less the vector is
        compressed.  The default is the dimension of the vector divided by 16.  If
        the dimension is not evenly divisible by 16 we use the dimension divded by
        8.

        The above two cases are highly preferred.  Having 8 or 16 values per
        subvector allows us to use efficient SIMD instructions.

        If the dimension is not visible by 8 then we use 1 subvector.  This is not
        ideal and will likely result in poor performance.
    num_bits: int, default 8
        Number of bits to encode each sub-vector.

        This value controls how much the sub-vectors are compressed.  The more bits
        the more accurate the index but the slower search.  The default is 8
        bits.  Only 4 and 8 are supported.
    max_iterations: int, default 50
        Max iteration to train kmeans.

        When training an IVF PQ index we use kmeans to calculate the partitions.
        This parameter controls how many iterations of kmeans to run.

        Increasing this might improve the quality of the index but in most cases
        these extra iterations have diminishing returns.

        The default value is 50.
    sample_rate: int, default 256
        The rate used to calculate the number of training vectors for kmeans.

        When an IVF PQ index is trained, we need to calculate partitions.  These
        are groups of vectors that are similar to each other.  To do this we use an
        algorithm called kmeans.

        Running kmeans on a large dataset can be slow.  To speed this up we run
        kmeans on a random sample of the data.  This parameter controls the size of
        the sample.  The total number of vectors used to train the index is
        `sample_rate * num_partitions`.

        Increasing this value might improve the quality of the index but in most
        cases the default should be sufficient.

        The default value is 256.
    """

    distance_type: Literal["l2", "cosine", "dot"] = "l2"
    num_partitions: Optional[int] = None
    num_sub_vectors: Optional[int] = None
    num_bits: int = 8
    max_iterations: int = 50
    sample_rate: int = 256

lancedb.index.HnswPq dataclass

Describe a HNSW-PQ index configuration.

HNSW-PQ stands for Hierarchical Navigable Small World - Product Quantization. It is a variant of the HNSW algorithm that uses product quantization to compress the vectors. To create an HNSW-PQ index, you can specify the following parameters:

Parameters:

  • distance_type (Literal['l2', 'cosine', 'dot'], default: 'l2' ) –

    The distance metric used to train the index.

    The following distance types are available:

    "l2" - Euclidean distance. This is a very common distance metric that accounts for both magnitude and direction when determining the distance between vectors. l2 distance has a range of [0, ∞).

    "cosine" - Cosine distance. Cosine distance is a distance metric calculated from the cosine similarity between two vectors. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is defined to equal the cosine of the angle between them. Unlike l2, the cosine distance is not affected by the magnitude of the vectors. Cosine distance has a range of [0, 2].

    "dot" - Dot product. Dot distance is the dot product of two vectors. Dot distance has a range of (-∞, ∞). If the vectors are normalized (i.e. their l2 norm is 1), then dot distance is equivalent to the cosine distance.

  • num_partitions (Optional[int], default: None ) –

    The number of IVF partitions to create.

    For HNSW, we recommend a small number of partitions. Setting this to 1 works well for most tables. For very large tables, training just one HNSW graph will require too much memory. Each partition becomes its own HNSW graph, so setting this value higher reduces the peak memory use of training.

  • default (Optional[int], default: None ) –

    The number of IVF partitions to create.

    For HNSW, we recommend a small number of partitions. Setting this to 1 works well for most tables. For very large tables, training just one HNSW graph will require too much memory. Each partition becomes its own HNSW graph, so setting this value higher reduces the peak memory use of training.

  • num_sub_vectors (Optional[int], default: None ) –

    Number of sub-vectors of PQ.

    This value controls how much the vector is compressed during the quantization step. The more sub vectors there are the less the vector is compressed. The default is the dimension of the vector divided by 16. If the dimension is not evenly divisible by 16 we use the dimension divided by 8.

    The above two cases are highly preferred. Having 8 or 16 values per subvector allows us to use efficient SIMD instructions.

    If the dimension is not visible by 8 then we use 1 subvector. This is not ideal and will likely result in poor performance.

    num_bits: int, default 8 Number of bits to encode each sub-vector.

    This value controls how much the sub-vectors are compressed. The more bits the more accurate the index but the slower search. Only 4 and 8 are supported.

  • default (Optional[int], default: None ) –

    Number of sub-vectors of PQ.

    This value controls how much the vector is compressed during the quantization step. The more sub vectors there are the less the vector is compressed. The default is the dimension of the vector divided by 16. If the dimension is not evenly divisible by 16 we use the dimension divided by 8.

    The above two cases are highly preferred. Having 8 or 16 values per subvector allows us to use efficient SIMD instructions.

    If the dimension is not visible by 8 then we use 1 subvector. This is not ideal and will likely result in poor performance.

    num_bits: int, default 8 Number of bits to encode each sub-vector.

    This value controls how much the sub-vectors are compressed. The more bits the more accurate the index but the slower search. Only 4 and 8 are supported.

  • max_iterations (int, default: 50 ) –

    Max iterations to train kmeans.

    When training an IVF index we use kmeans to calculate the partitions. This parameter controls how many iterations of kmeans to run.

    Increasing this might improve the quality of the index but in most cases the parameter is unused because kmeans will converge with fewer iterations. The parameter is only used in cases where kmeans does not appear to converge. In those cases it is unlikely that setting this larger will lead to the index converging anyways.

  • default (int, default: 50 ) –

    Max iterations to train kmeans.

    When training an IVF index we use kmeans to calculate the partitions. This parameter controls how many iterations of kmeans to run.

    Increasing this might improve the quality of the index but in most cases the parameter is unused because kmeans will converge with fewer iterations. The parameter is only used in cases where kmeans does not appear to converge. In those cases it is unlikely that setting this larger will lead to the index converging anyways.

  • sample_rate (int, default: 256 ) –

    The rate used to calculate the number of training vectors for kmeans.

    When an IVF index is trained, we need to calculate partitions. These are groups of vectors that are similar to each other. To do this we use an algorithm called kmeans.

    Running kmeans on a large dataset can be slow. To speed this up we run kmeans on a random sample of the data. This parameter controls the size of the sample. The total number of vectors used to train the index is sample_rate * num_partitions.

    Increasing this value might improve the quality of the index but in most cases the default should be sufficient.

  • default (int, default: 256 ) –

    The rate used to calculate the number of training vectors for kmeans.

    When an IVF index is trained, we need to calculate partitions. These are groups of vectors that are similar to each other. To do this we use an algorithm called kmeans.

    Running kmeans on a large dataset can be slow. To speed this up we run kmeans on a random sample of the data. This parameter controls the size of the sample. The total number of vectors used to train the index is sample_rate * num_partitions.

    Increasing this value might improve the quality of the index but in most cases the default should be sufficient.

  • m (int, default: 20 ) –

    The number of neighbors to select for each vector in the HNSW graph.

    This value controls the tradeoff between search speed and accuracy. The higher the value the more accurate the search but the slower it will be.

  • default (int, default: 20 ) –

    The number of neighbors to select for each vector in the HNSW graph.

    This value controls the tradeoff between search speed and accuracy. The higher the value the more accurate the search but the slower it will be.

  • ef_construction (int, default: 300 ) –

    The number of candidates to evaluate during the construction of the HNSW graph.

    This value controls the tradeoff between build speed and accuracy. The higher the value the more accurate the build but the slower it will be. 150 to 300 is the typical range. 100 is a minimum for good quality search results. In most cases, there is no benefit to setting this higher than 500. This value should be set to a value that is not less than ef in the search phase.

  • default (int, default: 300 ) –

    The number of candidates to evaluate during the construction of the HNSW graph.

    This value controls the tradeoff between build speed and accuracy. The higher the value the more accurate the build but the slower it will be. 150 to 300 is the typical range. 100 is a minimum for good quality search results. In most cases, there is no benefit to setting this higher than 500. This value should be set to a value that is not less than ef in the search phase.

Source code in lancedb/index.py
@dataclass
class HnswPq:
    """Describe a HNSW-PQ index configuration.

    HNSW-PQ stands for Hierarchical Navigable Small World - Product Quantization.
    It is a variant of the HNSW algorithm that uses product quantization to compress
    the vectors. To create an HNSW-PQ index, you can specify the following parameters:

    Parameters
    ----------

    distance_type: str, default "l2"

        The distance metric used to train the index.

        The following distance types are available:

        "l2" - Euclidean distance. This is a very common distance metric that
        accounts for both magnitude and direction when determining the distance
        between vectors. l2 distance has a range of [0, ∞).

        "cosine" - Cosine distance.  Cosine distance is a distance metric
        calculated from the cosine similarity between two vectors. Cosine
        similarity is a measure of similarity between two non-zero vectors of an
        inner product space. It is defined to equal the cosine of the angle
        between them.  Unlike l2, the cosine distance is not affected by the
        magnitude of the vectors.  Cosine distance has a range of [0, 2].

        "dot" - Dot product. Dot distance is the dot product of two vectors. Dot
        distance has a range of (-∞, ∞). If the vectors are normalized (i.e. their
        l2 norm is 1), then dot distance is equivalent to the cosine distance.

    num_partitions, default sqrt(num_rows)

        The number of IVF partitions to create.

        For HNSW, we recommend a small number of partitions. Setting this to 1 works
        well for most tables. For very large tables, training just one HNSW graph
        will require too much memory. Each partition becomes its own HNSW graph, so
        setting this value higher reduces the peak memory use of training.

    num_sub_vectors, default is vector dimension / 16

        Number of sub-vectors of PQ.

        This value controls how much the vector is compressed during the
        quantization step. The more sub vectors there are the less the vector is
        compressed.  The default is the dimension of the vector divided by 16.
        If the dimension is not evenly divisible by 16 we use the dimension
        divided by 8.

        The above two cases are highly preferred.  Having 8 or 16 values per
        subvector allows us to use efficient SIMD instructions.

        If the dimension is not visible by 8 then we use 1 subvector.  This is not
        ideal and will likely result in poor performance.

     num_bits: int, default 8
        Number of bits to encode each sub-vector.

        This value controls how much the sub-vectors are compressed.  The more bits
        the more accurate the index but the slower search. Only 4 and 8 are supported.

    max_iterations, default 50

        Max iterations to train kmeans.

        When training an IVF index we use kmeans to calculate the partitions.  This
        parameter controls how many iterations of kmeans to run.

        Increasing this might improve the quality of the index but in most cases the
        parameter is unused because kmeans will converge with fewer iterations.  The
        parameter is only used in cases where kmeans does not appear to converge.  In
        those cases it is unlikely that setting this larger will lead to the index
        converging anyways.

    sample_rate, default 256

        The rate used to calculate the number of training vectors for kmeans.

        When an IVF index is trained, we need to calculate partitions.  These are
        groups of vectors that are similar to each other.  To do this we use an
        algorithm called kmeans.

        Running kmeans on a large dataset can be slow.  To speed this up we
        run kmeans on a random sample of the data.  This parameter controls the
        size of the sample.  The total number of vectors used to train the index
        is `sample_rate * num_partitions`.

        Increasing this value might improve the quality of the index but in
        most cases the default should be sufficient.

    m, default 20

        The number of neighbors to select for each vector in the HNSW graph.

        This value controls the tradeoff between search speed and accuracy.
        The higher the value the more accurate the search but the slower it will be.

    ef_construction, default 300

        The number of candidates to evaluate during the construction of the HNSW graph.

        This value controls the tradeoff between build speed and accuracy.
        The higher the value the more accurate the build but the slower it will be.
        150 to 300 is the typical range. 100 is a minimum for good quality search
        results. In most cases, there is no benefit to setting this higher than 500.
        This value should be set to a value that is not less than `ef` in the
        search phase.
    """

    distance_type: Literal["l2", "cosine", "dot"] = "l2"
    num_partitions: Optional[int] = None
    num_sub_vectors: Optional[int] = None
    num_bits: int = 8
    max_iterations: int = 50
    sample_rate: int = 256
    m: int = 20
    ef_construction: int = 300

lancedb.index.HnswSq dataclass

Describe a HNSW-SQ index configuration.

HNSW-SQ stands for Hierarchical Navigable Small World - Scalar Quantization. It is a variant of the HNSW algorithm that uses scalar quantization to compress the vectors.

Parameters:

  • distance_type (Literal['l2', 'cosine', 'dot'], default: 'l2' ) –

    The distance metric used to train the index.

    The following distance types are available:

    "l2" - Euclidean distance. This is a very common distance metric that accounts for both magnitude and direction when determining the distance between vectors. l2 distance has a range of [0, ∞).

    "cosine" - Cosine distance. Cosine distance is a distance metric calculated from the cosine similarity between two vectors. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is defined to equal the cosine of the angle between them. Unlike l2, the cosine distance is not affected by the magnitude of the vectors. Cosine distance has a range of [0, 2].

    "dot" - Dot product. Dot distance is the dot product of two vectors. Dot distance has a range of (-∞, ∞). If the vectors are normalized (i.e. their l2 norm is 1), then dot distance is equivalent to the cosine distance.

  • num_partitions (Optional[int], default: None ) –

    The number of IVF partitions to create.

    For HNSW, we recommend a small number of partitions. Setting this to 1 works well for most tables. For very large tables, training just one HNSW graph will require too much memory. Each partition becomes its own HNSW graph, so setting this value higher reduces the peak memory use of training.

  • default (Optional[int], default: None ) –

    The number of IVF partitions to create.

    For HNSW, we recommend a small number of partitions. Setting this to 1 works well for most tables. For very large tables, training just one HNSW graph will require too much memory. Each partition becomes its own HNSW graph, so setting this value higher reduces the peak memory use of training.

  • max_iterations (int, default: 50 ) –

    Max iterations to train kmeans.

    When training an IVF index we use kmeans to calculate the partitions. This parameter controls how many iterations of kmeans to run.

    Increasing this might improve the quality of the index but in most cases the parameter is unused because kmeans will converge with fewer iterations. The parameter is only used in cases where kmeans does not appear to converge. In those cases it is unlikely that setting this larger will lead to the index converging anyways.

  • default (int, default: 50 ) –

    Max iterations to train kmeans.

    When training an IVF index we use kmeans to calculate the partitions. This parameter controls how many iterations of kmeans to run.

    Increasing this might improve the quality of the index but in most cases the parameter is unused because kmeans will converge with fewer iterations. The parameter is only used in cases where kmeans does not appear to converge. In those cases it is unlikely that setting this larger will lead to the index converging anyways.

  • sample_rate (int, default: 256 ) –

    The rate used to calculate the number of training vectors for kmeans.

    When an IVF index is trained, we need to calculate partitions. These are groups of vectors that are similar to each other. To do this we use an algorithm called kmeans.

    Running kmeans on a large dataset can be slow. To speed this up we run kmeans on a random sample of the data. This parameter controls the size of the sample. The total number of vectors used to train the index is sample_rate * num_partitions.

    Increasing this value might improve the quality of the index but in most cases the default should be sufficient.

  • default (int, default: 256 ) –

    The rate used to calculate the number of training vectors for kmeans.

    When an IVF index is trained, we need to calculate partitions. These are groups of vectors that are similar to each other. To do this we use an algorithm called kmeans.

    Running kmeans on a large dataset can be slow. To speed this up we run kmeans on a random sample of the data. This parameter controls the size of the sample. The total number of vectors used to train the index is sample_rate * num_partitions.

    Increasing this value might improve the quality of the index but in most cases the default should be sufficient.

  • m (int, default: 20 ) –

    The number of neighbors to select for each vector in the HNSW graph.

    This value controls the tradeoff between search speed and accuracy. The higher the value the more accurate the search but the slower it will be.

  • default (int, default: 20 ) –

    The number of neighbors to select for each vector in the HNSW graph.

    This value controls the tradeoff between search speed and accuracy. The higher the value the more accurate the search but the slower it will be.

  • ef_construction (int, default: 300 ) –

    The number of candidates to evaluate during the construction of the HNSW graph.

    This value controls the tradeoff between build speed and accuracy. The higher the value the more accurate the build but the slower it will be. 150 to 300 is the typical range. 100 is a minimum for good quality search results. In most cases, there is no benefit to setting this higher than 500. This value should be set to a value that is not less than ef in the search phase.

  • default (int, default: 300 ) –

    The number of candidates to evaluate during the construction of the HNSW graph.

    This value controls the tradeoff between build speed and accuracy. The higher the value the more accurate the build but the slower it will be. 150 to 300 is the typical range. 100 is a minimum for good quality search results. In most cases, there is no benefit to setting this higher than 500. This value should be set to a value that is not less than ef in the search phase.

Source code in lancedb/index.py
@dataclass
class HnswSq:
    """Describe a HNSW-SQ index configuration.

    HNSW-SQ stands for Hierarchical Navigable Small World - Scalar Quantization.
    It is a variant of the HNSW algorithm that uses scalar quantization to compress
    the vectors.

    Parameters
    ----------

    distance_type: str, default "l2"

        The distance metric used to train the index.

        The following distance types are available:

        "l2" - Euclidean distance. This is a very common distance metric that
        accounts for both magnitude and direction when determining the distance
        between vectors. l2 distance has a range of [0, ∞).

        "cosine" - Cosine distance.  Cosine distance is a distance metric
        calculated from the cosine similarity between two vectors. Cosine
        similarity is a measure of similarity between two non-zero vectors of an
        inner product space. It is defined to equal the cosine of the angle
        between them.  Unlike l2, the cosine distance is not affected by the
        magnitude of the vectors.  Cosine distance has a range of [0, 2].

        "dot" - Dot product. Dot distance is the dot product of two vectors. Dot
        distance has a range of (-∞, ∞). If the vectors are normalized (i.e. their
        l2 norm is 1), then dot distance is equivalent to the cosine distance.

    num_partitions, default sqrt(num_rows)

        The number of IVF partitions to create.

        For HNSW, we recommend a small number of partitions. Setting this to 1 works
        well for most tables. For very large tables, training just one HNSW graph
        will require too much memory. Each partition becomes its own HNSW graph, so
        setting this value higher reduces the peak memory use of training.

    max_iterations, default 50

        Max iterations to train kmeans.

        When training an IVF index we use kmeans to calculate the partitions.
        This parameter controls how many iterations of kmeans to run.

        Increasing this might improve the quality of the index but in most cases
        the parameter is unused because kmeans will converge with fewer iterations.
        The parameter is only used in cases where kmeans does not appear to converge.
        In those cases it is unlikely that setting this larger will lead to
        the index converging anyways.

    sample_rate, default 256

        The rate used to calculate the number of training vectors for kmeans.

        When an IVF index is trained, we need to calculate partitions.  These
        are groups of vectors that are similar to each other.  To do this
        we use an algorithm called kmeans.

        Running kmeans on a large dataset can be slow.  To speed this up we
        run kmeans on a random sample of the data.  This parameter controls the
        size of the sample.  The total number of vectors used to train the index
        is `sample_rate * num_partitions`.

        Increasing this value might improve the quality of the index but in
        most cases the default should be sufficient.

    m, default 20

        The number of neighbors to select for each vector in the HNSW graph.

        This value controls the tradeoff between search speed and accuracy.
        The higher the value the more accurate the search but the slower it will be.

    ef_construction, default 300

        The number of candidates to evaluate during the construction of the HNSW graph.

        This value controls the tradeoff between build speed and accuracy.
        The higher the value the more accurate the build but the slower it will be.
        150 to 300 is the typical range. 100 is a minimum for good quality search
        results. In most cases, there is no benefit to setting this higher than 500.
        This value should be set to a value that is not less than `ef` in the search
        phase.

    """

    distance_type: Literal["l2", "cosine", "dot"] = "l2"
    num_partitions: Optional[int] = None
    max_iterations: int = 50
    sample_rate: int = 256
    m: int = 20
    ef_construction: int = 300

lancedb.index.IvfFlat dataclass

Describes an IVF Flat Index

This index stores raw vectors. These vectors are grouped into partitions of similar vectors. Each partition keeps track of a centroid which is the average value of all vectors in the group.

Attributes:

  • distance_type (str, default "l2") –

    The distance metric used to train the index

    This is used when training the index to calculate the IVF partitions (vectors are grouped in partitions with similar vectors according to this distance type) and to calculate a subvector's code during quantization.

    The distance type used to train an index MUST match the distance type used to search the index. Failure to do so will yield inaccurate results.

    The following distance types are available:

    "l2" - Euclidean distance. This is a very common distance metric that accounts for both magnitude and direction when determining the distance between vectors. l2 distance has a range of [0, ∞).

    "cosine" - Cosine distance. Cosine distance is a distance metric calculated from the cosine similarity between two vectors. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is defined to equal the cosine of the angle between them. Unlike l2, the cosine distance is not affected by the magnitude of the vectors. Cosine distance has a range of [0, 2].

    Note: the cosine distance is undefined when one (or both) of the vectors are all zeros (there is no direction). These vectors are invalid and may never be returned from a vector search.

    "dot" - Dot product. Dot distance is the dot product of two vectors. Dot distance has a range of (-∞, ∞). If the vectors are normalized (i.e. their l2 norm is 1), then dot distance is equivalent to the cosine distance.

    "hamming" - Hamming distance. Hamming distance is a distance metric calculated as the number of positions at which the corresponding bits are different. Hamming distance has a range of [0, vector dimension].

  • num_partitions (int, default sqrt(num_rows)) –

    The number of IVF partitions to create.

    This value should generally scale with the number of rows in the dataset. By default the number of partitions is the square root of the number of rows.

    If this value is too large then the first part of the search (picking the right partition) will be slow. If this value is too small then the second part of the search (searching within a partition) will be slow.

  • max_iterations (int, default 50) –

    Max iteration to train kmeans.

    When training an IVF PQ index we use kmeans to calculate the partitions. This parameter controls how many iterations of kmeans to run.

    Increasing this might improve the quality of the index but in most cases these extra iterations have diminishing returns.

    The default value is 50.

  • sample_rate (int, default 256) –

    The rate used to calculate the number of training vectors for kmeans.

    When an IVF PQ index is trained, we need to calculate partitions. These are groups of vectors that are similar to each other. To do this we use an algorithm called kmeans.

    Running kmeans on a large dataset can be slow. To speed this up we run kmeans on a random sample of the data. This parameter controls the size of the sample. The total number of vectors used to train the index is sample_rate * num_partitions.

    Increasing this value might improve the quality of the index but in most cases the default should be sufficient.

    The default value is 256.

Source code in lancedb/index.py
@dataclass
class IvfFlat:
    """Describes an IVF Flat Index

    This index stores raw vectors.
    These vectors are grouped into partitions of similar vectors.
    Each partition keeps track of a centroid which is
    the average value of all vectors in the group.

    Attributes
    ----------
    distance_type: str, default "l2"
        The distance metric used to train the index

        This is used when training the index to calculate the IVF partitions
        (vectors are grouped in partitions with similar vectors according to this
        distance type) and to calculate a subvector's code during quantization.

        The distance type used to train an index MUST match the distance type used
        to search the index.  Failure to do so will yield inaccurate results.

        The following distance types are available:

        "l2" - Euclidean distance. This is a very common distance metric that
        accounts for both magnitude and direction when determining the distance
        between vectors. l2 distance has a range of [0, ∞).

        "cosine" - Cosine distance.  Cosine distance is a distance metric
        calculated from the cosine similarity between two vectors. Cosine
        similarity is a measure of similarity between two non-zero vectors of an
        inner product space. It is defined to equal the cosine of the angle
        between them.  Unlike l2, the cosine distance is not affected by the
        magnitude of the vectors.  Cosine distance has a range of [0, 2].

        Note: the cosine distance is undefined when one (or both) of the vectors
        are all zeros (there is no direction).  These vectors are invalid and may
        never be returned from a vector search.

        "dot" - Dot product. Dot distance is the dot product of two vectors. Dot
        distance has a range of (-∞, ∞). If the vectors are normalized (i.e. their
        l2 norm is 1), then dot distance is equivalent to the cosine distance.

        "hamming" - Hamming distance. Hamming distance is a distance metric
        calculated as the number of positions at which the corresponding bits are
        different. Hamming distance has a range of [0, vector dimension].

    num_partitions: int, default sqrt(num_rows)
        The number of IVF partitions to create.

        This value should generally scale with the number of rows in the dataset.
        By default the number of partitions is the square root of the number of
        rows.

        If this value is too large then the first part of the search (picking the
        right partition) will be slow.  If this value is too small then the second
        part of the search (searching within a partition) will be slow.

    max_iterations: int, default 50
        Max iteration to train kmeans.

        When training an IVF PQ index we use kmeans to calculate the partitions.
        This parameter controls how many iterations of kmeans to run.

        Increasing this might improve the quality of the index but in most cases
        these extra iterations have diminishing returns.

        The default value is 50.
    sample_rate: int, default 256
        The rate used to calculate the number of training vectors for kmeans.

        When an IVF PQ index is trained, we need to calculate partitions.  These
        are groups of vectors that are similar to each other.  To do this we use an
        algorithm called kmeans.

        Running kmeans on a large dataset can be slow.  To speed this up we run
        kmeans on a random sample of the data.  This parameter controls the size of
        the sample.  The total number of vectors used to train the index is
        `sample_rate * num_partitions`.

        Increasing this value might improve the quality of the index but in most
        cases the default should be sufficient.

        The default value is 256.
    """

    distance_type: Literal["l2", "cosine", "dot", "hamming"] = "l2"
    num_partitions: Optional[int] = None
    max_iterations: int = 50
    sample_rate: int = 256

Querying (Asynchronous)

Queries allow you to return data from your database. Basic queries can be created with the AsyncTable.query method to return the entire (typically filtered) table. Vector searches return the rows nearest to a query vector and can be created with the AsyncTable.vector_search method.

lancedb.query.AsyncQuery

Bases: AsyncQueryBase

Source code in lancedb/query.py
class AsyncQuery(AsyncQueryBase):
    def __init__(self, inner: LanceQuery):
        """
        Construct an AsyncQuery

        This method is not intended to be called directly.  Instead, use the
        [AsyncTable.query][lancedb.table.AsyncTable.query] method to create a query.
        """
        super().__init__(inner)
        self._inner = inner

    @classmethod
    def _query_vec_to_array(self, vec: Union[VEC, Tuple]):
        if isinstance(vec, list):
            return pa.array(vec)
        if isinstance(vec, np.ndarray):
            return pa.array(vec)
        if isinstance(vec, pa.Array):
            return vec
        if isinstance(vec, pa.ChunkedArray):
            return vec.combine_chunks()
        if isinstance(vec, tuple):
            return pa.array(vec)
        # We've checked everything we formally support in our typings
        # but, as a fallback, let pyarrow try and convert it anyway.
        # This can allow for some more exotic things like iterables
        return pa.array(vec)

    def nearest_to(
        self,
        query_vector: Union[VEC, Tuple, List[VEC]],
    ) -> AsyncVectorQuery:
        """
        Find the nearest vectors to the given query vector.

        This converts the query from a plain query to a vector query.

        This method will attempt to convert the input to the query vector
        expected by the embedding model.  If the input cannot be converted
        then an error will be thrown.

        By default, there is no embedding model, and the input should be
        something that can be converted to a pyarrow array of floats.  This
        includes lists, numpy arrays, and tuples.

        If there is only one vector column (a column whose data type is a
        fixed size list of floats) then the column does not need to be specified.
        If there is more than one vector column you must use
        [AsyncVectorQuery.column][lancedb.query.AsyncVectorQuery.column] to specify
        which column you would like to compare with.

        If no index has been created on the vector column then a vector query
        will perform a distance comparison between the query vector and every
        vector in the database and then sort the results.  This is sometimes
        called a "flat search"

        For small databases, with tens of thousands of vectors or less, this can
        be reasonably fast.  In larger databases you should create a vector index
        on the column.  If there is a vector index then an "approximate" nearest
        neighbor search (frequently called an ANN search) will be performed.  This
        search is much faster, but the results will be approximate.

        The query can be further parameterized using the returned builder.  There
        are various ANN search parameters that will let you fine tune your recall
        accuracy vs search latency.

        Vector searches always have a [limit][].  If `limit` has not been called then
        a default `limit` of 10 will be used.

        Typically, a single vector is passed in as the query. However, you can also
        pass in multiple vectors. When multiple vectors are passed in, if the vector
        column is with multivector type, then the vectors will be treated as a single
        query. Or the vectors will be treated as multiple queries, this can be useful
        if you want to find the nearest vectors to multiple query vectors.
        This is not expected to be faster than making multiple queries concurrently;
        it is just a convenience method. If multiple vectors are passed in then
        an additional column `query_index` will be added to the results. This column
        will contain the index of the query vector that the result is nearest to.
        """
        if query_vector is None:
            raise ValueError("query_vector can not be None")

        if (
            isinstance(query_vector, (list, np.ndarray, pa.Array))
            and len(query_vector) > 0
            and isinstance(query_vector[0], (list, np.ndarray, pa.Array))
        ):
            # multiple have been passed
            query_vectors = [AsyncQuery._query_vec_to_array(v) for v in query_vector]
            new_self = self._inner.nearest_to(query_vectors[0])
            for v in query_vectors[1:]:
                new_self.add_query_vector(v)
            return AsyncVectorQuery(new_self)
        else:
            return AsyncVectorQuery(
                self._inner.nearest_to(AsyncQuery._query_vec_to_array(query_vector))
            )

    def nearest_to_text(
        self, query: str | FullTextQuery, columns: Union[str, List[str], None] = None
    ) -> AsyncFTSQuery:
        """
        Find the documents that are most relevant to the given text query.

        This method will perform a full text search on the table and return
        the most relevant documents.  The relevance is determined by BM25.

        The columns to search must be with native FTS index
        (Tantivy-based can't work with this method).

        By default, all indexed columns are searched,
        now only one column can be searched at a time.

        Parameters
        ----------
        query: str
            The text query to search for.
        columns: str or list of str, default None
            The columns to search in. If None, all indexed columns are searched.
            For now only one column can be searched at a time.
        """
        if isinstance(columns, str):
            columns = [columns]
        if columns is None:
            columns = []

        if isinstance(query, str):
            return AsyncFTSQuery(
                self._inner.nearest_to_text({"query": query, "columns": columns})
            )
        # FullTextQuery object
        return AsyncFTSQuery(self._inner.nearest_to_text({"query": query.to_dict()}))

where

where(predicate: str) -> Self

Only return rows matching the given predicate

The predicate should be supplied as an SQL query string.

Examples:

>>> predicate = "x > 10"
>>> predicate = "y > 0 AND y < 100"
>>> predicate = "x > 5 OR y = 'test'"

Filtering performance can often be improved by creating a scalar index on the filter column(s).

Source code in lancedb/query.py
def where(self, predicate: str) -> Self:
    """
    Only return rows matching the given predicate

    The predicate should be supplied as an SQL query string.

    Examples
    --------

    >>> predicate = "x > 10"
    >>> predicate = "y > 0 AND y < 100"
    >>> predicate = "x > 5 OR y = 'test'"

    Filtering performance can often be improved by creating a scalar index
    on the filter column(s).
    """
    self._inner.where(predicate)
    return self

select

select(columns: Union[List[str], dict[str, str]]) -> Self

Return only the specified columns.

By default a query will return all columns from the table. However, this can have a very significant impact on latency. LanceDb stores data in a columnar fashion. This means we can finely tune our I/O to select exactly the columns we need.

As a best practice you should always limit queries to the columns that you need. If you pass in a list of column names then only those columns will be returned.

You can also use this method to create new "dynamic" columns based on your existing columns. For example, you may not care about "a" or "b" but instead simply want "a + b". This is often seen in the SELECT clause of an SQL query (e.g. SELECT a+b FROM my_table).

To create dynamic columns you can pass in a dict[str, str]. A column will be returned for each entry in the map. The key provides the name of the column. The value is an SQL string used to specify how the column is calculated.

For example, an SQL query might state SELECT a + b AS combined, c. The equivalent input to this method would be {"combined": "a + b", "c": "c"}.

Columns will always be returned in the order given, even if that order is different than the order used when adding the data.

Source code in lancedb/query.py
def select(self, columns: Union[List[str], dict[str, str]]) -> Self:
    """
    Return only the specified columns.

    By default a query will return all columns from the table.  However, this can
    have a very significant impact on latency.  LanceDb stores data in a columnar
    fashion.  This
    means we can finely tune our I/O to select exactly the columns we need.

    As a best practice you should always limit queries to the columns that you need.
    If you pass in a list of column names then only those columns will be
    returned.

    You can also use this method to create new "dynamic" columns based on your
    existing columns. For example, you may not care about "a" or "b" but instead
    simply want "a + b".  This is often seen in the SELECT clause of an SQL query
    (e.g. `SELECT a+b FROM my_table`).

    To create dynamic columns you can pass in a dict[str, str].  A column will be
    returned for each entry in the map.  The key provides the name of the column.
    The value is an SQL string used to specify how the column is calculated.

    For example, an SQL query might state `SELECT a + b AS combined, c`.  The
    equivalent input to this method would be `{"combined": "a + b", "c": "c"}`.

    Columns will always be returned in the order given, even if that order is
    different than the order used when adding the data.
    """
    if isinstance(columns, list) and all(isinstance(c, str) for c in columns):
        self._inner.select_columns(columns)
    elif isinstance(columns, dict) and all(
        isinstance(k, str) and isinstance(v, str) for k, v in columns.items()
    ):
        self._inner.select(list(columns.items()))
    else:
        raise TypeError("columns must be a list of column names or a dict")
    return self

limit

limit(limit: int) -> Self

Set the maximum number of results to return.

By default, a plain search has no limit. If this method is not called then every valid row from the table will be returned.

Source code in lancedb/query.py
def limit(self, limit: int) -> Self:
    """
    Set the maximum number of results to return.

    By default, a plain search has no limit.  If this method is not
    called then every valid row from the table will be returned.
    """
    self._inner.limit(limit)
    return self

offset

offset(offset: int) -> Self

Set the offset for the results.

Parameters:

  • offset (int) –

    The offset to start fetching results from.

Source code in lancedb/query.py
def offset(self, offset: int) -> Self:
    """
    Set the offset for the results.

    Parameters
    ----------
    offset: int
        The offset to start fetching results from.
    """
    self._inner.offset(offset)
    return self
fast_search() -> Self

Skip searching un-indexed data.

This can make queries faster, but will miss any data that has not been indexed.

Tip

You can add new data into an existing index by calling AsyncTable.optimize.

Source code in lancedb/query.py
def fast_search(self) -> Self:
    """
    Skip searching un-indexed data.

    This can make queries faster, but will miss any data that has not been
    indexed.

    !!! tip
        You can add new data into an existing index by calling
        [AsyncTable.optimize][lancedb.table.AsyncTable.optimize].
    """
    self._inner.fast_search()
    return self

with_row_id

with_row_id() -> Self

Include the _rowid column in the results.

Source code in lancedb/query.py
def with_row_id(self) -> Self:
    """
    Include the _rowid column in the results.
    """
    self._inner.with_row_id()
    return self

postfilter

postfilter() -> Self

If this is called then filtering will happen after the search instead of before. By default filtering will be performed before the search. This is how filtering is typically understood to work. This prefilter step does add some additional latency. Creating a scalar index on the filter column(s) can often improve this latency. However, sometimes a filter is too complex or scalar indices cannot be applied to the column. In these cases postfiltering can be used instead of prefiltering to improve latency. Post filtering applies the filter to the results of the search. This means we only run the filter on a much smaller set of data. However, it can cause the query to return fewer than limit results (or even no results) if none of the nearest results match the filter. Post filtering happens during the "refine stage" (described in more detail in @see {@link VectorQuery#refineFactor}). This means that setting a higher refine factor can often help restore some of the results lost by post filtering.

Source code in lancedb/query.py
def postfilter(self) -> Self:
    """
    If this is called then filtering will happen after the search instead of
    before.
    By default filtering will be performed before the search.  This is how
    filtering is typically understood to work.  This prefilter step does add some
    additional latency.  Creating a scalar index on the filter column(s) can
    often improve this latency.  However, sometimes a filter is too complex or
    scalar indices cannot be applied to the column.  In these cases postfiltering
    can be used instead of prefiltering to improve latency.
    Post filtering applies the filter to the results of the search.  This
    means we only run the filter on a much smaller set of data.  However, it can
    cause the query to return fewer than `limit` results (or even no results) if
    none of the nearest results match the filter.
    Post filtering happens during the "refine stage" (described in more detail in
    @see {@link VectorQuery#refineFactor}).  This means that setting a higher refine
    factor can often help restore some of the results lost by post filtering.
    """
    self._inner.postfilter()
    return self

to_batches async

to_batches(*, max_batch_length: Optional[int] = None, timeout: Optional[timedelta] = None) -> AsyncRecordBatchReader

Execute the query and return the results as an Apache Arrow RecordBatchReader.

Parameters:

  • max_batch_length (Optional[int], default: None ) –

    The maximum number of selected records in a single RecordBatch object. If not specified, a default batch length is used. It is possible for batches to be smaller than the provided length if the underlying data is stored in smaller chunks.

  • timeout (Optional[timedelta], default: None ) –

    The maximum time to wait for the query to complete. If not specified, no timeout is applied. If the query does not complete within the specified time, an error will be raised.

Source code in lancedb/query.py
async def to_batches(
    self,
    *,
    max_batch_length: Optional[int] = None,
    timeout: Optional[timedelta] = None,
) -> AsyncRecordBatchReader:
    """
    Execute the query and return the results as an Apache Arrow RecordBatchReader.

    Parameters
    ----------

    max_batch_length: Optional[int]
        The maximum number of selected records in a single RecordBatch object.
        If not specified, a default batch length is used.
        It is possible for batches to be smaller than the provided length if the
        underlying data is stored in smaller chunks.
    timeout: Optional[timedelta]
        The maximum time to wait for the query to complete.
        If not specified, no timeout is applied. If the query does not
        complete within the specified time, an error will be raised.
    """
    return AsyncRecordBatchReader(
        await self._inner.execute(max_batch_length, timeout)
    )

to_arrow async

to_arrow(timeout: Optional[timedelta] = None) -> Table

Execute the query and collect the results into an Apache Arrow Table.

This method will collect all results into memory before returning. If you expect a large number of results, you may want to use to_batches

Parameters:

  • timeout (Optional[timedelta], default: None ) –

    The maximum time to wait for the query to complete. If not specified, no timeout is applied. If the query does not complete within the specified time, an error will be raised.

Source code in lancedb/query.py
async def to_arrow(self, timeout: Optional[timedelta] = None) -> pa.Table:
    """
    Execute the query and collect the results into an Apache Arrow Table.

    This method will collect all results into memory before returning.  If
    you expect a large number of results, you may want to use
    [to_batches][lancedb.query.AsyncQueryBase.to_batches]

    Parameters
    ----------
    timeout: Optional[timedelta]
        The maximum time to wait for the query to complete.
        If not specified, no timeout is applied. If the query does not
        complete within the specified time, an error will be raised.
    """
    batch_iter = await self.to_batches(timeout=timeout)
    return pa.Table.from_batches(
        await batch_iter.read_all(), schema=batch_iter.schema
    )

to_list async

to_list(timeout: Optional[timedelta] = None) -> List[dict]

Execute the query and return the results as a list of dictionaries.

Each list entry is a dictionary with the selected column names as keys, or all table columns if select is not called. The vector and the "_distance" fields are returned whether or not they're explicitly selected.

Parameters:

  • timeout (Optional[timedelta], default: None ) –

    The maximum time to wait for the query to complete. If not specified, no timeout is applied. If the query does not complete within the specified time, an error will be raised.

Source code in lancedb/query.py
async def to_list(self, timeout: Optional[timedelta] = None) -> List[dict]:
    """
    Execute the query and return the results as a list of dictionaries.

    Each list entry is a dictionary with the selected column names as keys,
    or all table columns if `select` is not called. The vector and the "_distance"
    fields are returned whether or not they're explicitly selected.

    Parameters
    ----------
    timeout: Optional[timedelta]
        The maximum time to wait for the query to complete.
        If not specified, no timeout is applied. If the query does not
        complete within the specified time, an error will be raised.
    """
    return (await self.to_arrow(timeout=timeout)).to_pylist()

to_pandas async

to_pandas(flatten: Optional[Union[int, bool]] = None, timeout: Optional[timedelta] = None) -> 'pd.DataFrame'

Execute the query and collect the results into a pandas DataFrame.

This method will collect all results into memory before returning. If you expect a large number of results, you may want to use to_batches and convert each batch to pandas separately.

Examples:

>>> import asyncio
>>> from lancedb import connect_async
>>> async def doctest_example():
...     conn = await connect_async("./.lancedb")
...     table = await conn.create_table("my_table", data=[{"a": 1, "b": 2}])
...     async for batch in await table.query().to_batches():
...         batch_df = batch.to_pandas()
>>> asyncio.run(doctest_example())

Parameters:

  • flatten (Optional[Union[int, bool]], default: None ) –

    If flatten is True, flatten all nested columns. If flatten is an integer, flatten the nested columns up to the specified depth. If unspecified, do not flatten the nested columns.

  • timeout (Optional[timedelta], default: None ) –

    The maximum time to wait for the query to complete. If not specified, no timeout is applied. If the query does not complete within the specified time, an error will be raised.

Source code in lancedb/query.py
async def to_pandas(
    self,
    flatten: Optional[Union[int, bool]] = None,
    timeout: Optional[timedelta] = None,
) -> "pd.DataFrame":
    """
    Execute the query and collect the results into a pandas DataFrame.

    This method will collect all results into memory before returning.  If you
    expect a large number of results, you may want to use
    [to_batches][lancedb.query.AsyncQueryBase.to_batches] and convert each batch to
    pandas separately.

    Examples
    --------

    >>> import asyncio
    >>> from lancedb import connect_async
    >>> async def doctest_example():
    ...     conn = await connect_async("./.lancedb")
    ...     table = await conn.create_table("my_table", data=[{"a": 1, "b": 2}])
    ...     async for batch in await table.query().to_batches():
    ...         batch_df = batch.to_pandas()
    >>> asyncio.run(doctest_example())

    Parameters
    ----------
    flatten: Optional[Union[int, bool]]
        If flatten is True, flatten all nested columns.
        If flatten is an integer, flatten the nested columns up to the
        specified depth.
        If unspecified, do not flatten the nested columns.
    timeout: Optional[timedelta]
        The maximum time to wait for the query to complete.
        If not specified, no timeout is applied. If the query does not
        complete within the specified time, an error will be raised.
    """
    return (
        flatten_columns(await self.to_arrow(timeout=timeout), flatten)
    ).to_pandas()

to_polars async

to_polars(timeout: Optional[timedelta] = None) -> 'pl.DataFrame'

Execute the query and collect the results into a Polars DataFrame.

This method will collect all results into memory before returning. If you expect a large number of results, you may want to use to_batches and convert each batch to polars separately.

Parameters:

  • timeout (Optional[timedelta], default: None ) –

    The maximum time to wait for the query to complete. If not specified, no timeout is applied. If the query does not complete within the specified time, an error will be raised.

Examples:

>>> import asyncio
>>> import polars as pl
>>> from lancedb import connect_async
>>> async def doctest_example():
...     conn = await connect_async("./.lancedb")
...     table = await conn.create_table("my_table", data=[{"a": 1, "b": 2}])
...     async for batch in await table.query().to_batches():
...         batch_df = pl.from_arrow(batch)
>>> asyncio.run(doctest_example())
Source code in lancedb/query.py
async def to_polars(
    self,
    timeout: Optional[timedelta] = None,
) -> "pl.DataFrame":
    """
    Execute the query and collect the results into a Polars DataFrame.

    This method will collect all results into memory before returning.  If you
    expect a large number of results, you may want to use
    [to_batches][lancedb.query.AsyncQueryBase.to_batches] and convert each batch to
    polars separately.

    Parameters
    ----------
    timeout: Optional[timedelta]
        The maximum time to wait for the query to complete.
        If not specified, no timeout is applied. If the query does not
        complete within the specified time, an error will be raised.

    Examples
    --------

    >>> import asyncio
    >>> import polars as pl
    >>> from lancedb import connect_async
    >>> async def doctest_example():
    ...     conn = await connect_async("./.lancedb")
    ...     table = await conn.create_table("my_table", data=[{"a": 1, "b": 2}])
    ...     async for batch in await table.query().to_batches():
    ...         batch_df = pl.from_arrow(batch)
    >>> asyncio.run(doctest_example())
    """
    import polars as pl

    return pl.from_arrow(await self.to_arrow(timeout=timeout))

explain_plan async

explain_plan(verbose: Optional[bool] = False)

Return the execution plan for this query.

Examples:

>>> import asyncio
>>> from lancedb import connect_async
>>> async def doctest_example():
...     conn = await connect_async("./.lancedb")
...     table = await conn.create_table("my_table", [{"vector": [99, 99]}])
...     query = [100, 100]
...     plan = await table.query().nearest_to([1, 2]).explain_plan(True)
...     print(plan)
>>> asyncio.run(doctest_example())
ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]
  GlobalLimitExec: skip=0, fetch=10
    FilterExec: _distance@2 IS NOT NULL
      SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]
        KNNVectorDistance: metric=l2
          LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false

Parameters:

  • verbose (bool, default: False ) –

    Use a verbose output format.

Returns:

  • plan ( str ) –
Source code in lancedb/query.py
async def explain_plan(self, verbose: Optional[bool] = False):
    """Return the execution plan for this query.

    Examples
    --------
    >>> import asyncio
    >>> from lancedb import connect_async
    >>> async def doctest_example():
    ...     conn = await connect_async("./.lancedb")
    ...     table = await conn.create_table("my_table", [{"vector": [99, 99]}])
    ...     query = [100, 100]
    ...     plan = await table.query().nearest_to([1, 2]).explain_plan(True)
    ...     print(plan)
    >>> asyncio.run(doctest_example()) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
    ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]
      GlobalLimitExec: skip=0, fetch=10
        FilterExec: _distance@2 IS NOT NULL
          SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]
            KNNVectorDistance: metric=l2
              LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false

    Parameters
    ----------
    verbose : bool, default False
        Use a verbose output format.

    Returns
    -------
    plan : str
    """  # noqa: E501
    return await self._inner.explain_plan(verbose)

analyze_plan async

analyze_plan()

Execute the query and display with runtime metrics.

Returns:

  • plan ( str ) –
Source code in lancedb/query.py
async def analyze_plan(self):
    """Execute the query and display with runtime metrics.

    Returns
    -------
    plan : str
    """
    return await self._inner.analyze_plan()

__init__

__init__(inner: Query)

Construct an AsyncQuery

This method is not intended to be called directly. Instead, use the AsyncTable.query method to create a query.

Source code in lancedb/query.py
def __init__(self, inner: LanceQuery):
    """
    Construct an AsyncQuery

    This method is not intended to be called directly.  Instead, use the
    [AsyncTable.query][lancedb.table.AsyncTable.query] method to create a query.
    """
    super().__init__(inner)
    self._inner = inner

nearest_to

nearest_to(query_vector: Union[VEC, Tuple, List[VEC]]) -> AsyncVectorQuery

Find the nearest vectors to the given query vector.

This converts the query from a plain query to a vector query.

This method will attempt to convert the input to the query vector expected by the embedding model. If the input cannot be converted then an error will be thrown.

By default, there is no embedding model, and the input should be something that can be converted to a pyarrow array of floats. This includes lists, numpy arrays, and tuples.

If there is only one vector column (a column whose data type is a fixed size list of floats) then the column does not need to be specified. If there is more than one vector column you must use AsyncVectorQuery.column to specify which column you would like to compare with.

If no index has been created on the vector column then a vector query will perform a distance comparison between the query vector and every vector in the database and then sort the results. This is sometimes called a "flat search"

For small databases, with tens of thousands of vectors or less, this can be reasonably fast. In larger databases you should create a vector index on the column. If there is a vector index then an "approximate" nearest neighbor search (frequently called an ANN search) will be performed. This search is much faster, but the results will be approximate.

The query can be further parameterized using the returned builder. There are various ANN search parameters that will let you fine tune your recall accuracy vs search latency.

Vector searches always have a limit. If limit has not been called then a default limit of 10 will be used.

Typically, a single vector is passed in as the query. However, you can also pass in multiple vectors. When multiple vectors are passed in, if the vector column is with multivector type, then the vectors will be treated as a single query. Or the vectors will be treated as multiple queries, this can be useful if you want to find the nearest vectors to multiple query vectors. This is not expected to be faster than making multiple queries concurrently; it is just a convenience method. If multiple vectors are passed in then an additional column query_index will be added to the results. This column will contain the index of the query vector that the result is nearest to.

Source code in lancedb/query.py
def nearest_to(
    self,
    query_vector: Union[VEC, Tuple, List[VEC]],
) -> AsyncVectorQuery:
    """
    Find the nearest vectors to the given query vector.

    This converts the query from a plain query to a vector query.

    This method will attempt to convert the input to the query vector
    expected by the embedding model.  If the input cannot be converted
    then an error will be thrown.

    By default, there is no embedding model, and the input should be
    something that can be converted to a pyarrow array of floats.  This
    includes lists, numpy arrays, and tuples.

    If there is only one vector column (a column whose data type is a
    fixed size list of floats) then the column does not need to be specified.
    If there is more than one vector column you must use
    [AsyncVectorQuery.column][lancedb.query.AsyncVectorQuery.column] to specify
    which column you would like to compare with.

    If no index has been created on the vector column then a vector query
    will perform a distance comparison between the query vector and every
    vector in the database and then sort the results.  This is sometimes
    called a "flat search"

    For small databases, with tens of thousands of vectors or less, this can
    be reasonably fast.  In larger databases you should create a vector index
    on the column.  If there is a vector index then an "approximate" nearest
    neighbor search (frequently called an ANN search) will be performed.  This
    search is much faster, but the results will be approximate.

    The query can be further parameterized using the returned builder.  There
    are various ANN search parameters that will let you fine tune your recall
    accuracy vs search latency.

    Vector searches always have a [limit][].  If `limit` has not been called then
    a default `limit` of 10 will be used.

    Typically, a single vector is passed in as the query. However, you can also
    pass in multiple vectors. When multiple vectors are passed in, if the vector
    column is with multivector type, then the vectors will be treated as a single
    query. Or the vectors will be treated as multiple queries, this can be useful
    if you want to find the nearest vectors to multiple query vectors.
    This is not expected to be faster than making multiple queries concurrently;
    it is just a convenience method. If multiple vectors are passed in then
    an additional column `query_index` will be added to the results. This column
    will contain the index of the query vector that the result is nearest to.
    """
    if query_vector is None:
        raise ValueError("query_vector can not be None")

    if (
        isinstance(query_vector, (list, np.ndarray, pa.Array))
        and len(query_vector) > 0
        and isinstance(query_vector[0], (list, np.ndarray, pa.Array))
    ):
        # multiple have been passed
        query_vectors = [AsyncQuery._query_vec_to_array(v) for v in query_vector]
        new_self = self._inner.nearest_to(query_vectors[0])
        for v in query_vectors[1:]:
            new_self.add_query_vector(v)
        return AsyncVectorQuery(new_self)
    else:
        return AsyncVectorQuery(
            self._inner.nearest_to(AsyncQuery._query_vec_to_array(query_vector))
        )

nearest_to_text

nearest_to_text(query: str | FullTextQuery, columns: Union[str, List[str], None] = None) -> AsyncFTSQuery

Find the documents that are most relevant to the given text query.

This method will perform a full text search on the table and return the most relevant documents. The relevance is determined by BM25.

The columns to search must be with native FTS index (Tantivy-based can't work with this method).

By default, all indexed columns are searched, now only one column can be searched at a time.

Parameters:

  • query (str | FullTextQuery) –

    The text query to search for.

  • columns (Union[str, List[str], None], default: None ) –

    The columns to search in. If None, all indexed columns are searched. For now only one column can be searched at a time.

Source code in lancedb/query.py
def nearest_to_text(
    self, query: str | FullTextQuery, columns: Union[str, List[str], None] = None
) -> AsyncFTSQuery:
    """
    Find the documents that are most relevant to the given text query.

    This method will perform a full text search on the table and return
    the most relevant documents.  The relevance is determined by BM25.

    The columns to search must be with native FTS index
    (Tantivy-based can't work with this method).

    By default, all indexed columns are searched,
    now only one column can be searched at a time.

    Parameters
    ----------
    query: str
        The text query to search for.
    columns: str or list of str, default None
        The columns to search in. If None, all indexed columns are searched.
        For now only one column can be searched at a time.
    """
    if isinstance(columns, str):
        columns = [columns]
    if columns is None:
        columns = []

    if isinstance(query, str):
        return AsyncFTSQuery(
            self._inner.nearest_to_text({"query": query, "columns": columns})
        )
    # FullTextQuery object
    return AsyncFTSQuery(self._inner.nearest_to_text({"query": query.to_dict()}))

lancedb.query.AsyncVectorQuery

Bases: AsyncQueryBase, AsyncVectorQueryBase

Source code in lancedb/query.py
class AsyncVectorQuery(AsyncQueryBase, AsyncVectorQueryBase):
    def __init__(self, inner: LanceVectorQuery):
        """
        Construct an AsyncVectorQuery

        This method is not intended to be called directly.  Instead, create
        a query first with [AsyncTable.query][lancedb.table.AsyncTable.query] and then
        use [AsyncQuery.nearest_to][lancedb.query.AsyncQuery.nearest_to]] to convert to
        a vector query.  Or you can use
        [AsyncTable.vector_search][lancedb.table.AsyncTable.vector_search]
        """
        super().__init__(inner)
        self._inner = inner
        self._reranker = None
        self._query_string = None

    def rerank(
        self, reranker: Reranker = RRFReranker(), query_string: Optional[str] = None
    ) -> AsyncHybridQuery:
        if reranker and not isinstance(reranker, Reranker):
            raise ValueError("reranker must be an instance of Reranker class.")

        self._reranker = reranker

        if not self._query_string and not query_string:
            raise ValueError("query_string must be provided to rerank the results.")

        self._query_string = query_string

        return self

    def nearest_to_text(
        self, query: str | FullTextQuery, columns: Union[str, List[str], None] = None
    ) -> AsyncHybridQuery:
        """
        Find the documents that are most relevant to the given text query,
        in addition to vector search.

        This converts the vector query into a hybrid query.

        This search will perform a full text search on the table and return
        the most relevant documents, combined with the vector query results.
        The text relevance is determined by BM25.

        The columns to search must be with native FTS index
        (Tantivy-based can't work with this method).

        By default, all indexed columns are searched,
        now only one column can be searched at a time.

        Parameters
        ----------
        query: str
            The text query to search for.
        columns: str or list of str, default None
            The columns to search in. If None, all indexed columns are searched.
            For now only one column can be searched at a time.
        """
        if isinstance(columns, str):
            columns = [columns]
        if columns is None:
            columns = []

        if isinstance(query, str):
            return AsyncHybridQuery(
                self._inner.nearest_to_text({"query": query, "columns": columns})
            )
        # FullTextQuery object
        return AsyncHybridQuery(self._inner.nearest_to_text({"query": query.to_dict()}))

    async def to_batches(
        self,
        *,
        max_batch_length: Optional[int] = None,
        timeout: Optional[timedelta] = None,
    ) -> AsyncRecordBatchReader:
        reader = await super().to_batches(timeout=timeout)
        results = pa.Table.from_batches(await reader.read_all(), reader.schema)
        if self._reranker:
            results = self._reranker.rerank_vector(self._query_string, results)
        return AsyncRecordBatchReader(results, max_batch_length=max_batch_length)

column

column(column: str) -> Self

Set the vector column to query

This controls which column is compared to the query vector supplied in the call to AsyncQuery.nearest_to.

This parameter must be specified if the table has more than one column whose data type is a fixed-size-list of floats.

Source code in lancedb/query.py
def column(self, column: str) -> Self:
    """
    Set the vector column to query

    This controls which column is compared to the query vector supplied in
    the call to [AsyncQuery.nearest_to][lancedb.query.AsyncQuery.nearest_to].

    This parameter must be specified if the table has more than one column
    whose data type is a fixed-size-list of floats.
    """
    self._inner.column(column)
    return self

nprobes

nprobes(nprobes: int) -> Self

Set the number of partitions to search (probe)

This argument is only used when the vector column has an IVF-based index. If there is no index then this value is ignored.

The IVF stage of IVF PQ divides the input into partitions (clusters) of related values.

The partition whose centroids are closest to the query vector will be exhaustiely searched to find matches. This parameter controls how many partitions should be searched.

Increasing this value will increase the recall of your query but will also increase the latency of your query. The default value is 20. This default is good for many cases but the best value to use will depend on your data and the recall that you need to achieve.

For best results we recommend tuning this parameter with a benchmark against your actual data to find the smallest possible value that will still give you the desired recall.

Source code in lancedb/query.py
def nprobes(self, nprobes: int) -> Self:
    """
    Set the number of partitions to search (probe)

    This argument is only used when the vector column has an IVF-based index.
    If there is no index then this value is ignored.

    The IVF stage of IVF PQ divides the input into partitions (clusters) of
    related values.

    The partition whose centroids are closest to the query vector will be
    exhaustiely searched to find matches.  This parameter controls how many
    partitions should be searched.

    Increasing this value will increase the recall of your query but will
    also increase the latency of your query.  The default value is 20.  This
    default is good for many cases but the best value to use will depend on
    your data and the recall that you need to achieve.

    For best results we recommend tuning this parameter with a benchmark against
    your actual data to find the smallest possible value that will still give
    you the desired recall.
    """
    self._inner.nprobes(nprobes)
    return self

distance_range

distance_range(lower_bound: Optional[float] = None, upper_bound: Optional[float] = None) -> Self

Set the distance range to use.

Only rows with distances within range [lower_bound, upper_bound) will be returned.

Parameters:

  • lower_bound (Optional[float], default: None ) –

    The lower bound of the distance range.

  • upper_bound (Optional[float], default: None ) –

    The upper bound of the distance range.

Returns:

Source code in lancedb/query.py
def distance_range(
    self, lower_bound: Optional[float] = None, upper_bound: Optional[float] = None
) -> Self:
    """Set the distance range to use.

    Only rows with distances within range [lower_bound, upper_bound)
    will be returned.

    Parameters
    ----------
    lower_bound: Optional[float]
        The lower bound of the distance range.
    upper_bound: Optional[float]
        The upper bound of the distance range.

    Returns
    -------
    AsyncVectorQuery
        The AsyncVectorQuery object.
    """
    self._inner.distance_range(lower_bound, upper_bound)
    return self

ef

ef(ef: int) -> Self

Set the number of candidates to consider during search

This argument is only used when the vector column has an HNSW index. If there is no index then this value is ignored.

Increasing this value will increase the recall of your query but will also increase the latency of your query. The default value is 1.5 * limit. This default is good for many cases but the best value to use will depend on your data and the recall that you need to achieve.

Source code in lancedb/query.py
def ef(self, ef: int) -> Self:
    """
    Set the number of candidates to consider during search

    This argument is only used when the vector column has an HNSW index.
    If there is no index then this value is ignored.

    Increasing this value will increase the recall of your query but will also
    increase the latency of your query.  The default value is 1.5 * limit.  This
    default is good for many cases but the best value to use will depend on your
    data and the recall that you need to achieve.
    """
    self._inner.ef(ef)
    return self

refine_factor

refine_factor(refine_factor: int) -> Self

A multiplier to control how many additional rows are taken during the refine step

This argument is only used when the vector column has an IVF PQ index. If there is no index then this value is ignored.

An IVF PQ index stores compressed (quantized) values. They query vector is compared against these values and, since they are compressed, the comparison is inaccurate.

This parameter can be used to refine the results. It can improve both improve recall and correct the ordering of the nearest results.

To refine results LanceDb will first perform an ANN search to find the nearest limit * refine_factor results. In other words, if refine_factor is 3 and limit is the default (10) then the first 30 results will be selected. LanceDb then fetches the full, uncompressed, values for these 30 results. The results are then reordered by the true distance and only the nearest 10 are kept.

Note: there is a difference between calling this method with a value of 1 and never calling this method at all. Calling this method with any value will have an impact on your search latency. When you call this method with a refine_factor of 1 then LanceDb still needs to fetch the full, uncompressed, values so that it can potentially reorder the results.

Note: if this method is NOT called then the distances returned in the _distance column will be approximate distances based on the comparison of the quantized query vector and the quantized result vectors. This can be considerably different than the true distance between the query vector and the actual uncompressed vector.

Source code in lancedb/query.py
def refine_factor(self, refine_factor: int) -> Self:
    """
    A multiplier to control how many additional rows are taken during the refine
    step

    This argument is only used when the vector column has an IVF PQ index.
    If there is no index then this value is ignored.

    An IVF PQ index stores compressed (quantized) values.  They query vector is
    compared against these values and, since they are compressed, the comparison is
    inaccurate.

    This parameter can be used to refine the results.  It can improve both improve
    recall and correct the ordering of the nearest results.

    To refine results LanceDb will first perform an ANN search to find the nearest
    `limit` * `refine_factor` results.  In other words, if `refine_factor` is 3 and
    `limit` is the default (10) then the first 30 results will be selected.  LanceDb
    then fetches the full, uncompressed, values for these 30 results.  The results
    are then reordered by the true distance and only the nearest 10 are kept.

    Note: there is a difference between calling this method with a value of 1 and
    never calling this method at all.  Calling this method with any value will have
    an impact on your search latency.  When you call this method with a
    `refine_factor` of 1 then LanceDb still needs to fetch the full, uncompressed,
    values so that it can potentially reorder the results.

    Note: if this method is NOT called then the distances returned in the _distance
    column will be approximate distances based on the comparison of the quantized
    query vector and the quantized result vectors.  This can be considerably
    different than the true distance between the query vector and the actual
    uncompressed vector.
    """
    self._inner.refine_factor(refine_factor)
    return self

distance_type

distance_type(distance_type: str) -> Self

Set the distance metric to use

When performing a vector search we try and find the "nearest" vectors according to some kind of distance metric. This parameter controls which distance metric to use. See @see {@link IvfPqOptions.distanceType} for more details on the different distance metrics available.

Note: if there is a vector index then the distance type used MUST match the distance type used to train the vector index. If this is not done then the results will be invalid.

By default "l2" is used.

Source code in lancedb/query.py
def distance_type(self, distance_type: str) -> Self:
    """
    Set the distance metric to use

    When performing a vector search we try and find the "nearest" vectors according
    to some kind of distance metric.  This parameter controls which distance metric
    to use.  See @see {@link IvfPqOptions.distanceType} for more details on the
    different distance metrics available.

    Note: if there is a vector index then the distance type used MUST match the
    distance type used to train the vector index.  If this is not done then the
    results will be invalid.

    By default "l2" is used.
    """
    self._inner.distance_type(distance_type)
    return self

bypass_vector_index

bypass_vector_index() -> Self

If this is called then any vector index is skipped

An exhaustive (flat) search will be performed. The query vector will be compared to every vector in the table. At high scales this can be expensive. However, this is often still useful. For example, skipping the vector index can give you ground truth results which you can use to calculate your recall to select an appropriate value for nprobes.

Source code in lancedb/query.py
def bypass_vector_index(self) -> Self:
    """
    If this is called then any vector index is skipped

    An exhaustive (flat) search will be performed.  The query vector will
    be compared to every vector in the table.  At high scales this can be
    expensive.  However, this is often still useful.  For example, skipping
    the vector index can give you ground truth results which you can use to
    calculate your recall to select an appropriate value for nprobes.
    """
    self._inner.bypass_vector_index()
    return self

where

where(predicate: str) -> Self

Only return rows matching the given predicate

The predicate should be supplied as an SQL query string.

Examples:

>>> predicate = "x > 10"
>>> predicate = "y > 0 AND y < 100"
>>> predicate = "x > 5 OR y = 'test'"

Filtering performance can often be improved by creating a scalar index on the filter column(s).

Source code in lancedb/query.py
def where(self, predicate: str) -> Self:
    """
    Only return rows matching the given predicate

    The predicate should be supplied as an SQL query string.

    Examples
    --------

    >>> predicate = "x > 10"
    >>> predicate = "y > 0 AND y < 100"
    >>> predicate = "x > 5 OR y = 'test'"

    Filtering performance can often be improved by creating a scalar index
    on the filter column(s).
    """
    self._inner.where(predicate)
    return self

select

select(columns: Union[List[str], dict[str, str]]) -> Self

Return only the specified columns.

By default a query will return all columns from the table. However, this can have a very significant impact on latency. LanceDb stores data in a columnar fashion. This means we can finely tune our I/O to select exactly the columns we need.

As a best practice you should always limit queries to the columns that you need. If you pass in a list of column names then only those columns will be returned.

You can also use this method to create new "dynamic" columns based on your existing columns. For example, you may not care about "a" or "b" but instead simply want "a + b". This is often seen in the SELECT clause of an SQL query (e.g. SELECT a+b FROM my_table).

To create dynamic columns you can pass in a dict[str, str]. A column will be returned for each entry in the map. The key provides the name of the column. The value is an SQL string used to specify how the column is calculated.

For example, an SQL query might state SELECT a + b AS combined, c. The equivalent input to this method would be {"combined": "a + b", "c": "c"}.

Columns will always be returned in the order given, even if that order is different than the order used when adding the data.

Source code in lancedb/query.py
def select(self, columns: Union[List[str], dict[str, str]]) -> Self:
    """
    Return only the specified columns.

    By default a query will return all columns from the table.  However, this can
    have a very significant impact on latency.  LanceDb stores data in a columnar
    fashion.  This
    means we can finely tune our I/O to select exactly the columns we need.

    As a best practice you should always limit queries to the columns that you need.
    If you pass in a list of column names then only those columns will be
    returned.

    You can also use this method to create new "dynamic" columns based on your
    existing columns. For example, you may not care about "a" or "b" but instead
    simply want "a + b".  This is often seen in the SELECT clause of an SQL query
    (e.g. `SELECT a+b FROM my_table`).

    To create dynamic columns you can pass in a dict[str, str].  A column will be
    returned for each entry in the map.  The key provides the name of the column.
    The value is an SQL string used to specify how the column is calculated.

    For example, an SQL query might state `SELECT a + b AS combined, c`.  The
    equivalent input to this method would be `{"combined": "a + b", "c": "c"}`.

    Columns will always be returned in the order given, even if that order is
    different than the order used when adding the data.
    """
    if isinstance(columns, list) and all(isinstance(c, str) for c in columns):
        self._inner.select_columns(columns)
    elif isinstance(columns, dict) and all(
        isinstance(k, str) and isinstance(v, str) for k, v in columns.items()
    ):
        self._inner.select(list(columns.items()))
    else:
        raise TypeError("columns must be a list of column names or a dict")
    return self

limit

limit(limit: int) -> Self

Set the maximum number of results to return.

By default, a plain search has no limit. If this method is not called then every valid row from the table will be returned.

Source code in lancedb/query.py
def limit(self, limit: int) -> Self:
    """
    Set the maximum number of results to return.

    By default, a plain search has no limit.  If this method is not
    called then every valid row from the table will be returned.
    """
    self._inner.limit(limit)
    return self

offset

offset(offset: int) -> Self

Set the offset for the results.

Parameters:

  • offset (int) –

    The offset to start fetching results from.

Source code in lancedb/query.py
def offset(self, offset: int) -> Self:
    """
    Set the offset for the results.

    Parameters
    ----------
    offset: int
        The offset to start fetching results from.
    """
    self._inner.offset(offset)
    return self
fast_search() -> Self

Skip searching un-indexed data.

This can make queries faster, but will miss any data that has not been indexed.

Tip

You can add new data into an existing index by calling AsyncTable.optimize.

Source code in lancedb/query.py
def fast_search(self) -> Self:
    """
    Skip searching un-indexed data.

    This can make queries faster, but will miss any data that has not been
    indexed.

    !!! tip
        You can add new data into an existing index by calling
        [AsyncTable.optimize][lancedb.table.AsyncTable.optimize].
    """
    self._inner.fast_search()
    return self

with_row_id

with_row_id() -> Self

Include the _rowid column in the results.

Source code in lancedb/query.py
def with_row_id(self) -> Self:
    """
    Include the _rowid column in the results.
    """
    self._inner.with_row_id()
    return self

postfilter

postfilter() -> Self

If this is called then filtering will happen after the search instead of before. By default filtering will be performed before the search. This is how filtering is typically understood to work. This prefilter step does add some additional latency. Creating a scalar index on the filter column(s) can often improve this latency. However, sometimes a filter is too complex or scalar indices cannot be applied to the column. In these cases postfiltering can be used instead of prefiltering to improve latency. Post filtering applies the filter to the results of the search. This means we only run the filter on a much smaller set of data. However, it can cause the query to return fewer than limit results (or even no results) if none of the nearest results match the filter. Post filtering happens during the "refine stage" (described in more detail in @see {@link VectorQuery#refineFactor}). This means that setting a higher refine factor can often help restore some of the results lost by post filtering.

Source code in lancedb/query.py
def postfilter(self) -> Self:
    """
    If this is called then filtering will happen after the search instead of
    before.
    By default filtering will be performed before the search.  This is how
    filtering is typically understood to work.  This prefilter step does add some
    additional latency.  Creating a scalar index on the filter column(s) can
    often improve this latency.  However, sometimes a filter is too complex or
    scalar indices cannot be applied to the column.  In these cases postfiltering
    can be used instead of prefiltering to improve latency.
    Post filtering applies the filter to the results of the search.  This
    means we only run the filter on a much smaller set of data.  However, it can
    cause the query to return fewer than `limit` results (or even no results) if
    none of the nearest results match the filter.
    Post filtering happens during the "refine stage" (described in more detail in
    @see {@link VectorQuery#refineFactor}).  This means that setting a higher refine
    factor can often help restore some of the results lost by post filtering.
    """
    self._inner.postfilter()
    return self

to_arrow async

to_arrow(timeout: Optional[timedelta] = None) -> Table

Execute the query and collect the results into an Apache Arrow Table.

This method will collect all results into memory before returning. If you expect a large number of results, you may want to use to_batches

Parameters:

  • timeout (Optional[timedelta], default: None ) –

    The maximum time to wait for the query to complete. If not specified, no timeout is applied. If the query does not complete within the specified time, an error will be raised.

Source code in lancedb/query.py
async def to_arrow(self, timeout: Optional[timedelta] = None) -> pa.Table:
    """
    Execute the query and collect the results into an Apache Arrow Table.

    This method will collect all results into memory before returning.  If
    you expect a large number of results, you may want to use
    [to_batches][lancedb.query.AsyncQueryBase.to_batches]

    Parameters
    ----------
    timeout: Optional[timedelta]
        The maximum time to wait for the query to complete.
        If not specified, no timeout is applied. If the query does not
        complete within the specified time, an error will be raised.
    """
    batch_iter = await self.to_batches(timeout=timeout)
    return pa.Table.from_batches(
        await batch_iter.read_all(), schema=batch_iter.schema
    )

to_list async

to_list(timeout: Optional[timedelta] = None) -> List[dict]

Execute the query and return the results as a list of dictionaries.

Each list entry is a dictionary with the selected column names as keys, or all table columns if select is not called. The vector and the "_distance" fields are returned whether or not they're explicitly selected.

Parameters:

  • timeout (Optional[timedelta], default: None ) –

    The maximum time to wait for the query to complete. If not specified, no timeout is applied. If the query does not complete within the specified time, an error will be raised.

Source code in lancedb/query.py
async def to_list(self, timeout: Optional[timedelta] = None) -> List[dict]:
    """
    Execute the query and return the results as a list of dictionaries.

    Each list entry is a dictionary with the selected column names as keys,
    or all table columns if `select` is not called. The vector and the "_distance"
    fields are returned whether or not they're explicitly selected.

    Parameters
    ----------
    timeout: Optional[timedelta]
        The maximum time to wait for the query to complete.
        If not specified, no timeout is applied. If the query does not
        complete within the specified time, an error will be raised.
    """
    return (await self.to_arrow(timeout=timeout)).to_pylist()

to_pandas async

to_pandas(flatten: Optional[Union[int, bool]] = None, timeout: Optional[timedelta] = None) -> 'pd.DataFrame'

Execute the query and collect the results into a pandas DataFrame.

This method will collect all results into memory before returning. If you expect a large number of results, you may want to use to_batches and convert each batch to pandas separately.

Examples:

>>> import asyncio
>>> from lancedb import connect_async
>>> async def doctest_example():
...     conn = await connect_async("./.lancedb")
...     table = await conn.create_table("my_table", data=[{"a": 1, "b": 2}])
...     async for batch in await table.query().to_batches():
...         batch_df = batch.to_pandas()
>>> asyncio.run(doctest_example())

Parameters:

  • flatten (Optional[Union[int, bool]], default: None ) –

    If flatten is True, flatten all nested columns. If flatten is an integer, flatten the nested columns up to the specified depth. If unspecified, do not flatten the nested columns.

  • timeout (Optional[timedelta], default: None ) –

    The maximum time to wait for the query to complete. If not specified, no timeout is applied. If the query does not complete within the specified time, an error will be raised.

Source code in lancedb/query.py
async def to_pandas(
    self,
    flatten: Optional[Union[int, bool]] = None,
    timeout: Optional[timedelta] = None,
) -> "pd.DataFrame":
    """
    Execute the query and collect the results into a pandas DataFrame.

    This method will collect all results into memory before returning.  If you
    expect a large number of results, you may want to use
    [to_batches][lancedb.query.AsyncQueryBase.to_batches] and convert each batch to
    pandas separately.

    Examples
    --------

    >>> import asyncio
    >>> from lancedb import connect_async
    >>> async def doctest_example():
    ...     conn = await connect_async("./.lancedb")
    ...     table = await conn.create_table("my_table", data=[{"a": 1, "b": 2}])
    ...     async for batch in await table.query().to_batches():
    ...         batch_df = batch.to_pandas()
    >>> asyncio.run(doctest_example())

    Parameters
    ----------
    flatten: Optional[Union[int, bool]]
        If flatten is True, flatten all nested columns.
        If flatten is an integer, flatten the nested columns up to the
        specified depth.
        If unspecified, do not flatten the nested columns.
    timeout: Optional[timedelta]
        The maximum time to wait for the query to complete.
        If not specified, no timeout is applied. If the query does not
        complete within the specified time, an error will be raised.
    """
    return (
        flatten_columns(await self.to_arrow(timeout=timeout), flatten)
    ).to_pandas()

to_polars async

to_polars(timeout: Optional[timedelta] = None) -> 'pl.DataFrame'

Execute the query and collect the results into a Polars DataFrame.

This method will collect all results into memory before returning. If you expect a large number of results, you may want to use to_batches and convert each batch to polars separately.

Parameters:

  • timeout (Optional[timedelta], default: None ) –

    The maximum time to wait for the query to complete. If not specified, no timeout is applied. If the query does not complete within the specified time, an error will be raised.

Examples:

>>> import asyncio
>>> import polars as pl
>>> from lancedb import connect_async
>>> async def doctest_example():
...     conn = await connect_async("./.lancedb")
...     table = await conn.create_table("my_table", data=[{"a": 1, "b": 2}])
...     async for batch in await table.query().to_batches():
...         batch_df = pl.from_arrow(batch)
>>> asyncio.run(doctest_example())
Source code in lancedb/query.py
async def to_polars(
    self,
    timeout: Optional[timedelta] = None,
) -> "pl.DataFrame":
    """
    Execute the query and collect the results into a Polars DataFrame.

    This method will collect all results into memory before returning.  If you
    expect a large number of results, you may want to use
    [to_batches][lancedb.query.AsyncQueryBase.to_batches] and convert each batch to
    polars separately.

    Parameters
    ----------
    timeout: Optional[timedelta]
        The maximum time to wait for the query to complete.
        If not specified, no timeout is applied. If the query does not
        complete within the specified time, an error will be raised.

    Examples
    --------

    >>> import asyncio
    >>> import polars as pl
    >>> from lancedb import connect_async
    >>> async def doctest_example():
    ...     conn = await connect_async("./.lancedb")
    ...     table = await conn.create_table("my_table", data=[{"a": 1, "b": 2}])
    ...     async for batch in await table.query().to_batches():
    ...         batch_df = pl.from_arrow(batch)
    >>> asyncio.run(doctest_example())
    """
    import polars as pl

    return pl.from_arrow(await self.to_arrow(timeout=timeout))

explain_plan async

explain_plan(verbose: Optional[bool] = False)

Return the execution plan for this query.

Examples:

>>> import asyncio
>>> from lancedb import connect_async
>>> async def doctest_example():
...     conn = await connect_async("./.lancedb")
...     table = await conn.create_table("my_table", [{"vector": [99, 99]}])
...     query = [100, 100]
...     plan = await table.query().nearest_to([1, 2]).explain_plan(True)
...     print(plan)
>>> asyncio.run(doctest_example())
ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]
  GlobalLimitExec: skip=0, fetch=10
    FilterExec: _distance@2 IS NOT NULL
      SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]
        KNNVectorDistance: metric=l2
          LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false

Parameters:

  • verbose (bool, default: False ) –

    Use a verbose output format.

Returns:

  • plan ( str ) –
Source code in lancedb/query.py
async def explain_plan(self, verbose: Optional[bool] = False):
    """Return the execution plan for this query.

    Examples
    --------
    >>> import asyncio
    >>> from lancedb import connect_async
    >>> async def doctest_example():
    ...     conn = await connect_async("./.lancedb")
    ...     table = await conn.create_table("my_table", [{"vector": [99, 99]}])
    ...     query = [100, 100]
    ...     plan = await table.query().nearest_to([1, 2]).explain_plan(True)
    ...     print(plan)
    >>> asyncio.run(doctest_example()) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
    ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]
      GlobalLimitExec: skip=0, fetch=10
        FilterExec: _distance@2 IS NOT NULL
          SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]
            KNNVectorDistance: metric=l2
              LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false

    Parameters
    ----------
    verbose : bool, default False
        Use a verbose output format.

    Returns
    -------
    plan : str
    """  # noqa: E501
    return await self._inner.explain_plan(verbose)

analyze_plan async

analyze_plan()

Execute the query and display with runtime metrics.

Returns:

  • plan ( str ) –
Source code in lancedb/query.py
async def analyze_plan(self):
    """Execute the query and display with runtime metrics.

    Returns
    -------
    plan : str
    """
    return await self._inner.analyze_plan()

__init__

__init__(inner: VectorQuery)

Construct an AsyncVectorQuery

This method is not intended to be called directly. Instead, create a query first with AsyncTable.query and then use AsyncQuery.nearest_to] to convert to a vector query. Or you can use AsyncTable.vector_search

Source code in lancedb/query.py
def __init__(self, inner: LanceVectorQuery):
    """
    Construct an AsyncVectorQuery

    This method is not intended to be called directly.  Instead, create
    a query first with [AsyncTable.query][lancedb.table.AsyncTable.query] and then
    use [AsyncQuery.nearest_to][lancedb.query.AsyncQuery.nearest_to]] to convert to
    a vector query.  Or you can use
    [AsyncTable.vector_search][lancedb.table.AsyncTable.vector_search]
    """
    super().__init__(inner)
    self._inner = inner
    self._reranker = None
    self._query_string = None

nearest_to_text

nearest_to_text(query: str | FullTextQuery, columns: Union[str, List[str], None] = None) -> AsyncHybridQuery

Find the documents that are most relevant to the given text query, in addition to vector search.

This converts the vector query into a hybrid query.

This search will perform a full text search on the table and return the most relevant documents, combined with the vector query results. The text relevance is determined by BM25.

The columns to search must be with native FTS index (Tantivy-based can't work with this method).

By default, all indexed columns are searched, now only one column can be searched at a time.

Parameters:

  • query (str | FullTextQuery) –

    The text query to search for.

  • columns (Union[str, List[str], None], default: None ) –

    The columns to search in. If None, all indexed columns are searched. For now only one column can be searched at a time.

Source code in lancedb/query.py
def nearest_to_text(
    self, query: str | FullTextQuery, columns: Union[str, List[str], None] = None
) -> AsyncHybridQuery:
    """
    Find the documents that are most relevant to the given text query,
    in addition to vector search.

    This converts the vector query into a hybrid query.

    This search will perform a full text search on the table and return
    the most relevant documents, combined with the vector query results.
    The text relevance is determined by BM25.

    The columns to search must be with native FTS index
    (Tantivy-based can't work with this method).

    By default, all indexed columns are searched,
    now only one column can be searched at a time.

    Parameters
    ----------
    query: str
        The text query to search for.
    columns: str or list of str, default None
        The columns to search in. If None, all indexed columns are searched.
        For now only one column can be searched at a time.
    """
    if isinstance(columns, str):
        columns = [columns]
    if columns is None:
        columns = []

    if isinstance(query, str):
        return AsyncHybridQuery(
            self._inner.nearest_to_text({"query": query, "columns": columns})
        )
    # FullTextQuery object
    return AsyncHybridQuery(self._inner.nearest_to_text({"query": query.to_dict()}))

lancedb.query.AsyncFTSQuery

Bases: AsyncQueryBase

A query for full text search for LanceDB.

Source code in lancedb/query.py
class AsyncFTSQuery(AsyncQueryBase):
    """A query for full text search for LanceDB."""

    def __init__(self, inner: LanceFTSQuery):
        super().__init__(inner)
        self._inner = inner
        self._reranker = None

    def get_query(self) -> str:
        return self._inner.get_query()

    def rerank(
        self,
        reranker: Reranker = RRFReranker(),
    ) -> AsyncFTSQuery:
        if reranker and not isinstance(reranker, Reranker):
            raise ValueError("reranker must be an instance of Reranker class.")

        self._reranker = reranker

        return self

    def nearest_to(
        self,
        query_vector: Union[VEC, Tuple, List[VEC]],
    ) -> AsyncHybridQuery:
        """
        In addition doing text search on the LanceDB Table, also
        find the nearest vectors to the given query vector.

        This converts the query from a FTS Query to a Hybrid query. Results
        from the vector search will be combined with results from the FTS query.

        This method will attempt to convert the input to the query vector
        expected by the embedding model.  If the input cannot be converted
        then an error will be thrown.

        By default, there is no embedding model, and the input should be
        something that can be converted to a pyarrow array of floats.  This
        includes lists, numpy arrays, and tuples.

        If there is only one vector column (a column whose data type is a
        fixed size list of floats) then the column does not need to be specified.
        If there is more than one vector column you must use
        [AsyncVectorQuery.column][lancedb.query.AsyncVectorQuery.column] to specify
        which column you would like to compare with.

        If no index has been created on the vector column then a vector query
        will perform a distance comparison between the query vector and every
        vector in the database and then sort the results.  This is sometimes
        called a "flat search"

        For small databases, with tens of thousands of vectors or less, this can
        be reasonably fast.  In larger databases you should create a vector index
        on the column.  If there is a vector index then an "approximate" nearest
        neighbor search (frequently called an ANN search) will be performed.  This
        search is much faster, but the results will be approximate.

        The query can be further parameterized using the returned builder.  There
        are various ANN search parameters that will let you fine tune your recall
        accuracy vs search latency.

        Hybrid searches always have a [limit][].  If `limit` has not been called then
        a default `limit` of 10 will be used.

        Typically, a single vector is passed in as the query. However, you can also
        pass in multiple vectors.  This can be useful if you want to find the nearest
        vectors to multiple query vectors. This is not expected to be faster than
        making multiple queries concurrently; it is just a convenience method.
        If multiple vectors are passed in then an additional column `query_index`
        will be added to the results.  This column will contain the index of the
        query vector that the result is nearest to.
        """
        if query_vector is None:
            raise ValueError("query_vector can not be None")

        if (
            isinstance(query_vector, list)
            and len(query_vector) > 0
            and not isinstance(query_vector[0], (float, int))
        ):
            # multiple have been passed
            query_vectors = [AsyncQuery._query_vec_to_array(v) for v in query_vector]
            new_self = self._inner.nearest_to(query_vectors[0])
            for v in query_vectors[1:]:
                new_self.add_query_vector(v)
            return AsyncHybridQuery(new_self)
        else:
            return AsyncHybridQuery(
                self._inner.nearest_to(AsyncQuery._query_vec_to_array(query_vector))
            )

    async def to_batches(
        self,
        *,
        max_batch_length: Optional[int] = None,
        timeout: Optional[timedelta] = None,
    ) -> AsyncRecordBatchReader:
        reader = await super().to_batches(timeout=timeout)
        results = pa.Table.from_batches(await reader.read_all(), reader.schema)
        if self._reranker:
            results = self._reranker.rerank_fts(self.get_query(), results)
        return AsyncRecordBatchReader(results, max_batch_length=max_batch_length)

where

where(predicate: str) -> Self

Only return rows matching the given predicate

The predicate should be supplied as an SQL query string.

Examples:

>>> predicate = "x > 10"
>>> predicate = "y > 0 AND y < 100"
>>> predicate = "x > 5 OR y = 'test'"

Filtering performance can often be improved by creating a scalar index on the filter column(s).

Source code in lancedb/query.py
def where(self, predicate: str) -> Self:
    """
    Only return rows matching the given predicate

    The predicate should be supplied as an SQL query string.

    Examples
    --------

    >>> predicate = "x > 10"
    >>> predicate = "y > 0 AND y < 100"
    >>> predicate = "x > 5 OR y = 'test'"

    Filtering performance can often be improved by creating a scalar index
    on the filter column(s).
    """
    self._inner.where(predicate)
    return self

select

select(columns: Union[List[str], dict[str, str]]) -> Self

Return only the specified columns.

By default a query will return all columns from the table. However, this can have a very significant impact on latency. LanceDb stores data in a columnar fashion. This means we can finely tune our I/O to select exactly the columns we need.

As a best practice you should always limit queries to the columns that you need. If you pass in a list of column names then only those columns will be returned.

You can also use this method to create new "dynamic" columns based on your existing columns. For example, you may not care about "a" or "b" but instead simply want "a + b". This is often seen in the SELECT clause of an SQL query (e.g. SELECT a+b FROM my_table).

To create dynamic columns you can pass in a dict[str, str]. A column will be returned for each entry in the map. The key provides the name of the column. The value is an SQL string used to specify how the column is calculated.

For example, an SQL query might state SELECT a + b AS combined, c. The equivalent input to this method would be {"combined": "a + b", "c": "c"}.

Columns will always be returned in the order given, even if that order is different than the order used when adding the data.

Source code in lancedb/query.py
def select(self, columns: Union[List[str], dict[str, str]]) -> Self:
    """
    Return only the specified columns.

    By default a query will return all columns from the table.  However, this can
    have a very significant impact on latency.  LanceDb stores data in a columnar
    fashion.  This
    means we can finely tune our I/O to select exactly the columns we need.

    As a best practice you should always limit queries to the columns that you need.
    If you pass in a list of column names then only those columns will be
    returned.

    You can also use this method to create new "dynamic" columns based on your
    existing columns. For example, you may not care about "a" or "b" but instead
    simply want "a + b".  This is often seen in the SELECT clause of an SQL query
    (e.g. `SELECT a+b FROM my_table`).

    To create dynamic columns you can pass in a dict[str, str].  A column will be
    returned for each entry in the map.  The key provides the name of the column.
    The value is an SQL string used to specify how the column is calculated.

    For example, an SQL query might state `SELECT a + b AS combined, c`.  The
    equivalent input to this method would be `{"combined": "a + b", "c": "c"}`.

    Columns will always be returned in the order given, even if that order is
    different than the order used when adding the data.
    """
    if isinstance(columns, list) and all(isinstance(c, str) for c in columns):
        self._inner.select_columns(columns)
    elif isinstance(columns, dict) and all(
        isinstance(k, str) and isinstance(v, str) for k, v in columns.items()
    ):
        self._inner.select(list(columns.items()))
    else:
        raise TypeError("columns must be a list of column names or a dict")
    return self

limit

limit(limit: int) -> Self

Set the maximum number of results to return.

By default, a plain search has no limit. If this method is not called then every valid row from the table will be returned.

Source code in lancedb/query.py
def limit(self, limit: int) -> Self:
    """
    Set the maximum number of results to return.

    By default, a plain search has no limit.  If this method is not
    called then every valid row from the table will be returned.
    """
    self._inner.limit(limit)
    return self

offset

offset(offset: int) -> Self

Set the offset for the results.

Parameters:

  • offset (int) –

    The offset to start fetching results from.

Source code in lancedb/query.py
def offset(self, offset: int) -> Self:
    """
    Set the offset for the results.

    Parameters
    ----------
    offset: int
        The offset to start fetching results from.
    """
    self._inner.offset(offset)
    return self
fast_search() -> Self

Skip searching un-indexed data.

This can make queries faster, but will miss any data that has not been indexed.

Tip

You can add new data into an existing index by calling AsyncTable.optimize.

Source code in lancedb/query.py
def fast_search(self) -> Self:
    """
    Skip searching un-indexed data.

    This can make queries faster, but will miss any data that has not been
    indexed.

    !!! tip
        You can add new data into an existing index by calling
        [AsyncTable.optimize][lancedb.table.AsyncTable.optimize].
    """
    self._inner.fast_search()
    return self

with_row_id

with_row_id() -> Self

Include the _rowid column in the results.

Source code in lancedb/query.py
def with_row_id(self) -> Self:
    """
    Include the _rowid column in the results.
    """
    self._inner.with_row_id()
    return self

postfilter

postfilter() -> Self

If this is called then filtering will happen after the search instead of before. By default filtering will be performed before the search. This is how filtering is typically understood to work. This prefilter step does add some additional latency. Creating a scalar index on the filter column(s) can often improve this latency. However, sometimes a filter is too complex or scalar indices cannot be applied to the column. In these cases postfiltering can be used instead of prefiltering to improve latency. Post filtering applies the filter to the results of the search. This means we only run the filter on a much smaller set of data. However, it can cause the query to return fewer than limit results (or even no results) if none of the nearest results match the filter. Post filtering happens during the "refine stage" (described in more detail in @see {@link VectorQuery#refineFactor}). This means that setting a higher refine factor can often help restore some of the results lost by post filtering.

Source code in lancedb/query.py
def postfilter(self) -> Self:
    """
    If this is called then filtering will happen after the search instead of
    before.
    By default filtering will be performed before the search.  This is how
    filtering is typically understood to work.  This prefilter step does add some
    additional latency.  Creating a scalar index on the filter column(s) can
    often improve this latency.  However, sometimes a filter is too complex or
    scalar indices cannot be applied to the column.  In these cases postfiltering
    can be used instead of prefiltering to improve latency.
    Post filtering applies the filter to the results of the search.  This
    means we only run the filter on a much smaller set of data.  However, it can
    cause the query to return fewer than `limit` results (or even no results) if
    none of the nearest results match the filter.
    Post filtering happens during the "refine stage" (described in more detail in
    @see {@link VectorQuery#refineFactor}).  This means that setting a higher refine
    factor can often help restore some of the results lost by post filtering.
    """
    self._inner.postfilter()
    return self

to_arrow async

to_arrow(timeout: Optional[timedelta] = None) -> Table

Execute the query and collect the results into an Apache Arrow Table.

This method will collect all results into memory before returning. If you expect a large number of results, you may want to use to_batches

Parameters:

  • timeout (Optional[timedelta], default: None ) –

    The maximum time to wait for the query to complete. If not specified, no timeout is applied. If the query does not complete within the specified time, an error will be raised.

Source code in lancedb/query.py
async def to_arrow(self, timeout: Optional[timedelta] = None) -> pa.Table:
    """
    Execute the query and collect the results into an Apache Arrow Table.

    This method will collect all results into memory before returning.  If
    you expect a large number of results, you may want to use
    [to_batches][lancedb.query.AsyncQueryBase.to_batches]

    Parameters
    ----------
    timeout: Optional[timedelta]
        The maximum time to wait for the query to complete.
        If not specified, no timeout is applied. If the query does not
        complete within the specified time, an error will be raised.
    """
    batch_iter = await self.to_batches(timeout=timeout)
    return pa.Table.from_batches(
        await batch_iter.read_all(), schema=batch_iter.schema
    )

to_list async

to_list(timeout: Optional[timedelta] = None) -> List[dict]

Execute the query and return the results as a list of dictionaries.

Each list entry is a dictionary with the selected column names as keys, or all table columns if select is not called. The vector and the "_distance" fields are returned whether or not they're explicitly selected.

Parameters:

  • timeout (Optional[timedelta], default: None ) –

    The maximum time to wait for the query to complete. If not specified, no timeout is applied. If the query does not complete within the specified time, an error will be raised.

Source code in lancedb/query.py
async def to_list(self, timeout: Optional[timedelta] = None) -> List[dict]:
    """
    Execute the query and return the results as a list of dictionaries.

    Each list entry is a dictionary with the selected column names as keys,
    or all table columns if `select` is not called. The vector and the "_distance"
    fields are returned whether or not they're explicitly selected.

    Parameters
    ----------
    timeout: Optional[timedelta]
        The maximum time to wait for the query to complete.
        If not specified, no timeout is applied. If the query does not
        complete within the specified time, an error will be raised.
    """
    return (await self.to_arrow(timeout=timeout)).to_pylist()

to_pandas async

to_pandas(flatten: Optional[Union[int, bool]] = None, timeout: Optional[timedelta] = None) -> 'pd.DataFrame'

Execute the query and collect the results into a pandas DataFrame.

This method will collect all results into memory before returning. If you expect a large number of results, you may want to use to_batches and convert each batch to pandas separately.

Examples:

>>> import asyncio
>>> from lancedb import connect_async
>>> async def doctest_example():
...     conn = await connect_async("./.lancedb")
...     table = await conn.create_table("my_table", data=[{"a": 1, "b": 2}])
...     async for batch in await table.query().to_batches():
...         batch_df = batch.to_pandas()
>>> asyncio.run(doctest_example())

Parameters:

  • flatten (Optional[Union[int, bool]], default: None ) –

    If flatten is True, flatten all nested columns. If flatten is an integer, flatten the nested columns up to the specified depth. If unspecified, do not flatten the nested columns.

  • timeout (Optional[timedelta], default: None ) –

    The maximum time to wait for the query to complete. If not specified, no timeout is applied. If the query does not complete within the specified time, an error will be raised.

Source code in lancedb/query.py
async def to_pandas(
    self,
    flatten: Optional[Union[int, bool]] = None,
    timeout: Optional[timedelta] = None,
) -> "pd.DataFrame":
    """
    Execute the query and collect the results into a pandas DataFrame.

    This method will collect all results into memory before returning.  If you
    expect a large number of results, you may want to use
    [to_batches][lancedb.query.AsyncQueryBase.to_batches] and convert each batch to
    pandas separately.

    Examples
    --------

    >>> import asyncio
    >>> from lancedb import connect_async
    >>> async def doctest_example():
    ...     conn = await connect_async("./.lancedb")
    ...     table = await conn.create_table("my_table", data=[{"a": 1, "b": 2}])
    ...     async for batch in await table.query().to_batches():
    ...         batch_df = batch.to_pandas()
    >>> asyncio.run(doctest_example())

    Parameters
    ----------
    flatten: Optional[Union[int, bool]]
        If flatten is True, flatten all nested columns.
        If flatten is an integer, flatten the nested columns up to the
        specified depth.
        If unspecified, do not flatten the nested columns.
    timeout: Optional[timedelta]
        The maximum time to wait for the query to complete.
        If not specified, no timeout is applied. If the query does not
        complete within the specified time, an error will be raised.
    """
    return (
        flatten_columns(await self.to_arrow(timeout=timeout), flatten)
    ).to_pandas()

to_polars async

to_polars(timeout: Optional[timedelta] = None) -> 'pl.DataFrame'

Execute the query and collect the results into a Polars DataFrame.

This method will collect all results into memory before returning. If you expect a large number of results, you may want to use to_batches and convert each batch to polars separately.

Parameters:

  • timeout (Optional[timedelta], default: None ) –

    The maximum time to wait for the query to complete. If not specified, no timeout is applied. If the query does not complete within the specified time, an error will be raised.

Examples:

>>> import asyncio
>>> import polars as pl
>>> from lancedb import connect_async
>>> async def doctest_example():
...     conn = await connect_async("./.lancedb")
...     table = await conn.create_table("my_table", data=[{"a": 1, "b": 2}])
...     async for batch in await table.query().to_batches():
...         batch_df = pl.from_arrow(batch)
>>> asyncio.run(doctest_example())
Source code in lancedb/query.py
async def to_polars(
    self,
    timeout: Optional[timedelta] = None,
) -> "pl.DataFrame":
    """
    Execute the query and collect the results into a Polars DataFrame.

    This method will collect all results into memory before returning.  If you
    expect a large number of results, you may want to use
    [to_batches][lancedb.query.AsyncQueryBase.to_batches] and convert each batch to
    polars separately.

    Parameters
    ----------
    timeout: Optional[timedelta]
        The maximum time to wait for the query to complete.
        If not specified, no timeout is applied. If the query does not
        complete within the specified time, an error will be raised.

    Examples
    --------

    >>> import asyncio
    >>> import polars as pl
    >>> from lancedb import connect_async
    >>> async def doctest_example():
    ...     conn = await connect_async("./.lancedb")
    ...     table = await conn.create_table("my_table", data=[{"a": 1, "b": 2}])
    ...     async for batch in await table.query().to_batches():
    ...         batch_df = pl.from_arrow(batch)
    >>> asyncio.run(doctest_example())
    """
    import polars as pl

    return pl.from_arrow(await self.to_arrow(timeout=timeout))

explain_plan async

explain_plan(verbose: Optional[bool] = False)

Return the execution plan for this query.

Examples:

>>> import asyncio
>>> from lancedb import connect_async
>>> async def doctest_example():
...     conn = await connect_async("./.lancedb")
...     table = await conn.create_table("my_table", [{"vector": [99, 99]}])
...     query = [100, 100]
...     plan = await table.query().nearest_to([1, 2]).explain_plan(True)
...     print(plan)
>>> asyncio.run(doctest_example())
ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]
  GlobalLimitExec: skip=0, fetch=10
    FilterExec: _distance@2 IS NOT NULL
      SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]
        KNNVectorDistance: metric=l2
          LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false

Parameters:

  • verbose (bool, default: False ) –

    Use a verbose output format.

Returns:

  • plan ( str ) –
Source code in lancedb/query.py
async def explain_plan(self, verbose: Optional[bool] = False):
    """Return the execution plan for this query.

    Examples
    --------
    >>> import asyncio
    >>> from lancedb import connect_async
    >>> async def doctest_example():
    ...     conn = await connect_async("./.lancedb")
    ...     table = await conn.create_table("my_table", [{"vector": [99, 99]}])
    ...     query = [100, 100]
    ...     plan = await table.query().nearest_to([1, 2]).explain_plan(True)
    ...     print(plan)
    >>> asyncio.run(doctest_example()) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
    ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]
      GlobalLimitExec: skip=0, fetch=10
        FilterExec: _distance@2 IS NOT NULL
          SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]
            KNNVectorDistance: metric=l2
              LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false

    Parameters
    ----------
    verbose : bool, default False
        Use a verbose output format.

    Returns
    -------
    plan : str
    """  # noqa: E501
    return await self._inner.explain_plan(verbose)

analyze_plan async

analyze_plan()

Execute the query and display with runtime metrics.

Returns:

  • plan ( str ) –
Source code in lancedb/query.py
async def analyze_plan(self):
    """Execute the query and display with runtime metrics.

    Returns
    -------
    plan : str
    """
    return await self._inner.analyze_plan()

nearest_to

nearest_to(query_vector: Union[VEC, Tuple, List[VEC]]) -> AsyncHybridQuery

In addition doing text search on the LanceDB Table, also find the nearest vectors to the given query vector.

This converts the query from a FTS Query to a Hybrid query. Results from the vector search will be combined with results from the FTS query.

This method will attempt to convert the input to the query vector expected by the embedding model. If the input cannot be converted then an error will be thrown.

By default, there is no embedding model, and the input should be something that can be converted to a pyarrow array of floats. This includes lists, numpy arrays, and tuples.

If there is only one vector column (a column whose data type is a fixed size list of floats) then the column does not need to be specified. If there is more than one vector column you must use AsyncVectorQuery.column to specify which column you would like to compare with.

If no index has been created on the vector column then a vector query will perform a distance comparison between the query vector and every vector in the database and then sort the results. This is sometimes called a "flat search"

For small databases, with tens of thousands of vectors or less, this can be reasonably fast. In larger databases you should create a vector index on the column. If there is a vector index then an "approximate" nearest neighbor search (frequently called an ANN search) will be performed. This search is much faster, but the results will be approximate.

The query can be further parameterized using the returned builder. There are various ANN search parameters that will let you fine tune your recall accuracy vs search latency.

Hybrid searches always have a limit. If limit has not been called then a default limit of 10 will be used.

Typically, a single vector is passed in as the query. However, you can also pass in multiple vectors. This can be useful if you want to find the nearest vectors to multiple query vectors. This is not expected to be faster than making multiple queries concurrently; it is just a convenience method. If multiple vectors are passed in then an additional column query_index will be added to the results. This column will contain the index of the query vector that the result is nearest to.

Source code in lancedb/query.py
def nearest_to(
    self,
    query_vector: Union[VEC, Tuple, List[VEC]],
) -> AsyncHybridQuery:
    """
    In addition doing text search on the LanceDB Table, also
    find the nearest vectors to the given query vector.

    This converts the query from a FTS Query to a Hybrid query. Results
    from the vector search will be combined with results from the FTS query.

    This method will attempt to convert the input to the query vector
    expected by the embedding model.  If the input cannot be converted
    then an error will be thrown.

    By default, there is no embedding model, and the input should be
    something that can be converted to a pyarrow array of floats.  This
    includes lists, numpy arrays, and tuples.

    If there is only one vector column (a column whose data type is a
    fixed size list of floats) then the column does not need to be specified.
    If there is more than one vector column you must use
    [AsyncVectorQuery.column][lancedb.query.AsyncVectorQuery.column] to specify
    which column you would like to compare with.

    If no index has been created on the vector column then a vector query
    will perform a distance comparison between the query vector and every
    vector in the database and then sort the results.  This is sometimes
    called a "flat search"

    For small databases, with tens of thousands of vectors or less, this can
    be reasonably fast.  In larger databases you should create a vector index
    on the column.  If there is a vector index then an "approximate" nearest
    neighbor search (frequently called an ANN search) will be performed.  This
    search is much faster, but the results will be approximate.

    The query can be further parameterized using the returned builder.  There
    are various ANN search parameters that will let you fine tune your recall
    accuracy vs search latency.

    Hybrid searches always have a [limit][].  If `limit` has not been called then
    a default `limit` of 10 will be used.

    Typically, a single vector is passed in as the query. However, you can also
    pass in multiple vectors.  This can be useful if you want to find the nearest
    vectors to multiple query vectors. This is not expected to be faster than
    making multiple queries concurrently; it is just a convenience method.
    If multiple vectors are passed in then an additional column `query_index`
    will be added to the results.  This column will contain the index of the
    query vector that the result is nearest to.
    """
    if query_vector is None:
        raise ValueError("query_vector can not be None")

    if (
        isinstance(query_vector, list)
        and len(query_vector) > 0
        and not isinstance(query_vector[0], (float, int))
    ):
        # multiple have been passed
        query_vectors = [AsyncQuery._query_vec_to_array(v) for v in query_vector]
        new_self = self._inner.nearest_to(query_vectors[0])
        for v in query_vectors[1:]:
            new_self.add_query_vector(v)
        return AsyncHybridQuery(new_self)
    else:
        return AsyncHybridQuery(
            self._inner.nearest_to(AsyncQuery._query_vec_to_array(query_vector))
        )

lancedb.query.AsyncHybridQuery

Bases: AsyncQueryBase, AsyncVectorQueryBase

A query builder that performs hybrid vector and full text search. Results are combined and reranked based on the specified reranker. By default, the results are reranked using the RRFReranker, which uses reciprocal rank fusion score for reranking.

To make the vector and fts results comparable, the scores are normalized. Instead of normalizing scores, the normalize parameter can be set to "rank" in the rerank method to convert the scores to ranks and then normalize them.

Source code in lancedb/query.py
class AsyncHybridQuery(AsyncQueryBase, AsyncVectorQueryBase):
    """
    A query builder that performs hybrid vector and full text search.
    Results are combined and reranked based on the specified reranker.
    By default, the results are reranked using the RRFReranker, which
    uses reciprocal rank fusion score for reranking.

    To make the vector and fts results comparable, the scores are normalized.
    Instead of normalizing scores, the `normalize` parameter can be set to "rank"
    in the `rerank` method to convert the scores to ranks and then normalize them.
    """

    def __init__(self, inner: LanceHybridQuery):
        super().__init__(inner)
        self._inner = inner
        self._norm = "score"
        self._reranker = RRFReranker()

    def rerank(
        self, reranker: Reranker = RRFReranker(), normalize: str = "score"
    ) -> AsyncHybridQuery:
        """
        Rerank the hybrid search results using the specified reranker. The reranker
        must be an instance of Reranker class.

        Parameters
        ----------
        reranker: Reranker, default RRFReranker()
            The reranker to use. Must be an instance of Reranker class.
        normalize: str, default "score"
            The method to normalize the scores. Can be "rank" or "score". If "rank",
            the scores are converted to ranks and then normalized. If "score", the
            scores are normalized directly.
        Returns
        -------
        AsyncHybridQuery
            The AsyncHybridQuery object.
        """
        if normalize not in ["rank", "score"]:
            raise ValueError("normalize must be 'rank' or 'score'.")
        if reranker and not isinstance(reranker, Reranker):
            raise ValueError("reranker must be an instance of Reranker class.")

        self._norm = normalize
        self._reranker = reranker

        return self

    async def to_batches(
        self,
        *,
        max_batch_length: Optional[int] = None,
        timeout: Optional[timedelta] = None,
    ) -> AsyncRecordBatchReader:
        fts_query = AsyncFTSQuery(self._inner.to_fts_query())
        vec_query = AsyncVectorQuery(self._inner.to_vector_query())

        # save the row ID choice that was made on the query builder and force it
        # to actually fetch the row ids because we need this for reranking
        with_row_ids = self._inner.get_with_row_id()
        fts_query.with_row_id()
        vec_query.with_row_id()

        fts_results, vector_results = await asyncio.gather(
            fts_query.to_arrow(timeout=timeout),
            vec_query.to_arrow(timeout=timeout),
        )

        result = LanceHybridQueryBuilder._combine_hybrid_results(
            fts_results=fts_results,
            vector_results=vector_results,
            norm=self._norm,
            fts_query=fts_query.get_query(),
            reranker=self._reranker,
            limit=self._inner.get_limit(),
            with_row_ids=with_row_ids,
        )

        return AsyncRecordBatchReader(result, max_batch_length=max_batch_length)

    async def explain_plan(self, verbose: Optional[bool] = False):
        """Return the execution plan for this query.

        The output includes both the vector and FTS search plans.

        Examples
        --------
        >>> import asyncio
        >>> from lancedb import connect_async
        >>> from lancedb.index import FTS
        >>> async def doctest_example():
        ...     conn = await connect_async("./.lancedb")
        ...     table = await conn.create_table("my_table", [{"vector": [99, 99], "text": "hello world"}])
        ...     await table.create_index("text", config=FTS(with_position=False))
        ...     query = [100, 100]
        ...     plan = await table.query().nearest_to([1, 2]).nearest_to_text("hello").explain_plan(True)
        ...     print(plan)
        >>> asyncio.run(doctest_example()) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
        Vector Search Plan:
        ProjectionExec: expr=[vector@0 as vector, text@3 as text, _distance@2 as _distance]
            Take: columns="vector, _rowid, _distance, (text)"
                CoalesceBatchesExec: target_batch_size=1024
                GlobalLimitExec: skip=0, fetch=10
                    FilterExec: _distance@2 IS NOT NULL
                    SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]
                        KNNVectorDistance: metric=l2
                        LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false
        FTS Search Plan:
        LanceScan: uri=..., projection=[vector, text], row_id=false, row_addr=false, ordered=true

        Parameters
        ----------
        verbose : bool, default False
            Use a verbose output format.

        Returns
        -------
        plan : str
        """  # noqa: E501

        results = ["Vector Search Plan:"]
        results.append(await self._inner.to_vector_query().explain_plan(verbose))
        results.append("FTS Search Plan:")
        results.append(await self._inner.to_fts_query().explain_plan(verbose))

        return "\n".join(results)

    async def analyze_plan(self):
        """
        Execute the query and return the physical execution plan with runtime metrics.

        This runs both the vector and FTS (full-text search) queries and returns
        detailed metrics for each step of executionβ€”such as rows processed,
        elapsed time, I/O stats, and more. It’s useful for debugging and
        performance analysis.

        Returns
        -------
        plan : str
        """
        results = ["Vector Search Query:"]
        results.append(await self._inner.to_vector_query().analyze_plan())
        results.append("FTS Search Query:")
        results.append(await self._inner.to_fts_query().analyze_plan())

        return "\n".join(results)

column

column(column: str) -> Self

Set the vector column to query

This controls which column is compared to the query vector supplied in the call to AsyncQuery.nearest_to.

This parameter must be specified if the table has more than one column whose data type is a fixed-size-list of floats.

Source code in lancedb/query.py
def column(self, column: str) -> Self:
    """
    Set the vector column to query

    This controls which column is compared to the query vector supplied in
    the call to [AsyncQuery.nearest_to][lancedb.query.AsyncQuery.nearest_to].

    This parameter must be specified if the table has more than one column
    whose data type is a fixed-size-list of floats.
    """
    self._inner.column(column)
    return self

nprobes

nprobes(nprobes: int) -> Self

Set the number of partitions to search (probe)

This argument is only used when the vector column has an IVF-based index. If there is no index then this value is ignored.

The IVF stage of IVF PQ divides the input into partitions (clusters) of related values.

The partition whose centroids are closest to the query vector will be exhaustiely searched to find matches. This parameter controls how many partitions should be searched.

Increasing this value will increase the recall of your query but will also increase the latency of your query. The default value is 20. This default is good for many cases but the best value to use will depend on your data and the recall that you need to achieve.

For best results we recommend tuning this parameter with a benchmark against your actual data to find the smallest possible value that will still give you the desired recall.

Source code in lancedb/query.py
def nprobes(self, nprobes: int) -> Self:
    """
    Set the number of partitions to search (probe)

    This argument is only used when the vector column has an IVF-based index.
    If there is no index then this value is ignored.

    The IVF stage of IVF PQ divides the input into partitions (clusters) of
    related values.

    The partition whose centroids are closest to the query vector will be
    exhaustiely searched to find matches.  This parameter controls how many
    partitions should be searched.

    Increasing this value will increase the recall of your query but will
    also increase the latency of your query.  The default value is 20.  This
    default is good for many cases but the best value to use will depend on
    your data and the recall that you need to achieve.

    For best results we recommend tuning this parameter with a benchmark against
    your actual data to find the smallest possible value that will still give
    you the desired recall.
    """
    self._inner.nprobes(nprobes)
    return self

distance_range

distance_range(lower_bound: Optional[float] = None, upper_bound: Optional[float] = None) -> Self

Set the distance range to use.

Only rows with distances within range [lower_bound, upper_bound) will be returned.

Parameters:

  • lower_bound (Optional[float], default: None ) –

    The lower bound of the distance range.

  • upper_bound (Optional[float], default: None ) –

    The upper bound of the distance range.

Returns:

Source code in lancedb/query.py
def distance_range(
    self, lower_bound: Optional[float] = None, upper_bound: Optional[float] = None
) -> Self:
    """Set the distance range to use.

    Only rows with distances within range [lower_bound, upper_bound)
    will be returned.

    Parameters
    ----------
    lower_bound: Optional[float]
        The lower bound of the distance range.
    upper_bound: Optional[float]
        The upper bound of the distance range.

    Returns
    -------
    AsyncVectorQuery
        The AsyncVectorQuery object.
    """
    self._inner.distance_range(lower_bound, upper_bound)
    return self

ef

ef(ef: int) -> Self

Set the number of candidates to consider during search

This argument is only used when the vector column has an HNSW index. If there is no index then this value is ignored.

Increasing this value will increase the recall of your query but will also increase the latency of your query. The default value is 1.5 * limit. This default is good for many cases but the best value to use will depend on your data and the recall that you need to achieve.

Source code in lancedb/query.py
def ef(self, ef: int) -> Self:
    """
    Set the number of candidates to consider during search

    This argument is only used when the vector column has an HNSW index.
    If there is no index then this value is ignored.

    Increasing this value will increase the recall of your query but will also
    increase the latency of your query.  The default value is 1.5 * limit.  This
    default is good for many cases but the best value to use will depend on your
    data and the recall that you need to achieve.
    """
    self._inner.ef(ef)
    return self

refine_factor

refine_factor(refine_factor: int) -> Self

A multiplier to control how many additional rows are taken during the refine step

This argument is only used when the vector column has an IVF PQ index. If there is no index then this value is ignored.

An IVF PQ index stores compressed (quantized) values. They query vector is compared against these values and, since they are compressed, the comparison is inaccurate.

This parameter can be used to refine the results. It can improve both improve recall and correct the ordering of the nearest results.

To refine results LanceDb will first perform an ANN search to find the nearest limit * refine_factor results. In other words, if refine_factor is 3 and limit is the default (10) then the first 30 results will be selected. LanceDb then fetches the full, uncompressed, values for these 30 results. The results are then reordered by the true distance and only the nearest 10 are kept.

Note: there is a difference between calling this method with a value of 1 and never calling this method at all. Calling this method with any value will have an impact on your search latency. When you call this method with a refine_factor of 1 then LanceDb still needs to fetch the full, uncompressed, values so that it can potentially reorder the results.

Note: if this method is NOT called then the distances returned in the _distance column will be approximate distances based on the comparison of the quantized query vector and the quantized result vectors. This can be considerably different than the true distance between the query vector and the actual uncompressed vector.

Source code in lancedb/query.py
def refine_factor(self, refine_factor: int) -> Self:
    """
    A multiplier to control how many additional rows are taken during the refine
    step

    This argument is only used when the vector column has an IVF PQ index.
    If there is no index then this value is ignored.

    An IVF PQ index stores compressed (quantized) values.  They query vector is
    compared against these values and, since they are compressed, the comparison is
    inaccurate.

    This parameter can be used to refine the results.  It can improve both improve
    recall and correct the ordering of the nearest results.

    To refine results LanceDb will first perform an ANN search to find the nearest
    `limit` * `refine_factor` results.  In other words, if `refine_factor` is 3 and
    `limit` is the default (10) then the first 30 results will be selected.  LanceDb
    then fetches the full, uncompressed, values for these 30 results.  The results
    are then reordered by the true distance and only the nearest 10 are kept.

    Note: there is a difference between calling this method with a value of 1 and
    never calling this method at all.  Calling this method with any value will have
    an impact on your search latency.  When you call this method with a
    `refine_factor` of 1 then LanceDb still needs to fetch the full, uncompressed,
    values so that it can potentially reorder the results.

    Note: if this method is NOT called then the distances returned in the _distance
    column will be approximate distances based on the comparison of the quantized
    query vector and the quantized result vectors.  This can be considerably
    different than the true distance between the query vector and the actual
    uncompressed vector.
    """
    self._inner.refine_factor(refine_factor)
    return self

distance_type

distance_type(distance_type: str) -> Self

Set the distance metric to use

When performing a vector search we try and find the "nearest" vectors according to some kind of distance metric. This parameter controls which distance metric to use. See @see {@link IvfPqOptions.distanceType} for more details on the different distance metrics available.

Note: if there is a vector index then the distance type used MUST match the distance type used to train the vector index. If this is not done then the results will be invalid.

By default "l2" is used.

Source code in lancedb/query.py
def distance_type(self, distance_type: str) -> Self:
    """
    Set the distance metric to use

    When performing a vector search we try and find the "nearest" vectors according
    to some kind of distance metric.  This parameter controls which distance metric
    to use.  See @see {@link IvfPqOptions.distanceType} for more details on the
    different distance metrics available.

    Note: if there is a vector index then the distance type used MUST match the
    distance type used to train the vector index.  If this is not done then the
    results will be invalid.

    By default "l2" is used.
    """
    self._inner.distance_type(distance_type)
    return self

bypass_vector_index

bypass_vector_index() -> Self

If this is called then any vector index is skipped

An exhaustive (flat) search will be performed. The query vector will be compared to every vector in the table. At high scales this can be expensive. However, this is often still useful. For example, skipping the vector index can give you ground truth results which you can use to calculate your recall to select an appropriate value for nprobes.

Source code in lancedb/query.py
def bypass_vector_index(self) -> Self:
    """
    If this is called then any vector index is skipped

    An exhaustive (flat) search will be performed.  The query vector will
    be compared to every vector in the table.  At high scales this can be
    expensive.  However, this is often still useful.  For example, skipping
    the vector index can give you ground truth results which you can use to
    calculate your recall to select an appropriate value for nprobes.
    """
    self._inner.bypass_vector_index()
    return self

where

where(predicate: str) -> Self

Only return rows matching the given predicate

The predicate should be supplied as an SQL query string.

Examples:

>>> predicate = "x > 10"
>>> predicate = "y > 0 AND y < 100"
>>> predicate = "x > 5 OR y = 'test'"

Filtering performance can often be improved by creating a scalar index on the filter column(s).

Source code in lancedb/query.py
def where(self, predicate: str) -> Self:
    """
    Only return rows matching the given predicate

    The predicate should be supplied as an SQL query string.

    Examples
    --------

    >>> predicate = "x > 10"
    >>> predicate = "y > 0 AND y < 100"
    >>> predicate = "x > 5 OR y = 'test'"

    Filtering performance can often be improved by creating a scalar index
    on the filter column(s).
    """
    self._inner.where(predicate)
    return self

select

select(columns: Union[List[str], dict[str, str]]) -> Self

Return only the specified columns.

By default a query will return all columns from the table. However, this can have a very significant impact on latency. LanceDb stores data in a columnar fashion. This means we can finely tune our I/O to select exactly the columns we need.

As a best practice you should always limit queries to the columns that you need. If you pass in a list of column names then only those columns will be returned.

You can also use this method to create new "dynamic" columns based on your existing columns. For example, you may not care about "a" or "b" but instead simply want "a + b". This is often seen in the SELECT clause of an SQL query (e.g. SELECT a+b FROM my_table).

To create dynamic columns you can pass in a dict[str, str]. A column will be returned for each entry in the map. The key provides the name of the column. The value is an SQL string used to specify how the column is calculated.

For example, an SQL query might state SELECT a + b AS combined, c. The equivalent input to this method would be {"combined": "a + b", "c": "c"}.

Columns will always be returned in the order given, even if that order is different than the order used when adding the data.

Source code in lancedb/query.py
def select(self, columns: Union[List[str], dict[str, str]]) -> Self:
    """
    Return only the specified columns.

    By default a query will return all columns from the table.  However, this can
    have a very significant impact on latency.  LanceDb stores data in a columnar
    fashion.  This
    means we can finely tune our I/O to select exactly the columns we need.

    As a best practice you should always limit queries to the columns that you need.
    If you pass in a list of column names then only those columns will be
    returned.

    You can also use this method to create new "dynamic" columns based on your
    existing columns. For example, you may not care about "a" or "b" but instead
    simply want "a + b".  This is often seen in the SELECT clause of an SQL query
    (e.g. `SELECT a+b FROM my_table`).

    To create dynamic columns you can pass in a dict[str, str].  A column will be
    returned for each entry in the map.  The key provides the name of the column.
    The value is an SQL string used to specify how the column is calculated.

    For example, an SQL query might state `SELECT a + b AS combined, c`.  The
    equivalent input to this method would be `{"combined": "a + b", "c": "c"}`.

    Columns will always be returned in the order given, even if that order is
    different than the order used when adding the data.
    """
    if isinstance(columns, list) and all(isinstance(c, str) for c in columns):
        self._inner.select_columns(columns)
    elif isinstance(columns, dict) and all(
        isinstance(k, str) and isinstance(v, str) for k, v in columns.items()
    ):
        self._inner.select(list(columns.items()))
    else:
        raise TypeError("columns must be a list of column names or a dict")
    return self

limit

limit(limit: int) -> Self

Set the maximum number of results to return.

By default, a plain search has no limit. If this method is not called then every valid row from the table will be returned.

Source code in lancedb/query.py
def limit(self, limit: int) -> Self:
    """
    Set the maximum number of results to return.

    By default, a plain search has no limit.  If this method is not
    called then every valid row from the table will be returned.
    """
    self._inner.limit(limit)
    return self

offset

offset(offset: int) -> Self

Set the offset for the results.

Parameters:

  • offset (int) –

    The offset to start fetching results from.

Source code in lancedb/query.py
def offset(self, offset: int) -> Self:
    """
    Set the offset for the results.

    Parameters
    ----------
    offset: int
        The offset to start fetching results from.
    """
    self._inner.offset(offset)
    return self
fast_search() -> Self

Skip searching un-indexed data.

This can make queries faster, but will miss any data that has not been indexed.

Tip

You can add new data into an existing index by calling AsyncTable.optimize.

Source code in lancedb/query.py
def fast_search(self) -> Self:
    """
    Skip searching un-indexed data.

    This can make queries faster, but will miss any data that has not been
    indexed.

    !!! tip
        You can add new data into an existing index by calling
        [AsyncTable.optimize][lancedb.table.AsyncTable.optimize].
    """
    self._inner.fast_search()
    return self

with_row_id

with_row_id() -> Self

Include the _rowid column in the results.

Source code in lancedb/query.py
def with_row_id(self) -> Self:
    """
    Include the _rowid column in the results.
    """
    self._inner.with_row_id()
    return self

postfilter

postfilter() -> Self

If this is called then filtering will happen after the search instead of before. By default filtering will be performed before the search. This is how filtering is typically understood to work. This prefilter step does add some additional latency. Creating a scalar index on the filter column(s) can often improve this latency. However, sometimes a filter is too complex or scalar indices cannot be applied to the column. In these cases postfiltering can be used instead of prefiltering to improve latency. Post filtering applies the filter to the results of the search. This means we only run the filter on a much smaller set of data. However, it can cause the query to return fewer than limit results (or even no results) if none of the nearest results match the filter. Post filtering happens during the "refine stage" (described in more detail in @see {@link VectorQuery#refineFactor}). This means that setting a higher refine factor can often help restore some of the results lost by post filtering.

Source code in lancedb/query.py
def postfilter(self) -> Self:
    """
    If this is called then filtering will happen after the search instead of
    before.
    By default filtering will be performed before the search.  This is how
    filtering is typically understood to work.  This prefilter step does add some
    additional latency.  Creating a scalar index on the filter column(s) can
    often improve this latency.  However, sometimes a filter is too complex or
    scalar indices cannot be applied to the column.  In these cases postfiltering
    can be used instead of prefiltering to improve latency.
    Post filtering applies the filter to the results of the search.  This
    means we only run the filter on a much smaller set of data.  However, it can
    cause the query to return fewer than `limit` results (or even no results) if
    none of the nearest results match the filter.
    Post filtering happens during the "refine stage" (described in more detail in
    @see {@link VectorQuery#refineFactor}).  This means that setting a higher refine
    factor can often help restore some of the results lost by post filtering.
    """
    self._inner.postfilter()
    return self

to_arrow async

to_arrow(timeout: Optional[timedelta] = None) -> Table

Execute the query and collect the results into an Apache Arrow Table.

This method will collect all results into memory before returning. If you expect a large number of results, you may want to use to_batches

Parameters:

  • timeout (Optional[timedelta], default: None ) –

    The maximum time to wait for the query to complete. If not specified, no timeout is applied. If the query does not complete within the specified time, an error will be raised.

Source code in lancedb/query.py
async def to_arrow(self, timeout: Optional[timedelta] = None) -> pa.Table:
    """
    Execute the query and collect the results into an Apache Arrow Table.

    This method will collect all results into memory before returning.  If
    you expect a large number of results, you may want to use
    [to_batches][lancedb.query.AsyncQueryBase.to_batches]

    Parameters
    ----------
    timeout: Optional[timedelta]
        The maximum time to wait for the query to complete.
        If not specified, no timeout is applied. If the query does not
        complete within the specified time, an error will be raised.
    """
    batch_iter = await self.to_batches(timeout=timeout)
    return pa.Table.from_batches(
        await batch_iter.read_all(), schema=batch_iter.schema
    )

to_list async

to_list(timeout: Optional[timedelta] = None) -> List[dict]

Execute the query and return the results as a list of dictionaries.

Each list entry is a dictionary with the selected column names as keys, or all table columns if select is not called. The vector and the "_distance" fields are returned whether or not they're explicitly selected.

Parameters:

  • timeout (Optional[timedelta], default: None ) –

    The maximum time to wait for the query to complete. If not specified, no timeout is applied. If the query does not complete within the specified time, an error will be raised.

Source code in lancedb/query.py
async def to_list(self, timeout: Optional[timedelta] = None) -> List[dict]:
    """
    Execute the query and return the results as a list of dictionaries.

    Each list entry is a dictionary with the selected column names as keys,
    or all table columns if `select` is not called. The vector and the "_distance"
    fields are returned whether or not they're explicitly selected.

    Parameters
    ----------
    timeout: Optional[timedelta]
        The maximum time to wait for the query to complete.
        If not specified, no timeout is applied. If the query does not
        complete within the specified time, an error will be raised.
    """
    return (await self.to_arrow(timeout=timeout)).to_pylist()

to_pandas async

to_pandas(flatten: Optional[Union[int, bool]] = None, timeout: Optional[timedelta] = None) -> 'pd.DataFrame'

Execute the query and collect the results into a pandas DataFrame.

This method will collect all results into memory before returning. If you expect a large number of results, you may want to use to_batches and convert each batch to pandas separately.

Examples:

>>> import asyncio
>>> from lancedb import connect_async
>>> async def doctest_example():
...     conn = await connect_async("./.lancedb")
...     table = await conn.create_table("my_table", data=[{"a": 1, "b": 2}])
...     async for batch in await table.query().to_batches():
...         batch_df = batch.to_pandas()
>>> asyncio.run(doctest_example())

Parameters:

  • flatten (Optional[Union[int, bool]], default: None ) –

    If flatten is True, flatten all nested columns. If flatten is an integer, flatten the nested columns up to the specified depth. If unspecified, do not flatten the nested columns.

  • timeout (Optional[timedelta], default: None ) –

    The maximum time to wait for the query to complete. If not specified, no timeout is applied. If the query does not complete within the specified time, an error will be raised.

Source code in lancedb/query.py
async def to_pandas(
    self,
    flatten: Optional[Union[int, bool]] = None,
    timeout: Optional[timedelta] = None,
) -> "pd.DataFrame":
    """
    Execute the query and collect the results into a pandas DataFrame.

    This method will collect all results into memory before returning.  If you
    expect a large number of results, you may want to use
    [to_batches][lancedb.query.AsyncQueryBase.to_batches] and convert each batch to
    pandas separately.

    Examples
    --------

    >>> import asyncio
    >>> from lancedb import connect_async
    >>> async def doctest_example():
    ...     conn = await connect_async("./.lancedb")
    ...     table = await conn.create_table("my_table", data=[{"a": 1, "b": 2}])
    ...     async for batch in await table.query().to_batches():
    ...         batch_df = batch.to_pandas()
    >>> asyncio.run(doctest_example())

    Parameters
    ----------
    flatten: Optional[Union[int, bool]]
        If flatten is True, flatten all nested columns.
        If flatten is an integer, flatten the nested columns up to the
        specified depth.
        If unspecified, do not flatten the nested columns.
    timeout: Optional[timedelta]
        The maximum time to wait for the query to complete.
        If not specified, no timeout is applied. If the query does not
        complete within the specified time, an error will be raised.
    """
    return (
        flatten_columns(await self.to_arrow(timeout=timeout), flatten)
    ).to_pandas()

to_polars async

to_polars(timeout: Optional[timedelta] = None) -> 'pl.DataFrame'

Execute the query and collect the results into a Polars DataFrame.

This method will collect all results into memory before returning. If you expect a large number of results, you may want to use to_batches and convert each batch to polars separately.

Parameters:

  • timeout (Optional[timedelta], default: None ) –

    The maximum time to wait for the query to complete. If not specified, no timeout is applied. If the query does not complete within the specified time, an error will be raised.

Examples:

>>> import asyncio
>>> import polars as pl
>>> from lancedb import connect_async
>>> async def doctest_example():
...     conn = await connect_async("./.lancedb")
...     table = await conn.create_table("my_table", data=[{"a": 1, "b": 2}])
...     async for batch in await table.query().to_batches():
...         batch_df = pl.from_arrow(batch)
>>> asyncio.run(doctest_example())
Source code in lancedb/query.py
async def to_polars(
    self,
    timeout: Optional[timedelta] = None,
) -> "pl.DataFrame":
    """
    Execute the query and collect the results into a Polars DataFrame.

    This method will collect all results into memory before returning.  If you
    expect a large number of results, you may want to use
    [to_batches][lancedb.query.AsyncQueryBase.to_batches] and convert each batch to
    polars separately.

    Parameters
    ----------
    timeout: Optional[timedelta]
        The maximum time to wait for the query to complete.
        If not specified, no timeout is applied. If the query does not
        complete within the specified time, an error will be raised.

    Examples
    --------

    >>> import asyncio
    >>> import polars as pl
    >>> from lancedb import connect_async
    >>> async def doctest_example():
    ...     conn = await connect_async("./.lancedb")
    ...     table = await conn.create_table("my_table", data=[{"a": 1, "b": 2}])
    ...     async for batch in await table.query().to_batches():
    ...         batch_df = pl.from_arrow(batch)
    >>> asyncio.run(doctest_example())
    """
    import polars as pl

    return pl.from_arrow(await self.to_arrow(timeout=timeout))

rerank

rerank(reranker: Reranker = RRFReranker(), normalize: str = 'score') -> AsyncHybridQuery

Rerank the hybrid search results using the specified reranker. The reranker must be an instance of Reranker class.

Parameters:

  • reranker (Reranker, default: RRFReranker() ) –

    The reranker to use. Must be an instance of Reranker class.

  • normalize (str, default: 'score' ) –

    The method to normalize the scores. Can be "rank" or "score". If "rank", the scores are converted to ranks and then normalized. If "score", the scores are normalized directly.

Returns:

Source code in lancedb/query.py
def rerank(
    self, reranker: Reranker = RRFReranker(), normalize: str = "score"
) -> AsyncHybridQuery:
    """
    Rerank the hybrid search results using the specified reranker. The reranker
    must be an instance of Reranker class.

    Parameters
    ----------
    reranker: Reranker, default RRFReranker()
        The reranker to use. Must be an instance of Reranker class.
    normalize: str, default "score"
        The method to normalize the scores. Can be "rank" or "score". If "rank",
        the scores are converted to ranks and then normalized. If "score", the
        scores are normalized directly.
    Returns
    -------
    AsyncHybridQuery
        The AsyncHybridQuery object.
    """
    if normalize not in ["rank", "score"]:
        raise ValueError("normalize must be 'rank' or 'score'.")
    if reranker and not isinstance(reranker, Reranker):
        raise ValueError("reranker must be an instance of Reranker class.")

    self._norm = normalize
    self._reranker = reranker

    return self

explain_plan async

explain_plan(verbose: Optional[bool] = False)

Return the execution plan for this query.

The output includes both the vector and FTS search plans.

Examples:

>>> import asyncio
>>> from lancedb import connect_async
>>> from lancedb.index import FTS
>>> async def doctest_example():
...     conn = await connect_async("./.lancedb")
...     table = await conn.create_table("my_table", [{"vector": [99, 99], "text": "hello world"}])
...     await table.create_index("text", config=FTS(with_position=False))
...     query = [100, 100]
...     plan = await table.query().nearest_to([1, 2]).nearest_to_text("hello").explain_plan(True)
...     print(plan)
>>> asyncio.run(doctest_example())
Vector Search Plan:
ProjectionExec: expr=[vector@0 as vector, text@3 as text, _distance@2 as _distance]
    Take: columns="vector, _rowid, _distance, (text)"
        CoalesceBatchesExec: target_batch_size=1024
        GlobalLimitExec: skip=0, fetch=10
            FilterExec: _distance@2 IS NOT NULL
            SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]
                KNNVectorDistance: metric=l2
                LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false
FTS Search Plan:
LanceScan: uri=..., projection=[vector, text], row_id=false, row_addr=false, ordered=true

Parameters:

  • verbose (bool, default: False ) –

    Use a verbose output format.

Returns:

  • plan ( str ) –
Source code in lancedb/query.py
async def explain_plan(self, verbose: Optional[bool] = False):
    """Return the execution plan for this query.

    The output includes both the vector and FTS search plans.

    Examples
    --------
    >>> import asyncio
    >>> from lancedb import connect_async
    >>> from lancedb.index import FTS
    >>> async def doctest_example():
    ...     conn = await connect_async("./.lancedb")
    ...     table = await conn.create_table("my_table", [{"vector": [99, 99], "text": "hello world"}])
    ...     await table.create_index("text", config=FTS(with_position=False))
    ...     query = [100, 100]
    ...     plan = await table.query().nearest_to([1, 2]).nearest_to_text("hello").explain_plan(True)
    ...     print(plan)
    >>> asyncio.run(doctest_example()) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
    Vector Search Plan:
    ProjectionExec: expr=[vector@0 as vector, text@3 as text, _distance@2 as _distance]
        Take: columns="vector, _rowid, _distance, (text)"
            CoalesceBatchesExec: target_batch_size=1024
            GlobalLimitExec: skip=0, fetch=10
                FilterExec: _distance@2 IS NOT NULL
                SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]
                    KNNVectorDistance: metric=l2
                    LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false
    FTS Search Plan:
    LanceScan: uri=..., projection=[vector, text], row_id=false, row_addr=false, ordered=true

    Parameters
    ----------
    verbose : bool, default False
        Use a verbose output format.

    Returns
    -------
    plan : str
    """  # noqa: E501

    results = ["Vector Search Plan:"]
    results.append(await self._inner.to_vector_query().explain_plan(verbose))
    results.append("FTS Search Plan:")
    results.append(await self._inner.to_fts_query().explain_plan(verbose))

    return "\n".join(results)

analyze_plan async

analyze_plan()

Execute the query and return the physical execution plan with runtime metrics.

This runs both the vector and FTS (full-text search) queries and returns detailed metrics for each step of executionβ€”such as rows processed, elapsed time, I/O stats, and more. It’s useful for debugging and performance analysis.

Returns:

  • plan ( str ) –
Source code in lancedb/query.py
async def analyze_plan(self):
    """
    Execute the query and return the physical execution plan with runtime metrics.

    This runs both the vector and FTS (full-text search) queries and returns
    detailed metrics for each step of executionβ€”such as rows processed,
    elapsed time, I/O stats, and more. It’s useful for debugging and
    performance analysis.

    Returns
    -------
    plan : str
    """
    results = ["Vector Search Query:"]
    results.append(await self._inner.to_vector_query().analyze_plan())
    results.append("FTS Search Query:")
    results.append(await self._inner.to_fts_query().analyze_plan())

    return "\n".join(results)