Python API Reference

This section contains the API reference for the OSS Python API.

Installation

pip install lancedb

The following methods describe the synchronous API client. There is also an asynchronous API client.

Connections (Synchronous)

`lancedb.connect(uri: URI, *, api_key: Optional[str] = None, region: str = 'us-east-1', host_override: Optional[str] = None, read_consistency_interval: Optional[timedelta] = None, request_thread_pool: Optional[Union[int, ThreadPoolExecutor]] = None, **kwargs: Any) -> DBConnection`

Connect to a LanceDB database.

Parameters:

Name	Type	Description	Default
`uri`	`URI`	The uri of the database.	required
`api_key`	`Optional[str]`	If presented, connect to LanceDB cloud. Otherwise, connect to a database on file system or cloud storage. Can be set via environment variable `LANCEDB_API_KEY`.	`None`
`region`	`str`	The region to use for LanceDB Cloud.	`'us-east-1'`
`host_override`	`Optional[str]`	The override url for LanceDB Cloud.	`None`
`read_consistency_interval`	`Optional[timedelta]`	(For LanceDB OSS only) The interval at which to check for updates to the table from other processes. If None, then consistency is not checked. For performance reasons, this is the default. For strong consistency, set this to zero seconds. Then every read will check for updates from other processes. As a compromise, you can set this to a non-zero timedelta for eventual consistency. If more than that interval has passed since the last check, then the table will be checked for updates. Note: this consistency only applies to read operations. Write operations are always consistent.	`None`
`request_thread_pool`	`Optional[Union[int, ThreadPoolExecutor]]`	The thread pool to use for making batch requests to the LanceDB Cloud API. If an integer, then a ThreadPoolExecutor will be created with that number of threads. If None, then a ThreadPoolExecutor will be created with the default number of threads. If a ThreadPoolExecutor, then that executor will be used for making requests. This is for LanceDB Cloud only and is only used when making batch requests (i.e., passing in multiple queries to the search method at once).	`None`

Examples:

For a local directory, provide a path for the database:

>>> import lancedb
>>> db = lancedb.connect("~/.lancedb")

For object storage, use a URI prefix:

>>> db = lancedb.connect("s3://my-bucket/lancedb")

Connect to LanceDB cloud:

>>> db = lancedb.connect("db://my_database", api_key="ldb_...")

Returns:

Name	Type	Description
`conn`	`DBConnection`	A connection to a LanceDB database.

Source code in lancedb/__init__.py

def connect(
    uri: URI,
    *,
    api_key: Optional[str] = None,
    region: str = "us-east-1",
    host_override: Optional[str] = None,
    read_consistency_interval: Optional[timedelta] = None,
    request_thread_pool: Optional[Union[int, ThreadPoolExecutor]] = None,
    **kwargs: Any,
) -> DBConnection:
    """Connect to a LanceDB database.

    Parameters
    ----------
    uri: str or Path
        The uri of the database.
    api_key: str, optional
        If presented, connect to LanceDB cloud.
        Otherwise, connect to a database on file system or cloud storage.
        Can be set via environment variable `LANCEDB_API_KEY`.
    region: str, default "us-east-1"
        The region to use for LanceDB Cloud.
    host_override: str, optional
        The override url for LanceDB Cloud.
    read_consistency_interval: timedelta, default None
        (For LanceDB OSS only)
        The interval at which to check for updates to the table from other
        processes. If None, then consistency is not checked. For performance
        reasons, this is the default. For strong consistency, set this to
        zero seconds. Then every read will check for updates from other
        processes. As a compromise, you can set this to a non-zero timedelta
        for eventual consistency. If more than that interval has passed since
        the last check, then the table will be checked for updates. Note: this
        consistency only applies to read operations. Write operations are
        always consistent.
    request_thread_pool: int or ThreadPoolExecutor, optional
        The thread pool to use for making batch requests to the LanceDB Cloud API.
        If an integer, then a ThreadPoolExecutor will be created with that
        number of threads. If None, then a ThreadPoolExecutor will be created
        with the default number of threads. If a ThreadPoolExecutor, then that
        executor will be used for making requests. This is for LanceDB Cloud
        only and is only used when making batch requests (i.e., passing in
        multiple queries to the search method at once).

    Examples
    --------

    For a local directory, provide a path for the database:

    >>> import lancedb
    >>> db = lancedb.connect("~/.lancedb")

    For object storage, use a URI prefix:

    >>> db = lancedb.connect("s3://my-bucket/lancedb")

    Connect to LanceDB cloud:

    >>> db = lancedb.connect("db://my_database", api_key="ldb_...")

    Returns
    -------
    conn : DBConnection
        A connection to a LanceDB database.
    """
    if isinstance(uri, str) and uri.startswith("db://"):
        if api_key is None:
            api_key = os.environ.get("LANCEDB_API_KEY")
        if api_key is None:
            raise ValueError(f"api_key is required to connected LanceDB cloud: {uri}")
        if isinstance(request_thread_pool, int):
            request_thread_pool = ThreadPoolExecutor(request_thread_pool)
        return RemoteDBConnection(
            uri,
            api_key,
            region,
            host_override,
            request_thread_pool=request_thread_pool,
            **kwargs,
        )

    if kwargs:
        raise ValueError(f"Unknown keyword arguments: {kwargs}")
    return LanceDBConnection(uri, read_consistency_interval=read_consistency_interval)

`lancedb.db.DBConnection`

Bases: EnforceOverrides

An active LanceDB connection interface.

Source code in lancedb/db.py

class DBConnection(EnforceOverrides):
    """An active LanceDB connection interface."""

    @abstractmethod
    def table_names(
        self, page_token: Optional[str] = None, limit: int = 10
    ) -> Iterable[str]:
        """List all tables in this database, in sorted order

        Parameters
        ----------
        page_token: str, optional
            The token to use for pagination. If not present, start from the beginning.
            Typically, this token is last table name from the previous page.
            Only supported by LanceDb Cloud.
        limit: int, default 10
            The size of the page to return.
            Only supported by LanceDb Cloud.

        Returns
        -------
        Iterable of str
        """
        pass

    @abstractmethod
    def create_table(
        self,
        name: str,
        data: Optional[DATA] = None,
        schema: Optional[Union[pa.Schema, LanceModel]] = None,
        mode: str = "create",
        exist_ok: bool = False,
        on_bad_vectors: str = "error",
        fill_value: float = 0.0,
        embedding_functions: Optional[List[EmbeddingFunctionConfig]] = None,
    ) -> Table:
        """Create a [Table][lancedb.table.Table] in the database.

        Parameters
        ----------
        name: str
            The name of the table.
        data: The data to initialize the table, *optional*
            User must provide at least one of `data` or `schema`.
            Acceptable types are:

            - dict or list-of-dict

            - pandas.DataFrame

            - pyarrow.Table or pyarrow.RecordBatch
        schema: The schema of the table, *optional*
            Acceptable types are:

            - pyarrow.Schema

            - [LanceModel][lancedb.pydantic.LanceModel]
        mode: str; default "create"
            The mode to use when creating the table.
            Can be either "create" or "overwrite".
            By default, if the table already exists, an exception is raised.
            If you want to overwrite the table, use mode="overwrite".
        exist_ok: bool, default False
            If a table by the same name already exists, then raise an exception
            if exist_ok=False. If exist_ok=True, then open the existing table;
            it will not add the provided data but will validate against any
            schema that's specified.
        on_bad_vectors: str, default "error"
            What to do if any of the vectors are not the same size or contains NaNs.
            One of "error", "drop", "fill".
        fill_value: float
            The value to use when filling vectors. Only used if on_bad_vectors="fill".

        Returns
        -------
        LanceTable
            A reference to the newly created table.

        !!! note

            The vector index won't be created by default.
            To create the index, call the `create_index` method on the table.

        Examples
        --------

        Can create with list of tuples or dictionaries:

        >>> import lancedb
        >>> db = lancedb.connect("./.lancedb")
        >>> data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},
        ...         {"vector": [0.2, 1.8], "lat": 40.1, "long":  -74.1}]
        >>> db.create_table("my_table", data)
        LanceTable(connection=..., name="my_table")
        >>> db["my_table"].head()
        pyarrow.Table
        vector: fixed_size_list<item: float>[2]
          child 0, item: float
        lat: double
        long: double
        ----
        vector: [[[1.1,1.2],[0.2,1.8]]]
        lat: [[45.5,40.1]]
        long: [[-122.7,-74.1]]

        You can also pass a pandas DataFrame:

        >>> import pandas as pd
        >>> data = pd.DataFrame({
        ...    "vector": [[1.1, 1.2], [0.2, 1.8]],
        ...    "lat": [45.5, 40.1],
        ...    "long": [-122.7, -74.1]
        ... })
        >>> db.create_table("table2", data)
        LanceTable(connection=..., name="table2")
        >>> db["table2"].head()
        pyarrow.Table
        vector: fixed_size_list<item: float>[2]
          child 0, item: float
        lat: double
        long: double
        ----
        vector: [[[1.1,1.2],[0.2,1.8]]]
        lat: [[45.5,40.1]]
        long: [[-122.7,-74.1]]

        Data is converted to Arrow before being written to disk. For maximum
        control over how data is saved, either provide the PyArrow schema to
        convert to or else provide a [PyArrow Table](pyarrow.Table) directly.

        >>> custom_schema = pa.schema([
        ...   pa.field("vector", pa.list_(pa.float32(), 2)),
        ...   pa.field("lat", pa.float32()),
        ...   pa.field("long", pa.float32())
        ... ])
        >>> db.create_table("table3", data, schema = custom_schema)
        LanceTable(connection=..., name="table3")
        >>> db["table3"].head()
        pyarrow.Table
        vector: fixed_size_list<item: float>[2]
          child 0, item: float
        lat: float
        long: float
        ----
        vector: [[[1.1,1.2],[0.2,1.8]]]
        lat: [[45.5,40.1]]
        long: [[-122.7,-74.1]]


        It is also possible to create an table from `[Iterable[pa.RecordBatch]]`:


        >>> import pyarrow as pa
        >>> def make_batches():
        ...     for i in range(5):
        ...         yield pa.RecordBatch.from_arrays(
        ...             [
        ...                 pa.array([[3.1, 4.1], [5.9, 26.5]],
        ...                     pa.list_(pa.float32(), 2)),
        ...                 pa.array(["foo", "bar"]),
        ...                 pa.array([10.0, 20.0]),
        ...             ],
        ...             ["vector", "item", "price"],
        ...         )
        >>> schema=pa.schema([
        ...     pa.field("vector", pa.list_(pa.float32(), 2)),
        ...     pa.field("item", pa.utf8()),
        ...     pa.field("price", pa.float32()),
        ... ])
        >>> db.create_table("table4", make_batches(), schema=schema)
        LanceTable(connection=..., name="table4")

        """
        raise NotImplementedError

    def __getitem__(self, name: str) -> LanceTable:
        return self.open_table(name)

    def open_table(self, name: str, *, index_cache_size: Optional[int] = None) -> Table:
        """Open a Lance Table in the database.

        Parameters
        ----------
        name: str
            The name of the table.
        index_cache_size: int, default 256
            Set the size of the index cache, specified as a number of entries

            The exact meaning of an "entry" will depend on the type of index:
            * IVF - there is one entry for each IVF partition
            * BTREE - there is one entry for the entire index

            This cache applies to the entire opened table, across all indices.
            Setting this value higher will increase performance on larger datasets
            at the expense of more RAM

        Returns
        -------
        A LanceTable object representing the table.
        """
        raise NotImplementedError

    def drop_table(self, name: str):
        """Drop a table from the database.

        Parameters
        ----------
        name: str
            The name of the table.
        """
        raise NotImplementedError

    def rename_table(self, cur_name: str, new_name: str):
        """Rename a table in the database.

        Parameters
        ----------
        cur_name: str
            The current name of the table.
        new_name: str
            The new name of the table.
        """
        raise NotImplementedError

    def drop_database(self):
        """
        Drop database
        This is the same thing as dropping all the tables
        """
        raise NotImplementedError

`table_names(page_token: Optional[str] = None, limit: int = 10) -> Iterable[str]` `abstractmethod`

List all tables in this database, in sorted order

Parameters:

Name	Type	Description	Default
`page_token`	`Optional[str]`	The token to use for pagination. If not present, start from the beginning. Typically, this token is last table name from the previous page. Only supported by LanceDb Cloud.	`None`
`limit`	`int`	The size of the page to return. Only supported by LanceDb Cloud.	`10`

Returns:

Type	Description
`Iterable of str`

Source code in lancedb/db.py

@abstractmethod
def table_names(
    self, page_token: Optional[str] = None, limit: int = 10
) -> Iterable[str]:
    """List all tables in this database, in sorted order

    Parameters
    ----------
    page_token: str, optional
        The token to use for pagination. If not present, start from the beginning.
        Typically, this token is last table name from the previous page.
        Only supported by LanceDb Cloud.
    limit: int, default 10
        The size of the page to return.
        Only supported by LanceDb Cloud.

    Returns
    -------
    Iterable of str
    """
    pass

`create_table(name: str, data: Optional[DATA] = None, schema: Optional[Union[pa.Schema, LanceModel]] = None, mode: str = 'create', exist_ok: bool = False, on_bad_vectors: str = 'error', fill_value: float = 0.0, embedding_functions: Optional[List[EmbeddingFunctionConfig]] = None) -> Table` `abstractmethod`

Create a Table in the database.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the table.	required
`data`	`Optional[DATA]`	User must provide at least one of `data` or `schema`. Acceptable types are: dict or list-of-dict pandas.DataFrame pyarrow.Table or pyarrow.RecordBatch	`None`
`schema`	`Optional[Union[Schema, LanceModel]]`	Acceptable types are: pyarrow.Schema LanceModel	`None`
`mode`	`str`	The mode to use when creating the table. Can be either "create" or "overwrite". By default, if the table already exists, an exception is raised. If you want to overwrite the table, use mode="overwrite".	`'create'`
`exist_ok`	`bool`	If a table by the same name already exists, then raise an exception if exist_ok=False. If exist_ok=True, then open the existing table; it will not add the provided data but will validate against any schema that's specified.	`False`
`on_bad_vectors`	`str`	What to do if any of the vectors are not the same size or contains NaNs. One of "error", "drop", "fill".	`'error'`
`fill_value`	`float`	The value to use when filling vectors. Only used if on_bad_vectors="fill".	`0.0`

Returns:

Type	Description
`LanceTable`	A reference to the newly created table.
`!!! note`	The vector index won't be created by default. To create the index, call the `create_index` method on the table.

Examples:

Can create with list of tuples or dictionaries:

>>> import lancedb
>>> db = lancedb.connect("./.lancedb")
>>> data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},
...         {"vector": [0.2, 1.8], "lat": 40.1, "long":  -74.1}]
>>> db.create_table("my_table", data)
LanceTable(connection=..., name="my_table")
>>> db["my_table"].head()
pyarrow.Table
vector: fixed_size_list<item: float>[2]
  child 0, item: float
lat: double
long: double
----
vector: [[[1.1,1.2],[0.2,1.8]]]
lat: [[45.5,40.1]]
long: [[-122.7,-74.1]]

You can also pass a pandas DataFrame:

>>> import pandas as pd
>>> data = pd.DataFrame({
...    "vector": [[1.1, 1.2], [0.2, 1.8]],
...    "lat": [45.5, 40.1],
...    "long": [-122.7, -74.1]
... })
>>> db.create_table("table2", data)
LanceTable(connection=..., name="table2")
>>> db["table2"].head()
pyarrow.Table
vector: fixed_size_list<item: float>[2]
  child 0, item: float
lat: double
long: double
----
vector: [[[1.1,1.2],[0.2,1.8]]]
lat: [[45.5,40.1]]
long: [[-122.7,-74.1]]

Data is converted to Arrow before being written to disk. For maximum control over how data is saved, either provide the PyArrow schema to convert to or else provide a PyArrow Table directly.

>>> custom_schema = pa.schema([
...   pa.field("vector", pa.list_(pa.float32(), 2)),
...   pa.field("lat", pa.float32()),
...   pa.field("long", pa.float32())
... ])
>>> db.create_table("table3", data, schema = custom_schema)
LanceTable(connection=..., name="table3")
>>> db["table3"].head()
pyarrow.Table
vector: fixed_size_list<item: float>[2]
  child 0, item: float
lat: float
long: float
----
vector: [[[1.1,1.2],[0.2,1.8]]]
lat: [[45.5,40.1]]
long: [[-122.7,-74.1]]

It is also possible to create an table from [Iterable[pa.RecordBatch]]:

>>> import pyarrow as pa
>>> def make_batches():
...     for i in range(5):
...         yield pa.RecordBatch.from_arrays(
...             [
...                 pa.array([[3.1, 4.1], [5.9, 26.5]],
...                     pa.list_(pa.float32(), 2)),
...                 pa.array(["foo", "bar"]),
...                 pa.array([10.0, 20.0]),
...             ],
...             ["vector", "item", "price"],
...         )
>>> schema=pa.schema([
...     pa.field("vector", pa.list_(pa.float32(), 2)),
...     pa.field("item", pa.utf8()),
...     pa.field("price", pa.float32()),
... ])
>>> db.create_table("table4", make_batches(), schema=schema)
LanceTable(connection=..., name="table4")

Source code in lancedb/db.py

@abstractmethod
def create_table(
    self,
    name: str,
    data: Optional[DATA] = None,
    schema: Optional[Union[pa.Schema, LanceModel]] = None,
    mode: str = "create",
    exist_ok: bool = False,
    on_bad_vectors: str = "error",
    fill_value: float = 0.0,
    embedding_functions: Optional[List[EmbeddingFunctionConfig]] = None,
) -> Table:
    """Create a [Table][lancedb.table.Table] in the database.

    Parameters
    ----------
    name: str
        The name of the table.
    data: The data to initialize the table, *optional*
        User must provide at least one of `data` or `schema`.
        Acceptable types are:

        - dict or list-of-dict

        - pandas.DataFrame

        - pyarrow.Table or pyarrow.RecordBatch
    schema: The schema of the table, *optional*
        Acceptable types are:

        - pyarrow.Schema

        - [LanceModel][lancedb.pydantic.LanceModel]
    mode: str; default "create"
        The mode to use when creating the table.
        Can be either "create" or "overwrite".
        By default, if the table already exists, an exception is raised.
        If you want to overwrite the table, use mode="overwrite".
    exist_ok: bool, default False
        If a table by the same name already exists, then raise an exception
        if exist_ok=False. If exist_ok=True, then open the existing table;
        it will not add the provided data but will validate against any
        schema that's specified.
    on_bad_vectors: str, default "error"
        What to do if any of the vectors are not the same size or contains NaNs.
        One of "error", "drop", "fill".
    fill_value: float
        The value to use when filling vectors. Only used if on_bad_vectors="fill".

    Returns
    -------
    LanceTable
        A reference to the newly created table.

    !!! note

        The vector index won't be created by default.
        To create the index, call the `create_index` method on the table.

    Examples
    --------

    Can create with list of tuples or dictionaries:

    >>> import lancedb
    >>> db = lancedb.connect("./.lancedb")
    >>> data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},
    ...         {"vector": [0.2, 1.8], "lat": 40.1, "long":  -74.1}]
    >>> db.create_table("my_table", data)
    LanceTable(connection=..., name="my_table")
    >>> db["my_table"].head()
    pyarrow.Table
    vector: fixed_size_list<item: float>[2]
      child 0, item: float
    lat: double
    long: double
    ----
    vector: [[[1.1,1.2],[0.2,1.8]]]
    lat: [[45.5,40.1]]
    long: [[-122.7,-74.1]]

    You can also pass a pandas DataFrame:

    >>> import pandas as pd
    >>> data = pd.DataFrame({
    ...    "vector": [[1.1, 1.2], [0.2, 1.8]],
    ...    "lat": [45.5, 40.1],
    ...    "long": [-122.7, -74.1]
    ... })
    >>> db.create_table("table2", data)
    LanceTable(connection=..., name="table2")
    >>> db["table2"].head()
    pyarrow.Table
    vector: fixed_size_list<item: float>[2]
      child 0, item: float
    lat: double
    long: double
    ----
    vector: [[[1.1,1.2],[0.2,1.8]]]
    lat: [[45.5,40.1]]
    long: [[-122.7,-74.1]]

    Data is converted to Arrow before being written to disk. For maximum
    control over how data is saved, either provide the PyArrow schema to
    convert to or else provide a [PyArrow Table](pyarrow.Table) directly.

    >>> custom_schema = pa.schema([
    ...   pa.field("vector", pa.list_(pa.float32(), 2)),
    ...   pa.field("lat", pa.float32()),
    ...   pa.field("long", pa.float32())
    ... ])
    >>> db.create_table("table3", data, schema = custom_schema)
    LanceTable(connection=..., name="table3")
    >>> db["table3"].head()
    pyarrow.Table
    vector: fixed_size_list<item: float>[2]
      child 0, item: float
    lat: float
    long: float
    ----
    vector: [[[1.1,1.2],[0.2,1.8]]]
    lat: [[45.5,40.1]]
    long: [[-122.7,-74.1]]


    It is also possible to create an table from `[Iterable[pa.RecordBatch]]`:


    >>> import pyarrow as pa
    >>> def make_batches():
    ...     for i in range(5):
    ...         yield pa.RecordBatch.from_arrays(
    ...             [
    ...                 pa.array([[3.1, 4.1], [5.9, 26.5]],
    ...                     pa.list_(pa.float32(), 2)),
    ...                 pa.array(["foo", "bar"]),
    ...                 pa.array([10.0, 20.0]),
    ...             ],
    ...             ["vector", "item", "price"],
    ...         )
    >>> schema=pa.schema([
    ...     pa.field("vector", pa.list_(pa.float32(), 2)),
    ...     pa.field("item", pa.utf8()),
    ...     pa.field("price", pa.float32()),
    ... ])
    >>> db.create_table("table4", make_batches(), schema=schema)
    LanceTable(connection=..., name="table4")

    """
    raise NotImplementedError

`open_table(name: str, *, index_cache_size: Optional[int] = None) -> Table`

Open a Lance Table in the database.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the table.	required
`index_cache_size`	`Optional[int]`	Set the size of the index cache, specified as a number of entries The exact meaning of an "entry" will depend on the type of index: * IVF - there is one entry for each IVF partition * BTREE - there is one entry for the entire index This cache applies to the entire opened table, across all indices. Setting this value higher will increase performance on larger datasets at the expense of more RAM	`None`

Returns:

Type	Description
`A LanceTable object representing the table.`

Source code in lancedb/db.py

def open_table(self, name: str, *, index_cache_size: Optional[int] = None) -> Table:
    """Open a Lance Table in the database.

    Parameters
    ----------
    name: str
        The name of the table.
    index_cache_size: int, default 256
        Set the size of the index cache, specified as a number of entries

        The exact meaning of an "entry" will depend on the type of index:
        * IVF - there is one entry for each IVF partition
        * BTREE - there is one entry for the entire index

        This cache applies to the entire opened table, across all indices.
        Setting this value higher will increase performance on larger datasets
        at the expense of more RAM

    Returns
    -------
    A LanceTable object representing the table.
    """
    raise NotImplementedError

`drop_table(name: str)`

Drop a table from the database.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the table.	required

Source code in lancedb/db.py

def drop_table(self, name: str):
    """Drop a table from the database.

    Parameters
    ----------
    name: str
        The name of the table.
    """
    raise NotImplementedError

`rename_table(cur_name: str, new_name: str)`

Rename a table in the database.

Parameters:

Name	Type	Description	Default
`cur_name`	`str`	The current name of the table.	required
`new_name`	`str`	The new name of the table.	required

Source code in lancedb/db.py

def rename_table(self, cur_name: str, new_name: str):
    """Rename a table in the database.

    Parameters
    ----------
    cur_name: str
        The current name of the table.
    new_name: str
        The new name of the table.
    """
    raise NotImplementedError

`drop_database()`

Drop database This is the same thing as dropping all the tables

Source code in lancedb/db.py

def drop_database(self):
    """
    Drop database
    This is the same thing as dropping all the tables
    """
    raise NotImplementedError

Tables (Synchronous)

`lancedb.table.Table`

Bases: ABC

A Table is a collection of Records in a LanceDB Database.

Examples:

Create using DBConnection.create_table (more examples in that method's documentation).

>>> import lancedb
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data=[{"vector": [1.1, 1.2], "b": 2}])
>>> table.head()
pyarrow.Table
vector: fixed_size_list<item: float>[2]
  child 0, item: float
b: int64
----
vector: [[[1.1,1.2]]]
b: [[2]]

Can append new data with Table.add().

>>> table.add([{"vector": [0.5, 1.3], "b": 4}])

Can query the table with Table.search.

>>> table.search([0.4, 0.4]).select(["b", "vector"]).to_pandas()
   b      vector  _distance
0  4  [0.5, 1.3]       0.82
1  2  [1.1, 1.2]       1.13

Search queries are much faster when an index is created. See Table.create_index.

Source code in lancedb/table.py

class Table(ABC):
    """
    A Table is a collection of Records in a LanceDB Database.

    Examples
    --------

    Create using [DBConnection.create_table][lancedb.DBConnection.create_table]
    (more examples in that method's documentation).

    >>> import lancedb
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", data=[{"vector": [1.1, 1.2], "b": 2}])
    >>> table.head()
    pyarrow.Table
    vector: fixed_size_list<item: float>[2]
      child 0, item: float
    b: int64
    ----
    vector: [[[1.1,1.2]]]
    b: [[2]]

    Can append new data with [Table.add()][lancedb.table.Table.add].

    >>> table.add([{"vector": [0.5, 1.3], "b": 4}])

    Can query the table with [Table.search][lancedb.table.Table.search].

    >>> table.search([0.4, 0.4]).select(["b", "vector"]).to_pandas()
       b      vector  _distance
    0  4  [0.5, 1.3]       0.82
    1  2  [1.1, 1.2]       1.13

    Search queries are much faster when an index is created. See
    [Table.create_index][lancedb.table.Table.create_index].
    """

    @property
    @abstractmethod
    def schema(self) -> pa.Schema:
        """The [Arrow Schema](https://arrow.apache.org/docs/python/api/datatypes.html#)
        of this Table

        """
        raise NotImplementedError

    @abstractmethod
    def count_rows(self, filter: Optional[str] = None) -> int:
        """
        Count the number of rows in the table.

        Parameters
        ----------
        filter: str, optional
            A SQL where clause to filter the rows to count.
        """
        raise NotImplementedError

    def to_pandas(self) -> "pd.DataFrame":
        """Return the table as a pandas DataFrame.

        Returns
        -------
        pd.DataFrame
        """
        return self.to_arrow().to_pandas()

    @abstractmethod
    def to_arrow(self) -> pa.Table:
        """Return the table as a pyarrow Table.

        Returns
        -------
        pa.Table
        """
        raise NotImplementedError

    def create_index(
        self,
        metric="L2",
        num_partitions=256,
        num_sub_vectors=96,
        vector_column_name: str = VECTOR_COLUMN_NAME,
        replace: bool = True,
        accelerator: Optional[str] = None,
        index_cache_size: Optional[int] = None,
    ):
        """Create an index on the table.

        Parameters
        ----------
        metric: str, default "L2"
            The distance metric to use when creating the index.
            Valid values are "L2", "cosine", or "dot".
            L2 is euclidean distance.
        num_partitions: int, default 256
            The number of IVF partitions to use when creating the index.
            Default is 256.
        num_sub_vectors: int, default 96
            The number of PQ sub-vectors to use when creating the index.
            Default is 96.
        vector_column_name: str, default "vector"
            The vector column name to create the index.
        replace: bool, default True
            - If True, replace the existing index if it exists.

            - If False, raise an error if duplicate index exists.
        accelerator: str, default None
            If set, use the given accelerator to create the index.
            Only support "cuda" for now.
        index_cache_size : int, optional
            The size of the index cache in number of entries. Default value is 256.
        """
        raise NotImplementedError

    @abstractmethod
    def create_scalar_index(
        self,
        column: str,
        *,
        replace: bool = True,
    ):
        """Create a scalar index on a column.

        Scalar indices, like vector indices, can be used to speed up scans.  A scalar
        index can speed up scans that contain filter expressions on the indexed column.
        For example, the following scan will be faster if the column ``my_col`` has
        a scalar index:


            import lancedb

            db = lancedb.connect("/data/lance")
            img_table = db.open_table("images")
            my_df = img_table.search().where("my_col = 7", prefilter=True).to_pandas()

        Scalar indices can also speed up scans containing a vector search and a
        prefilter:

            import lancedb

            db = lancedb.connect("/data/lance")
            img_table = db.open_table("images")
            img_table.search([1, 2, 3, 4], vector_column_name="vector")
                .where("my_col != 7", prefilter=True)
                .to_pandas()

        Scalar indices can only speed up scans for basic filters using
        equality, comparison, range (e.g. ``my_col BETWEEN 0 AND 100``), and set
        membership (e.g. `my_col IN (0, 1, 2)`)

        Scalar indices can be used if the filter contains multiple indexed columns and
        the filter criteria are AND'd or OR'd together
        (e.g. ``my_col < 0 AND other_col> 100``)

        Scalar indices may be used if the filter contains non-indexed columns but,
        depending on the structure of the filter, they may not be usable.  For example,
        if the column ``not_indexed`` does not have a scalar index then the filter
        ``my_col = 0 OR not_indexed = 1`` will not be able to use any scalar index on
        ``my_col``.

        **Experimental API**

        Parameters
        ----------
        column : str
            The column to be indexed.  Must be a boolean, integer, float,
            or string column.
        replace : bool, default True
            Replace the existing index if it exists.

        Examples
        --------


            import lance

            dataset = lance.dataset("./images.lance")
            dataset.create_scalar_index("category")
        """
        raise NotImplementedError

    @abstractmethod
    def add(
        self,
        data: DATA,
        mode: str = "append",
        on_bad_vectors: str = "error",
        fill_value: float = 0.0,
    ):
        """Add more data to the [Table](Table).

        Parameters
        ----------
        data: DATA
            The data to insert into the table. Acceptable types are:

            - dict or list-of-dict

            - pandas.DataFrame

            - pyarrow.Table or pyarrow.RecordBatch
        mode: str
            The mode to use when writing the data. Valid values are
            "append" and "overwrite".
        on_bad_vectors: str, default "error"
            What to do if any of the vectors are not the same size or contains NaNs.
            One of "error", "drop", "fill".
        fill_value: float, default 0.
            The value to use when filling vectors. Only used if on_bad_vectors="fill".

        """
        raise NotImplementedError

    def merge_insert(self, on: Union[str, Iterable[str]]) -> LanceMergeInsertBuilder:
        """
        Returns a [`LanceMergeInsertBuilder`][lancedb.merge.LanceMergeInsertBuilder]
        that can be used to create a "merge insert" operation

        This operation can add rows, update rows, and remove rows all in a single
        transaction. It is a very generic tool that can be used to create
        behaviors like "insert if not exists", "update or insert (i.e. upsert)",
        or even replace a portion of existing data with new data (e.g. replace
        all data where month="january")

        The merge insert operation works by combining new data from a
        **source table** with existing data in a **target table** by using a
        join.  There are three categories of records.

        "Matched" records are records that exist in both the source table and
        the target table. "Not matched" records exist only in the source table
        (e.g. these are new data) "Not matched by source" records exist only
        in the target table (this is old data)

        The builder returned by this method can be used to customize what
        should happen for each category of data.

        Please note that the data may appear to be reordered as part of this
        operation.  This is because updated rows will be deleted from the
        dataset and then reinserted at the end with the new values.

        Parameters
        ----------

        on: Union[str, Iterable[str]]
            A column (or columns) to join on.  This is how records from the
            source table and target table are matched.  Typically this is some
            kind of key or id column.

        Examples
        --------
        >>> import lancedb
        >>> data = pa.table({"a": [2, 1, 3], "b": ["a", "b", "c"]})
        >>> db = lancedb.connect("./.lancedb")
        >>> table = db.create_table("my_table", data)
        >>> new_data = pa.table({"a": [2, 3, 4], "b": ["x", "y", "z"]})
        >>> # Perform a "upsert" operation
        >>> table.merge_insert("a")             \\
        ...      .when_matched_update_all()     \\
        ...      .when_not_matched_insert_all() \\
        ...      .execute(new_data)
        >>> # The order of new rows is non-deterministic since we use
        >>> # a hash-join as part of this operation and so we sort here
        >>> table.to_arrow().sort_by("a").to_pandas()
           a  b
        0  1  b
        1  2  x
        2  3  y
        3  4  z
        """
        on = [on] if isinstance(on, str) else list(on.iter())

        return LanceMergeInsertBuilder(self, on)

    @abstractmethod
    def search(
        self,
        query: Optional[Union[VEC, str, "PIL.Image.Image", Tuple]] = None,
        vector_column_name: Optional[str] = None,
        query_type: str = "auto",
    ) -> LanceQueryBuilder:
        """Create a search query to find the nearest neighbors
        of the given query vector. We currently support [vector search][search]
        and [full-text search][experimental-full-text-search].

        All query options are defined in [Query][lancedb.query.Query].

        Examples
        --------
        >>> import lancedb
        >>> db = lancedb.connect("./.lancedb")
        >>> data = [
        ...    {"original_width": 100, "caption": "bar", "vector": [0.1, 2.3, 4.5]},
        ...    {"original_width": 2000, "caption": "foo",  "vector": [0.5, 3.4, 1.3]},
        ...    {"original_width": 3000, "caption": "test", "vector": [0.3, 6.2, 2.6]}
        ... ]
        >>> table = db.create_table("my_table", data)
        >>> query = [0.4, 1.4, 2.4]
        >>> (table.search(query)
        ...     .where("original_width > 1000", prefilter=True)
        ...     .select(["caption", "original_width", "vector"])
        ...     .limit(2)
        ...     .to_pandas())
          caption  original_width           vector  _distance
        0     foo            2000  [0.5, 3.4, 1.3]   5.220000
        1    test            3000  [0.3, 6.2, 2.6]  23.089996

        Parameters
        ----------
        query: list/np.ndarray/str/PIL.Image.Image, default None
            The targetted vector to search for.

            - *default None*.
            Acceptable types are: list, np.ndarray, PIL.Image.Image

            - If None then the select/where/limit clauses are applied to filter
            the table
        vector_column_name: str, optional
            The name of the vector column to search.

            The vector column needs to be a pyarrow fixed size list type

            - If not specified then the vector column is inferred from
            the table schema

            - If the table has multiple vector columns then the *vector_column_name*
            needs to be specified. Otherwise, an error is raised.
        query_type: str
            *default "auto"*.
            Acceptable types are: "vector", "fts", "hybrid", or "auto"

            - If "auto" then the query type is inferred from the query;

                - If `query` is a list/np.ndarray then the query type is
                "vector";

                - If `query` is a PIL.Image.Image then either do vector search,
                or raise an error if no corresponding embedding function is found.

            - If `query` is a string, then the query type is "vector" if the
            table has embedding functions else the query type is "fts"

        Returns
        -------
        LanceQueryBuilder
            A query builder object representing the query.
            Once executed, the query returns

            - selected columns

            - the vector

            - and also the "_distance" column which is the distance between the query
            vector and the returned vector.
        """
        raise NotImplementedError

    @abstractmethod
    def _execute_query(
        self, query: Query, batch_size: Optional[int] = None
    ) -> pa.RecordBatchReader:
        pass

    @abstractmethod
    def _do_merge(
        self,
        merge: LanceMergeInsertBuilder,
        new_data: DATA,
        on_bad_vectors: str,
        fill_value: float,
    ):
        pass

    @abstractmethod
    def delete(self, where: str):
        """Delete rows from the table.

        This can be used to delete a single row, many rows, all rows, or
        sometimes no rows (if your predicate matches nothing).

        Parameters
        ----------
        where: str
            The SQL where clause to use when deleting rows.

            - For example, 'x = 2' or 'x IN (1, 2, 3)'.

            The filter must not be empty, or it will error.

        Examples
        --------
        >>> import lancedb
        >>> data = [
        ...    {"x": 1, "vector": [1, 2]},
        ...    {"x": 2, "vector": [3, 4]},
        ...    {"x": 3, "vector": [5, 6]}
        ... ]
        >>> db = lancedb.connect("./.lancedb")
        >>> table = db.create_table("my_table", data)
        >>> table.to_pandas()
           x      vector
        0  1  [1.0, 2.0]
        1  2  [3.0, 4.0]
        2  3  [5.0, 6.0]
        >>> table.delete("x = 2")
        >>> table.to_pandas()
           x      vector
        0  1  [1.0, 2.0]
        1  3  [5.0, 6.0]

        If you have a list of values to delete, you can combine them into a
        stringified list and use the `IN` operator:

        >>> to_remove = [1, 5]
        >>> to_remove = ", ".join([str(v) for v in to_remove])
        >>> to_remove
        '1, 5'
        >>> table.delete(f"x IN ({to_remove})")
        >>> table.to_pandas()
           x      vector
        0  3  [5.0, 6.0]
        """
        raise NotImplementedError

    @abstractmethod
    def update(
        self,
        where: Optional[str] = None,
        values: Optional[dict] = None,
        *,
        values_sql: Optional[Dict[str, str]] = None,
    ):
        """
        This can be used to update zero to all rows depending on how many
        rows match the where clause. If no where clause is provided, then
        all rows will be updated.

        Either `values` or `values_sql` must be provided. You cannot provide
        both.

        Parameters
        ----------
        where: str, optional
            The SQL where clause to use when updating rows. For example, 'x = 2'
            or 'x IN (1, 2, 3)'. The filter must not be empty, or it will error.
        values: dict, optional
            The values to update. The keys are the column names and the values
            are the values to set.
        values_sql: dict, optional
            The values to update, expressed as SQL expression strings. These can
            reference existing columns. For example, {"x": "x + 1"} will increment
            the x column by 1.

        Examples
        --------
        >>> import lancedb
        >>> import pandas as pd
        >>> data = pd.DataFrame({"x": [1, 2, 3], "vector": [[1, 2], [3, 4], [5, 6]]})
        >>> db = lancedb.connect("./.lancedb")
        >>> table = db.create_table("my_table", data)
        >>> table.to_pandas()
           x      vector
        0  1  [1.0, 2.0]
        1  2  [3.0, 4.0]
        2  3  [5.0, 6.0]
        >>> table.update(where="x = 2", values={"vector": [10, 10]})
        >>> table.to_pandas()
           x        vector
        0  1    [1.0, 2.0]
        1  3    [5.0, 6.0]
        2  2  [10.0, 10.0]
        >>> table.update(values_sql={"x": "x + 1"})
        >>> table.to_pandas()
           x        vector
        0  2    [1.0, 2.0]
        1  4    [5.0, 6.0]
        2  3  [10.0, 10.0]
        """
        raise NotImplementedError

    @abstractmethod
    def cleanup_old_versions(
        self,
        older_than: Optional[timedelta] = None,
        *,
        delete_unverified: bool = False,
    ) -> CleanupStats:
        """
        Clean up old versions of the table, freeing disk space.

        Note: This function is not available in LanceDb Cloud (since LanceDb
        Cloud manages cleanup for you automatically)

        Parameters
        ----------
        older_than: timedelta, default None
            The minimum age of the version to delete. If None, then this defaults
            to two weeks.
        delete_unverified: bool, default False
            Because they may be part of an in-progress transaction, files newer
            than 7 days old are not deleted by default. If you are sure that
            there are no in-progress transactions, then you can set this to True
            to delete all files older than `older_than`.

        Returns
        -------
        CleanupStats
            The stats of the cleanup operation, including how many bytes were
            freed.
        """

    @abstractmethod
    def compact_files(self, *args, **kwargs):
        """
        Run the compaction process on the table.

        Note: This function is not available in LanceDb Cloud (since LanceDb
        Cloud manages compaction for you automatically)

        This can be run after making several small appends to optimize the table
        for faster reads.

        Arguments are passed onto :meth:`lance.dataset.DatasetOptimizer.compact_files`.
        For most cases, the default should be fine.
        """

    @abstractmethod
    def add_columns(self, transforms: Dict[str, str]):
        """
        Add new columns with defined values.

        This is not yet available in LanceDB Cloud.

        Parameters
        ----------
        transforms: Dict[str, str]
            A map of column name to a SQL expression to use to calculate the
            value of the new column. These expressions will be evaluated for
            each row in the table, and can reference existing columns.
        """

    @abstractmethod
    def alter_columns(self, alterations: Iterable[Dict[str, str]]):
        """
        Alter column names and nullability.

        This is not yet available in LanceDB Cloud.

        alterations : Iterable[Dict[str, Any]]
            A sequence of dictionaries, each with the following keys:
            - "path": str
                The column path to alter. For a top-level column, this is the name.
                For a nested column, this is the dot-separated path, e.g. "a.b.c".
            - "name": str, optional
                The new name of the column. If not specified, the column name is
                not changed.
            - "nullable": bool, optional
                Whether the column should be nullable. If not specified, the column
                nullability is not changed. Only non-nullable columns can be changed
                to nullable. Currently, you cannot change a nullable column to
                non-nullable.
        """

    @abstractmethod
    def drop_columns(self, columns: Iterable[str]):
        """
        Drop columns from the table.

        This is not yet available in LanceDB Cloud.

        Parameters
        ----------
        columns : Iterable[str]
            The names of the columns to drop.
        """

`schema: pa.Schema` `abstractmethod` `property`

The Arrow Schema of this Table

`count_rows(filter: Optional[str] = None) -> int` `abstractmethod`

Count the number of rows in the table.

Parameters:

Name	Type	Description	Default
`filter`	`Optional[str]`	A SQL where clause to filter the rows to count.	`None`

Source code in lancedb/table.py

@abstractmethod
def count_rows(self, filter: Optional[str] = None) -> int:
    """
    Count the number of rows in the table.

    Parameters
    ----------
    filter: str, optional
        A SQL where clause to filter the rows to count.
    """
    raise NotImplementedError

`to_pandas() -> 'pd.DataFrame'`

Return the table as a pandas DataFrame.

Returns:

Type	Description
`DataFrame`

Source code in lancedb/table.py

def to_pandas(self) -> "pd.DataFrame":
    """Return the table as a pandas DataFrame.

    Returns
    -------
    pd.DataFrame
    """
    return self.to_arrow().to_pandas()

`to_arrow() -> pa.Table` `abstractmethod`

Return the table as a pyarrow Table.

Returns:

Type	Description
`Table`

Source code in lancedb/table.py

@abstractmethod
def to_arrow(self) -> pa.Table:
    """Return the table as a pyarrow Table.

    Returns
    -------
    pa.Table
    """
    raise NotImplementedError

`create_index(metric='L2', num_partitions=256, num_sub_vectors=96, vector_column_name: str = VECTOR_COLUMN_NAME, replace: bool = True, accelerator: Optional[str] = None, index_cache_size: Optional[int] = None)`

Create an index on the table.

Parameters:

Name	Type	Description	Default
`metric`		The distance metric to use when creating the index. Valid values are "L2", "cosine", or "dot". L2 is euclidean distance.	`'L2'`
`num_partitions`		The number of IVF partitions to use when creating the index. Default is 256.	`256`
`num_sub_vectors`		The number of PQ sub-vectors to use when creating the index. Default is 96.	`96`
`vector_column_name`	`str`	The vector column name to create the index.	`VECTOR_COLUMN_NAME`
`replace`	`bool`	If True, replace the existing index if it exists. If False, raise an error if duplicate index exists.	`True`
`accelerator`	`Optional[str]`	If set, use the given accelerator to create the index. Only support "cuda" for now.	`None`
`index_cache_size`	`int`	The size of the index cache in number of entries. Default value is 256.	`None`

Source code in lancedb/table.py

def create_index(
    self,
    metric="L2",
    num_partitions=256,
    num_sub_vectors=96,
    vector_column_name: str = VECTOR_COLUMN_NAME,
    replace: bool = True,
    accelerator: Optional[str] = None,
    index_cache_size: Optional[int] = None,
):
    """Create an index on the table.

    Parameters
    ----------
    metric: str, default "L2"
        The distance metric to use when creating the index.
        Valid values are "L2", "cosine", or "dot".
        L2 is euclidean distance.
    num_partitions: int, default 256
        The number of IVF partitions to use when creating the index.
        Default is 256.
    num_sub_vectors: int, default 96
        The number of PQ sub-vectors to use when creating the index.
        Default is 96.
    vector_column_name: str, default "vector"
        The vector column name to create the index.
    replace: bool, default True
        - If True, replace the existing index if it exists.

        - If False, raise an error if duplicate index exists.
    accelerator: str, default None
        If set, use the given accelerator to create the index.
        Only support "cuda" for now.
    index_cache_size : int, optional
        The size of the index cache in number of entries. Default value is 256.
    """
    raise NotImplementedError

`create_scalar_index(column: str, *, replace: bool = True)` `abstractmethod`

Create a scalar index on a column.

Scalar indices, like vector indices, can be used to speed up scans. A scalar index can speed up scans that contain filter expressions on the indexed column. For example, the following scan will be faster if the column my_col has a scalar index:

import lancedb

db = lancedb.connect("/data/lance")
img_table = db.open_table("images")
my_df = img_table.search().where("my_col = 7", prefilter=True).to_pandas()

Scalar indices can also speed up scans containing a vector search and a prefilter:

import lancedb

db = lancedb.connect("/data/lance")
img_table = db.open_table("images")
img_table.search([1, 2, 3, 4], vector_column_name="vector")
    .where("my_col != 7", prefilter=True)
    .to_pandas()

Scalar indices can only speed up scans for basic filters using equality, comparison, range (e.g. my_col BETWEEN 0 AND 100), and set membership (e.g. my_col IN (0, 1, 2))

Scalar indices can be used if the filter contains multiple indexed columns and the filter criteria are AND'd or OR'd together (e.g. my_col < 0 AND other_col> 100)

Scalar indices may be used if the filter contains non-indexed columns but, depending on the structure of the filter, they may not be usable. For example, if the column not_indexed does not have a scalar index then the filter my_col = 0 OR not_indexed = 1 will not be able to use any scalar index on my_col.

Experimental API

Parameters:

Name	Type	Description	Default
`column`	`str`	The column to be indexed. Must be a boolean, integer, float, or string column.	required
`replace`	`bool`	Replace the existing index if it exists.	`True`

Examples:

import lance

dataset = lance.dataset("./images.lance")
dataset.create_scalar_index("category")

Source code in lancedb/table.py

@abstractmethod
def create_scalar_index(
    self,
    column: str,
    *,
    replace: bool = True,
):
    """Create a scalar index on a column.

    Scalar indices, like vector indices, can be used to speed up scans.  A scalar
    index can speed up scans that contain filter expressions on the indexed column.
    For example, the following scan will be faster if the column ``my_col`` has
    a scalar index:


        import lancedb

        db = lancedb.connect("/data/lance")
        img_table = db.open_table("images")
        my_df = img_table.search().where("my_col = 7", prefilter=True).to_pandas()

    Scalar indices can also speed up scans containing a vector search and a
    prefilter:

        import lancedb

        db = lancedb.connect("/data/lance")
        img_table = db.open_table("images")
        img_table.search([1, 2, 3, 4], vector_column_name="vector")
            .where("my_col != 7", prefilter=True)
            .to_pandas()

    Scalar indices can only speed up scans for basic filters using
    equality, comparison, range (e.g. ``my_col BETWEEN 0 AND 100``), and set
    membership (e.g. `my_col IN (0, 1, 2)`)

    Scalar indices can be used if the filter contains multiple indexed columns and
    the filter criteria are AND'd or OR'd together
    (e.g. ``my_col < 0 AND other_col> 100``)

    Scalar indices may be used if the filter contains non-indexed columns but,
    depending on the structure of the filter, they may not be usable.  For example,
    if the column ``not_indexed`` does not have a scalar index then the filter
    ``my_col = 0 OR not_indexed = 1`` will not be able to use any scalar index on
    ``my_col``.

    **Experimental API**

    Parameters
    ----------
    column : str
        The column to be indexed.  Must be a boolean, integer, float,
        or string column.
    replace : bool, default True
        Replace the existing index if it exists.

    Examples
    --------


        import lance

        dataset = lance.dataset("./images.lance")
        dataset.create_scalar_index("category")
    """
    raise NotImplementedError

`add(data: DATA, mode: str = 'append', on_bad_vectors: str = 'error', fill_value: float = 0.0)` `abstractmethod`

Add more data to the Table.

Parameters:

Name	Type	Description	Default
`data`	`DATA`	The data to insert into the table. Acceptable types are: dict or list-of-dict pandas.DataFrame pyarrow.Table or pyarrow.RecordBatch	required
`mode`	`str`	The mode to use when writing the data. Valid values are "append" and "overwrite".	`'append'`
`on_bad_vectors`	`str`	What to do if any of the vectors are not the same size or contains NaNs. One of "error", "drop", "fill".	`'error'`
`fill_value`	`float`	The value to use when filling vectors. Only used if on_bad_vectors="fill".	`0.0`

Source code in lancedb/table.py

@abstractmethod
def add(
    self,
    data: DATA,
    mode: str = "append",
    on_bad_vectors: str = "error",
    fill_value: float = 0.0,
):
    """Add more data to the [Table](Table).

    Parameters
    ----------
    data: DATA
        The data to insert into the table. Acceptable types are:

        - dict or list-of-dict

        - pandas.DataFrame

        - pyarrow.Table or pyarrow.RecordBatch
    mode: str
        The mode to use when writing the data. Valid values are
        "append" and "overwrite".
    on_bad_vectors: str, default "error"
        What to do if any of the vectors are not the same size or contains NaNs.
        One of "error", "drop", "fill".
    fill_value: float, default 0.
        The value to use when filling vectors. Only used if on_bad_vectors="fill".

    """
    raise NotImplementedError

`merge_insert(on: Union[str, Iterable[str]]) -> LanceMergeInsertBuilder`

Returns a LanceMergeInsertBuilder that can be used to create a "merge insert" operation

This operation can add rows, update rows, and remove rows all in a single transaction. It is a very generic tool that can be used to create behaviors like "insert if not exists", "update or insert (i.e. upsert)", or even replace a portion of existing data with new data (e.g. replace all data where month="january")

The merge insert operation works by combining new data from a source table with existing data in a target table by using a join. There are three categories of records.

"Matched" records are records that exist in both the source table and the target table. "Not matched" records exist only in the source table (e.g. these are new data) "Not matched by source" records exist only in the target table (this is old data)

The builder returned by this method can be used to customize what should happen for each category of data.

Please note that the data may appear to be reordered as part of this operation. This is because updated rows will be deleted from the dataset and then reinserted at the end with the new values.

Parameters:

Name	Type	Description	Default
`on`	`Union[str, Iterable[str]]`	A column (or columns) to join on. This is how records from the source table and target table are matched. Typically this is some kind of key or id column.	required

Examples:

>>> import lancedb
>>> data = pa.table({"a": [2, 1, 3], "b": ["a", "b", "c"]})
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data)
>>> new_data = pa.table({"a": [2, 3, 4], "b": ["x", "y", "z"]})
>>> # Perform a "upsert" operation
>>> table.merge_insert("a")             \
...      .when_matched_update_all()     \
...      .when_not_matched_insert_all() \
...      .execute(new_data)
>>> # The order of new rows is non-deterministic since we use
>>> # a hash-join as part of this operation and so we sort here
>>> table.to_arrow().sort_by("a").to_pandas()
   a  b
0  1  b
1  2  x
2  3  y
3  4  z

Source code in lancedb/table.py

def merge_insert(self, on: Union[str, Iterable[str]]) -> LanceMergeInsertBuilder:
    """
    Returns a [`LanceMergeInsertBuilder`][lancedb.merge.LanceMergeInsertBuilder]
    that can be used to create a "merge insert" operation

    This operation can add rows, update rows, and remove rows all in a single
    transaction. It is a very generic tool that can be used to create
    behaviors like "insert if not exists", "update or insert (i.e. upsert)",
    or even replace a portion of existing data with new data (e.g. replace
    all data where month="january")

    The merge insert operation works by combining new data from a
    **source table** with existing data in a **target table** by using a
    join.  There are three categories of records.

    "Matched" records are records that exist in both the source table and
    the target table. "Not matched" records exist only in the source table
    (e.g. these are new data) "Not matched by source" records exist only
    in the target table (this is old data)

    The builder returned by this method can be used to customize what
    should happen for each category of data.

    Please note that the data may appear to be reordered as part of this
    operation.  This is because updated rows will be deleted from the
    dataset and then reinserted at the end with the new values.

    Parameters
    ----------

    on: Union[str, Iterable[str]]
        A column (or columns) to join on.  This is how records from the
        source table and target table are matched.  Typically this is some
        kind of key or id column.

    Examples
    --------
    >>> import lancedb
    >>> data = pa.table({"a": [2, 1, 3], "b": ["a", "b", "c"]})
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", data)
    >>> new_data = pa.table({"a": [2, 3, 4], "b": ["x", "y", "z"]})
    >>> # Perform a "upsert" operation
    >>> table.merge_insert("a")             \\
    ...      .when_matched_update_all()     \\
    ...      .when_not_matched_insert_all() \\
    ...      .execute(new_data)
    >>> # The order of new rows is non-deterministic since we use
    >>> # a hash-join as part of this operation and so we sort here
    >>> table.to_arrow().sort_by("a").to_pandas()
       a  b
    0  1  b
    1  2  x
    2  3  y
    3  4  z
    """
    on = [on] if isinstance(on, str) else list(on.iter())

    return LanceMergeInsertBuilder(self, on)

`search(query: Optional[Union[VEC, str, 'PIL.Image.Image', Tuple]] = None, vector_column_name: Optional[str] = None, query_type: str = 'auto') -> LanceQueryBuilder` `abstractmethod`

Create a search query to find the nearest neighbors of the given query vector. We currently support vector search and [full-text search][experimental-full-text-search].

All query options are defined in Query.

Examples:

>>> import lancedb
>>> db = lancedb.connect("./.lancedb")
>>> data = [
...    {"original_width": 100, "caption": "bar", "vector": [0.1, 2.3, 4.5]},
...    {"original_width": 2000, "caption": "foo",  "vector": [0.5, 3.4, 1.3]},
...    {"original_width": 3000, "caption": "test", "vector": [0.3, 6.2, 2.6]}
... ]
>>> table = db.create_table("my_table", data)
>>> query = [0.4, 1.4, 2.4]
>>> (table.search(query)
...     .where("original_width > 1000", prefilter=True)
...     .select(["caption", "original_width", "vector"])
...     .limit(2)
...     .to_pandas())
  caption  original_width           vector  _distance
0     foo            2000  [0.5, 3.4, 1.3]   5.220000
1    test            3000  [0.3, 6.2, 2.6]  23.089996

Parameters:

Name	Type	Description	Default
`query`	`Optional[Union[VEC, str, 'PIL.Image.Image', Tuple]]`	The targetted vector to search for. default None. Acceptable types are: list, np.ndarray, PIL.Image.Image If None then the select/where/limit clauses are applied to filter the table	`None`
`vector_column_name`	`Optional[str]`	The name of the vector column to search. The vector column needs to be a pyarrow fixed size list type If not specified then the vector column is inferred from the table schema If the table has multiple vector columns then the vector_column_name needs to be specified. Otherwise, an error is raised.	`None`
`query_type`	`str`	default "auto". Acceptable types are: "vector", "fts", "hybrid", or "auto" If "auto" then the query type is inferred from the query; If `query` is a list/np.ndarray then the query type is "vector"; If `query` is a PIL.Image.Image then either do vector search, or raise an error if no corresponding embedding function is found. If `query` is a string, then the query type is "vector" if the table has embedding functions else the query type is "fts"	`'auto'`

Returns:

Type	Description
`LanceQueryBuilder`	A query builder object representing the query. Once executed, the query returns selected columns the vector and also the "_distance" column which is the distance between the query vector and the returned vector.

Source code in lancedb/table.py

@abstractmethod
def search(
    self,
    query: Optional[Union[VEC, str, "PIL.Image.Image", Tuple]] = None,
    vector_column_name: Optional[str] = None,
    query_type: str = "auto",
) -> LanceQueryBuilder:
    """Create a search query to find the nearest neighbors
    of the given query vector. We currently support [vector search][search]
    and [full-text search][experimental-full-text-search].

    All query options are defined in [Query][lancedb.query.Query].

    Examples
    --------
    >>> import lancedb
    >>> db = lancedb.connect("./.lancedb")
    >>> data = [
    ...    {"original_width": 100, "caption": "bar", "vector": [0.1, 2.3, 4.5]},
    ...    {"original_width": 2000, "caption": "foo",  "vector": [0.5, 3.4, 1.3]},
    ...    {"original_width": 3000, "caption": "test", "vector": [0.3, 6.2, 2.6]}
    ... ]
    >>> table = db.create_table("my_table", data)
    >>> query = [0.4, 1.4, 2.4]
    >>> (table.search(query)
    ...     .where("original_width > 1000", prefilter=True)
    ...     .select(["caption", "original_width", "vector"])
    ...     .limit(2)
    ...     .to_pandas())
      caption  original_width           vector  _distance
    0     foo            2000  [0.5, 3.4, 1.3]   5.220000
    1    test            3000  [0.3, 6.2, 2.6]  23.089996

    Parameters
    ----------
    query: list/np.ndarray/str/PIL.Image.Image, default None
        The targetted vector to search for.

        - *default None*.
        Acceptable types are: list, np.ndarray, PIL.Image.Image

        - If None then the select/where/limit clauses are applied to filter
        the table
    vector_column_name: str, optional
        The name of the vector column to search.

        The vector column needs to be a pyarrow fixed size list type

        - If not specified then the vector column is inferred from
        the table schema

        - If the table has multiple vector columns then the *vector_column_name*
        needs to be specified. Otherwise, an error is raised.
    query_type: str
        *default "auto"*.
        Acceptable types are: "vector", "fts", "hybrid", or "auto"

        - If "auto" then the query type is inferred from the query;

            - If `query` is a list/np.ndarray then the query type is
            "vector";

            - If `query` is a PIL.Image.Image then either do vector search,
            or raise an error if no corresponding embedding function is found.

        - If `query` is a string, then the query type is "vector" if the
        table has embedding functions else the query type is "fts"

    Returns
    -------
    LanceQueryBuilder
        A query builder object representing the query.
        Once executed, the query returns

        - selected columns

        - the vector

        - and also the "_distance" column which is the distance between the query
        vector and the returned vector.
    """
    raise NotImplementedError

`delete(where: str)` `abstractmethod`

Delete rows from the table.

This can be used to delete a single row, many rows, all rows, or sometimes no rows (if your predicate matches nothing).

Parameters:

Name	Type	Description	Default
`where`	`str`	The SQL where clause to use when deleting rows. For example, 'x = 2' or 'x IN (1, 2, 3)'. The filter must not be empty, or it will error.	required

Examples:

>>> import lancedb
>>> data = [
...    {"x": 1, "vector": [1, 2]},
...    {"x": 2, "vector": [3, 4]},
...    {"x": 3, "vector": [5, 6]}
... ]
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data)
>>> table.to_pandas()
   x      vector
0  1  [1.0, 2.0]
1  2  [3.0, 4.0]
2  3  [5.0, 6.0]
>>> table.delete("x = 2")
>>> table.to_pandas()
   x      vector
0  1  [1.0, 2.0]
1  3  [5.0, 6.0]

If you have a list of values to delete, you can combine them into a stringified list and use the IN operator:

>>> to_remove = [1, 5]
>>> to_remove = ", ".join([str(v) for v in to_remove])
>>> to_remove
'1, 5'
>>> table.delete(f"x IN ({to_remove})")
>>> table.to_pandas()
   x      vector
0  3  [5.0, 6.0]

Source code in lancedb/table.py

@abstractmethod
def delete(self, where: str):
    """Delete rows from the table.

    This can be used to delete a single row, many rows, all rows, or
    sometimes no rows (if your predicate matches nothing).

    Parameters
    ----------
    where: str
        The SQL where clause to use when deleting rows.

        - For example, 'x = 2' or 'x IN (1, 2, 3)'.

        The filter must not be empty, or it will error.

    Examples
    --------
    >>> import lancedb
    >>> data = [
    ...    {"x": 1, "vector": [1, 2]},
    ...    {"x": 2, "vector": [3, 4]},
    ...    {"x": 3, "vector": [5, 6]}
    ... ]
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", data)
    >>> table.to_pandas()
       x      vector
    0  1  [1.0, 2.0]
    1  2  [3.0, 4.0]
    2  3  [5.0, 6.0]
    >>> table.delete("x = 2")
    >>> table.to_pandas()
       x      vector
    0  1  [1.0, 2.0]
    1  3  [5.0, 6.0]

    If you have a list of values to delete, you can combine them into a
    stringified list and use the `IN` operator:

    >>> to_remove = [1, 5]
    >>> to_remove = ", ".join([str(v) for v in to_remove])
    >>> to_remove
    '1, 5'
    >>> table.delete(f"x IN ({to_remove})")
    >>> table.to_pandas()
       x      vector
    0  3  [5.0, 6.0]
    """
    raise NotImplementedError

`update(where: Optional[str] = None, values: Optional[dict] = None, *, values_sql: Optional[Dict[str, str]] = None)` `abstractmethod`

This can be used to update zero to all rows depending on how many rows match the where clause. If no where clause is provided, then all rows will be updated.

Either values or values_sql must be provided. You cannot provide both.

Parameters:

Name	Type	Description	Default
`where`	`Optional[str]`	The SQL where clause to use when updating rows. For example, 'x = 2' or 'x IN (1, 2, 3)'. The filter must not be empty, or it will error.	`None`
`values`	`Optional[dict]`	The values to update. The keys are the column names and the values are the values to set.	`None`
`values_sql`	`Optional[Dict[str, str]]`	The values to update, expressed as SQL expression strings. These can reference existing columns. For example, {"x": "x + 1"} will increment the x column by 1.	`None`

Examples:

>>> import lancedb
>>> import pandas as pd
>>> data = pd.DataFrame({"x": [1, 2, 3], "vector": [[1, 2], [3, 4], [5, 6]]})
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data)
>>> table.to_pandas()
   x      vector
0  1  [1.0, 2.0]
1  2  [3.0, 4.0]
2  3  [5.0, 6.0]
>>> table.update(where="x = 2", values={"vector": [10, 10]})
>>> table.to_pandas()
   x        vector
0  1    [1.0, 2.0]
1  3    [5.0, 6.0]
2  2  [10.0, 10.0]
>>> table.update(values_sql={"x": "x + 1"})
>>> table.to_pandas()
   x        vector
0  2    [1.0, 2.0]
1  4    [5.0, 6.0]
2  3  [10.0, 10.0]

Source code in lancedb/table.py

@abstractmethod
def update(
    self,
    where: Optional[str] = None,
    values: Optional[dict] = None,
    *,
    values_sql: Optional[Dict[str, str]] = None,
):
    """
    This can be used to update zero to all rows depending on how many
    rows match the where clause. If no where clause is provided, then
    all rows will be updated.

    Either `values` or `values_sql` must be provided. You cannot provide
    both.

    Parameters
    ----------
    where: str, optional
        The SQL where clause to use when updating rows. For example, 'x = 2'
        or 'x IN (1, 2, 3)'. The filter must not be empty, or it will error.
    values: dict, optional
        The values to update. The keys are the column names and the values
        are the values to set.
    values_sql: dict, optional
        The values to update, expressed as SQL expression strings. These can
        reference existing columns. For example, {"x": "x + 1"} will increment
        the x column by 1.

    Examples
    --------
    >>> import lancedb
    >>> import pandas as pd
    >>> data = pd.DataFrame({"x": [1, 2, 3], "vector": [[1, 2], [3, 4], [5, 6]]})
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", data)
    >>> table.to_pandas()
       x      vector
    0  1  [1.0, 2.0]
    1  2  [3.0, 4.0]
    2  3  [5.0, 6.0]
    >>> table.update(where="x = 2", values={"vector": [10, 10]})
    >>> table.to_pandas()
       x        vector
    0  1    [1.0, 2.0]
    1  3    [5.0, 6.0]
    2  2  [10.0, 10.0]
    >>> table.update(values_sql={"x": "x + 1"})
    >>> table.to_pandas()
       x        vector
    0  2    [1.0, 2.0]
    1  4    [5.0, 6.0]
    2  3  [10.0, 10.0]
    """
    raise NotImplementedError

`cleanup_old_versions(older_than: Optional[timedelta] = None, *, delete_unverified: bool = False) -> CleanupStats` `abstractmethod`

Clean up old versions of the table, freeing disk space.

Note: This function is not available in LanceDb Cloud (since LanceDb Cloud manages cleanup for you automatically)

Parameters:

Name	Type	Description	Default
`older_than`	`Optional[timedelta]`	The minimum age of the version to delete. If None, then this defaults to two weeks.	`None`
`delete_unverified`	`bool`	Because they may be part of an in-progress transaction, files newer than 7 days old are not deleted by default. If you are sure that there are no in-progress transactions, then you can set this to True to delete all files older than `older_than`.	`False`

Returns:

Type	Description
`CleanupStats`	The stats of the cleanup operation, including how many bytes were freed.

Source code in lancedb/table.py

@abstractmethod
def cleanup_old_versions(
    self,
    older_than: Optional[timedelta] = None,
    *,
    delete_unverified: bool = False,
) -> CleanupStats:
    """
    Clean up old versions of the table, freeing disk space.

    Note: This function is not available in LanceDb Cloud (since LanceDb
    Cloud manages cleanup for you automatically)

    Parameters
    ----------
    older_than: timedelta, default None
        The minimum age of the version to delete. If None, then this defaults
        to two weeks.
    delete_unverified: bool, default False
        Because they may be part of an in-progress transaction, files newer
        than 7 days old are not deleted by default. If you are sure that
        there are no in-progress transactions, then you can set this to True
        to delete all files older than `older_than`.

    Returns
    -------
    CleanupStats
        The stats of the cleanup operation, including how many bytes were
        freed.
    """

`compact_files(*args, **kwargs)` `abstractmethod`

Run the compaction process on the table.

Note: This function is not available in LanceDb Cloud (since LanceDb Cloud manages compaction for you automatically)

This can be run after making several small appends to optimize the table for faster reads.

Arguments are passed onto :meth:lance.dataset.DatasetOptimizer.compact_files. For most cases, the default should be fine.

Source code in lancedb/table.py

@abstractmethod
def compact_files(self, *args, **kwargs):
    """
    Run the compaction process on the table.

    Note: This function is not available in LanceDb Cloud (since LanceDb
    Cloud manages compaction for you automatically)

    This can be run after making several small appends to optimize the table
    for faster reads.

    Arguments are passed onto :meth:`lance.dataset.DatasetOptimizer.compact_files`.
    For most cases, the default should be fine.
    """

`add_columns(transforms: Dict[str, str])` `abstractmethod`

Add new columns with defined values.

This is not yet available in LanceDB Cloud.

Parameters:

Name	Type	Description	Default
`transforms`	`Dict[str, str]`	A map of column name to a SQL expression to use to calculate the value of the new column. These expressions will be evaluated for each row in the table, and can reference existing columns.	required

Source code in lancedb/table.py

@abstractmethod
def add_columns(self, transforms: Dict[str, str]):
    """
    Add new columns with defined values.

    This is not yet available in LanceDB Cloud.

    Parameters
    ----------
    transforms: Dict[str, str]
        A map of column name to a SQL expression to use to calculate the
        value of the new column. These expressions will be evaluated for
        each row in the table, and can reference existing columns.
    """

`alter_columns(alterations: Iterable[Dict[str, str]])` `abstractmethod`

Alter column names and nullability.

This is not yet available in LanceDB Cloud.

alterations : Iterable[Dict[str, Any]] A sequence of dictionaries, each with the following keys: - "path": str The column path to alter. For a top-level column, this is the name. For a nested column, this is the dot-separated path, e.g. "a.b.c". - "name": str, optional The new name of the column. If not specified, the column name is not changed. - "nullable": bool, optional Whether the column should be nullable. If not specified, the column nullability is not changed. Only non-nullable columns can be changed to nullable. Currently, you cannot change a nullable column to non-nullable.

Source code in lancedb/table.py

@abstractmethod
def alter_columns(self, alterations: Iterable[Dict[str, str]]):
    """
    Alter column names and nullability.

    This is not yet available in LanceDB Cloud.

    alterations : Iterable[Dict[str, Any]]
        A sequence of dictionaries, each with the following keys:
        - "path": str
            The column path to alter. For a top-level column, this is the name.
            For a nested column, this is the dot-separated path, e.g. "a.b.c".
        - "name": str, optional
            The new name of the column. If not specified, the column name is
            not changed.
        - "nullable": bool, optional
            Whether the column should be nullable. If not specified, the column
            nullability is not changed. Only non-nullable columns can be changed
            to nullable. Currently, you cannot change a nullable column to
            non-nullable.
    """

`drop_columns(columns: Iterable[str])` `abstractmethod`

Drop columns from the table.

This is not yet available in LanceDB Cloud.

Parameters:

Name	Type	Description	Default
`columns`	`Iterable[str]`	The names of the columns to drop.	required

Source code in lancedb/table.py

@abstractmethod
def drop_columns(self, columns: Iterable[str]):
    """
    Drop columns from the table.

    This is not yet available in LanceDB Cloud.

    Parameters
    ----------
    columns : Iterable[str]
        The names of the columns to drop.
    """

Querying (Synchronous)

`lancedb.query.Query`

Bases: BaseModel

The LanceDB Query

Attributes:

Name	Type	Description
`vector`	`List[float]`	the vector to search for
`filter`	`Optional[str]`	sql filter to refine the query with, optional
`prefilter`	`bool`	if True then apply the filter before vector search
`k`	`int`	top k results to return
`metric`	`str`	the distance metric between a pair of vectors, can support L2 (default), Cosine and Dot. metric definitions
`columns`	`Optional[List[str]]`	which columns to return in the results
`nprobes`	`int`	The number of probes used - optional A higher number makes search more accurate but also slower. See discussion in Querying an ANN Index for tuning advice.
`refine_factor`	`Optional[int]`	Refine the results by reading extra elements and re-ranking them in memory. A higher number makes search more accurate but also slower. See discussion in Querying an ANN Index for tuning advice.

Source code in lancedb/query.py

class Query(pydantic.BaseModel):
    """The LanceDB Query

    Attributes
    ----------
    vector : List[float]
        the vector to search for
    filter : Optional[str]
        sql filter to refine the query with, optional
    prefilter : bool
        if True then apply the filter before vector search
    k : int
        top k results to return
    metric : str
        the distance metric between a pair of vectors,

        can support L2 (default), Cosine and Dot.
        [metric definitions][search]
    columns : Optional[List[str]]
        which columns to return in the results
    nprobes : int
        The number of probes used - optional

        - A higher number makes search more accurate but also slower.

        - See discussion in [Querying an ANN Index][querying-an-ann-index] for
          tuning advice.
    refine_factor : Optional[int]
        Refine the results by reading extra elements and re-ranking them in memory.

        - A higher number makes search more accurate but also slower.

        - See discussion in [Querying an ANN Index][querying-an-ann-index] for
          tuning advice.
    """

    vector_column: Optional[str] = None

    # vector to search for
    vector: Union[List[float], List[List[float]]]

    # sql filter to refine the query with
    filter: Optional[str] = None

    # if True then apply the filter before vector search
    prefilter: bool = False

    # top k results to return
    k: int

    # # metrics
    metric: str = "L2"

    # which columns to return in the results
    columns: Optional[Union[List[str], Dict[str, str]]] = None

    # optional query parameters for tuning the results,
    # e.g. `{"nprobes": "10", "refine_factor": "10"}`
    nprobes: int = 10

    # Refine factor.
    refine_factor: Optional[int] = None

    with_row_id: bool = False

`lancedb.query.LanceQueryBuilder`

Bases: ABC

An abstract query builder. Subclasses are defined for vector search, full text search, hybrid, and plain SQL filtering.

Source code in lancedb/query.py

class LanceQueryBuilder(ABC):
    """An abstract query builder. Subclasses are defined for vector search,
    full text search, hybrid, and plain SQL filtering.
    """

    @classmethod
    def create(
        cls,
        table: "Table",
        query: Optional[Union[np.ndarray, str, "PIL.Image.Image", Tuple]],
        query_type: str,
        vector_column_name: str,
        ordering_field_name: str = None,
    ) -> LanceQueryBuilder:
        """
        Create a query builder based on the given query and query type.

        Parameters
        ----------
        table: Table
            The table to query.
        query: Optional[Union[np.ndarray, str, "PIL.Image.Image", Tuple]]
            The query to use. If None, an empty query builder is returned
            which performs simple SQL filtering.
        query_type: str
            The type of query to perform. One of "vector", "fts", "hybrid", or "auto".
            If "auto", the query type is inferred based on the query.
        vector_column_name: str
            The name of the vector column to use for vector search.
        """
        if query is None:
            return LanceEmptyQueryBuilder(table)

        if query_type == "hybrid":
            # hybrid fts and vector query
            return LanceHybridQueryBuilder(table, query, vector_column_name)

        # remember the string query for reranking purpose
        str_query = query if isinstance(query, str) else None

        # convert "auto" query_type to "vector", "fts"
        # or "hybrid" and convert the query to vector if needed
        query, query_type = cls._resolve_query(
            table, query, query_type, vector_column_name
        )

        if query_type == "hybrid":
            return LanceHybridQueryBuilder(table, query, vector_column_name)

        if isinstance(query, str):
            # fts
            return LanceFtsQueryBuilder(
                table, query, ordering_field_name=ordering_field_name
            )

        if isinstance(query, list):
            query = np.array(query, dtype=np.float32)
        elif isinstance(query, np.ndarray):
            query = query.astype(np.float32)
        else:
            raise TypeError(f"Unsupported query type: {type(query)}")

        return LanceVectorQueryBuilder(table, query, vector_column_name, str_query)

    @classmethod
    def _resolve_query(cls, table, query, query_type, vector_column_name):
        # If query_type is fts, then query must be a string.
        # otherwise raise TypeError
        if query_type == "fts":
            if not isinstance(query, str):
                raise TypeError(f"'fts' queries must be a string: {type(query)}")
            return query, query_type
        elif query_type == "vector":
            query = cls._query_to_vector(table, query, vector_column_name)
            return query, query_type
        elif query_type == "auto":
            if isinstance(query, (list, np.ndarray)):
                return query, "vector"
            if isinstance(query, tuple):
                return query, "hybrid"
            else:
                conf = table.embedding_functions.get(vector_column_name)
                if conf is not None:
                    query = conf.function.compute_query_embeddings_with_retry(query)[0]
                    return query, "vector"
                else:
                    return query, "fts"
        else:
            raise ValueError(
                f"Invalid query_type, must be 'vector', 'fts', or 'auto': {query_type}"
            )

    @classmethod
    def _query_to_vector(cls, table, query, vector_column_name):
        if isinstance(query, (list, np.ndarray)):
            return query
        conf = table.embedding_functions.get(vector_column_name)
        if conf is not None:
            return conf.function.compute_query_embeddings_with_retry(query)[0]
        else:
            msg = f"No embedding function for {vector_column_name}"
            raise ValueError(msg)

    def __init__(self, table: "Table"):
        self._table = table
        self._limit = 10
        self._columns = None
        self._where = None
        self._with_row_id = False

    @deprecation.deprecated(
        deprecated_in="0.3.1",
        removed_in="0.4.0",
        current_version=__version__,
        details="Use to_pandas() instead",
    )
    def to_df(self) -> "pd.DataFrame":
        """
        *Deprecated alias for `to_pandas()`. Please use `to_pandas()` instead.*

        Execute the query and return the results as a pandas DataFrame.
        In addition to the selected columns, LanceDB also returns a vector
        and also the "_distance" column which is the distance between the query
        vector and the returned vector.
        """
        return self.to_pandas()

    def to_pandas(self, flatten: Optional[Union[int, bool]] = None) -> "pd.DataFrame":
        """
        Execute the query and return the results as a pandas DataFrame.
        In addition to the selected columns, LanceDB also returns a vector
        and also the "_distance" column which is the distance between the query
        vector and the returned vector.

        Parameters
        ----------
        flatten: Optional[Union[int, bool]]
            If flatten is True, flatten all nested columns.
            If flatten is an integer, flatten the nested columns up to the
            specified depth.
            If unspecified, do not flatten the nested columns.
        """
        tbl = self.to_arrow()
        if flatten is True:
            while True:
                tbl = tbl.flatten()
                # loop through all columns to check if there is any struct column
                if any(pa.types.is_struct(col.type) for col in tbl.schema):
                    continue
                else:
                    break
        elif isinstance(flatten, int):
            if flatten <= 0:
                raise ValueError(
                    "Please specify a positive integer for flatten or the boolean "
                    "value `True`"
                )
            while flatten > 0:
                tbl = tbl.flatten()
                flatten -= 1
        return tbl.to_pandas()

    @abstractmethod
    def to_arrow(self) -> pa.Table:
        """
        Execute the query and return the results as an
        [Apache Arrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table).

        In addition to the selected columns, LanceDB also returns a vector
        and also the "_distance" column which is the distance between the query
        vector and the returned vectors.
        """
        raise NotImplementedError

    def to_list(self) -> List[dict]:
        """
        Execute the query and return the results as a list of dictionaries.

        Each list entry is a dictionary with the selected column names as keys,
        or all table columns if `select` is not called. The vector and the "_distance"
        fields are returned whether or not they're explicitly selected.
        """
        return self.to_arrow().to_pylist()

    def to_pydantic(self, model: Type[LanceModel]) -> List[LanceModel]:
        """Return the table as a list of pydantic models.

        Parameters
        ----------
        model: Type[LanceModel]
            The pydantic model to use.

        Returns
        -------
        List[LanceModel]
        """
        return [
            model(**{k: v for k, v in row.items() if k in model.field_names()})
            for row in self.to_arrow().to_pylist()
        ]

    def to_polars(self) -> "pl.DataFrame":
        """
        Execute the query and return the results as a Polars DataFrame.
        In addition to the selected columns, LanceDB also returns a vector
        and also the "_distance" column which is the distance between the query
        vector and the returned vector.
        """
        import polars as pl

        return pl.from_arrow(self.to_arrow())

    def limit(self, limit: Union[int, None]) -> LanceQueryBuilder:
        """Set the maximum number of results to return.

        Parameters
        ----------
        limit: int
            The maximum number of results to return.
            By default the query is limited to the first 10.
            Call this method and pass 0, a negative value,
            or None to remove the limit.
            *WARNING* if you have a large dataset, removing
            the limit can potentially result in reading a
            large amount of data into memory and cause
            out of memory issues.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        if limit is None or limit <= 0:
            self._limit = None
        else:
            self._limit = limit
        return self

    def select(self, columns: Union[list[str], dict[str, str]]) -> LanceQueryBuilder:
        """Set the columns to return.

        Parameters
        ----------
        columns: list of str, or dict of str to str default None
            List of column names to be fetched.
            Or a dictionary of column names to SQL expressions.
            All columns are fetched if None or unspecified.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        if isinstance(columns, list) or isinstance(columns, dict):
            self._columns = columns
        else:
            raise ValueError("columns must be a list or a dictionary")
        return self

    def where(self, where: str, prefilter: bool = False) -> LanceQueryBuilder:
        """Set the where clause.

        Parameters
        ----------
        where: str
            The where clause which is a valid SQL where clause. See
            `Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>`_
            for valid SQL expressions.
        prefilter: bool, default False
            If True, apply the filter before vector search, otherwise the
            filter is applied on the result of vector search.
            This feature is **EXPERIMENTAL** and may be removed and modified
            without warning in the future.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        self._where = where
        self._prefilter = prefilter
        return self

    def with_row_id(self, with_row_id: bool) -> LanceQueryBuilder:
        """Set whether to return row ids.

        Parameters
        ----------
        with_row_id: bool
            If True, return _rowid column in the results.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        self._with_row_id = with_row_id
        return self

    def explain_plan(self, verbose: Optional[bool] = False) -> str:
        """Return the execution plan for this query.

        Examples
        --------
        >>> import lancedb
        >>> db = lancedb.connect("./.lancedb")
        >>> table = db.create_table("my_table", [{"vector": [99, 99]}])
        >>> query = [100, 100]
        >>> plan = table.search(query).explain_plan(True)
        >>> print(plan) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
        ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]
          FilterExec: _distance@2 IS NOT NULL
            SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]
              KNNVectorDistance: metric=l2
                LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false

        Parameters
        ----------
        verbose : bool, default False
            Use a verbose output format.

        Returns
        -------
        plan : str
        """  # noqa: E501
        ds = self._table.to_lance()
        return ds.scanner(
            nearest={
                "column": self._vector_column,
                "q": self._query,
            },
        ).explain_plan(verbose)

`create(table: 'Table', query: Optional[Union[np.ndarray, str, 'PIL.Image.Image', Tuple]], query_type: str, vector_column_name: str, ordering_field_name: str = None) -> LanceQueryBuilder` `classmethod`

Create a query builder based on the given query and query type.

Parameters:

Name	Type	Description	Default
`table`	`'Table'`	The table to query.	required
`query`	`Optional[Union[ndarray, str, 'PIL.Image.Image', Tuple]]`	The query to use. If None, an empty query builder is returned which performs simple SQL filtering.	required
`query_type`	`str`	The type of query to perform. One of "vector", "fts", "hybrid", or "auto". If "auto", the query type is inferred based on the query.	required
`vector_column_name`	`str`	The name of the vector column to use for vector search.	required

Source code in lancedb/query.py

@classmethod
def create(
    cls,
    table: "Table",
    query: Optional[Union[np.ndarray, str, "PIL.Image.Image", Tuple]],
    query_type: str,
    vector_column_name: str,
    ordering_field_name: str = None,
) -> LanceQueryBuilder:
    """
    Create a query builder based on the given query and query type.

    Parameters
    ----------
    table: Table
        The table to query.
    query: Optional[Union[np.ndarray, str, "PIL.Image.Image", Tuple]]
        The query to use. If None, an empty query builder is returned
        which performs simple SQL filtering.
    query_type: str
        The type of query to perform. One of "vector", "fts", "hybrid", or "auto".
        If "auto", the query type is inferred based on the query.
    vector_column_name: str
        The name of the vector column to use for vector search.
    """
    if query is None:
        return LanceEmptyQueryBuilder(table)

    if query_type == "hybrid":
        # hybrid fts and vector query
        return LanceHybridQueryBuilder(table, query, vector_column_name)

    # remember the string query for reranking purpose
    str_query = query if isinstance(query, str) else None

    # convert "auto" query_type to "vector", "fts"
    # or "hybrid" and convert the query to vector if needed
    query, query_type = cls._resolve_query(
        table, query, query_type, vector_column_name
    )

    if query_type == "hybrid":
        return LanceHybridQueryBuilder(table, query, vector_column_name)

    if isinstance(query, str):
        # fts
        return LanceFtsQueryBuilder(
            table, query, ordering_field_name=ordering_field_name
        )

    if isinstance(query, list):
        query = np.array(query, dtype=np.float32)
    elif isinstance(query, np.ndarray):
        query = query.astype(np.float32)
    else:
        raise TypeError(f"Unsupported query type: {type(query)}")

    return LanceVectorQueryBuilder(table, query, vector_column_name, str_query)

`to_df() -> 'pd.DataFrame'`

Deprecated alias for to_pandas(). Please use to_pandas() instead.

Execute the query and return the results as a pandas DataFrame. In addition to the selected columns, LanceDB also returns a vector and also the "_distance" column which is the distance between the query vector and the returned vector.

Source code in lancedb/query.py

@deprecation.deprecated(
    deprecated_in="0.3.1",
    removed_in="0.4.0",
    current_version=__version__,
    details="Use to_pandas() instead",
)
def to_df(self) -> "pd.DataFrame":
    """
    *Deprecated alias for `to_pandas()`. Please use `to_pandas()` instead.*

    Execute the query and return the results as a pandas DataFrame.
    In addition to the selected columns, LanceDB also returns a vector
    and also the "_distance" column which is the distance between the query
    vector and the returned vector.
    """
    return self.to_pandas()

`to_pandas(flatten: Optional[Union[int, bool]] = None) -> 'pd.DataFrame'`

Execute the query and return the results as a pandas DataFrame. In addition to the selected columns, LanceDB also returns a vector and also the "_distance" column which is the distance between the query vector and the returned vector.

Parameters:

Name	Type	Description	Default
`flatten`	`Optional[Union[int, bool]]`	If flatten is True, flatten all nested columns. If flatten is an integer, flatten the nested columns up to the specified depth. If unspecified, do not flatten the nested columns.	`None`

Source code in lancedb/query.py

def to_pandas(self, flatten: Optional[Union[int, bool]] = None) -> "pd.DataFrame":
    """
    Execute the query and return the results as a pandas DataFrame.
    In addition to the selected columns, LanceDB also returns a vector
    and also the "_distance" column which is the distance between the query
    vector and the returned vector.

    Parameters
    ----------
    flatten: Optional[Union[int, bool]]
        If flatten is True, flatten all nested columns.
        If flatten is an integer, flatten the nested columns up to the
        specified depth.
        If unspecified, do not flatten the nested columns.
    """
    tbl = self.to_arrow()
    if flatten is True:
        while True:
            tbl = tbl.flatten()
            # loop through all columns to check if there is any struct column
            if any(pa.types.is_struct(col.type) for col in tbl.schema):
                continue
            else:
                break
    elif isinstance(flatten, int):
        if flatten <= 0:
            raise ValueError(
                "Please specify a positive integer for flatten or the boolean "
                "value `True`"
            )
        while flatten > 0:
            tbl = tbl.flatten()
            flatten -= 1
    return tbl.to_pandas()

`to_arrow() -> pa.Table` `abstractmethod`

Execute the query and return the results as an Apache Arrow Table.

In addition to the selected columns, LanceDB also returns a vector and also the "_distance" column which is the distance between the query vector and the returned vectors.

Source code in lancedb/query.py

@abstractmethod
def to_arrow(self) -> pa.Table:
    """
    Execute the query and return the results as an
    [Apache Arrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table).

    In addition to the selected columns, LanceDB also returns a vector
    and also the "_distance" column which is the distance between the query
    vector and the returned vectors.
    """
    raise NotImplementedError

`to_list() -> List[dict]`

Execute the query and return the results as a list of dictionaries.

Each list entry is a dictionary with the selected column names as keys, or all table columns if select is not called. The vector and the "_distance" fields are returned whether or not they're explicitly selected.

Source code in lancedb/query.py

def to_list(self) -> List[dict]:
    """
    Execute the query and return the results as a list of dictionaries.

    Each list entry is a dictionary with the selected column names as keys,
    or all table columns if `select` is not called. The vector and the "_distance"
    fields are returned whether or not they're explicitly selected.
    """
    return self.to_arrow().to_pylist()

`to_pydantic(model: Type[LanceModel]) -> List[LanceModel]`

Return the table as a list of pydantic models.

Parameters:

Name	Type	Description	Default
`model`	`Type[LanceModel]`	The pydantic model to use.	required

Returns:

Type	Description
`List[LanceModel]`

Source code in lancedb/query.py

def to_pydantic(self, model: Type[LanceModel]) -> List[LanceModel]:
    """Return the table as a list of pydantic models.

    Parameters
    ----------
    model: Type[LanceModel]
        The pydantic model to use.

    Returns
    -------
    List[LanceModel]
    """
    return [
        model(**{k: v for k, v in row.items() if k in model.field_names()})
        for row in self.to_arrow().to_pylist()
    ]

`to_polars() -> 'pl.DataFrame'`

Execute the query and return the results as a Polars DataFrame. In addition to the selected columns, LanceDB also returns a vector and also the "_distance" column which is the distance between the query vector and the returned vector.

Source code in lancedb/query.py

def to_polars(self) -> "pl.DataFrame":
    """
    Execute the query and return the results as a Polars DataFrame.
    In addition to the selected columns, LanceDB also returns a vector
    and also the "_distance" column which is the distance between the query
    vector and the returned vector.
    """
    import polars as pl

    return pl.from_arrow(self.to_arrow())

`limit(limit: Union[int, None]) -> LanceQueryBuilder`

Set the maximum number of results to return.

Parameters:

Name	Type	Description	Default
`limit`	`Union[int, None]`	The maximum number of results to return. By default the query is limited to the first 10. Call this method and pass 0, a negative value, or None to remove the limit. WARNING if you have a large dataset, removing the limit can potentially result in reading a large amount of data into memory and cause out of memory issues.	required

Returns:

Type	Description
`LanceQueryBuilder`	The LanceQueryBuilder object.

Source code in lancedb/query.py

def limit(self, limit: Union[int, None]) -> LanceQueryBuilder:
    """Set the maximum number of results to return.

    Parameters
    ----------
    limit: int
        The maximum number of results to return.
        By default the query is limited to the first 10.
        Call this method and pass 0, a negative value,
        or None to remove the limit.
        *WARNING* if you have a large dataset, removing
        the limit can potentially result in reading a
        large amount of data into memory and cause
        out of memory issues.

    Returns
    -------
    LanceQueryBuilder
        The LanceQueryBuilder object.
    """
    if limit is None or limit <= 0:
        self._limit = None
    else:
        self._limit = limit
    return self

`select(columns: Union[list[str], dict[str, str]]) -> LanceQueryBuilder`

Set the columns to return.

Parameters:

Name	Type	Description	Default
`columns`	`Union[list[str], dict[str, str]]`	List of column names to be fetched. Or a dictionary of column names to SQL expressions. All columns are fetched if None or unspecified.	required

Returns:

Type	Description
`LanceQueryBuilder`	The LanceQueryBuilder object.

Source code in lancedb/query.py

def select(self, columns: Union[list[str], dict[str, str]]) -> LanceQueryBuilder:
    """Set the columns to return.

    Parameters
    ----------
    columns: list of str, or dict of str to str default None
        List of column names to be fetched.
        Or a dictionary of column names to SQL expressions.
        All columns are fetched if None or unspecified.

    Returns
    -------
    LanceQueryBuilder
        The LanceQueryBuilder object.
    """
    if isinstance(columns, list) or isinstance(columns, dict):
        self._columns = columns
    else:
        raise ValueError("columns must be a list or a dictionary")
    return self

`where(where: str, prefilter: bool = False) -> LanceQueryBuilder`

Set the where clause.

Parameters:

Name	Type	Description	Default
`where`	`str`	The where clause which is a valid SQL where clause. See `Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>`_ for valid SQL expressions.	required
`prefilter`	`bool`	If True, apply the filter before vector search, otherwise the filter is applied on the result of vector search. This feature is EXPERIMENTAL and may be removed and modified without warning in the future.	`False`

Returns:

Type	Description
`LanceQueryBuilder`	The LanceQueryBuilder object.

Source code in lancedb/query.py

def where(self, where: str, prefilter: bool = False) -> LanceQueryBuilder:
    """Set the where clause.

    Parameters
    ----------
    where: str
        The where clause which is a valid SQL where clause. See
        `Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>`_
        for valid SQL expressions.
    prefilter: bool, default False
        If True, apply the filter before vector search, otherwise the
        filter is applied on the result of vector search.
        This feature is **EXPERIMENTAL** and may be removed and modified
        without warning in the future.

    Returns
    -------
    LanceQueryBuilder
        The LanceQueryBuilder object.
    """
    self._where = where
    self._prefilter = prefilter
    return self

`with_row_id(with_row_id: bool) -> LanceQueryBuilder`

Set whether to return row ids.

Parameters:

Name	Type	Description	Default
`with_row_id`	`bool`	If True, return _rowid column in the results.	required

Returns:

Type	Description
`LanceQueryBuilder`	The LanceQueryBuilder object.

Source code in lancedb/query.py

def with_row_id(self, with_row_id: bool) -> LanceQueryBuilder:
    """Set whether to return row ids.

    Parameters
    ----------
    with_row_id: bool
        If True, return _rowid column in the results.

    Returns
    -------
    LanceQueryBuilder
        The LanceQueryBuilder object.
    """
    self._with_row_id = with_row_id
    return self

`explain_plan(verbose: Optional[bool] = False) -> str`

Return the execution plan for this query.

Examples:

>>> import lancedb
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", [{"vector": [99, 99]}])
>>> query = [100, 100]
>>> plan = table.search(query).explain_plan(True)
>>> print(plan)
ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]
  FilterExec: _distance@2 IS NOT NULL
    SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]
      KNNVectorDistance: metric=l2
        LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false

Parameters:

Name	Type	Description	Default
`verbose`	`bool`	Use a verbose output format.	`False`

Returns:

Name	Type	Description
`plan`	`str`

Source code in lancedb/query.py

def explain_plan(self, verbose: Optional[bool] = False) -> str:
    """Return the execution plan for this query.

    Examples
    --------
    >>> import lancedb
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", [{"vector": [99, 99]}])
    >>> query = [100, 100]
    >>> plan = table.search(query).explain_plan(True)
    >>> print(plan) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
    ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]
      FilterExec: _distance@2 IS NOT NULL
        SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]
          KNNVectorDistance: metric=l2
            LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false

    Parameters
    ----------
    verbose : bool, default False
        Use a verbose output format.

    Returns
    -------
    plan : str
    """  # noqa: E501
    ds = self._table.to_lance()
    return ds.scanner(
        nearest={
            "column": self._vector_column,
            "q": self._query,
        },
    ).explain_plan(verbose)

`lancedb.query.LanceVectorQueryBuilder`

Bases: LanceQueryBuilder

Examples:

>>> import lancedb
>>> data = [{"vector": [1.1, 1.2], "b": 2},
...         {"vector": [0.5, 1.3], "b": 4},
...         {"vector": [0.4, 0.4], "b": 6},
...         {"vector": [0.4, 0.4], "b": 10}]
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data=data)
>>> (table.search([0.4, 0.4])
...       .metric("cosine")
...       .where("b < 10")
...       .select(["b", "vector"])
...       .limit(2)
...       .to_pandas())
   b      vector  _distance
0  6  [0.4, 0.4]        0.0

Source code in lancedb/query.py

class LanceVectorQueryBuilder(LanceQueryBuilder):
    """
    Examples
    --------
    >>> import lancedb
    >>> data = [{"vector": [1.1, 1.2], "b": 2},
    ...         {"vector": [0.5, 1.3], "b": 4},
    ...         {"vector": [0.4, 0.4], "b": 6},
    ...         {"vector": [0.4, 0.4], "b": 10}]
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", data=data)
    >>> (table.search([0.4, 0.4])
    ...       .metric("cosine")
    ...       .where("b < 10")
    ...       .select(["b", "vector"])
    ...       .limit(2)
    ...       .to_pandas())
       b      vector  _distance
    0  6  [0.4, 0.4]        0.0
    """

    def __init__(
        self,
        table: "Table",
        query: Union[np.ndarray, list, "PIL.Image.Image"],
        vector_column: str,
        str_query: Optional[str] = None,
    ):
        super().__init__(table)
        self._query = query
        self._metric = "L2"
        self._nprobes = 20
        self._refine_factor = None
        self._vector_column = vector_column
        self._prefilter = False
        self._reranker = None
        self._str_query = str_query

    def metric(self, metric: Literal["L2", "cosine"]) -> LanceVectorQueryBuilder:
        """Set the distance metric to use.

        Parameters
        ----------
        metric: "L2" or "cosine"
            The distance metric to use. By default "L2" is used.

        Returns
        -------
        LanceVectorQueryBuilder
            The LanceQueryBuilder object.
        """
        self._metric = metric
        return self

    def nprobes(self, nprobes: int) -> LanceVectorQueryBuilder:
        """Set the number of probes to use.

        Higher values will yield better recall (more likely to find vectors if
        they exist) at the expense of latency.

        See discussion in [Querying an ANN Index][querying-an-ann-index] for
        tuning advice.

        Parameters
        ----------
        nprobes: int
            The number of probes to use.

        Returns
        -------
        LanceVectorQueryBuilder
            The LanceQueryBuilder object.
        """
        self._nprobes = nprobes
        return self

    def refine_factor(self, refine_factor: int) -> LanceVectorQueryBuilder:
        """Set the refine factor to use, increasing the number of vectors sampled.

        As an example, a refine factor of 2 will sample 2x as many vectors as
        requested, re-ranks them, and returns the top half most relevant results.

        See discussion in [Querying an ANN Index][querying-an-ann-index] for
        tuning advice.

        Parameters
        ----------
        refine_factor: int
            The refine factor to use.

        Returns
        -------
        LanceVectorQueryBuilder
            The LanceQueryBuilder object.
        """
        self._refine_factor = refine_factor
        return self

    def to_arrow(self) -> pa.Table:
        """
        Execute the query and return the results as an
        [Apache Arrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table).

        In addition to the selected columns, LanceDB also returns a vector
        and also the "_distance" column which is the distance between the query
        vector and the returned vectors.
        """
        return self.to_batches().read_all()

    def to_batches(self, /, batch_size: Optional[int] = None) -> pa.RecordBatchReader:
        """
        Execute the query and return the result as a RecordBatchReader object.

        Parameters
        ----------
        batch_size: int
            The maximum number of selected records in a RecordBatch object.

        Returns
        -------
        pa.RecordBatchReader
        """
        vector = self._query if isinstance(self._query, list) else self._query.tolist()
        if isinstance(vector[0], np.ndarray):
            vector = [v.tolist() for v in vector]
        query = Query(
            vector=vector,
            filter=self._where,
            prefilter=self._prefilter,
            k=self._limit,
            metric=self._metric,
            columns=self._columns,
            nprobes=self._nprobes,
            refine_factor=self._refine_factor,
            vector_column=self._vector_column,
            with_row_id=self._with_row_id,
        )
        result_set = self._table._execute_query(query, batch_size)
        if self._reranker is not None:
            rs_table = result_set.read_all()
            result_set = self._reranker.rerank_vector(self._str_query, rs_table)
            # convert result_set back to RecordBatchReader
            result_set = pa.RecordBatchReader.from_batches(
                result_set.schema, result_set.to_batches()
            )

        return result_set

    def where(self, where: str, prefilter: bool = False) -> LanceVectorQueryBuilder:
        """Set the where clause.

        Parameters
        ----------
        where: str
            The where clause which is a valid SQL where clause. See
            `Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>`_
            for valid SQL expressions.
        prefilter: bool, default False
            If True, apply the filter before vector search, otherwise the
            filter is applied on the result of vector search.
            This feature is **EXPERIMENTAL** and may be removed and modified
            without warning in the future.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        self._where = where
        self._prefilter = prefilter
        return self

    def rerank(
        self, reranker: Reranker, query_string: Optional[str] = None
    ) -> LanceVectorQueryBuilder:
        """Rerank the results using the specified reranker.

        Parameters
        ----------
        reranker: Reranker
            The reranker to use.

        query_string: Optional[str]
            The query to use for reranking. This needs to be specified explicitly here
            as the query used for vector search may already be vectorized and the
            reranker requires a string query.
            This is only required if the query used for vector search is not a string.
            Note: This doesn't yet support the case where the query is multimodal or a
            list of vectors.

        Returns
        -------
        LanceVectorQueryBuilder
            The LanceQueryBuilder object.
        """
        self._reranker = reranker
        if self._str_query is None and query_string is None:
            raise ValueError(
                """
                The query used for vector search is not a string.
                In this case, the reranker query needs to be specified explicitly.
                """
            )
        if query_string is not None and not isinstance(query_string, str):
            raise ValueError("Reranking currently only supports string queries")
        self._str_query = query_string if query_string is not None else self._str_query
        return self

`metric(metric: Literal['L2', 'cosine']) -> LanceVectorQueryBuilder`

Set the distance metric to use.

Parameters:

Name	Type	Description	Default
`metric`	`Literal['L2', 'cosine']`	The distance metric to use. By default "L2" is used.	required

Returns:

Type	Description
`LanceVectorQueryBuilder`	The LanceQueryBuilder object.

Source code in lancedb/query.py

def metric(self, metric: Literal["L2", "cosine"]) -> LanceVectorQueryBuilder:
    """Set the distance metric to use.

    Parameters
    ----------
    metric: "L2" or "cosine"
        The distance metric to use. By default "L2" is used.

    Returns
    -------
    LanceVectorQueryBuilder
        The LanceQueryBuilder object.
    """
    self._metric = metric
    return self

`nprobes(nprobes: int) -> LanceVectorQueryBuilder`

Set the number of probes to use.

Higher values will yield better recall (more likely to find vectors if they exist) at the expense of latency.

See discussion in Querying an ANN Index for tuning advice.

Parameters:

Name	Type	Description	Default
`nprobes`	`int`	The number of probes to use.	required

Returns:

Type	Description
`LanceVectorQueryBuilder`	The LanceQueryBuilder object.

Source code in lancedb/query.py

def nprobes(self, nprobes: int) -> LanceVectorQueryBuilder:
    """Set the number of probes to use.

    Higher values will yield better recall (more likely to find vectors if
    they exist) at the expense of latency.

    See discussion in [Querying an ANN Index][querying-an-ann-index] for
    tuning advice.

    Parameters
    ----------
    nprobes: int
        The number of probes to use.

    Returns
    -------
    LanceVectorQueryBuilder
        The LanceQueryBuilder object.
    """
    self._nprobes = nprobes
    return self

`refine_factor(refine_factor: int) -> LanceVectorQueryBuilder`

Set the refine factor to use, increasing the number of vectors sampled.

As an example, a refine factor of 2 will sample 2x as many vectors as requested, re-ranks them, and returns the top half most relevant results.

See discussion in Querying an ANN Index for tuning advice.

Parameters:

Name	Type	Description	Default
`refine_factor`	`int`	The refine factor to use.	required

Returns:

Type	Description
`LanceVectorQueryBuilder`	The LanceQueryBuilder object.

Source code in lancedb/query.py

def refine_factor(self, refine_factor: int) -> LanceVectorQueryBuilder:
    """Set the refine factor to use, increasing the number of vectors sampled.

    As an example, a refine factor of 2 will sample 2x as many vectors as
    requested, re-ranks them, and returns the top half most relevant results.

    See discussion in [Querying an ANN Index][querying-an-ann-index] for
    tuning advice.

    Parameters
    ----------
    refine_factor: int
        The refine factor to use.

    Returns
    -------
    LanceVectorQueryBuilder
        The LanceQueryBuilder object.
    """
    self._refine_factor = refine_factor
    return self

`to_arrow() -> pa.Table`

Execute the query and return the results as an Apache Arrow Table.

In addition to the selected columns, LanceDB also returns a vector and also the "_distance" column which is the distance between the query vector and the returned vectors.

Source code in lancedb/query.py

def to_arrow(self) -> pa.Table:
    """
    Execute the query and return the results as an
    [Apache Arrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table).

    In addition to the selected columns, LanceDB also returns a vector
    and also the "_distance" column which is the distance between the query
    vector and the returned vectors.
    """
    return self.to_batches().read_all()

`to_batches(batch_size: Optional[int] = None) -> pa.RecordBatchReader`

Execute the query and return the result as a RecordBatchReader object.

Parameters:

Name	Type	Description	Default
`batch_size`	`Optional[int]`	The maximum number of selected records in a RecordBatch object.	`None`

Returns:

Type	Description
`RecordBatchReader`

Source code in lancedb/query.py

def to_batches(self, /, batch_size: Optional[int] = None) -> pa.RecordBatchReader:
    """
    Execute the query and return the result as a RecordBatchReader object.

    Parameters
    ----------
    batch_size: int
        The maximum number of selected records in a RecordBatch object.

    Returns
    -------
    pa.RecordBatchReader
    """
    vector = self._query if isinstance(self._query, list) else self._query.tolist()
    if isinstance(vector[0], np.ndarray):
        vector = [v.tolist() for v in vector]
    query = Query(
        vector=vector,
        filter=self._where,
        prefilter=self._prefilter,
        k=self._limit,
        metric=self._metric,
        columns=self._columns,
        nprobes=self._nprobes,
        refine_factor=self._refine_factor,
        vector_column=self._vector_column,
        with_row_id=self._with_row_id,
    )
    result_set = self._table._execute_query(query, batch_size)
    if self._reranker is not None:
        rs_table = result_set.read_all()
        result_set = self._reranker.rerank_vector(self._str_query, rs_table)
        # convert result_set back to RecordBatchReader
        result_set = pa.RecordBatchReader.from_batches(
            result_set.schema, result_set.to_batches()
        )

    return result_set

`where(where: str, prefilter: bool = False) -> LanceVectorQueryBuilder`

Set the where clause.

Parameters:

Name	Type	Description	Default
`where`	`str`	The where clause which is a valid SQL where clause. See `Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>`_ for valid SQL expressions.	required
`prefilter`	`bool`	If True, apply the filter before vector search, otherwise the filter is applied on the result of vector search. This feature is EXPERIMENTAL and may be removed and modified without warning in the future.	`False`

Returns:

Type	Description
`LanceQueryBuilder`	The LanceQueryBuilder object.

Source code in lancedb/query.py

def where(self, where: str, prefilter: bool = False) -> LanceVectorQueryBuilder:
    """Set the where clause.

    Parameters
    ----------
    where: str
        The where clause which is a valid SQL where clause. See
        `Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>`_
        for valid SQL expressions.
    prefilter: bool, default False
        If True, apply the filter before vector search, otherwise the
        filter is applied on the result of vector search.
        This feature is **EXPERIMENTAL** and may be removed and modified
        without warning in the future.

    Returns
    -------
    LanceQueryBuilder
        The LanceQueryBuilder object.
    """
    self._where = where
    self._prefilter = prefilter
    return self

`rerank(reranker: Reranker, query_string: Optional[str] = None) -> LanceVectorQueryBuilder`

Rerank the results using the specified reranker.

Parameters:

Name	Type	Description	Default
`reranker`	`Reranker`	The reranker to use.	required
`query_string`	`Optional[str]`	The query to use for reranking. This needs to be specified explicitly here as the query used for vector search may already be vectorized and the reranker requires a string query. This is only required if the query used for vector search is not a string. Note: This doesn't yet support the case where the query is multimodal or a list of vectors.	`None`

Returns:

Type	Description
`LanceVectorQueryBuilder`	The LanceQueryBuilder object.

Source code in lancedb/query.py

def rerank(
    self, reranker: Reranker, query_string: Optional[str] = None
) -> LanceVectorQueryBuilder:
    """Rerank the results using the specified reranker.

    Parameters
    ----------
    reranker: Reranker
        The reranker to use.

    query_string: Optional[str]
        The query to use for reranking. This needs to be specified explicitly here
        as the query used for vector search may already be vectorized and the
        reranker requires a string query.
        This is only required if the query used for vector search is not a string.
        Note: This doesn't yet support the case where the query is multimodal or a
        list of vectors.

    Returns
    -------
    LanceVectorQueryBuilder
        The LanceQueryBuilder object.
    """
    self._reranker = reranker
    if self._str_query is None and query_string is None:
        raise ValueError(
            """
            The query used for vector search is not a string.
            In this case, the reranker query needs to be specified explicitly.
            """
        )
    if query_string is not None and not isinstance(query_string, str):
        raise ValueError("Reranking currently only supports string queries")
    self._str_query = query_string if query_string is not None else self._str_query
    return self

`lancedb.query.LanceFtsQueryBuilder`

Bases: LanceQueryBuilder

A builder for full text search for LanceDB.

Source code in lancedb/query.py

class LanceFtsQueryBuilder(LanceQueryBuilder):
    """A builder for full text search for LanceDB."""

    def __init__(self, table: "Table", query: str, ordering_field_name: str = None):
        super().__init__(table)
        self._query = query
        self._phrase_query = False
        self.ordering_field_name = ordering_field_name
        self._reranker = None

    def phrase_query(self, phrase_query: bool = True) -> LanceFtsQueryBuilder:
        """Set whether to use phrase query.

        Parameters
        ----------
        phrase_query: bool, default True
            If True, then the query will be wrapped in quotes and
            double quotes replaced by single quotes.

        Returns
        -------
        LanceFtsQueryBuilder
            The LanceFtsQueryBuilder object.
        """
        self._phrase_query = phrase_query
        return self

    def to_arrow(self) -> pa.Table:
        try:
            import tantivy
        except ImportError:
            raise ImportError(
                "Please install tantivy-py `pip install tantivy` to use the full text search feature."  # noqa: E501
            )

        from .fts import search_index

        # get the index path
        index_path = self._table._get_fts_index_path()

        # Check that we are on local filesystem
        fs, _path = fs_from_uri(index_path)
        if not isinstance(fs, pa_fs.LocalFileSystem):
            raise NotImplementedError(
                "Full-text search is only supported on the local filesystem"
            )

        # check if the index exist
        if not Path(index_path).exists():
            raise FileNotFoundError(
                "Fts index does not exist. "
                "Please first call table.create_fts_index(['<field_names>']) to "
                "create the fts index."
            )
        # open the index
        index = tantivy.Index.open(index_path)
        # get the scores and doc ids
        query = self._query
        if self._phrase_query:
            query = query.replace('"', "'")
            query = f'"{query}"'
        row_ids, scores = search_index(
            index, query, self._limit, ordering_field=self.ordering_field_name
        )
        if len(row_ids) == 0:
            empty_schema = pa.schema([pa.field("score", pa.float32())])
            return pa.Table.from_pylist([], schema=empty_schema)
        scores = pa.array(scores)
        output_tbl = self._table.to_lance().take(row_ids, columns=self._columns)
        output_tbl = output_tbl.append_column("score", scores)
        # this needs to match vector search results which are uint64
        row_ids = pa.array(row_ids, type=pa.uint64())

        if self._where is not None:
            tmp_name = "__lancedb__duckdb__indexer__"
            output_tbl = output_tbl.append_column(
                tmp_name, pa.array(range(len(output_tbl)))
            )
            try:
                # TODO would be great to have Substrait generate pyarrow compute
                # expressions or conversely have pyarrow support SQL expressions
                # using Substrait
                import duckdb

                indexer = duckdb.sql(
                    f"SELECT {tmp_name} FROM output_tbl WHERE {self._where}"
                ).to_arrow_table()[tmp_name]
                output_tbl = output_tbl.take(indexer).drop([tmp_name])
                row_ids = row_ids.take(indexer)

            except ImportError:
                import tempfile

                import lance

                # TODO Use "memory://" instead once that's supported
                with tempfile.TemporaryDirectory() as tmp:
                    ds = lance.write_dataset(output_tbl, tmp)
                    output_tbl = ds.to_table(filter=self._where)
                    indexer = output_tbl[tmp_name]
                    row_ids = row_ids.take(indexer)
                    output_tbl = output_tbl.drop([tmp_name])

        if self._with_row_id:
            output_tbl = output_tbl.append_column("_rowid", row_ids)

        if self._reranker is not None:
            output_tbl = self._reranker.rerank_fts(self._query, output_tbl)
        return output_tbl

    def rerank(self, reranker: Reranker) -> LanceFtsQueryBuilder:
        """Rerank the results using the specified reranker.

        Parameters
        ----------
        reranker: Reranker
            The reranker to use.

        Returns
        -------
        LanceFtsQueryBuilder
            The LanceQueryBuilder object.
        """
        self._reranker = reranker
        return self

`phrase_query(phrase_query: bool = True) -> LanceFtsQueryBuilder`

Set whether to use phrase query.

Parameters:

Name	Type	Description	Default
`phrase_query`	`bool`	If True, then the query will be wrapped in quotes and double quotes replaced by single quotes.	`True`

Returns:

Type	Description
`LanceFtsQueryBuilder`	The LanceFtsQueryBuilder object.

Source code in lancedb/query.py

def phrase_query(self, phrase_query: bool = True) -> LanceFtsQueryBuilder:
    """Set whether to use phrase query.

    Parameters
    ----------
    phrase_query: bool, default True
        If True, then the query will be wrapped in quotes and
        double quotes replaced by single quotes.

    Returns
    -------
    LanceFtsQueryBuilder
        The LanceFtsQueryBuilder object.
    """
    self._phrase_query = phrase_query
    return self

`rerank(reranker: Reranker) -> LanceFtsQueryBuilder`

Rerank the results using the specified reranker.

Parameters:

Name	Type	Description	Default
`reranker`	`Reranker`	The reranker to use.	required

Returns:

Type	Description
`LanceFtsQueryBuilder`	The LanceQueryBuilder object.

Source code in lancedb/query.py

def rerank(self, reranker: Reranker) -> LanceFtsQueryBuilder:
    """Rerank the results using the specified reranker.

    Parameters
    ----------
    reranker: Reranker
        The reranker to use.

    Returns
    -------
    LanceFtsQueryBuilder
        The LanceQueryBuilder object.
    """
    self._reranker = reranker
    return self

`lancedb.query.LanceHybridQueryBuilder`

Bases: LanceQueryBuilder

A query builder that performs hybrid vector and full text search. Results are combined and reranked based on the specified reranker. By default, the results are reranked using the LinearCombinationReranker.

To make the vector and fts results comparable, the scores are normalized. Instead of normalizing scores, the normalize parameter can be set to "rank" in the rerank method to convert the scores to ranks and then normalize them.

Source code in lancedb/query.py

class LanceHybridQueryBuilder(LanceQueryBuilder):
    """
    A query builder that performs hybrid vector and full text search.
    Results are combined and reranked based on the specified reranker.
    By default, the results are reranked using the LinearCombinationReranker.

    To make the vector and fts results comparable, the scores are normalized.
    Instead of normalizing scores, the `normalize` parameter can be set to "rank"
    in the `rerank` method to convert the scores to ranks and then normalize them.
    """

    def __init__(self, table: "Table", query: str, vector_column: str):
        super().__init__(table)
        self._validate_fts_index()
        vector_query, fts_query = self._validate_query(query)
        self._fts_query = LanceFtsQueryBuilder(table, fts_query)
        vector_query = self._query_to_vector(table, vector_query, vector_column)
        self._vector_query = LanceVectorQueryBuilder(table, vector_query, vector_column)
        self._norm = "score"
        self._reranker = LinearCombinationReranker(weight=0.7, fill=1.0)

    def _validate_fts_index(self):
        if self._table._get_fts_index_path() is None:
            raise ValueError(
                "Please create a full-text search index " "to perform hybrid search."
            )

    def _validate_query(self, query):
        # Temp hack to support vectorized queries for hybrid search
        if isinstance(query, str):
            return query, query
        elif isinstance(query, tuple):
            if len(query) != 2:
                raise ValueError(
                    "The query must be a tuple of (vector_query, fts_query)."
                )
            if not isinstance(query[0], (list, np.ndarray, pa.Array, pa.ChunkedArray)):
                raise ValueError(f"The vector query must be one of {VEC}.")
            if not isinstance(query[1], str):
                raise ValueError("The fts query must be a string.")
            return query[0], query[1]
        else:
            raise ValueError(
                "The query must be either a string or a tuple of (vector, string)."
            )

    def to_arrow(self) -> pa.Table:
        with ThreadPoolExecutor() as executor:
            fts_future = executor.submit(self._fts_query.with_row_id(True).to_arrow)
            vector_future = executor.submit(
                self._vector_query.with_row_id(True).to_arrow
            )
            fts_results = fts_future.result()
            vector_results = vector_future.result()

        # convert to ranks first if needed
        if self._norm == "rank":
            vector_results = self._rank(vector_results, "_distance")
            fts_results = self._rank(fts_results, "score")
        # normalize the scores to be between 0 and 1, 0 being most relevant
        vector_results = self._normalize_scores(vector_results, "_distance")

        # In fts higher scores represent relevance. Not inverting them here as
        # rerankers might need to preserve this score to support `return_score="all"`
        fts_results = self._normalize_scores(fts_results, "score")

        results = self._reranker.rerank_hybrid(
            self._fts_query._query, vector_results, fts_results
        )

        if not isinstance(results, pa.Table):  # Enforce type
            raise TypeError(
                f"rerank_hybrid must return a pyarrow.Table, got {type(results)}"
            )

        # apply limit after reranking
        results = results.slice(length=self._limit)

        if not self._with_row_id:
            results = results.drop(["_rowid"])
        return results

    def _rank(self, results: pa.Table, column: str, ascending: bool = True):
        if len(results) == 0:
            return results
        # Get the _score column from results
        scores = results.column(column).to_numpy()
        sort_indices = np.argsort(scores)
        if not ascending:
            sort_indices = sort_indices[::-1]
        ranks = np.empty_like(sort_indices)
        ranks[sort_indices] = np.arange(len(scores)) + 1
        # replace the _score column with the ranks
        _score_idx = results.column_names.index(column)
        results = results.set_column(
            _score_idx, column, pa.array(ranks, type=pa.float32())
        )
        return results

    def _normalize_scores(self, results: pa.Table, column: str, invert=False):
        if len(results) == 0:
            return results
        # Get the _score column from results
        scores = results.column(column).to_numpy()
        # normalize the scores by subtracting the min and dividing by the max
        max, min = np.max(scores), np.min(scores)
        if np.isclose(max, min):
            rng = max
        else:
            rng = max - min
        scores = (scores - min) / rng
        if invert:
            scores = 1 - scores
        # replace the _score column with the ranks
        _score_idx = results.column_names.index(column)
        results = results.set_column(
            _score_idx, column, pa.array(scores, type=pa.float32())
        )
        return results

    def rerank(
        self,
        normalize="score",
        reranker: Reranker = LinearCombinationReranker(weight=0.7, fill=1.0),
    ) -> LanceHybridQueryBuilder:
        """
        Rerank the hybrid search results using the specified reranker. The reranker
        must be an instance of Reranker class.

        Parameters
        ----------
        normalize: str, default "score"
            The method to normalize the scores. Can be "rank" or "score". If "rank",
            the scores are converted to ranks and then normalized. If "score", the
            scores are normalized directly.
        reranker: Reranker, default LinearCombinationReranker(weight=0.7, fill=1.0)
            The reranker to use. Must be an instance of Reranker class.
        Returns
        -------
        LanceHybridQueryBuilder
            The LanceHybridQueryBuilder object.
        """
        if normalize not in ["rank", "score"]:
            raise ValueError("normalize must be 'rank' or 'score'.")
        if reranker and not isinstance(reranker, Reranker):
            raise ValueError("reranker must be an instance of Reranker class.")

        self._norm = normalize
        self._reranker = reranker

        return self

    def limit(self, limit: int) -> LanceHybridQueryBuilder:
        """
        Set the maximum number of results to return for both vector and fts search
        components.

        Parameters
        ----------
        limit: int
            The maximum number of results to return.

        Returns
        -------
        LanceHybridQueryBuilder
            The LanceHybridQueryBuilder object.
        """
        self._vector_query.limit(limit)
        self._fts_query.limit(limit)
        self._limit = limit

        return self

    def select(self, columns: list) -> LanceHybridQueryBuilder:
        """
        Set the columns to return for both vector and fts search.

        Parameters
        ----------
        columns: list
            The columns to return.

        Returns
        -------
        LanceHybridQueryBuilder
            The LanceHybridQueryBuilder object.
        """
        self._vector_query.select(columns)
        self._fts_query.select(columns)
        return self

    def where(self, where: str, prefilter: bool = False) -> LanceHybridQueryBuilder:
        """
        Set the where clause for both vector and fts search.

        Parameters
        ----------
        where: str
            The where clause which is a valid SQL where clause. See
            `Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>`_
            for valid SQL expressions.

        prefilter: bool, default False
            If True, apply the filter before vector search, otherwise the
            filter is applied on the result of vector search.

        Returns
        -------
        LanceHybridQueryBuilder
            The LanceHybridQueryBuilder object.
        """

        self._vector_query.where(where, prefilter=prefilter)
        self._fts_query.where(where)
        return self

    def metric(self, metric: Literal["L2", "cosine"]) -> LanceHybridQueryBuilder:
        """
        Set the distance metric to use for vector search.

        Parameters
        ----------
        metric: "L2" or "cosine"
            The distance metric to use. By default "L2" is used.

        Returns
        -------
        LanceHybridQueryBuilder
            The LanceHybridQueryBuilder object.
        """
        self._vector_query.metric(metric)
        return self

    def nprobes(self, nprobes: int) -> LanceHybridQueryBuilder:
        """
        Set the number of probes to use for vector search.

        Higher values will yield better recall (more likely to find vectors if
        they exist) at the expense of latency.

        Parameters
        ----------
        nprobes: int
            The number of probes to use.

        Returns
        -------
        LanceHybridQueryBuilder
            The LanceHybridQueryBuilder object.
        """
        self._vector_query.nprobes(nprobes)
        return self

    def refine_factor(self, refine_factor: int) -> LanceHybridQueryBuilder:
        """
        Refine the vector search results by reading extra elements and
        re-ranking them in memory.

        Parameters
        ----------
        refine_factor: int
            The refine factor to use.

        Returns
        -------
        LanceHybridQueryBuilder
            The LanceHybridQueryBuilder object.
        """
        self._vector_query.refine_factor(refine_factor)
        return self

`rerank(normalize='score', reranker: Reranker = LinearCombinationReranker(weight=0.7, fill=1.0)) -> LanceHybridQueryBuilder`

Rerank the hybrid search results using the specified reranker. The reranker must be an instance of Reranker class.

Parameters:

Name	Type	Description	Default
`normalize`		The method to normalize the scores. Can be "rank" or "score". If "rank", the scores are converted to ranks and then normalized. If "score", the scores are normalized directly.	`'score'`
`reranker`	`Reranker`	The reranker to use. Must be an instance of Reranker class.	`LinearCombinationReranker(weight=0.7, fill=1.0)`

Returns:

Type	Description
`LanceHybridQueryBuilder`	The LanceHybridQueryBuilder object.

Source code in lancedb/query.py

def rerank(
    self,
    normalize="score",
    reranker: Reranker = LinearCombinationReranker(weight=0.7, fill=1.0),
) -> LanceHybridQueryBuilder:
    """
    Rerank the hybrid search results using the specified reranker. The reranker
    must be an instance of Reranker class.

    Parameters
    ----------
    normalize: str, default "score"
        The method to normalize the scores. Can be "rank" or "score". If "rank",
        the scores are converted to ranks and then normalized. If "score", the
        scores are normalized directly.
    reranker: Reranker, default LinearCombinationReranker(weight=0.7, fill=1.0)
        The reranker to use. Must be an instance of Reranker class.
    Returns
    -------
    LanceHybridQueryBuilder
        The LanceHybridQueryBuilder object.
    """
    if normalize not in ["rank", "score"]:
        raise ValueError("normalize must be 'rank' or 'score'.")
    if reranker and not isinstance(reranker, Reranker):
        raise ValueError("reranker must be an instance of Reranker class.")

    self._norm = normalize
    self._reranker = reranker

    return self

`limit(limit: int) -> LanceHybridQueryBuilder`

Set the maximum number of results to return for both vector and fts search components.

Parameters:

Name	Type	Description	Default
`limit`	`int`	The maximum number of results to return.	required

Returns:

Type	Description
`LanceHybridQueryBuilder`	The LanceHybridQueryBuilder object.

Source code in lancedb/query.py

def limit(self, limit: int) -> LanceHybridQueryBuilder:
    """
    Set the maximum number of results to return for both vector and fts search
    components.

    Parameters
    ----------
    limit: int
        The maximum number of results to return.

    Returns
    -------
    LanceHybridQueryBuilder
        The LanceHybridQueryBuilder object.
    """
    self._vector_query.limit(limit)
    self._fts_query.limit(limit)
    self._limit = limit

    return self

`select(columns: list) -> LanceHybridQueryBuilder`

Set the columns to return for both vector and fts search.

Parameters:

Name	Type	Description	Default
`columns`	`list`	The columns to return.	required

Returns:

Type	Description
`LanceHybridQueryBuilder`	The LanceHybridQueryBuilder object.

Source code in lancedb/query.py

def select(self, columns: list) -> LanceHybridQueryBuilder:
    """
    Set the columns to return for both vector and fts search.

    Parameters
    ----------
    columns: list
        The columns to return.

    Returns
    -------
    LanceHybridQueryBuilder
        The LanceHybridQueryBuilder object.
    """
    self._vector_query.select(columns)
    self._fts_query.select(columns)
    return self

`where(where: str, prefilter: bool = False) -> LanceHybridQueryBuilder`

Set the where clause for both vector and fts search.

Parameters:

Name	Type	Description	Default
`where`	`str`	The where clause which is a valid SQL where clause. See `Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>`_ for valid SQL expressions.	required
`prefilter`	`bool`	If True, apply the filter before vector search, otherwise the filter is applied on the result of vector search.	`False`

Returns:

Type	Description
`LanceHybridQueryBuilder`	The LanceHybridQueryBuilder object.

Source code in lancedb/query.py

def where(self, where: str, prefilter: bool = False) -> LanceHybridQueryBuilder:
    """
    Set the where clause for both vector and fts search.

    Parameters
    ----------
    where: str
        The where clause which is a valid SQL where clause. See
        `Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>`_
        for valid SQL expressions.

    prefilter: bool, default False
        If True, apply the filter before vector search, otherwise the
        filter is applied on the result of vector search.

    Returns
    -------
    LanceHybridQueryBuilder
        The LanceHybridQueryBuilder object.
    """

    self._vector_query.where(where, prefilter=prefilter)
    self._fts_query.where(where)
    return self

`metric(metric: Literal['L2', 'cosine']) -> LanceHybridQueryBuilder`

Set the distance metric to use for vector search.

Parameters:

Name	Type	Description	Default
`metric`	`Literal['L2', 'cosine']`	The distance metric to use. By default "L2" is used.	required

Returns:

Type	Description
`LanceHybridQueryBuilder`	The LanceHybridQueryBuilder object.

Source code in lancedb/query.py

def metric(self, metric: Literal["L2", "cosine"]) -> LanceHybridQueryBuilder:
    """
    Set the distance metric to use for vector search.

    Parameters
    ----------
    metric: "L2" or "cosine"
        The distance metric to use. By default "L2" is used.

    Returns
    -------
    LanceHybridQueryBuilder
        The LanceHybridQueryBuilder object.
    """
    self._vector_query.metric(metric)
    return self

`nprobes(nprobes: int) -> LanceHybridQueryBuilder`

Set the number of probes to use for vector search.

Higher values will yield better recall (more likely to find vectors if they exist) at the expense of latency.

Parameters:

Name	Type	Description	Default
`nprobes`	`int`	The number of probes to use.	required

Returns:

Type	Description
`LanceHybridQueryBuilder`	The LanceHybridQueryBuilder object.

Source code in lancedb/query.py

def nprobes(self, nprobes: int) -> LanceHybridQueryBuilder:
    """
    Set the number of probes to use for vector search.

    Higher values will yield better recall (more likely to find vectors if
    they exist) at the expense of latency.

    Parameters
    ----------
    nprobes: int
        The number of probes to use.

    Returns
    -------
    LanceHybridQueryBuilder
        The LanceHybridQueryBuilder object.
    """
    self._vector_query.nprobes(nprobes)
    return self

`refine_factor(refine_factor: int) -> LanceHybridQueryBuilder`

Refine the vector search results by reading extra elements and re-ranking them in memory.

Parameters:

Name	Type	Description	Default
`refine_factor`	`int`	The refine factor to use.	required

Returns:

Type	Description
`LanceHybridQueryBuilder`	The LanceHybridQueryBuilder object.

Source code in lancedb/query.py

def refine_factor(self, refine_factor: int) -> LanceHybridQueryBuilder:
    """
    Refine the vector search results by reading extra elements and
    re-ranking them in memory.

    Parameters
    ----------
    refine_factor: int
        The refine factor to use.

    Returns
    -------
    LanceHybridQueryBuilder
        The LanceHybridQueryBuilder object.
    """
    self._vector_query.refine_factor(refine_factor)
    return self

Embeddings

`lancedb.embeddings.registry.EmbeddingFunctionRegistry`

This is a singleton class used to register embedding functions and fetch them by name. It also handles serializing and deserializing. You can implement your own embedding function by subclassing EmbeddingFunction or TextEmbeddingFunction and registering it with the registry.

NOTE: Here TEXT is a type alias for Union[str, List[str], pa.Array, pa.ChunkedArray, np.ndarray]

Examples:

>>> registry = EmbeddingFunctionRegistry.get_instance()
>>> @registry.register("my-embedding-function")
... class MyEmbeddingFunction(EmbeddingFunction):
...     def ndims(self) -> int:
...         return 128
...
...     def compute_query_embeddings(self, query: str, *args, **kwargs):
...         return self.compute_source_embeddings(query, *args, **kwargs)
...
...     def compute_source_embeddings(self, texts, *args, **kwargs):
...         return [np.random.rand(self.ndims()) for _ in range(len(texts))]
...
>>> registry.get("my-embedding-function")
<class 'lancedb.embeddings.registry.MyEmbeddingFunction'>

Source code in lancedb/embeddings/registry.py

class EmbeddingFunctionRegistry:
    """
    This is a singleton class used to register embedding functions
    and fetch them by name. It also handles serializing and deserializing.
    You can implement your own embedding function by subclassing EmbeddingFunction
    or TextEmbeddingFunction and registering it with the registry.

    NOTE: Here TEXT is a type alias for Union[str, List[str], pa.Array,
          pa.ChunkedArray, np.ndarray]

    Examples
    --------
    >>> registry = EmbeddingFunctionRegistry.get_instance()
    >>> @registry.register("my-embedding-function")
    ... class MyEmbeddingFunction(EmbeddingFunction):
    ...     def ndims(self) -> int:
    ...         return 128
    ...
    ...     def compute_query_embeddings(self, query: str, *args, **kwargs):
    ...         return self.compute_source_embeddings(query, *args, **kwargs)
    ...
    ...     def compute_source_embeddings(self, texts, *args, **kwargs):
    ...         return [np.random.rand(self.ndims()) for _ in range(len(texts))]
    ...
    >>> registry.get("my-embedding-function")
    <class 'lancedb.embeddings.registry.MyEmbeddingFunction'>
    """

    @classmethod
    def get_instance(cls):
        return __REGISTRY__

    def __init__(self):
        self._functions = {}

    def register(self, alias: str = None):
        """
        This creates a decorator that can be used to register
        an EmbeddingFunction.

        Parameters
        ----------
        alias : Optional[str]
            a human friendly name for the embedding function. If not
            provided, the class name will be used.
        """

        # This is a decorator for a class that inherits from BaseModel
        # It adds the class to the registry
        def decorator(cls):
            if not issubclass(cls, EmbeddingFunction):
                raise TypeError("Must be a subclass of EmbeddingFunction")
            if cls.__name__ in self._functions:
                raise KeyError(f"{cls.__name__} was already registered")
            key = alias or cls.__name__
            self._functions[key] = cls
            cls.__embedding_function_registry_alias__ = alias
            return cls

        return decorator

    def reset(self):
        """
        Reset the registry to its initial state
        """
        self._functions = {}

    def get(self, name: str):
        """
        Fetch an embedding function class by name

        Parameters
        ----------
        name : str
            The name of the embedding function to fetch
            Either the alias or the class name if no alias was provided
            during registration
        """
        return self._functions[name]

    def parse_functions(
        self, metadata: Optional[Dict[bytes, bytes]]
    ) -> Dict[str, "EmbeddingFunctionConfig"]:
        """
        Parse the metadata from an arrow table and
        return a mapping of the vector column to the
        embedding function and source column

        Parameters
        ----------
        metadata : Optional[Dict[bytes, bytes]]
            The metadata from an arrow table. Note that
            the keys and values are bytes (pyarrow api)

        Returns
        -------
        functions : dict
            A mapping of vector column name to embedding function.
            An empty dict is returned if input is None or does not
            contain b"embedding_functions".
        """
        if metadata is None or b"embedding_functions" not in metadata:
            return {}
        serialized = metadata[b"embedding_functions"]
        raw_list = json.loads(serialized.decode("utf-8"))
        return {
            obj["vector_column"]: EmbeddingFunctionConfig(
                vector_column=obj["vector_column"],
                source_column=obj["source_column"],
                function=self.get(obj["name"])(**obj["model"]),
            )
            for obj in raw_list
        }

    def function_to_metadata(self, conf: "EmbeddingFunctionConfig"):
        """
        Convert the given embedding function and source / vector column configs
        into a config dictionary that can be serialized into arrow metadata
        """
        func = conf.function
        name = getattr(
            func, "__embedding_function_registry_alias__", func.__class__.__name__
        )
        json_data = func.safe_model_dump()
        return {
            "name": name,
            "model": json_data,
            "source_column": conf.source_column,
            "vector_column": conf.vector_column,
        }

    def get_table_metadata(self, func_list):
        """
        Convert a list of embedding functions and source / vector configs
        into a config dictionary that can be serialized into arrow metadata
        """
        if func_list is None or len(func_list) == 0:
            return None
        json_data = [self.function_to_metadata(func) for func in func_list]
        # Note that metadata dictionary values must be bytes
        # so we need to json dump then utf8 encode
        metadata = json.dumps(json_data, indent=2).encode("utf-8")
        return {"embedding_functions": metadata}

`register(alias: str = None)`

This creates a decorator that can be used to register an EmbeddingFunction.

Parameters:

Name	Type	Description	Default
`alias`	`Optional[str]`	a human friendly name for the embedding function. If not provided, the class name will be used.	`None`

Source code in lancedb/embeddings/registry.py

def register(self, alias: str = None):
    """
    This creates a decorator that can be used to register
    an EmbeddingFunction.

    Parameters
    ----------
    alias : Optional[str]
        a human friendly name for the embedding function. If not
        provided, the class name will be used.
    """

    # This is a decorator for a class that inherits from BaseModel
    # It adds the class to the registry
    def decorator(cls):
        if not issubclass(cls, EmbeddingFunction):
            raise TypeError("Must be a subclass of EmbeddingFunction")
        if cls.__name__ in self._functions:
            raise KeyError(f"{cls.__name__} was already registered")
        key = alias or cls.__name__
        self._functions[key] = cls
        cls.__embedding_function_registry_alias__ = alias
        return cls

    return decorator

`reset()`

Reset the registry to its initial state

Source code in lancedb/embeddings/registry.py

def reset(self):
    """
    Reset the registry to its initial state
    """
    self._functions = {}

`get(name: str)`

Fetch an embedding function class by name

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the embedding function to fetch Either the alias or the class name if no alias was provided during registration	required

Source code in lancedb/embeddings/registry.py

def get(self, name: str):
    """
    Fetch an embedding function class by name

    Parameters
    ----------
    name : str
        The name of the embedding function to fetch
        Either the alias or the class name if no alias was provided
        during registration
    """
    return self._functions[name]

`parse_functions(metadata: Optional[Dict[bytes, bytes]]) -> Dict[str, EmbeddingFunctionConfig]`

Parse the metadata from an arrow table and return a mapping of the vector column to the embedding function and source column

Parameters:

Name	Type	Description	Default
`metadata`	`Optional[Dict[bytes, bytes]]`	The metadata from an arrow table. Note that the keys and values are bytes (pyarrow api)	required

Returns:

Name	Type	Description
`functions`	`dict`	A mapping of vector column name to embedding function. An empty dict is returned if input is None or does not contain b"embedding_functions".

Source code in lancedb/embeddings/registry.py

def parse_functions(
    self, metadata: Optional[Dict[bytes, bytes]]
) -> Dict[str, "EmbeddingFunctionConfig"]:
    """
    Parse the metadata from an arrow table and
    return a mapping of the vector column to the
    embedding function and source column

    Parameters
    ----------
    metadata : Optional[Dict[bytes, bytes]]
        The metadata from an arrow table. Note that
        the keys and values are bytes (pyarrow api)

    Returns
    -------
    functions : dict
        A mapping of vector column name to embedding function.
        An empty dict is returned if input is None or does not
        contain b"embedding_functions".
    """
    if metadata is None or b"embedding_functions" not in metadata:
        return {}
    serialized = metadata[b"embedding_functions"]
    raw_list = json.loads(serialized.decode("utf-8"))
    return {
        obj["vector_column"]: EmbeddingFunctionConfig(
            vector_column=obj["vector_column"],
            source_column=obj["source_column"],
            function=self.get(obj["name"])(**obj["model"]),
        )
        for obj in raw_list
    }

`function_to_metadata(conf: EmbeddingFunctionConfig)`

Convert the given embedding function and source / vector column configs into a config dictionary that can be serialized into arrow metadata

Source code in lancedb/embeddings/registry.py

def function_to_metadata(self, conf: "EmbeddingFunctionConfig"):
    """
    Convert the given embedding function and source / vector column configs
    into a config dictionary that can be serialized into arrow metadata
    """
    func = conf.function
    name = getattr(
        func, "__embedding_function_registry_alias__", func.__class__.__name__
    )
    json_data = func.safe_model_dump()
    return {
        "name": name,
        "model": json_data,
        "source_column": conf.source_column,
        "vector_column": conf.vector_column,
    }

`get_table_metadata(func_list)`

Convert a list of embedding functions and source / vector configs into a config dictionary that can be serialized into arrow metadata

Source code in lancedb/embeddings/registry.py

def get_table_metadata(self, func_list):
    """
    Convert a list of embedding functions and source / vector configs
    into a config dictionary that can be serialized into arrow metadata
    """
    if func_list is None or len(func_list) == 0:
        return None
    json_data = [self.function_to_metadata(func) for func in func_list]
    # Note that metadata dictionary values must be bytes
    # so we need to json dump then utf8 encode
    metadata = json.dumps(json_data, indent=2).encode("utf-8")
    return {"embedding_functions": metadata}

`lancedb.embeddings.base.EmbeddingFunction`

Bases: BaseModel, ABC

An ABC for embedding functions.

All concrete embedding functions must implement the following: 1. compute_query_embeddings() which takes a query and returns a list of embeddings 2. get_source_embeddings() which returns a list of embeddings for the source column For text data, the two will be the same. For multi-modal data, the source column might be images and the vector column might be text. 3. ndims method which returns the number of dimensions of the vector column

Source code in lancedb/embeddings/base.py

class EmbeddingFunction(BaseModel, ABC):
    """
    An ABC for embedding functions.

    All concrete embedding functions must implement the following:
    1. compute_query_embeddings() which takes a query and returns a list of embeddings
    2. get_source_embeddings() which returns a list of embeddings for the source column
    For text data, the two will be the same. For multi-modal data, the source column
    might be images and the vector column might be text.
    3. ndims method which returns the number of dimensions of the vector column
    """

    __slots__ = ("__weakref__",)  # pydantic 1.x compatibility
    max_retries: int = (
        7  # Setitng 0 disables retires. Maybe this should not be enabled by default,
    )
    _ndims: int = PrivateAttr()

    @classmethod
    def create(cls, **kwargs):
        """
        Create an instance of the embedding function
        """
        return cls(**kwargs)

    @abstractmethod
    def compute_query_embeddings(self, *args, **kwargs) -> List[np.array]:
        """
        Compute the embeddings for a given user query
        """
        pass

    @abstractmethod
    def compute_source_embeddings(self, *args, **kwargs) -> List[np.array]:
        """
        Compute the embeddings for the source column in the database
        """
        pass

    def compute_query_embeddings_with_retry(self, *args, **kwargs) -> List[np.array]:
        """
        Compute the embeddings for a given user query with retries
        """
        return retry_with_exponential_backoff(
            self.compute_query_embeddings, max_retries=self.max_retries
        )(
            *args,
            **kwargs,
        )

    def compute_source_embeddings_with_retry(self, *args, **kwargs) -> List[np.array]:
        """
        Compute the embeddings for the source column in the database with retries
        """
        return retry_with_exponential_backoff(
            self.compute_source_embeddings, max_retries=self.max_retries
        )(*args, **kwargs)

    def sanitize_input(self, texts: TEXT) -> Union[List[str], np.ndarray]:
        """
        Sanitize the input to the embedding function.
        """
        if isinstance(texts, str):
            texts = [texts]
        elif isinstance(texts, pa.Array):
            texts = texts.to_pylist()
        elif isinstance(texts, pa.ChunkedArray):
            texts = texts.combine_chunks().to_pylist()
        return texts

    def safe_model_dump(self):
        from ..pydantic import PYDANTIC_VERSION

        if PYDANTIC_VERSION.major < 2:
            return dict(self)
        return self.model_dump()

    @abstractmethod
    def ndims(self):
        """
        Return the dimensions of the vector column
        """
        pass

    def SourceField(self, **kwargs):
        """
        Creates a pydantic Field that can automatically annotate
        the source column for this embedding function
        """
        return Field(json_schema_extra={"source_column_for": self}, **kwargs)

    def VectorField(self, **kwargs):
        """
        Creates a pydantic Field that can automatically annotate
        the target vector column for this embedding function
        """
        return Field(json_schema_extra={"vector_column_for": self}, **kwargs)

    def __eq__(self, __value: object) -> bool:
        if not hasattr(__value, "__dict__"):
            return False
        return vars(self) == vars(__value)

    def __hash__(self) -> int:
        return hash(frozenset(vars(self).items()))

`create(**kwargs)` `classmethod`

Create an instance of the embedding function

Source code in lancedb/embeddings/base.py

@classmethod
def create(cls, **kwargs):
    """
    Create an instance of the embedding function
    """
    return cls(**kwargs)

`compute_query_embeddings(*args, **kwargs) -> List[np.array]` `abstractmethod`

Compute the embeddings for a given user query

Source code in lancedb/embeddings/base.py

@abstractmethod
def compute_query_embeddings(self, *args, **kwargs) -> List[np.array]:
    """
    Compute the embeddings for a given user query
    """
    pass

`compute_source_embeddings(*args, **kwargs) -> List[np.array]` `abstractmethod`

Compute the embeddings for the source column in the database

Source code in lancedb/embeddings/base.py

@abstractmethod
def compute_source_embeddings(self, *args, **kwargs) -> List[np.array]:
    """
    Compute the embeddings for the source column in the database
    """
    pass

`compute_query_embeddings_with_retry(*args, **kwargs) -> List[np.array]`

Compute the embeddings for a given user query with retries

Source code in lancedb/embeddings/base.py

def compute_query_embeddings_with_retry(self, *args, **kwargs) -> List[np.array]:
    """
    Compute the embeddings for a given user query with retries
    """
    return retry_with_exponential_backoff(
        self.compute_query_embeddings, max_retries=self.max_retries
    )(
        *args,
        **kwargs,
    )

`compute_source_embeddings_with_retry(*args, **kwargs) -> List[np.array]`

Compute the embeddings for the source column in the database with retries

Source code in lancedb/embeddings/base.py

def compute_source_embeddings_with_retry(self, *args, **kwargs) -> List[np.array]:
    """
    Compute the embeddings for the source column in the database with retries
    """
    return retry_with_exponential_backoff(
        self.compute_source_embeddings, max_retries=self.max_retries
    )(*args, **kwargs)

`sanitize_input(texts: TEXT) -> Union[List[str], np.ndarray]`

Sanitize the input to the embedding function.

Source code in lancedb/embeddings/base.py

def sanitize_input(self, texts: TEXT) -> Union[List[str], np.ndarray]:
    """
    Sanitize the input to the embedding function.
    """
    if isinstance(texts, str):
        texts = [texts]
    elif isinstance(texts, pa.Array):
        texts = texts.to_pylist()
    elif isinstance(texts, pa.ChunkedArray):
        texts = texts.combine_chunks().to_pylist()
    return texts

`ndims()` `abstractmethod`

Return the dimensions of the vector column

Source code in lancedb/embeddings/base.py

@abstractmethod
def ndims(self):
    """
    Return the dimensions of the vector column
    """
    pass

`SourceField(**kwargs)`

Creates a pydantic Field that can automatically annotate the source column for this embedding function

Source code in lancedb/embeddings/base.py

def SourceField(self, **kwargs):
    """
    Creates a pydantic Field that can automatically annotate
    the source column for this embedding function
    """
    return Field(json_schema_extra={"source_column_for": self}, **kwargs)

`VectorField(**kwargs)`

Creates a pydantic Field that can automatically annotate the target vector column for this embedding function

Source code in lancedb/embeddings/base.py

def VectorField(self, **kwargs):
    """
    Creates a pydantic Field that can automatically annotate
    the target vector column for this embedding function
    """
    return Field(json_schema_extra={"vector_column_for": self}, **kwargs)

`lancedb.embeddings.base.TextEmbeddingFunction`

Bases: EmbeddingFunction

A callable ABC for embedding functions that take text as input

Source code in lancedb/embeddings/base.py

class TextEmbeddingFunction(EmbeddingFunction):
    """
    A callable ABC for embedding functions that take text as input
    """

    def compute_query_embeddings(self, query: str, *args, **kwargs) -> List[np.array]:
        return self.compute_source_embeddings(query, *args, **kwargs)

    def compute_source_embeddings(self, texts: TEXT, *args, **kwargs) -> List[np.array]:
        texts = self.sanitize_input(texts)
        return self.generate_embeddings(texts)

    @abstractmethod
    def generate_embeddings(
        self, texts: Union[List[str], np.ndarray], *args, **kwargs
    ) -> List[np.array]:
        """
        Generate the embeddings for the given texts
        """
        pass

`generate_embeddings(texts: Union[List[str], np.ndarray], *args, **kwargs) -> List[np.array]` `abstractmethod`

Generate the embeddings for the given texts

Source code in lancedb/embeddings/base.py

@abstractmethod
def generate_embeddings(
    self, texts: Union[List[str], np.ndarray], *args, **kwargs
) -> List[np.array]:
    """
    Generate the embeddings for the given texts
    """
    pass

`lancedb.embeddings.sentence_transformers.SentenceTransformerEmbeddings`

Bases: TextEmbeddingFunction

An embedding function that uses the sentence-transformers library

https://huggingface.co/sentence-transformers

Source code in lancedb/embeddings/sentence_transformers.py

@register("sentence-transformers")
class SentenceTransformerEmbeddings(TextEmbeddingFunction):
    """
    An embedding function that uses the sentence-transformers library

    https://huggingface.co/sentence-transformers
    """

    name: str = "all-MiniLM-L6-v2"
    device: str = "cpu"
    normalize: bool = True
    trust_remote_code: bool = False

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self._ndims = None

    @property
    def embedding_model(self):
        """
        Get the sentence-transformers embedding model specified by the
        name, device, and trust_remote_code. This is cached so that the
        model is only loaded once per process.
        """
        return self.get_embedding_model()

    def ndims(self):
        if self._ndims is None:
            self._ndims = len(self.generate_embeddings("foo")[0])
        return self._ndims

    def generate_embeddings(
        self, texts: Union[List[str], np.ndarray]
    ) -> List[np.array]:
        """
        Get the embeddings for the given texts

        Parameters
        ----------
        texts: list[str] or np.ndarray (of str)
            The texts to embed
        """
        return self.embedding_model.encode(
            list(texts),
            convert_to_numpy=True,
            normalize_embeddings=self.normalize,
        ).tolist()

    @weak_lru(maxsize=1)
    def get_embedding_model(self):
        """
        Get the sentence-transformers embedding model specified by the
        name, device, and trust_remote_code. This is cached so that the
        model is only loaded once per process.

        TODO: use lru_cache instead with a reasonable/configurable maxsize
        """
        sentence_transformers = attempt_import_or_raise(
            "sentence_transformers", "sentence-transformers"
        )
        return sentence_transformers.SentenceTransformer(
            self.name, device=self.device, trust_remote_code=self.trust_remote_code
        )

`embedding_model` `property`

Get the sentence-transformers embedding model specified by the name, device, and trust_remote_code. This is cached so that the model is only loaded once per process.

`generate_embeddings(texts: Union[List[str], np.ndarray]) -> List[np.array]`

Get the embeddings for the given texts

Parameters:

Name	Type	Description	Default
`texts`	`Union[List[str], ndarray]`	The texts to embed	required

Source code in lancedb/embeddings/sentence_transformers.py

def generate_embeddings(
    self, texts: Union[List[str], np.ndarray]
) -> List[np.array]:
    """
    Get the embeddings for the given texts

    Parameters
    ----------
    texts: list[str] or np.ndarray (of str)
        The texts to embed
    """
    return self.embedding_model.encode(
        list(texts),
        convert_to_numpy=True,
        normalize_embeddings=self.normalize,
    ).tolist()

`get_embedding_model()`

Get the sentence-transformers embedding model specified by the name, device, and trust_remote_code. This is cached so that the model is only loaded once per process.

TODO: use lru_cache instead with a reasonable/configurable maxsize

Source code in lancedb/embeddings/sentence_transformers.py

@weak_lru(maxsize=1)
def get_embedding_model(self):
    """
    Get the sentence-transformers embedding model specified by the
    name, device, and trust_remote_code. This is cached so that the
    model is only loaded once per process.

    TODO: use lru_cache instead with a reasonable/configurable maxsize
    """
    sentence_transformers = attempt_import_or_raise(
        "sentence_transformers", "sentence-transformers"
    )
    return sentence_transformers.SentenceTransformer(
        self.name, device=self.device, trust_remote_code=self.trust_remote_code
    )

`lancedb.embeddings.openai.OpenAIEmbeddings`

Bases: TextEmbeddingFunction

An embedding function that uses the OpenAI API

https://platform.openai.com/docs/guides/embeddings

This can also be used for open source models that are compatible with the OpenAI API.

Notes

If you're running an Ollama server locally, you can just override the base_url parameter and provide the Ollama embedding model you want to use (https://ollama.com/library):

from lancedb.embeddings import get_registry
openai = get_registry().get("openai")
embedding_function = openai.create(
    name="<ollama-embedding-model-name>",
    base_url="http://localhost:11434",
    )

Source code in lancedb/embeddings/openai.py

@register("openai")
class OpenAIEmbeddings(TextEmbeddingFunction):
    """
    An embedding function that uses the OpenAI API

    https://platform.openai.com/docs/guides/embeddings

    This can also be used for open source models that
    are compatible with the OpenAI API.

    Notes
    -----
    If you're running an Ollama server locally,
    you can just override the `base_url` parameter
    and provide the Ollama embedding model you want
    to use (https://ollama.com/library):

    ```python
    from lancedb.embeddings import get_registry
    openai = get_registry().get("openai")
    embedding_function = openai.create(
        name="<ollama-embedding-model-name>",
        base_url="http://localhost:11434",
        )
    ```

    """

    name: str = "text-embedding-ada-002"
    dim: Optional[int] = None
    base_url: Optional[str] = None
    default_headers: Optional[dict] = None
    organization: Optional[str] = None
    api_key: Optional[str] = None

    def ndims(self):
        return self._ndims

    @staticmethod
    def model_names():
        return [
            "text-embedding-ada-002",
            "text-embedding-3-large",
            "text-embedding-3-small",
        ]

    @cached_property
    def _ndims(self):
        if self.name == "text-embedding-ada-002":
            return 1536
        elif self.name == "text-embedding-3-large":
            return self.dim or 3072
        elif self.name == "text-embedding-3-small":
            return self.dim or 1536
        else:
            raise ValueError(f"Unknown model name {self.name}")

    def generate_embeddings(
        self, texts: Union[List[str], "np.ndarray"]
    ) -> List["np.array"]:
        """
        Get the embeddings for the given texts

        Parameters
        ----------
        texts: list[str] or np.ndarray (of str)
            The texts to embed
        """
        # TODO retry, rate limit, token limit
        if self.name == "text-embedding-ada-002":
            rs = self._openai_client.embeddings.create(input=texts, model=self.name)
        else:
            kwargs = {
                "input": texts,
                "model": self.name,
            }
            if self.dim:
                kwargs["dimensions"] = self.dim
            rs = self._openai_client.embeddings.create(**kwargs)
        return [v.embedding for v in rs.data]

    @cached_property
    def _openai_client(self):
        openai = attempt_import_or_raise("openai")
        kwargs = {}
        if self.base_url:
            kwargs["base_url"] = self.base_url
        if self.default_headers:
            kwargs["default_headers"] = self.default_headers
        if self.organization:
            kwargs["organization"] = self.organization
        if self.api_key:
            kwargs["api_key"] = self.api_key
        return openai.OpenAI(**kwargs)

`generate_embeddings(texts: Union[List[str], np.ndarray]) -> List[np.array]`

Get the embeddings for the given texts

Parameters:

Name	Type	Description	Default
`texts`	`Union[List[str], ndarray]`	The texts to embed	required

Source code in lancedb/embeddings/openai.py

def generate_embeddings(
    self, texts: Union[List[str], "np.ndarray"]
) -> List["np.array"]:
    """
    Get the embeddings for the given texts

    Parameters
    ----------
    texts: list[str] or np.ndarray (of str)
        The texts to embed
    """
    # TODO retry, rate limit, token limit
    if self.name == "text-embedding-ada-002":
        rs = self._openai_client.embeddings.create(input=texts, model=self.name)
    else:
        kwargs = {
            "input": texts,
            "model": self.name,
        }
        if self.dim:
            kwargs["dimensions"] = self.dim
        rs = self._openai_client.embeddings.create(**kwargs)
    return [v.embedding for v in rs.data]

`lancedb.embeddings.open_clip.OpenClipEmbeddings`

Bases: EmbeddingFunction

An embedding function that uses the OpenClip API For multi-modal text-to-image search

https://github.com/mlfoundations/open_clip

Source code in lancedb/embeddings/open_clip.py

@register("open-clip")
class OpenClipEmbeddings(EmbeddingFunction):
    """
    An embedding function that uses the OpenClip API
    For multi-modal text-to-image search

    https://github.com/mlfoundations/open_clip
    """

    name: str = "ViT-B-32"
    pretrained: str = "laion2b_s34b_b79k"
    device: str = "cpu"
    batch_size: int = 64
    normalize: bool = True
    _model = PrivateAttr()
    _preprocess = PrivateAttr()
    _tokenizer = PrivateAttr()

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        open_clip = attempt_import_or_raise("open_clip", "open-clip")
        model, _, preprocess = open_clip.create_model_and_transforms(
            self.name, pretrained=self.pretrained
        )
        model.to(self.device)
        self._model, self._preprocess = model, preprocess
        self._tokenizer = open_clip.get_tokenizer(self.name)
        self._ndims = None

    def ndims(self):
        if self._ndims is None:
            self._ndims = self.generate_text_embeddings("foo").shape[0]
        return self._ndims

    def compute_query_embeddings(
        self, query: Union[str, "PIL.Image.Image"], *args, **kwargs
    ) -> List[np.ndarray]:
        """
        Compute the embeddings for a given user query

        Parameters
        ----------
        query : Union[str, PIL.Image.Image]
            The query to embed. A query can be either text or an image.
        """
        if isinstance(query, str):
            return [self.generate_text_embeddings(query)]
        else:
            PIL = attempt_import_or_raise("PIL", "pillow")
            if isinstance(query, PIL.Image.Image):
                return [self.generate_image_embedding(query)]
            else:
                raise TypeError("OpenClip supports str or PIL Image as query")

    def generate_text_embeddings(self, text: str) -> np.ndarray:
        torch = attempt_import_or_raise("torch")
        text = self.sanitize_input(text)
        text = self._tokenizer(text)
        text.to(self.device)
        with torch.no_grad():
            text_features = self._model.encode_text(text.to(self.device))
            if self.normalize:
                text_features /= text_features.norm(dim=-1, keepdim=True)
            return text_features.cpu().numpy().squeeze()

    def sanitize_input(self, images: IMAGES) -> Union[List[bytes], np.ndarray]:
        """
        Sanitize the input to the embedding function.
        """
        if isinstance(images, (str, bytes)):
            images = [images]
        elif isinstance(images, pa.Array):
            images = images.to_pylist()
        elif isinstance(images, pa.ChunkedArray):
            images = images.combine_chunks().to_pylist()
        return images

    def compute_source_embeddings(
        self, images: IMAGES, *args, **kwargs
    ) -> List[np.array]:
        """
        Get the embeddings for the given images
        """
        images = self.sanitize_input(images)
        embeddings = []
        for i in range(0, len(images), self.batch_size):
            j = min(i + self.batch_size, len(images))
            batch = images[i:j]
            embeddings.extend(self._parallel_get(batch))
        return embeddings

    def _parallel_get(self, images: Union[List[str], List[bytes]]) -> List[np.ndarray]:
        """
        Issue concurrent requests to retrieve the image data
        """
        with concurrent.futures.ThreadPoolExecutor() as executor:
            futures = [
                executor.submit(self.generate_image_embedding, image)
                for image in images
            ]
            return [future.result() for future in tqdm(futures)]

    def generate_image_embedding(
        self, image: Union[str, bytes, "PIL.Image.Image"]
    ) -> np.ndarray:
        """
        Generate the embedding for a single image

        Parameters
        ----------
        image : Union[str, bytes, PIL.Image.Image]
            The image to embed. If the image is a str, it is treated as a uri.
            If the image is bytes, it is treated as the raw image bytes.
        """
        torch = attempt_import_or_raise("torch")
        # TODO handle retry and errors for https
        image = self._to_pil(image)
        image = self._preprocess(image).unsqueeze(0)
        with torch.no_grad():
            return self._encode_and_normalize_image(image)

    def _to_pil(self, image: Union[str, bytes]):
        PIL = attempt_import_or_raise("PIL", "pillow")
        if isinstance(image, bytes):
            return PIL.Image.open(io.BytesIO(image))
        if isinstance(image, PIL.Image.Image):
            return image
        elif isinstance(image, str):
            parsed = urlparse.urlparse(image)
            # TODO handle drive letter on windows.
            if parsed.scheme == "file":
                return PIL.Image.open(parsed.path)
            elif parsed.scheme == "":
                return PIL.Image.open(image if os.name == "nt" else parsed.path)
            elif parsed.scheme.startswith("http"):
                return PIL.Image.open(io.BytesIO(url_retrieve(image)))
            else:
                raise NotImplementedError("Only local and http(s) urls are supported")

    def _encode_and_normalize_image(self, image_tensor: "torch.Tensor"):
        """
        encode a single image tensor and optionally normalize the output
        """
        image_features = self._model.encode_image(image_tensor.to(self.device))
        if self.normalize:
            image_features /= image_features.norm(dim=-1, keepdim=True)
        return image_features.cpu().numpy().squeeze()

`compute_query_embeddings(query: Union[str, PIL.Image.Image], *args, **kwargs) -> List[np.ndarray]`

Compute the embeddings for a given user query

Parameters:

Name	Type	Description	Default
`query`	`Union[str, Image]`	The query to embed. A query can be either text or an image.	required

Source code in lancedb/embeddings/open_clip.py

def compute_query_embeddings(
    self, query: Union[str, "PIL.Image.Image"], *args, **kwargs
) -> List[np.ndarray]:
    """
    Compute the embeddings for a given user query

    Parameters
    ----------
    query : Union[str, PIL.Image.Image]
        The query to embed. A query can be either text or an image.
    """
    if isinstance(query, str):
        return [self.generate_text_embeddings(query)]
    else:
        PIL = attempt_import_or_raise("PIL", "pillow")
        if isinstance(query, PIL.Image.Image):
            return [self.generate_image_embedding(query)]
        else:
            raise TypeError("OpenClip supports str or PIL Image as query")

`sanitize_input(images: IMAGES) -> Union[List[bytes], np.ndarray]`

Sanitize the input to the embedding function.

Source code in lancedb/embeddings/open_clip.py

def sanitize_input(self, images: IMAGES) -> Union[List[bytes], np.ndarray]:
    """
    Sanitize the input to the embedding function.
    """
    if isinstance(images, (str, bytes)):
        images = [images]
    elif isinstance(images, pa.Array):
        images = images.to_pylist()
    elif isinstance(images, pa.ChunkedArray):
        images = images.combine_chunks().to_pylist()
    return images

`compute_source_embeddings(images: IMAGES, *args, **kwargs) -> List[np.array]`

Get the embeddings for the given images

Source code in lancedb/embeddings/open_clip.py

def compute_source_embeddings(
    self, images: IMAGES, *args, **kwargs
) -> List[np.array]:
    """
    Get the embeddings for the given images
    """
    images = self.sanitize_input(images)
    embeddings = []
    for i in range(0, len(images), self.batch_size):
        j = min(i + self.batch_size, len(images))
        batch = images[i:j]
        embeddings.extend(self._parallel_get(batch))
    return embeddings

`generate_image_embedding(image: Union[str, bytes, PIL.Image.Image]) -> np.ndarray`

Generate the embedding for a single image

Parameters:

Name	Type	Description	Default
`image`	`Union[str, bytes, Image]`	The image to embed. If the image is a str, it is treated as a uri. If the image is bytes, it is treated as the raw image bytes.	required

Source code in lancedb/embeddings/open_clip.py

def generate_image_embedding(
    self, image: Union[str, bytes, "PIL.Image.Image"]
) -> np.ndarray:
    """
    Generate the embedding for a single image

    Parameters
    ----------
    image : Union[str, bytes, PIL.Image.Image]
        The image to embed. If the image is a str, it is treated as a uri.
        If the image is bytes, it is treated as the raw image bytes.
    """
    torch = attempt_import_or_raise("torch")
    # TODO handle retry and errors for https
    image = self._to_pil(image)
    image = self._preprocess(image).unsqueeze(0)
    with torch.no_grad():
        return self._encode_and_normalize_image(image)

`lancedb.embeddings.utils.with_embeddings(func: Callable, data: DATA, column: str = 'text', wrap_api: bool = True, show_progress: bool = False, batch_size: int = 1000) -> pa.Table`

Add a vector column to a table using the given embedding function.

The new columns will be called "vector".

Parameters:

Name	Type	Description	Default
`func`	`Callable`	A function that takes a list of strings and returns a list of vectors.	required
`data`	`Table or DataFrame`	The data to add an embedding column to.	required
`column`	`str`	The name of the column to use as input to the embedding function.	`"text"`
`wrap_api`	`bool`	Whether to wrap the embedding function in a retry and rate limiter.	`True`
`show_progress`	`bool`	Whether to show a progress bar.	`False`
`batch_size`	`int`	The number of row values to pass to each call of the embedding function.	`1000`

Returns:

Type	Description
`Table`	The input table with a new column called "vector" containing the embeddings.

Source code in lancedb/embeddings/utils.py

@deprecated
def with_embeddings(
    func: Callable,
    data: DATA,
    column: str = "text",
    wrap_api: bool = True,
    show_progress: bool = False,
    batch_size: int = 1000,
) -> pa.Table:
    """Add a vector column to a table using the given embedding function.

    The new columns will be called "vector".

    Parameters
    ----------
    func : Callable
        A function that takes a list of strings and returns a list of vectors.
    data : pa.Table or pd.DataFrame
        The data to add an embedding column to.
    column : str, default "text"
        The name of the column to use as input to the embedding function.
    wrap_api : bool, default True
        Whether to wrap the embedding function in a retry and rate limiter.
    show_progress : bool, default False
        Whether to show a progress bar.
    batch_size : int, default 1000
        The number of row values to pass to each call of the embedding function.

    Returns
    -------
    pa.Table
        The input table with a new column called "vector" containing the embeddings.
    """
    func = FunctionWrapper(func)
    if wrap_api:
        func = func.retry().rate_limit()
    func = func.batch_size(batch_size)
    if show_progress:
        func = func.show_progress()
    if pd is not None and isinstance(data, pd.DataFrame):
        data = pa.Table.from_pandas(data, preserve_index=False)
    embeddings = func(data[column].to_numpy())
    table = vec_to_table(np.array(embeddings))
    return data.append_column("vector", table["vector"])

Context

`lancedb.context.contextualize(raw_df: 'pd.DataFrame') -> Contextualizer`

Create a Contextualizer object for the given DataFrame.

Used to create context windows. Context windows are rolling subsets of text data.

The input text column should already be separated into rows that will be the unit of the window. So to create a context window over tokens, start with a DataFrame with one token per row. To create a context window over sentences, start with a DataFrame with one sentence per row.

Examples:

>>> from lancedb.context import contextualize
>>> import pandas as pd
>>> data = pd.DataFrame({
...    'token': ['The', 'quick', 'brown', 'fox', 'jumped', 'over',
...              'the', 'lazy', 'dog', 'I', 'love', 'sandwiches'],
...    'document_id': [1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2]
... })

window determines how many rows to include in each window. In our case this how many tokens, but depending on the input data, it could be sentences, paragraphs, messages, etc.

>>> contextualize(data).window(3).stride(1).text_col('token').to_pandas()
                token  document_id
0     The quick brown            1
1     quick brown fox            1
2    brown fox jumped            1
3     fox jumped over            1
4     jumped over the            1
5       over the lazy            1
6        the lazy dog            1
7          lazy dog I            1
8          dog I love            1
9   I love sandwiches            2
10    love sandwiches            2
>>> (contextualize(data).window(7).stride(1).min_window_size(7)
...   .text_col('token').to_pandas())
                                  token  document_id
0   The quick brown fox jumped over the            1
1  quick brown fox jumped over the lazy            1
2    brown fox jumped over the lazy dog            1
3        fox jumped over the lazy dog I            1
4       jumped over the lazy dog I love            1
5   over the lazy dog I love sandwiches            1

stride determines how many rows to skip between each window start. This can be used to reduce the total number of windows generated.

>>> contextualize(data).window(4).stride(2).text_col('token').to_pandas()
                    token  document_id
0     The quick brown fox            1
2   brown fox jumped over            1
4    jumped over the lazy            1
6          the lazy dog I            1
8   dog I love sandwiches            1
10        love sandwiches            2

groupby determines how to group the rows. For example, we would like to have context windows that don't cross document boundaries. In this case, we can pass document_id as the group by.

>>> (contextualize(data)
...     .window(4).stride(2).text_col('token').groupby('document_id')
...     .to_pandas())
                   token  document_id
0    The quick brown fox            1
2  brown fox jumped over            1
4   jumped over the lazy            1
6           the lazy dog            1
9      I love sandwiches            2

min_window_size determines the minimum size of the context windows that are generated.This can be used to trim the last few context windows which have size less than min_window_size. By default context windows of size 1 are skipped.

>>> (contextualize(data)
...     .window(6).stride(3).text_col('token').groupby('document_id')
...     .to_pandas())
                             token  document_id
0  The quick brown fox jumped over            1
3     fox jumped over the lazy dog            1
6                     the lazy dog            1
9                I love sandwiches            2

>>> (contextualize(data)
...     .window(6).stride(3).min_window_size(4).text_col('token')
...     .groupby('document_id')
...     .to_pandas())
                             token  document_id
0  The quick brown fox jumped over            1
3     fox jumped over the lazy dog            1

Source code in lancedb/context.py

def contextualize(raw_df: "pd.DataFrame") -> Contextualizer:
    """Create a Contextualizer object for the given DataFrame.

    Used to create context windows. Context windows are rolling subsets of text
    data.

    The input text column should already be separated into rows that will be the
    unit of the window. So to create a context window over tokens, start with
    a DataFrame with one token per row. To create a context window over sentences,
    start with a DataFrame with one sentence per row.

    Examples
    --------
    >>> from lancedb.context import contextualize
    >>> import pandas as pd
    >>> data = pd.DataFrame({
    ...    'token': ['The', 'quick', 'brown', 'fox', 'jumped', 'over',
    ...              'the', 'lazy', 'dog', 'I', 'love', 'sandwiches'],
    ...    'document_id': [1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2]
    ... })

    ``window`` determines how many rows to include in each window. In our case
    this how many tokens, but depending on the input data, it could be sentences,
    paragraphs, messages, etc.

    >>> contextualize(data).window(3).stride(1).text_col('token').to_pandas()
                    token  document_id
    0     The quick brown            1
    1     quick brown fox            1
    2    brown fox jumped            1
    3     fox jumped over            1
    4     jumped over the            1
    5       over the lazy            1
    6        the lazy dog            1
    7          lazy dog I            1
    8          dog I love            1
    9   I love sandwiches            2
    10    love sandwiches            2
    >>> (contextualize(data).window(7).stride(1).min_window_size(7)
    ...   .text_col('token').to_pandas())
                                      token  document_id
    0   The quick brown fox jumped over the            1
    1  quick brown fox jumped over the lazy            1
    2    brown fox jumped over the lazy dog            1
    3        fox jumped over the lazy dog I            1
    4       jumped over the lazy dog I love            1
    5   over the lazy dog I love sandwiches            1

    ``stride`` determines how many rows to skip between each window start. This can
    be used to reduce the total number of windows generated.

    >>> contextualize(data).window(4).stride(2).text_col('token').to_pandas()
                        token  document_id
    0     The quick brown fox            1
    2   brown fox jumped over            1
    4    jumped over the lazy            1
    6          the lazy dog I            1
    8   dog I love sandwiches            1
    10        love sandwiches            2

    ``groupby`` determines how to group the rows. For example, we would like to have
    context windows that don't cross document boundaries. In this case, we can
    pass ``document_id`` as the group by.

    >>> (contextualize(data)
    ...     .window(4).stride(2).text_col('token').groupby('document_id')
    ...     .to_pandas())
                       token  document_id
    0    The quick brown fox            1
    2  brown fox jumped over            1
    4   jumped over the lazy            1
    6           the lazy dog            1
    9      I love sandwiches            2

    ``min_window_size`` determines the minimum size of the context windows
    that are generated.This can be used to trim the last few context windows
    which have size less than ``min_window_size``.
    By default context windows of size 1 are skipped.

    >>> (contextualize(data)
    ...     .window(6).stride(3).text_col('token').groupby('document_id')
    ...     .to_pandas())
                                 token  document_id
    0  The quick brown fox jumped over            1
    3     fox jumped over the lazy dog            1
    6                     the lazy dog            1
    9                I love sandwiches            2

    >>> (contextualize(data)
    ...     .window(6).stride(3).min_window_size(4).text_col('token')
    ...     .groupby('document_id')
    ...     .to_pandas())
                                 token  document_id
    0  The quick brown fox jumped over            1
    3     fox jumped over the lazy dog            1

    """
    return Contextualizer(raw_df)

`lancedb.context.Contextualizer`

Create context windows from a DataFrame. See lancedb.context.contextualize.

Source code in lancedb/context.py

class Contextualizer:
    """Create context windows from a DataFrame.
    See [lancedb.context.contextualize][].
    """

    def __init__(self, raw_df):
        self._text_col = None
        self._groupby = None
        self._stride = None
        self._window = None
        self._min_window_size = 2
        self._raw_df = raw_df

    def window(self, window: int) -> Contextualizer:
        """Set the window size. i.e., how many rows to include in each window.

        Parameters
        ----------
        window: int
            The window size.
        """
        self._window = window
        return self

    def stride(self, stride: int) -> Contextualizer:
        """Set the stride. i.e., how many rows to skip between each window.

        Parameters
        ----------
        stride: int
            The stride.
        """
        self._stride = stride
        return self

    def groupby(self, groupby: str) -> Contextualizer:
        """Set the groupby column. i.e., how to group the rows.
        Windows don't cross groups

        Parameters
        ----------
        groupby: str
            The groupby column.
        """
        self._groupby = groupby
        return self

    def text_col(self, text_col: str) -> Contextualizer:
        """Set the text column used to make the context window.

        Parameters
        ----------
        text_col: str
            The text column.
        """
        self._text_col = text_col
        return self

    def min_window_size(self, min_window_size: int) -> Contextualizer:
        """Set the (optional) min_window_size size for the context window.

        Parameters
        ----------
        min_window_size: int
            The min_window_size.
        """
        self._min_window_size = min_window_size
        return self

    @deprecation.deprecated(
        deprecated_in="0.3.1",
        removed_in="0.4.0",
        current_version=__version__,
        details="Use to_pandas() instead",
    )
    def to_df(self) -> "pd.DataFrame":
        return self.to_pandas()

    def to_pandas(self) -> "pd.DataFrame":
        """Create the context windows and return a DataFrame."""
        if pd is None:
            raise ImportError(
                "pandas is required to create context windows using lancedb"
            )

        if self._text_col not in self._raw_df.columns.tolist():
            raise MissingColumnError(self._text_col)

        if self._window is None or self._window < 1:
            raise MissingValueError(
                "The value of window is None or less than 1. Specify the "
                "window size (number of rows to include in each window)"
            )

        if self._stride is None or self._stride < 1:
            raise MissingValueError(
                "The value of stride is None or less than 1. Specify the "
                "stride (number of rows to skip between each window)"
            )

        def process_group(grp):
            # For each group, create the text rolling window
            # with values of size >= min_window_size
            text = grp[self._text_col].values
            contexts = grp.iloc[:: self._stride, :].copy()
            windows = [
                " ".join(text[start_i : min(start_i + self._window, len(grp))])
                for start_i in range(0, len(grp), self._stride)
                if start_i + self._window <= len(grp)
                or len(grp) - start_i >= self._min_window_size
            ]
            # if last few rows dropped
            if len(windows) < len(contexts):
                contexts = contexts.iloc[: len(windows)]
            contexts[self._text_col] = windows
            return contexts

        if self._groupby is None:
            return process_group(self._raw_df)
        # concat result from all groups
        return pd.concat(
            [process_group(grp) for _, grp in self._raw_df.groupby(self._groupby)]
        )

`window(window: int) -> Contextualizer`

Set the window size. i.e., how many rows to include in each window.

Parameters:

Name	Type	Description	Default
`window`	`int`	The window size.	required

Source code in lancedb/context.py

def window(self, window: int) -> Contextualizer:
    """Set the window size. i.e., how many rows to include in each window.

    Parameters
    ----------
    window: int
        The window size.
    """
    self._window = window
    return self

`stride(stride: int) -> Contextualizer`

Set the stride. i.e., how many rows to skip between each window.

Parameters:

Name	Type	Description	Default
`stride`	`int`	The stride.	required

Source code in lancedb/context.py

def stride(self, stride: int) -> Contextualizer:
    """Set the stride. i.e., how many rows to skip between each window.

    Parameters
    ----------
    stride: int
        The stride.
    """
    self._stride = stride
    return self

`groupby(groupby: str) -> Contextualizer`

Set the groupby column. i.e., how to group the rows. Windows don't cross groups

Parameters:

Name	Type	Description	Default
`groupby`	`str`	The groupby column.	required

Source code in lancedb/context.py

def groupby(self, groupby: str) -> Contextualizer:
    """Set the groupby column. i.e., how to group the rows.
    Windows don't cross groups

    Parameters
    ----------
    groupby: str
        The groupby column.
    """
    self._groupby = groupby
    return self

`text_col(text_col: str) -> Contextualizer`

Set the text column used to make the context window.

Parameters:

Name	Type	Description	Default
`text_col`	`str`	The text column.	required

Source code in lancedb/context.py

def text_col(self, text_col: str) -> Contextualizer:
    """Set the text column used to make the context window.

    Parameters
    ----------
    text_col: str
        The text column.
    """
    self._text_col = text_col
    return self

`min_window_size(min_window_size: int) -> Contextualizer`

Set the (optional) min_window_size size for the context window.

Parameters:

Name	Type	Description	Default
`min_window_size`	`int`	The min_window_size.	required

Source code in lancedb/context.py

def min_window_size(self, min_window_size: int) -> Contextualizer:
    """Set the (optional) min_window_size size for the context window.

    Parameters
    ----------
    min_window_size: int
        The min_window_size.
    """
    self._min_window_size = min_window_size
    return self

`to_pandas() -> 'pd.DataFrame'`

Create the context windows and return a DataFrame.

Source code in lancedb/context.py

def to_pandas(self) -> "pd.DataFrame":
    """Create the context windows and return a DataFrame."""
    if pd is None:
        raise ImportError(
            "pandas is required to create context windows using lancedb"
        )

    if self._text_col not in self._raw_df.columns.tolist():
        raise MissingColumnError(self._text_col)

    if self._window is None or self._window < 1:
        raise MissingValueError(
            "The value of window is None or less than 1. Specify the "
            "window size (number of rows to include in each window)"
        )

    if self._stride is None or self._stride < 1:
        raise MissingValueError(
            "The value of stride is None or less than 1. Specify the "
            "stride (number of rows to skip between each window)"
        )

    def process_group(grp):
        # For each group, create the text rolling window
        # with values of size >= min_window_size
        text = grp[self._text_col].values
        contexts = grp.iloc[:: self._stride, :].copy()
        windows = [
            " ".join(text[start_i : min(start_i + self._window, len(grp))])
            for start_i in range(0, len(grp), self._stride)
            if start_i + self._window <= len(grp)
            or len(grp) - start_i >= self._min_window_size
        ]
        # if last few rows dropped
        if len(windows) < len(contexts):
            contexts = contexts.iloc[: len(windows)]
        contexts[self._text_col] = windows
        return contexts

    if self._groupby is None:
        return process_group(self._raw_df)
    # concat result from all groups
    return pd.concat(
        [process_group(grp) for _, grp in self._raw_df.groupby(self._groupby)]
    )

Full text search

`lancedb.fts.create_index(index_path: str, text_fields: List[str], ordering_fields: List[str] = None, tokenizer_name: str = 'default') -> tantivy.Index`

Create a new Index (not populated)

Parameters:

Name	Type	Description	Default
`index_path`	`str`	Path to the index directory	required
`text_fields`	`List[str]`	List of text fields to index	required
`ordering_fields`	`List[str]`	List of unsigned type fields to order by at search time	`None`
`tokenizer_name`	`str`	The tokenizer to use	`"default"`

Returns:

Name	Type	Description
`index`	`Index`	The index object (not yet populated)

Source code in lancedb/fts.py

def create_index(
    index_path: str,
    text_fields: List[str],
    ordering_fields: List[str] = None,
    tokenizer_name: str = "default",
) -> tantivy.Index:
    """
    Create a new Index (not populated)

    Parameters
    ----------
    index_path : str
        Path to the index directory
    text_fields : List[str]
        List of text fields to index
    ordering_fields: List[str]
        List of unsigned type fields to order by at search time
    tokenizer_name : str, default "default"
        The tokenizer to use

    Returns
    -------
    index : tantivy.Index
        The index object (not yet populated)
    """
    if ordering_fields is None:
        ordering_fields = []
    # Declaring our schema.
    schema_builder = tantivy.SchemaBuilder()
    # special field that we'll populate with row_id
    schema_builder.add_integer_field("doc_id", stored=True)
    # data fields
    for name in text_fields:
        schema_builder.add_text_field(name, stored=True, tokenizer_name=tokenizer_name)
    if ordering_fields:
        for name in ordering_fields:
            schema_builder.add_unsigned_field(name, fast=True)
    schema = schema_builder.build()
    os.makedirs(index_path, exist_ok=True)
    index = tantivy.Index(schema, path=index_path)
    return index

`lancedb.fts.populate_index(index: tantivy.Index, table: LanceTable, fields: List[str], writer_heap_size: int = 1024 * 1024 * 1024, ordering_fields: List[str] = None) -> int`

Populate an index with data from a LanceTable

Parameters:

Name	Type	Description	Default
`index`	`Index`	The index object	required
`table`	`LanceTable`	The table to index	required
`fields`	`List[str]`	List of fields to index	required
`writer_heap_size`	`int`	The writer heap size in bytes, defaults to 1GB	`1024 * 1024 * 1024`

Returns:

Type	Description
`int`	The number of rows indexed

Source code in lancedb/fts.py

def populate_index(
    index: tantivy.Index,
    table: LanceTable,
    fields: List[str],
    writer_heap_size: int = 1024 * 1024 * 1024,
    ordering_fields: List[str] = None,
) -> int:
    """
    Populate an index with data from a LanceTable

    Parameters
    ----------
    index : tantivy.Index
        The index object
    table : LanceTable
        The table to index
    fields : List[str]
        List of fields to index
    writer_heap_size : int
        The writer heap size in bytes, defaults to 1GB

    Returns
    -------
    int
        The number of rows indexed
    """
    if ordering_fields is None:
        ordering_fields = []
    # first check the fields exist and are string or large string type
    nested = []

    for name in fields:
        try:
            f = table.schema.field(name)  # raises KeyError if not found
        except KeyError:
            f = resolve_path(table.schema, name)
            nested.append(name)

        if not pa.types.is_string(f.type) and not pa.types.is_large_string(f.type):
            raise TypeError(f"Field {name} is not a string type")

    # create a tantivy writer
    writer = index.writer(heap_size=writer_heap_size)
    # write data into index
    dataset = table.to_lance()
    row_id = 0

    max_nested_level = 0
    if len(nested) > 0:
        max_nested_level = max([len(name.split(".")) for name in nested])

    for b in dataset.to_batches(columns=fields + ordering_fields):
        if max_nested_level > 0:
            b = pa.Table.from_batches([b])
            for _ in range(max_nested_level - 1):
                b = b.flatten()
        for i in range(b.num_rows):
            doc = tantivy.Document()
            for name in fields:
                value = b[name][i].as_py()
                if value is not None:
                    doc.add_text(name, value)
            for name in ordering_fields:
                value = b[name][i].as_py()
                if value is not None:
                    doc.add_unsigned(name, value)
            if not doc.is_empty:
                doc.add_integer("doc_id", row_id)
                writer.add_document(doc)
            row_id += 1
    # commit changes
    writer.commit()
    return row_id

`lancedb.fts.search_index(index: tantivy.Index, query: str, limit: int = 10, ordering_field=None) -> Tuple[Tuple[int], Tuple[float]]`

Search an index for a query

Parameters:

Name	Type	Description	Default
`index`	`Index`	The index object	required
`query`	`str`	The query string	required
`limit`	`int`	The maximum number of results to return	`10`

Returns:

Name	Type	Description
`ids_and_score`	`list[tuple[int], tuple[float]]`	A tuple of two tuples, the first containing the document ids and the second containing the scores

Source code in lancedb/fts.py

def search_index(
    index: tantivy.Index, query: str, limit: int = 10, ordering_field=None
) -> Tuple[Tuple[int], Tuple[float]]:
    """
    Search an index for a query

    Parameters
    ----------
    index : tantivy.Index
        The index object
    query : str
        The query string
    limit : int
        The maximum number of results to return

    Returns
    -------
    ids_and_score: list[tuple[int], tuple[float]]
        A tuple of two tuples, the first containing the document ids
        and the second containing the scores
    """
    searcher = index.searcher()
    query = index.parse_query(query)
    # get top results
    if ordering_field:
        results = searcher.search(query, limit, order_by_field=ordering_field)
    else:
        results = searcher.search(query, limit)
    if results.count == 0:
        return tuple(), tuple()
    return tuple(
        zip(
            *[
                (searcher.doc(doc_address)["doc_id"][0], score)
                for score, doc_address in results.hits
            ]
        )
    )

Utilities

`lancedb.schema.vector(dimension: int, value_type: pa.DataType = pa.float32()) -> pa.DataType`

A help function to create a vector type.

Parameters:

Name	Type	Description	Default
`dimension`	`int`		required
`value_type`	`DataType`	The type of the value in the vector.	`float32()`

Returns:

Type	Description
`A PyArrow DataType for vectors.`

Examples:

>>> import pyarrow as pa
>>> import lancedb
>>> schema = pa.schema([
...     pa.field("id", pa.int64()),
...     pa.field("vector", lancedb.vector(756)),
... ])

Source code in lancedb/schema.py

def vector(dimension: int, value_type: pa.DataType = pa.float32()) -> pa.DataType:
    """A help function to create a vector type.

    Parameters
    ----------
    dimension: The dimension of the vector.
    value_type: pa.DataType, optional
        The type of the value in the vector.

    Returns
    -------
    A PyArrow DataType for vectors.

    Examples
    --------

    >>> import pyarrow as pa
    >>> import lancedb
    >>> schema = pa.schema([
    ...     pa.field("id", pa.int64()),
    ...     pa.field("vector", lancedb.vector(756)),
    ... ])
    """
    return pa.list_(value_type, dimension)

`lancedb.merge.LanceMergeInsertBuilder`

Bases: object

Builder for a LanceDB merge insert operation

See merge_insert for more context

Source code in lancedb/merge.py

class LanceMergeInsertBuilder(object):
    """Builder for a LanceDB merge insert operation

    See [`merge_insert`][lancedb.table.Table.merge_insert] for
    more context
    """

    def __init__(self, table: "Table", on: List[str]):  # noqa: F821
        # Do not put a docstring here.  This method should be hidden
        # from API docs.  Users should use merge_insert to create
        # this object.
        self._table = table
        self._on = on
        self._when_matched_update_all = False
        self._when_matched_update_all_condition = None
        self._when_not_matched_insert_all = False
        self._when_not_matched_by_source_delete = False
        self._when_not_matched_by_source_condition = None

    def when_matched_update_all(
        self, *, where: Optional[str] = None
    ) -> LanceMergeInsertBuilder:
        """
        Rows that exist in both the source table (new data) and
        the target table (old data) will be updated, replacing
        the old row with the corresponding matching row.

        If there are multiple matches then the behavior is undefined.
        Currently this causes multiple copies of the row to be created
        but that behavior is subject to change.
        """
        self._when_matched_update_all = True
        self._when_matched_update_all_condition = where
        return self

    def when_not_matched_insert_all(self) -> LanceMergeInsertBuilder:
        """
        Rows that exist only in the source table (new data) should
        be inserted into the target table.
        """
        self._when_not_matched_insert_all = True
        return self

    def when_not_matched_by_source_delete(
        self, condition: Optional[str] = None
    ) -> LanceMergeInsertBuilder:
        """
        Rows that exist only in the target table (old data) will be
        deleted.  An optional condition can be provided to limit what
        data is deleted.

        Parameters
        ----------
        condition: Optional[str], default None
            If None then all such rows will be deleted.  Otherwise the
            condition will be used as an SQL filter to limit what rows
            are deleted.
        """
        self._when_not_matched_by_source_delete = True
        if condition is not None:
            self._when_not_matched_by_source_condition = condition
        return self

    def execute(
        self,
        new_data: DATA,
        on_bad_vectors: str = "error",
        fill_value: float = 0.0,
    ):
        """
        Executes the merge insert operation

        Nothing is returned but the [`Table`][lancedb.table.Table] is updated

        Parameters
        ----------
        new_data: DATA
            New records which will be matched against the existing records
            to potentially insert or update into the table.  This parameter
            can be anything you use for [`add`][lancedb.table.Table.add]
        on_bad_vectors: str, default "error"
            What to do if any of the vectors are not the same size or contains NaNs.
            One of "error", "drop", "fill".
        fill_value: float, default 0.
            The value to use when filling vectors. Only used if on_bad_vectors="fill".
        """
        self._table._do_merge(self, new_data, on_bad_vectors, fill_value)

`when_matched_update_all(*, where: Optional[str] = None) -> LanceMergeInsertBuilder`

Rows that exist in both the source table (new data) and the target table (old data) will be updated, replacing the old row with the corresponding matching row.

If there are multiple matches then the behavior is undefined. Currently this causes multiple copies of the row to be created but that behavior is subject to change.

Source code in lancedb/merge.py

def when_matched_update_all(
    self, *, where: Optional[str] = None
) -> LanceMergeInsertBuilder:
    """
    Rows that exist in both the source table (new data) and
    the target table (old data) will be updated, replacing
    the old row with the corresponding matching row.

    If there are multiple matches then the behavior is undefined.
    Currently this causes multiple copies of the row to be created
    but that behavior is subject to change.
    """
    self._when_matched_update_all = True
    self._when_matched_update_all_condition = where
    return self

`when_not_matched_insert_all() -> LanceMergeInsertBuilder`

Rows that exist only in the source table (new data) should be inserted into the target table.

Source code in lancedb/merge.py

def when_not_matched_insert_all(self) -> LanceMergeInsertBuilder:
    """
    Rows that exist only in the source table (new data) should
    be inserted into the target table.
    """
    self._when_not_matched_insert_all = True
    return self

`when_not_matched_by_source_delete(condition: Optional[str] = None) -> LanceMergeInsertBuilder`

Rows that exist only in the target table (old data) will be deleted. An optional condition can be provided to limit what data is deleted.

Parameters:

Name	Type	Description	Default
`condition`	`Optional[str]`	If None then all such rows will be deleted. Otherwise the condition will be used as an SQL filter to limit what rows are deleted.	`None`

Source code in lancedb/merge.py

def when_not_matched_by_source_delete(
    self, condition: Optional[str] = None
) -> LanceMergeInsertBuilder:
    """
    Rows that exist only in the target table (old data) will be
    deleted.  An optional condition can be provided to limit what
    data is deleted.

    Parameters
    ----------
    condition: Optional[str], default None
        If None then all such rows will be deleted.  Otherwise the
        condition will be used as an SQL filter to limit what rows
        are deleted.
    """
    self._when_not_matched_by_source_delete = True
    if condition is not None:
        self._when_not_matched_by_source_condition = condition
    return self

`execute(new_data: DATA, on_bad_vectors: str = 'error', fill_value: float = 0.0)`

Executes the merge insert operation

Nothing is returned but the Table is updated

Parameters:

Name	Type	Description	Default
`new_data`	`DATA`	New records which will be matched against the existing records to potentially insert or update into the table. This parameter can be anything you use for `add`	required
`on_bad_vectors`	`str`	What to do if any of the vectors are not the same size or contains NaNs. One of "error", "drop", "fill".	`'error'`
`fill_value`	`float`	The value to use when filling vectors. Only used if on_bad_vectors="fill".	`0.0`

Source code in lancedb/merge.py

def execute(
    self,
    new_data: DATA,
    on_bad_vectors: str = "error",
    fill_value: float = 0.0,
):
    """
    Executes the merge insert operation

    Nothing is returned but the [`Table`][lancedb.table.Table] is updated

    Parameters
    ----------
    new_data: DATA
        New records which will be matched against the existing records
        to potentially insert or update into the table.  This parameter
        can be anything you use for [`add`][lancedb.table.Table.add]
    on_bad_vectors: str, default "error"
        What to do if any of the vectors are not the same size or contains NaNs.
        One of "error", "drop", "fill".
    fill_value: float, default 0.
        The value to use when filling vectors. Only used if on_bad_vectors="fill".
    """
    self._table._do_merge(self, new_data, on_bad_vectors, fill_value)

Integrations

Pydantic

`lancedb.pydantic.pydantic_to_schema(model: Type[pydantic.BaseModel]) -> pa.Schema`

Convert a Pydantic model to a PyArrow Schema.

Parameters:

Name	Type	Description	Default
`model`	`Type[BaseModel]`	The Pydantic BaseModel to convert to Arrow Schema.	required

Returns:

Type	Description
`Schema`

Examples:

>>> from typing import List, Optional
>>> import pydantic
>>> from lancedb.pydantic import pydantic_to_schema
>>> class FooModel(pydantic.BaseModel):
...     id: int
...     s: str
...     vec: List[float]
...     li: List[int]
...
>>> schema = pydantic_to_schema(FooModel)
>>> assert schema == pa.schema([
...     pa.field("id", pa.int64(), False),
...     pa.field("s", pa.utf8(), False),
...     pa.field("vec", pa.list_(pa.float64()), False),
...     pa.field("li", pa.list_(pa.int64()), False),
... ])

Source code in lancedb/pydantic.py

def pydantic_to_schema(model: Type[pydantic.BaseModel]) -> pa.Schema:
    """Convert a Pydantic model to a PyArrow Schema.

    Parameters
    ----------
    model : Type[pydantic.BaseModel]
        The Pydantic BaseModel to convert to Arrow Schema.

    Returns
    -------
    pyarrow.Schema

    Examples
    --------

    >>> from typing import List, Optional
    >>> import pydantic
    >>> from lancedb.pydantic import pydantic_to_schema
    >>> class FooModel(pydantic.BaseModel):
    ...     id: int
    ...     s: str
    ...     vec: List[float]
    ...     li: List[int]
    ...
    >>> schema = pydantic_to_schema(FooModel)
    >>> assert schema == pa.schema([
    ...     pa.field("id", pa.int64(), False),
    ...     pa.field("s", pa.utf8(), False),
    ...     pa.field("vec", pa.list_(pa.float64()), False),
    ...     pa.field("li", pa.list_(pa.int64()), False),
    ... ])
    """
    fields = _pydantic_model_to_fields(model)
    return pa.schema(fields)

`lancedb.pydantic.vector(dim: int, value_type: pa.DataType = pa.float32())`

Source code in lancedb/pydantic.py

def vector(dim: int, value_type: pa.DataType = pa.float32()):
    # TODO: remove in future release
    from warnings import warn

    warn(
        "lancedb.pydantic.vector() is deprecated, use lancedb.pydantic.Vector instead."
        "This function will be removed in future release",
        DeprecationWarning,
    )
    return Vector(dim, value_type)

`lancedb.pydantic.LanceModel`

Bases: BaseModel

A Pydantic Model base class that can be converted to a LanceDB Table.

Examples:

>>> import lancedb
>>> from lancedb.pydantic import LanceModel, Vector
>>>
>>> class TestModel(LanceModel):
...     name: str
...     vector: Vector(2)
...
>>> db = lancedb.connect("./example")
>>> table = db.create_table("test", schema=TestModel.to_arrow_schema())
>>> table.add([
...     TestModel(name="test", vector=[1.0, 2.0])
... ])
>>> table.search([0., 0.]).limit(1).to_pydantic(TestModel)
[TestModel(name='test', vector=FixedSizeList(dim=2))]

Source code in lancedb/pydantic.py

class LanceModel(pydantic.BaseModel):
    """
    A Pydantic Model base class that can be converted to a LanceDB Table.

    Examples
    --------
    >>> import lancedb
    >>> from lancedb.pydantic import LanceModel, Vector
    >>>
    >>> class TestModel(LanceModel):
    ...     name: str
    ...     vector: Vector(2)
    ...
    >>> db = lancedb.connect("./example")
    >>> table = db.create_table("test", schema=TestModel.to_arrow_schema())
    >>> table.add([
    ...     TestModel(name="test", vector=[1.0, 2.0])
    ... ])
    >>> table.search([0., 0.]).limit(1).to_pydantic(TestModel)
    [TestModel(name='test', vector=FixedSizeList(dim=2))]
    """

    @classmethod
    def to_arrow_schema(cls):
        """
        Get the Arrow Schema for this model.
        """
        schema = pydantic_to_schema(cls)
        functions = cls.parse_embedding_functions()
        if len(functions) > 0:
            # Prevent circular import
            from .embeddings import EmbeddingFunctionRegistry

            metadata = EmbeddingFunctionRegistry.get_instance().get_table_metadata(
                functions
            )
            schema = schema.with_metadata(metadata)
        return schema

    @classmethod
    def field_names(cls) -> List[str]:
        """
        Get the field names of this model.
        """
        return list(cls.safe_get_fields().keys())

    @classmethod
    def safe_get_fields(cls):
        if PYDANTIC_VERSION.major < 2:
            return cls.__fields__
        return cls.model_fields

    @classmethod
    def parse_embedding_functions(cls) -> List["EmbeddingFunctionConfig"]:
        """
        Parse the embedding functions from this model.
        """
        from .embeddings import EmbeddingFunctionConfig

        vec_and_function = []
        for name, field_info in cls.safe_get_fields().items():
            func = get_extras(field_info, "vector_column_for")
            if func is not None:
                vec_and_function.append([name, func])

        configs = []
        for vec, func in vec_and_function:
            for source, field_info in cls.safe_get_fields().items():
                src_func = get_extras(field_info, "source_column_for")
                if src_func is func:
                    # note we can't use == here since the function is a pydantic
                    # model so two instances of the same function are ==, so if you
                    # have multiple vector columns from multiple sources, both will
                    # be mapped to the same source column
                    # GH594
                    configs.append(
                        EmbeddingFunctionConfig(
                            source_column=source, vector_column=vec, function=func
                        )
                    )
        return configs

`to_arrow_schema()` `classmethod`

Get the Arrow Schema for this model.

Source code in lancedb/pydantic.py

@classmethod
def to_arrow_schema(cls):
    """
    Get the Arrow Schema for this model.
    """
    schema = pydantic_to_schema(cls)
    functions = cls.parse_embedding_functions()
    if len(functions) > 0:
        # Prevent circular import
        from .embeddings import EmbeddingFunctionRegistry

        metadata = EmbeddingFunctionRegistry.get_instance().get_table_metadata(
            functions
        )
        schema = schema.with_metadata(metadata)
    return schema

`field_names() -> List[str]` `classmethod`

Get the field names of this model.

Source code in lancedb/pydantic.py

@classmethod
def field_names(cls) -> List[str]:
    """
    Get the field names of this model.
    """
    return list(cls.safe_get_fields().keys())

`parse_embedding_functions() -> List['EmbeddingFunctionConfig']` `classmethod`

Parse the embedding functions from this model.

Source code in lancedb/pydantic.py

@classmethod
def parse_embedding_functions(cls) -> List["EmbeddingFunctionConfig"]:
    """
    Parse the embedding functions from this model.
    """
    from .embeddings import EmbeddingFunctionConfig

    vec_and_function = []
    for name, field_info in cls.safe_get_fields().items():
        func = get_extras(field_info, "vector_column_for")
        if func is not None:
            vec_and_function.append([name, func])

    configs = []
    for vec, func in vec_and_function:
        for source, field_info in cls.safe_get_fields().items():
            src_func = get_extras(field_info, "source_column_for")
            if src_func is func:
                # note we can't use == here since the function is a pydantic
                # model so two instances of the same function are ==, so if you
                # have multiple vector columns from multiple sources, both will
                # be mapped to the same source column
                # GH594
                configs.append(
                    EmbeddingFunctionConfig(
                        source_column=source, vector_column=vec, function=func
                    )
                )
    return configs

Reranking

`lancedb.rerankers.linear_combination.LinearCombinationReranker`

Bases: Reranker

Reranks the results using a linear combination of the scores from the vector and FTS search. For missing scores, fill with fill value.

Parameters:

Name	Type	Description	Default
`weight`	`float`	The weight to give to the vector score. Must be between 0 and 1.	`0.7`
`fill`	`float`	The score to give to results that are only in one of the two result sets. This is treated as penalty, so a higher value means a lower score. TODO: We should just hardcode this-- its pretty confusing as we invert scores to calculate final score	`1.0`
`return_score`	`str`	opntions are "relevance" or "all" The type of score to return. If "relevance", will return only the relevance score. If "all", will return all scores from the vector and FTS search along with the relevance score.	`"relevance"`

Source code in lancedb/rerankers/linear_combination.py

class LinearCombinationReranker(Reranker):
    """
    Reranks the results using a linear combination of the scores from the
    vector and FTS search. For missing scores, fill with `fill` value.
    Parameters
    ----------
    weight : float, default 0.7
        The weight to give to the vector score. Must be between 0 and 1.
    fill : float, default 1.0
        The score to give to results that are only in one of the two result sets.
        This is treated as penalty, so a higher value means a lower score.
        TODO: We should just hardcode this--
        its pretty confusing as we invert scores to calculate final score
    return_score : str, default "relevance"
        opntions are "relevance" or "all"
        The type of score to return. If "relevance", will return only the relevance
        score. If "all", will return all scores from the vector and FTS search along
        with the relevance score.
    """

    def __init__(
        self, weight: float = 0.7, fill: float = 1.0, return_score="relevance"
    ):
        if weight < 0 or weight > 1:
            raise ValueError("weight must be between 0 and 1.")
        super().__init__(return_score)
        self.weight = weight
        self.fill = fill

    def rerank_hybrid(
        self,
        query: str,  # noqa: F821
        vector_results: pa.Table,
        fts_results: pa.Table,
    ):
        combined_results = self.merge_results(vector_results, fts_results, self.fill)

        return combined_results

    def merge_results(
        self, vector_results: pa.Table, fts_results: pa.Table, fill: float
    ):
        # If both are empty then just return an empty table
        if len(vector_results) == 0 and len(fts_results) == 0:
            return vector_results
        # If one is empty then return the other
        if len(vector_results) == 0:
            return fts_results
        if len(fts_results) == 0:
            return vector_results

        # sort both input tables on _rowid
        combined_list = []
        vector_list = vector_results.sort_by("_rowid").to_pylist()
        fts_list = fts_results.sort_by("_rowid").to_pylist()
        i, j = 0, 0
        while i < len(vector_list):
            if j >= len(fts_list):
                for vi in vector_list[i:]:
                    vi["_relevance_score"] = self._combine_score(vi["_distance"], fill)
                    combined_list.append(vi)
                break

            vi = vector_list[i]
            fj = fts_list[j]
            # invert the fts score from relevance to distance
            inverted_fts_score = self._invert_score(fj["score"])
            if vi["_rowid"] == fj["_rowid"]:
                vi["_relevance_score"] = self._combine_score(
                    vi["_distance"], inverted_fts_score
                )
                vi["score"] = fj["score"]  # keep the original score
                combined_list.append(vi)
                i += 1
                j += 1
            elif vector_list[i]["_rowid"] < fts_list[j]["_rowid"]:
                vi["_relevance_score"] = self._combine_score(vi["_distance"], fill)
                combined_list.append(vi)
                i += 1
            else:
                fj["_relevance_score"] = self._combine_score(inverted_fts_score, fill)
                combined_list.append(fj)
                j += 1
        if j < len(fts_list) - 1:
            for fj in fts_list[j:]:
                fj["_relevance_score"] = self._combine_score(inverted_fts_score, fill)
                combined_list.append(fj)

        relevance_score_schema = pa.schema(
            [
                pa.field("_relevance_score", pa.float32()),
            ]
        )
        combined_schema = pa.unify_schemas(
            [vector_results.schema, fts_results.schema, relevance_score_schema]
        )
        tbl = pa.Table.from_pylist(combined_list, schema=combined_schema).sort_by(
            [("_relevance_score", "descending")]
        )
        if self.score == "relevance":
            tbl = tbl.drop_columns(["score", "_distance"])
        return tbl

    def _combine_score(self, score1, score2):
        # these scores represent distance
        return 1 - (self.weight * score1 + (1 - self.weight) * score2)

    def _invert_score(self, score: float):
        # Invert the score between relevance and distance
        return 1 - score

`lancedb.rerankers.cohere.CohereReranker`

Bases: Reranker

Reranks the results using the Cohere Rerank API. https://docs.cohere.com/docs/rerank-guide

Parameters:

Name	Type	Description	Default
`model_name`	`str`	The name of the cross encoder model to use. Available cohere models are: - rerank-english-v2.0 - rerank-multilingual-v2.0	`"rerank-english-v2.0"`
`column`	`str`	The name of the column to use as input to the cross encoder model.	`"text"`
`top_n`	`str`	The number of results to return. If None, will return all results.	`None`

Source code in lancedb/rerankers/cohere.py

class CohereReranker(Reranker):
    """
    Reranks the results using the Cohere Rerank API.
    https://docs.cohere.com/docs/rerank-guide

    Parameters
    ----------
    model_name : str, default "rerank-english-v2.0"
        The name of the cross encoder model to use. Available cohere models are:
        - rerank-english-v2.0
        - rerank-multilingual-v2.0
    column : str, default "text"
        The name of the column to use as input to the cross encoder model.
    top_n : str, default None
        The number of results to return. If None, will return all results.
    """

    def __init__(
        self,
        model_name: str = "rerank-english-v2.0",
        column: str = "text",
        top_n: Union[int, None] = None,
        return_score="relevance",
        api_key: Union[str, None] = None,
    ):
        super().__init__(return_score)
        self.model_name = model_name
        self.column = column
        self.top_n = top_n
        self.api_key = api_key

    @cached_property
    def _client(self):
        cohere = attempt_import_or_raise("cohere")
        # ensure version is at least 0.5.0
        if hasattr(cohere, "__version__") and Version(cohere.__version__) < Version(
            "0.5.0"
        ):
            raise ValueError(
                f"cohere version must be at least 0.5.0, found {cohere.__version__}"
            )
        if os.environ.get("COHERE_API_KEY") is None and self.api_key is None:
            raise ValueError(
                "COHERE_API_KEY not set. Either set it in your environment or \
                pass it as `api_key` argument to the CohereReranker."
            )
        return cohere.Client(os.environ.get("COHERE_API_KEY") or self.api_key)

    def _rerank(self, result_set: pa.Table, query: str):
        docs = result_set[self.column].to_pylist()
        response = self._client.rerank(
            query=query,
            documents=docs,
            top_n=self.top_n,
            model=self.model_name,
        )
        results = (
            response.results
        )  # returns list (text, idx, relevance) attributes sorted descending by score
        indices, scores = list(
            zip(*[(result.index, result.relevance_score) for result in results])
        )  # tuples
        result_set = result_set.take(list(indices))
        # add the scores
        result_set = result_set.append_column(
            "_relevance_score", pa.array(scores, type=pa.float32())
        )

        return result_set

    def rerank_hybrid(
        self,
        query: str,
        vector_results: pa.Table,
        fts_results: pa.Table,
    ):
        combined_results = self.merge_results(vector_results, fts_results)
        combined_results = self._rerank(combined_results, query)
        if self.score == "relevance":
            combined_results = combined_results.drop_columns(["score", "_distance"])
        elif self.score == "all":
            raise NotImplementedError(
                "return_score='all' not implemented for cohere reranker"
            )
        return combined_results

    def rerank_vector(
        self,
        query: str,
        vector_results: pa.Table,
    ):
        result_set = self._rerank(vector_results, query)
        if self.score == "relevance":
            result_set = result_set.drop_columns(["_distance"])

        return result_set

    def rerank_fts(
        self,
        query: str,
        fts_results: pa.Table,
    ):
        result_set = self._rerank(fts_results, query)
        if self.score == "relevance":
            result_set = result_set.drop_columns(["score"])

        return result_set

`lancedb.rerankers.colbert.ColbertReranker`

Bases: Reranker

Reranks the results using the ColBERT model.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	The name of the cross encoder model to use.	`"colbert-ir/colbertv2.0"`
`column`	`str`	The name of the column to use as input to the cross encoder model.	`"text"`
`return_score`	`str`	options are "relevance" or "all". Only "relevance" is supported for now.	`"relevance"`

Source code in lancedb/rerankers/colbert.py

class ColbertReranker(Reranker):
    """
    Reranks the results using the ColBERT model.

    Parameters
    ----------
    model_name : str, default "colbert-ir/colbertv2.0"
        The name of the cross encoder model to use.
    column : str, default "text"
        The name of the column to use as input to the cross encoder model.
    return_score : str, default "relevance"
        options are "relevance" or "all". Only "relevance" is supported for now.
    """

    def __init__(
        self,
        model_name: str = "colbert-ir/colbertv2.0",
        column: str = "text",
        return_score="relevance",
    ):
        super().__init__(return_score)
        self.model_name = model_name
        self.column = column
        self.torch = attempt_import_or_raise(
            "torch"
        )  # import here for faster ops later

    def _rerank(self, result_set: pa.Table, query: str):
        docs = result_set[self.column].to_pylist()

        tokenizer, model = self._model

        # Encode the query
        query_encoding = tokenizer(query, return_tensors="pt")
        query_embedding = model(**query_encoding).last_hidden_state.mean(dim=1)
        scores = []
        # Get score for each document
        for document in docs:
            document_encoding = tokenizer(
                document, return_tensors="pt", truncation=True, max_length=512
            )
            document_embedding = model(**document_encoding).last_hidden_state
            # Calculate MaxSim score
            score = self.maxsim(query_embedding.unsqueeze(0), document_embedding)
            scores.append(score.item())

        # replace the self.column column with the docs
        result_set = result_set.drop(self.column)
        result_set = result_set.append_column(
            self.column, pa.array(docs, type=pa.string())
        )
        # add the scores
        result_set = result_set.append_column(
            "_relevance_score", pa.array(scores, type=pa.float32())
        )

        return result_set

    def rerank_hybrid(
        self,
        query: str,
        vector_results: pa.Table,
        fts_results: pa.Table,
    ):
        combined_results = self.merge_results(vector_results, fts_results)
        combined_results = self._rerank(combined_results, query)
        if self.score == "relevance":
            combined_results = combined_results.drop_columns(["score", "_distance"])
        elif self.score == "all":
            raise NotImplementedError(
                "OpenAI Reranker does not support score='all' yet"
            )

        combined_results = combined_results.sort_by(
            [("_relevance_score", "descending")]
        )

        return combined_results

    def rerank_vector(
        self,
        query: str,
        vector_results: pa.Table,
    ):
        result_set = self._rerank(vector_results, query)
        if self.score == "relevance":
            result_set = result_set.drop_columns(["_distance"])

        result_set = result_set.sort_by([("_relevance_score", "descending")])

        return result_set

    def rerank_fts(
        self,
        query: str,
        fts_results: pa.Table,
    ):
        result_set = self._rerank(fts_results, query)
        if self.score == "relevance":
            result_set = result_set.drop_columns(["score"])

        result_set = result_set.sort_by([("_relevance_score", "descending")])

        return result_set

    @cached_property
    def _model(self):
        transformers = attempt_import_or_raise("transformers")
        tokenizer = transformers.AutoTokenizer.from_pretrained(self.model_name)
        model = transformers.AutoModel.from_pretrained(self.model_name)

        return tokenizer, model

    def maxsim(self, query_embedding, document_embedding):
        # Expand dimensions for broadcasting
        # Query: [batch, length, size] -> [batch, query, 1, size]
        # Document: [batch, length, size] -> [batch, 1, length, size]
        expanded_query = query_embedding.unsqueeze(2)
        expanded_doc = document_embedding.unsqueeze(1)

        # Compute cosine similarity across the embedding dimension
        sim_matrix = self.torch.nn.functional.cosine_similarity(
            expanded_query, expanded_doc, dim=-1
        )

        # Take the maximum similarity for each query token (across all document tokens)
        # sim_matrix shape: [batch_size, query_length, doc_length]
        max_sim_scores, _ = self.torch.max(sim_matrix, dim=2)

        # Average these maximum scores across all query tokens
        avg_max_sim = self.torch.mean(max_sim_scores, dim=1)
        return avg_max_sim

`lancedb.rerankers.cross_encoder.CrossEncoderReranker`

Bases: Reranker

Reranks the results using a cross encoder model. The cross encoder model is used to score the query and each result. The results are then sorted by the score.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	The name of the cross encoder model to use. See the sentence transformers documentation for a list of available models.	`"cross-encoder/ms-marco-TinyBERT-L-6"`
`column`	`str`	The name of the column to use as input to the cross encoder model.	`"text"`
`device`	`str`	The device to use for the cross encoder model. If None, will use "cuda" if available, otherwise "cpu".	`None`

Source code in lancedb/rerankers/cross_encoder.py

class CrossEncoderReranker(Reranker):
    """
    Reranks the results using a cross encoder model. The cross encoder model is
    used to score the query and each result. The results are then sorted by the score.

    Parameters
    ----------
    model_name : str, default "cross-encoder/ms-marco-TinyBERT-L-6"
        The name of the cross encoder model to use. See the sentence transformers
        documentation for a list of available models.
    column : str, default "text"
        The name of the column to use as input to the cross encoder model.
    device : str, default None
        The device to use for the cross encoder model. If None, will use "cuda"
        if available, otherwise "cpu".
    """

    def __init__(
        self,
        model_name: str = "cross-encoder/ms-marco-TinyBERT-L-6",
        column: str = "text",
        device: Union[str, None] = None,
        return_score="relevance",
    ):
        super().__init__(return_score)
        torch = attempt_import_or_raise("torch")
        self.model_name = model_name
        self.column = column
        self.device = device
        if self.device is None:
            self.device = "cuda" if torch.cuda.is_available() else "cpu"

    @cached_property
    def model(self):
        sbert = attempt_import_or_raise("sentence_transformers")
        cross_encoder = sbert.CrossEncoder(self.model_name)

        return cross_encoder

    def _rerank(self, result_set: pa.Table, query: str):
        passages = result_set[self.column].to_pylist()
        cross_inp = [[query, passage] for passage in passages]
        cross_scores = self.model.predict(cross_inp)
        result_set = result_set.append_column(
            "_relevance_score", pa.array(cross_scores, type=pa.float32())
        )

        return result_set

    def rerank_hybrid(
        self,
        query: str,
        vector_results: pa.Table,
        fts_results: pa.Table,
    ):
        combined_results = self.merge_results(vector_results, fts_results)
        combined_results = self._rerank(combined_results, query)
        # sort the results by _score
        if self.score == "relevance":
            combined_results = combined_results.drop_columns(["score", "_distance"])
        elif self.score == "all":
            raise NotImplementedError(
                "return_score='all' not implemented for CrossEncoderReranker"
            )
        combined_results = combined_results.sort_by(
            [("_relevance_score", "descending")]
        )

        return combined_results

    def rerank_vector(
        self,
        query: str,
        vector_results: pa.Table,
    ):
        vector_results = self._rerank(vector_results, query)
        if self.score == "relevance":
            vector_results = vector_results.drop_columns(["_distance"])

        vector_results = vector_results.sort_by([("_relevance_score", "descending")])
        return vector_results

    def rerank_fts(
        self,
        query: str,
        fts_results: pa.Table,
    ):
        fts_results = self._rerank(fts_results, query)
        if self.score == "relevance":
            fts_results = fts_results.drop_columns(["score"])

        fts_results = fts_results.sort_by([("_relevance_score", "descending")])
        return fts_results

`lancedb.rerankers.openai.OpenaiReranker`

Bases: Reranker

Reranks the results using the OpenAI API. WARNING: This is a prompt based reranker that uses chat model that is not a dedicated reranker API. This should be treated as experimental.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	The name of the cross encoder model to use.	`"gpt-4-turbo-preview"`
`column`	`str`	The name of the column to use as input to the cross encoder model.	`"text"`
`return_score`	`str`	options are "relevance" or "all". Only "relevance" is supported for now.	`"relevance"`
`api_key`	`str`	The API key to use. If None, will use the OPENAI_API_KEY environment variable.	`None`

Source code in lancedb/rerankers/openai.py

class OpenaiReranker(Reranker):
    """
    Reranks the results using the OpenAI API.
    WARNING: This is a prompt based reranker that uses chat model that is
    not a dedicated reranker API. This should be treated as experimental.

    Parameters
    ----------
    model_name : str, default "gpt-4-turbo-preview"
        The name of the cross encoder model to use.
    column : str, default "text"
        The name of the column to use as input to the cross encoder model.
    return_score : str, default "relevance"
        options are "relevance" or "all". Only "relevance" is supported for now.
    api_key : str, default None
        The API key to use. If None, will use the OPENAI_API_KEY environment variable.
    """

    def __init__(
        self,
        model_name: str = "gpt-4-turbo-preview",
        column: str = "text",
        return_score="relevance",
        api_key: Optional[str] = None,
    ):
        super().__init__(return_score)
        self.model_name = model_name
        self.column = column
        self.api_key = api_key

    def _rerank(self, result_set: pa.Table, query: str):
        docs = result_set[self.column].to_pylist()
        response = self._client.chat.completions.create(
            model=self.model_name,
            response_format={"type": "json_object"},
            temperature=0,
            messages=[
                {
                    "role": "system",
                    "content": "You are an expert relevance ranker. Given a list of\
                        documents and a query, your job is to determine the relevance\
                        each document is for answering the query. Your output is JSON,\
                        which is a list of documents. Each document has two fields,\
                        content and relevance_score.  relevance_score is from 0.0 to\
                        1.0 indicating the relevance of the text to the given query.\
                        Make sure to include all documents in the response.",
                },
                {"role": "user", "content": f"Query: {query} Docs: {docs}"},
            ],
        )
        results = json.loads(response.choices[0].message.content)["documents"]
        docs, scores = list(
            zip(*[(result["content"], result["relevance_score"]) for result in results])
        )  # tuples
        # replace the self.column column with the docs
        result_set = result_set.drop(self.column)
        result_set = result_set.append_column(
            self.column, pa.array(docs, type=pa.string())
        )
        # add the scores
        result_set = result_set.append_column(
            "_relevance_score", pa.array(scores, type=pa.float32())
        )

        return result_set

    def rerank_hybrid(
        self,
        query: str,
        vector_results: pa.Table,
        fts_results: pa.Table,
    ):
        combined_results = self.merge_results(vector_results, fts_results)
        combined_results = self._rerank(combined_results, query)
        if self.score == "relevance":
            combined_results = combined_results.drop_columns(["score", "_distance"])
        elif self.score == "all":
            raise NotImplementedError(
                "OpenAI Reranker does not support score='all' yet"
            )

        combined_results = combined_results.sort_by(
            [("_relevance_score", "descending")]
        )

        return combined_results

    def rerank_vector(self, query: str, vector_results: pa.Table):
        vector_results = self._rerank(vector_results, query)
        if self.score == "relevance":
            vector_results = vector_results.drop_columns(["_distance"])

        vector_results = vector_results.sort_by([("_relevance_score", "descending")])

        return vector_results

    def rerank_fts(self, query: str, fts_results: pa.Table):
        fts_results = self._rerank(fts_results, query)
        if self.score == "relevance":
            fts_results = fts_results.drop_columns(["score"])

        fts_results = fts_results.sort_by([("_relevance_score", "descending")])

        return fts_results

    @cached_property
    def _client(self):
        openai = attempt_import_or_raise(
            "openai"
        )  # TODO: force version or handle versions < 1.0
        if os.environ.get("OPENAI_API_KEY") is None and self.api_key is None:
            raise ValueError(
                "OPENAI_API_KEY not set. Either set it in your environment or \
                pass it as `api_key` argument to the CohereReranker."
            )
        return openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY") or self.api_key)

Connections (Asynchronous)

Connections represent a connection to a LanceDb database and can be used to create, list, or open tables.

`lancedb.connect_async(uri: URI, *, api_key: Optional[str] = None, region: str = 'us-east-1', host_override: Optional[str] = None, read_consistency_interval: Optional[timedelta] = None, request_thread_pool: Optional[Union[int, ThreadPoolExecutor]] = None, storage_options: Optional[Dict[str, str]] = None) -> AsyncConnection` `async`

Connect to a LanceDB database.

Parameters:

Name	Type	Description	Default
`uri`	`URI`	The uri of the database.	required
`api_key`	`Optional[str]`	If present, connect to LanceDB cloud. Otherwise, connect to a database on file system or cloud storage. Can be set via environment variable `LANCEDB_API_KEY`.	`None`
`region`	`str`	The region to use for LanceDB Cloud.	`'us-east-1'`
`host_override`	`Optional[str]`	The override url for LanceDB Cloud.	`None`
`read_consistency_interval`	`Optional[timedelta]`	(For LanceDB OSS only) The interval at which to check for updates to the table from other processes. If None, then consistency is not checked. For performance reasons, this is the default. For strong consistency, set this to zero seconds. Then every read will check for updates from other processes. As a compromise, you can set this to a non-zero timedelta for eventual consistency. If more than that interval has passed since the last check, then the table will be checked for updates. Note: this consistency only applies to read operations. Write operations are always consistent.	`None`
`storage_options`	`Optional[Dict[str, str]]`	Additional options for the storage backend. See available options at https://lancedb.github.io/lancedb/guides/storage/	`None`

Examples:

>>> import lancedb
>>> async def doctest_example():
...     # For a local directory, provide a path to the database
...     db = await lancedb.connect_async("~/.lancedb")
...     # For object storage, use a URI prefix
...     db = await lancedb.connect_async("s3://my-bucket/lancedb")

Returns:

Name	Type	Description
`conn`	`AsyncConnection`	A connection to a LanceDB database.

Source code in lancedb/__init__.py

async def connect_async(
    uri: URI,
    *,
    api_key: Optional[str] = None,
    region: str = "us-east-1",
    host_override: Optional[str] = None,
    read_consistency_interval: Optional[timedelta] = None,
    request_thread_pool: Optional[Union[int, ThreadPoolExecutor]] = None,
    storage_options: Optional[Dict[str, str]] = None,
) -> AsyncConnection:
    """Connect to a LanceDB database.

    Parameters
    ----------
    uri: str or Path
        The uri of the database.
    api_key: str, optional
        If present, connect to LanceDB cloud.
        Otherwise, connect to a database on file system or cloud storage.
        Can be set via environment variable `LANCEDB_API_KEY`.
    region: str, default "us-east-1"
        The region to use for LanceDB Cloud.
    host_override: str, optional
        The override url for LanceDB Cloud.
    read_consistency_interval: timedelta, default None
        (For LanceDB OSS only)
        The interval at which to check for updates to the table from other
        processes. If None, then consistency is not checked. For performance
        reasons, this is the default. For strong consistency, set this to
        zero seconds. Then every read will check for updates from other
        processes. As a compromise, you can set this to a non-zero timedelta
        for eventual consistency. If more than that interval has passed since
        the last check, then the table will be checked for updates. Note: this
        consistency only applies to read operations. Write operations are
        always consistent.
    storage_options: dict, optional
        Additional options for the storage backend. See available options at
        https://lancedb.github.io/lancedb/guides/storage/

    Examples
    --------

    >>> import lancedb
    >>> async def doctest_example():
    ...     # For a local directory, provide a path to the database
    ...     db = await lancedb.connect_async("~/.lancedb")
    ...     # For object storage, use a URI prefix
    ...     db = await lancedb.connect_async("s3://my-bucket/lancedb")

    Returns
    -------
    conn : AsyncConnection
        A connection to a LanceDB database.
    """
    if read_consistency_interval is not None:
        read_consistency_interval_secs = read_consistency_interval.total_seconds()
    else:
        read_consistency_interval_secs = None

    return AsyncConnection(
        await lancedb_connect(
            sanitize_uri(uri),
            api_key,
            region,
            host_override,
            read_consistency_interval_secs,
            storage_options,
        )
    )

`lancedb.db.AsyncConnection`

Bases: object

An active LanceDB connection

To obtain a connection you can use the connect_async function.

This could be a native connection (using lance) or a remote connection (e.g. for connecting to LanceDb Cloud)

Local connections do not currently hold any open resources but they may do so in the future (for example, for shared cache or connections to catalog services) Remote connections represent an open connection to the remote server. The close method can be used to release any underlying resources eagerly. The connection can also be used as a context manager.

Connections can be shared on multiple threads and are expected to be long lived. Connections can also be used as a context manager, however, in many cases a single connection can be used for the lifetime of the application and so this is often not needed. Closing a connection is optional. If it is not closed then it will be automatically closed when the connection object is deleted.

Examples:

>>> import lancedb
>>> async def doctest_example():
...   with await lancedb.connect_async("/tmp/my_dataset") as conn:
...     # do something with the connection
...     pass
...   # conn is closed here

Source code in lancedb/db.py

class AsyncConnection(object):
    """An active LanceDB connection

    To obtain a connection you can use the [connect_async][lancedb.connect_async]
    function.

    This could be a native connection (using lance) or a remote connection (e.g. for
    connecting to LanceDb Cloud)

    Local connections do not currently hold any open resources but they may do so in the
    future (for example, for shared cache or connections to catalog services) Remote
    connections represent an open connection to the remote server.  The
    [close][lancedb.db.AsyncConnection.close] method can be used to release any
    underlying resources eagerly.  The connection can also be used as a context manager.

    Connections can be shared on multiple threads and are expected to be long lived.
    Connections can also be used as a context manager, however, in many cases a single
    connection can be used for the lifetime of the application and so this is often
    not needed.  Closing a connection is optional.  If it is not closed then it will
    be automatically closed when the connection object is deleted.

    Examples
    --------

    >>> import lancedb
    >>> async def doctest_example():
    ...   with await lancedb.connect_async("/tmp/my_dataset") as conn:
    ...     # do something with the connection
    ...     pass
    ...   # conn is closed here
    """

    def __init__(self, connection: LanceDbConnection):
        self._inner = connection

    def __repr__(self):
        return self._inner.__repr__()

    def __enter__(self):
        return self

    def __exit__(self, *_):
        self.close()

    def is_open(self):
        """Return True if the connection is open."""
        return self._inner.is_open()

    def close(self):
        """Close the connection, releasing any underlying resources.

        It is safe to call this method multiple times.

        Any attempt to use the connection after it is closed will result in an error."""
        self._inner.close()

    async def table_names(
        self, *, start_after: Optional[str] = None, limit: Optional[int] = None
    ) -> Iterable[str]:
        """List all tables in this database, in sorted order

        Parameters
        ----------
        start_after: str, optional
            If present, only return names that come lexicographically after the supplied
            value.

            This can be combined with limit to implement pagination by setting this to
            the last table name from the previous page.
        limit: int, default 10
            The number of results to return.

        Returns
        -------
        Iterable of str
        """
        return await self._inner.table_names(start_after=start_after, limit=limit)

    async def create_table(
        self,
        name: str,
        data: Optional[DATA] = None,
        schema: Optional[Union[pa.Schema, LanceModel]] = None,
        mode: Optional[Literal["create", "overwrite"]] = None,
        exist_ok: Optional[bool] = None,
        on_bad_vectors: Optional[str] = None,
        fill_value: Optional[float] = None,
        storage_options: Optional[Dict[str, str]] = None,
        *,
        use_legacy_format: Optional[bool] = None,
    ) -> AsyncTable:
        """Create an [AsyncTable][lancedb.table.AsyncTable] in the database.

        Parameters
        ----------
        name: str
            The name of the table.
        data: The data to initialize the table, *optional*
            User must provide at least one of `data` or `schema`.
            Acceptable types are:

            - dict or list-of-dict

            - pandas.DataFrame

            - pyarrow.Table or pyarrow.RecordBatch
        schema: The schema of the table, *optional*
            Acceptable types are:

            - pyarrow.Schema

            - [LanceModel][lancedb.pydantic.LanceModel]
        mode: Literal["create", "overwrite"]; default "create"
            The mode to use when creating the table.
            Can be either "create" or "overwrite".
            By default, if the table already exists, an exception is raised.
            If you want to overwrite the table, use mode="overwrite".
        exist_ok: bool, default False
            If a table by the same name already exists, then raise an exception
            if exist_ok=False. If exist_ok=True, then open the existing table;
            it will not add the provided data but will validate against any
            schema that's specified.
        on_bad_vectors: str, default "error"
            What to do if any of the vectors are not the same size or contains NaNs.
            One of "error", "drop", "fill".
        fill_value: float
            The value to use when filling vectors. Only used if on_bad_vectors="fill".
        storage_options: dict, optional
            Additional options for the storage backend. Options already set on the
            connection will be inherited by the table, but can be overridden here.
            See available options at
            https://lancedb.github.io/lancedb/guides/storage/
        use_legacy_format: bool, optional, default True
            If True, use the legacy format for the table. If False, use the new format.
            The default is True while the new format is in beta.


        Returns
        -------
        AsyncTable
            A reference to the newly created table.

        !!! note

            The vector index won't be created by default.
            To create the index, call the `create_index` method on the table.

        Examples
        --------

        Can create with list of tuples or dictionaries:

        >>> import lancedb
        >>> async def doctest_example():
        ...     db = await lancedb.connect_async("./.lancedb")
        ...     data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},
        ...             {"vector": [0.2, 1.8], "lat": 40.1, "long":  -74.1}]
        ...     my_table = await db.create_table("my_table", data)
        ...     print(await my_table.query().limit(5).to_arrow())
        >>> import asyncio
        >>> asyncio.run(doctest_example())
        pyarrow.Table
        vector: fixed_size_list<item: float>[2]
          child 0, item: float
        lat: double
        long: double
        ----
        vector: [[[1.1,1.2],[0.2,1.8]]]
        lat: [[45.5,40.1]]
        long: [[-122.7,-74.1]]

        You can also pass a pandas DataFrame:

        >>> import pandas as pd
        >>> data = pd.DataFrame({
        ...    "vector": [[1.1, 1.2], [0.2, 1.8]],
        ...    "lat": [45.5, 40.1],
        ...    "long": [-122.7, -74.1]
        ... })
        >>> async def pandas_example():
        ...     db = await lancedb.connect_async("./.lancedb")
        ...     my_table = await db.create_table("table2", data)
        ...     print(await my_table.query().limit(5).to_arrow())
        >>> asyncio.run(pandas_example())
        pyarrow.Table
        vector: fixed_size_list<item: float>[2]
          child 0, item: float
        lat: double
        long: double
        ----
        vector: [[[1.1,1.2],[0.2,1.8]]]
        lat: [[45.5,40.1]]
        long: [[-122.7,-74.1]]

        Data is converted to Arrow before being written to disk. For maximum
        control over how data is saved, either provide the PyArrow schema to
        convert to or else provide a [PyArrow Table](pyarrow.Table) directly.

        >>> custom_schema = pa.schema([
        ...   pa.field("vector", pa.list_(pa.float32(), 2)),
        ...   pa.field("lat", pa.float32()),
        ...   pa.field("long", pa.float32())
        ... ])
        >>> async def with_schema():
        ...     db = await lancedb.connect_async("./.lancedb")
        ...     my_table = await db.create_table("table3", data, schema = custom_schema)
        ...     print(await my_table.query().limit(5).to_arrow())
        >>> asyncio.run(with_schema())
        pyarrow.Table
        vector: fixed_size_list<item: float>[2]
          child 0, item: float
        lat: float
        long: float
        ----
        vector: [[[1.1,1.2],[0.2,1.8]]]
        lat: [[45.5,40.1]]
        long: [[-122.7,-74.1]]


        It is also possible to create an table from `[Iterable[pa.RecordBatch]]`:


        >>> import pyarrow as pa
        >>> def make_batches():
        ...     for i in range(5):
        ...         yield pa.RecordBatch.from_arrays(
        ...             [
        ...                 pa.array([[3.1, 4.1], [5.9, 26.5]],
        ...                     pa.list_(pa.float32(), 2)),
        ...                 pa.array(["foo", "bar"]),
        ...                 pa.array([10.0, 20.0]),
        ...             ],
        ...             ["vector", "item", "price"],
        ...         )
        >>> schema=pa.schema([
        ...     pa.field("vector", pa.list_(pa.float32(), 2)),
        ...     pa.field("item", pa.utf8()),
        ...     pa.field("price", pa.float32()),
        ... ])
        >>> async def iterable_example():
        ...     db = await lancedb.connect_async("./.lancedb")
        ...     await db.create_table("table4", make_batches(), schema=schema)
        >>> asyncio.run(iterable_example())
        """
        if inspect.isclass(schema) and issubclass(schema, LanceModel):
            # convert LanceModel to pyarrow schema
            # note that it's possible this contains
            # embedding function metadata already
            schema = schema.to_arrow_schema()

        metadata = None

        # Defining defaults here and not in function prototype.  In the future
        # these defaults will move into rust so better to keep them as None.
        if on_bad_vectors is None:
            on_bad_vectors = "error"

        if fill_value is None:
            fill_value = 0.0

        if data is not None:
            data, schema = _sanitize_data(
                data,
                schema,
                metadata=metadata,
                on_bad_vectors=on_bad_vectors,
                fill_value=fill_value,
            )

        if schema is None:
            if data is None:
                raise ValueError("Either data or schema must be provided")
            elif hasattr(data, "schema"):
                schema = data.schema
            elif isinstance(data, Iterable):
                if metadata:
                    raise TypeError(
                        (
                            "Persistent embedding functions not yet "
                            "supported for generator data input"
                        )
                    )

        if metadata:
            schema = schema.with_metadata(metadata)
        validate_schema(schema)

        if exist_ok is None:
            exist_ok = False
        if mode is None:
            mode = "create"
        if mode == "create" and exist_ok:
            mode = "exist_ok"

        if data is None:
            new_table = await self._inner.create_empty_table(
                name,
                mode,
                schema,
                storage_options=storage_options,
                use_legacy_format=use_legacy_format,
            )
        else:
            data = data_to_reader(data, schema)
            new_table = await self._inner.create_table(
                name,
                mode,
                data,
                storage_options=storage_options,
                use_legacy_format=use_legacy_format,
            )

        return AsyncTable(new_table)

    async def open_table(
        self,
        name: str,
        storage_options: Optional[Dict[str, str]] = None,
        index_cache_size: Optional[int] = None,
    ) -> AsyncTable:
        """Open a Lance Table in the database.

        Parameters
        ----------
        name: str
            The name of the table.
        storage_options: dict, optional
            Additional options for the storage backend. Options already set on the
            connection will be inherited by the table, but can be overridden here.
            See available options at
            https://lancedb.github.io/lancedb/guides/storage/
        index_cache_size: int, default 256
            Set the size of the index cache, specified as a number of entries

            The exact meaning of an "entry" will depend on the type of index:
            * IVF - there is one entry for each IVF partition
            * BTREE - there is one entry for the entire index

            This cache applies to the entire opened table, across all indices.
            Setting this value higher will increase performance on larger datasets
            at the expense of more RAM

        Returns
        -------
        A LanceTable object representing the table.
        """
        table = await self._inner.open_table(name, storage_options, index_cache_size)
        return AsyncTable(table)

    async def drop_table(self, name: str):
        """Drop a table from the database.

        Parameters
        ----------
        name: str
            The name of the table.
        """
        await self._inner.drop_table(name)

    async def drop_database(self):
        """
        Drop database
        This is the same thing as dropping all the tables
        """
        await self._inner.drop_db()

`is_open()`

Return True if the connection is open.

Source code in lancedb/db.py

def is_open(self):
    """Return True if the connection is open."""
    return self._inner.is_open()

`close()`

Close the connection, releasing any underlying resources.

It is safe to call this method multiple times.

Any attempt to use the connection after it is closed will result in an error.

Source code in lancedb/db.py

def close(self):
    """Close the connection, releasing any underlying resources.

    It is safe to call this method multiple times.

    Any attempt to use the connection after it is closed will result in an error."""
    self._inner.close()

`table_names(*, start_after: Optional[str] = None, limit: Optional[int] = None) -> Iterable[str]` `async`

List all tables in this database, in sorted order

Parameters:

Name	Type	Description	Default
`start_after`	`Optional[str]`	If present, only return names that come lexicographically after the supplied value. This can be combined with limit to implement pagination by setting this to the last table name from the previous page.	`None`
`limit`	`Optional[int]`	The number of results to return.	`None`

Returns:

Type	Description
`Iterable of str`

Source code in lancedb/db.py

async def table_names(
    self, *, start_after: Optional[str] = None, limit: Optional[int] = None
) -> Iterable[str]:
    """List all tables in this database, in sorted order

    Parameters
    ----------
    start_after: str, optional
        If present, only return names that come lexicographically after the supplied
        value.

        This can be combined with limit to implement pagination by setting this to
        the last table name from the previous page.
    limit: int, default 10
        The number of results to return.

    Returns
    -------
    Iterable of str
    """
    return await self._inner.table_names(start_after=start_after, limit=limit)

`create_table(name: str, data: Optional[DATA] = None, schema: Optional[Union[pa.Schema, LanceModel]] = None, mode: Optional[Literal['create', 'overwrite']] = None, exist_ok: Optional[bool] = None, on_bad_vectors: Optional[str] = None, fill_value: Optional[float] = None, storage_options: Optional[Dict[str, str]] = None, *, use_legacy_format: Optional[bool] = None) -> AsyncTable` `async`

Create an AsyncTable in the database.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the table.	required
`data`	`Optional[DATA]`	User must provide at least one of `data` or `schema`. Acceptable types are: dict or list-of-dict pandas.DataFrame pyarrow.Table or pyarrow.RecordBatch	`None`
`schema`	`Optional[Union[Schema, LanceModel]]`	Acceptable types are: pyarrow.Schema LanceModel	`None`
`mode`	`Optional[Literal['create', 'overwrite']]`	The mode to use when creating the table. Can be either "create" or "overwrite". By default, if the table already exists, an exception is raised. If you want to overwrite the table, use mode="overwrite".	`None`
`exist_ok`	`Optional[bool]`	If a table by the same name already exists, then raise an exception if exist_ok=False. If exist_ok=True, then open the existing table; it will not add the provided data but will validate against any schema that's specified.	`None`
`on_bad_vectors`	`Optional[str]`	What to do if any of the vectors are not the same size or contains NaNs. One of "error", "drop", "fill".	`None`
`fill_value`	`Optional[float]`	The value to use when filling vectors. Only used if on_bad_vectors="fill".	`None`
`storage_options`	`Optional[Dict[str, str]]`	Additional options for the storage backend. Options already set on the connection will be inherited by the table, but can be overridden here. See available options at https://lancedb.github.io/lancedb/guides/storage/	`None`
`use_legacy_format`	`Optional[bool]`	If True, use the legacy format for the table. If False, use the new format. The default is True while the new format is in beta.	`None`

Returns:

Type	Description
`AsyncTable`	A reference to the newly created table.
`!!! note`	The vector index won't be created by default. To create the index, call the `create_index` method on the table.

Examples:

Can create with list of tuples or dictionaries:

>>> import lancedb
>>> async def doctest_example():
...     db = await lancedb.connect_async("./.lancedb")
...     data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},
...             {"vector": [0.2, 1.8], "lat": 40.1, "long":  -74.1}]
...     my_table = await db.create_table("my_table", data)
...     print(await my_table.query().limit(5).to_arrow())
>>> import asyncio
>>> asyncio.run(doctest_example())
pyarrow.Table
vector: fixed_size_list<item: float>[2]
  child 0, item: float
lat: double
long: double
----
vector: [[[1.1,1.2],[0.2,1.8]]]
lat: [[45.5,40.1]]
long: [[-122.7,-74.1]]

You can also pass a pandas DataFrame:

>>> import pandas as pd
>>> data = pd.DataFrame({
...    "vector": [[1.1, 1.2], [0.2, 1.8]],
...    "lat": [45.5, 40.1],
...    "long": [-122.7, -74.1]
... })
>>> async def pandas_example():
...     db = await lancedb.connect_async("./.lancedb")
...     my_table = await db.create_table("table2", data)
...     print(await my_table.query().limit(5).to_arrow())
>>> asyncio.run(pandas_example())
pyarrow.Table
vector: fixed_size_list<item: float>[2]
  child 0, item: float
lat: double
long: double
----
vector: [[[1.1,1.2],[0.2,1.8]]]
lat: [[45.5,40.1]]
long: [[-122.7,-74.1]]

Data is converted to Arrow before being written to disk. For maximum control over how data is saved, either provide the PyArrow schema to convert to or else provide a PyArrow Table directly.

>>> custom_schema = pa.schema([
...   pa.field("vector", pa.list_(pa.float32(), 2)),
...   pa.field("lat", pa.float32()),
...   pa.field("long", pa.float32())
... ])
>>> async def with_schema():
...     db = await lancedb.connect_async("./.lancedb")
...     my_table = await db.create_table("table3", data, schema = custom_schema)
...     print(await my_table.query().limit(5).to_arrow())
>>> asyncio.run(with_schema())
pyarrow.Table
vector: fixed_size_list<item: float>[2]
  child 0, item: float
lat: float
long: float
----
vector: [[[1.1,1.2],[0.2,1.8]]]
lat: [[45.5,40.1]]
long: [[-122.7,-74.1]]

It is also possible to create an table from [Iterable[pa.RecordBatch]]:

>>> import pyarrow as pa
>>> def make_batches():
...     for i in range(5):
...         yield pa.RecordBatch.from_arrays(
...             [
...                 pa.array([[3.1, 4.1], [5.9, 26.5]],
...                     pa.list_(pa.float32(), 2)),
...                 pa.array(["foo", "bar"]),
...                 pa.array([10.0, 20.0]),
...             ],
...             ["vector", "item", "price"],
...         )
>>> schema=pa.schema([
...     pa.field("vector", pa.list_(pa.float32(), 2)),
...     pa.field("item", pa.utf8()),
...     pa.field("price", pa.float32()),
... ])
>>> async def iterable_example():
...     db = await lancedb.connect_async("./.lancedb")
...     await db.create_table("table4", make_batches(), schema=schema)
>>> asyncio.run(iterable_example())

Source code in lancedb/db.py

async def create_table(
    self,
    name: str,
    data: Optional[DATA] = None,
    schema: Optional[Union[pa.Schema, LanceModel]] = None,
    mode: Optional[Literal["create", "overwrite"]] = None,
    exist_ok: Optional[bool] = None,
    on_bad_vectors: Optional[str] = None,
    fill_value: Optional[float] = None,
    storage_options: Optional[Dict[str, str]] = None,
    *,
    use_legacy_format: Optional[bool] = None,
) -> AsyncTable:
    """Create an [AsyncTable][lancedb.table.AsyncTable] in the database.

    Parameters
    ----------
    name: str
        The name of the table.
    data: The data to initialize the table, *optional*
        User must provide at least one of `data` or `schema`.
        Acceptable types are:

        - dict or list-of-dict

        - pandas.DataFrame

        - pyarrow.Table or pyarrow.RecordBatch
    schema: The schema of the table, *optional*
        Acceptable types are:

        - pyarrow.Schema

        - [LanceModel][lancedb.pydantic.LanceModel]
    mode: Literal["create", "overwrite"]; default "create"
        The mode to use when creating the table.
        Can be either "create" or "overwrite".
        By default, if the table already exists, an exception is raised.
        If you want to overwrite the table, use mode="overwrite".
    exist_ok: bool, default False
        If a table by the same name already exists, then raise an exception
        if exist_ok=False. If exist_ok=True, then open the existing table;
        it will not add the provided data but will validate against any
        schema that's specified.
    on_bad_vectors: str, default "error"
        What to do if any of the vectors are not the same size or contains NaNs.
        One of "error", "drop", "fill".
    fill_value: float
        The value to use when filling vectors. Only used if on_bad_vectors="fill".
    storage_options: dict, optional
        Additional options for the storage backend. Options already set on the
        connection will be inherited by the table, but can be overridden here.
        See available options at
        https://lancedb.github.io/lancedb/guides/storage/
    use_legacy_format: bool, optional, default True
        If True, use the legacy format for the table. If False, use the new format.
        The default is True while the new format is in beta.


    Returns
    -------
    AsyncTable
        A reference to the newly created table.

    !!! note

        The vector index won't be created by default.
        To create the index, call the `create_index` method on the table.

    Examples
    --------

    Can create with list of tuples or dictionaries:

    >>> import lancedb
    >>> async def doctest_example():
    ...     db = await lancedb.connect_async("./.lancedb")
    ...     data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},
    ...             {"vector": [0.2, 1.8], "lat": 40.1, "long":  -74.1}]
    ...     my_table = await db.create_table("my_table", data)
    ...     print(await my_table.query().limit(5).to_arrow())
    >>> import asyncio
    >>> asyncio.run(doctest_example())
    pyarrow.Table
    vector: fixed_size_list<item: float>[2]
      child 0, item: float
    lat: double
    long: double
    ----
    vector: [[[1.1,1.2],[0.2,1.8]]]
    lat: [[45.5,40.1]]
    long: [[-122.7,-74.1]]

    You can also pass a pandas DataFrame:

    >>> import pandas as pd
    >>> data = pd.DataFrame({
    ...    "vector": [[1.1, 1.2], [0.2, 1.8]],
    ...    "lat": [45.5, 40.1],
    ...    "long": [-122.7, -74.1]
    ... })
    >>> async def pandas_example():
    ...     db = await lancedb.connect_async("./.lancedb")
    ...     my_table = await db.create_table("table2", data)
    ...     print(await my_table.query().limit(5).to_arrow())
    >>> asyncio.run(pandas_example())
    pyarrow.Table
    vector: fixed_size_list<item: float>[2]
      child 0, item: float
    lat: double
    long: double
    ----
    vector: [[[1.1,1.2],[0.2,1.8]]]
    lat: [[45.5,40.1]]
    long: [[-122.7,-74.1]]

    Data is converted to Arrow before being written to disk. For maximum
    control over how data is saved, either provide the PyArrow schema to
    convert to or else provide a [PyArrow Table](pyarrow.Table) directly.

    >>> custom_schema = pa.schema([
    ...   pa.field("vector", pa.list_(pa.float32(), 2)),
    ...   pa.field("lat", pa.float32()),
    ...   pa.field("long", pa.float32())
    ... ])
    >>> async def with_schema():
    ...     db = await lancedb.connect_async("./.lancedb")
    ...     my_table = await db.create_table("table3", data, schema = custom_schema)
    ...     print(await my_table.query().limit(5).to_arrow())
    >>> asyncio.run(with_schema())
    pyarrow.Table
    vector: fixed_size_list<item: float>[2]
      child 0, item: float
    lat: float
    long: float
    ----
    vector: [[[1.1,1.2],[0.2,1.8]]]
    lat: [[45.5,40.1]]
    long: [[-122.7,-74.1]]


    It is also possible to create an table from `[Iterable[pa.RecordBatch]]`:


    >>> import pyarrow as pa
    >>> def make_batches():
    ...     for i in range(5):
    ...         yield pa.RecordBatch.from_arrays(
    ...             [
    ...                 pa.array([[3.1, 4.1], [5.9, 26.5]],
    ...                     pa.list_(pa.float32(), 2)),
    ...                 pa.array(["foo", "bar"]),
    ...                 pa.array([10.0, 20.0]),
    ...             ],
    ...             ["vector", "item", "price"],
    ...         )
    >>> schema=pa.schema([
    ...     pa.field("vector", pa.list_(pa.float32(), 2)),
    ...     pa.field("item", pa.utf8()),
    ...     pa.field("price", pa.float32()),
    ... ])
    >>> async def iterable_example():
    ...     db = await lancedb.connect_async("./.lancedb")
    ...     await db.create_table("table4", make_batches(), schema=schema)
    >>> asyncio.run(iterable_example())
    """
    if inspect.isclass(schema) and issubclass(schema, LanceModel):
        # convert LanceModel to pyarrow schema
        # note that it's possible this contains
        # embedding function metadata already
        schema = schema.to_arrow_schema()

    metadata = None

    # Defining defaults here and not in function prototype.  In the future
    # these defaults will move into rust so better to keep them as None.
    if on_bad_vectors is None:
        on_bad_vectors = "error"

    if fill_value is None:
        fill_value = 0.0

    if data is not None:
        data, schema = _sanitize_data(
            data,
            schema,
            metadata=metadata,
            on_bad_vectors=on_bad_vectors,
            fill_value=fill_value,
        )

    if schema is None:
        if data is None:
            raise ValueError("Either data or schema must be provided")
        elif hasattr(data, "schema"):
            schema = data.schema
        elif isinstance(data, Iterable):
            if metadata:
                raise TypeError(
                    (
                        "Persistent embedding functions not yet "
                        "supported for generator data input"
                    )
                )

    if metadata:
        schema = schema.with_metadata(metadata)
    validate_schema(schema)

    if exist_ok is None:
        exist_ok = False
    if mode is None:
        mode = "create"
    if mode == "create" and exist_ok:
        mode = "exist_ok"

    if data is None:
        new_table = await self._inner.create_empty_table(
            name,
            mode,
            schema,
            storage_options=storage_options,
            use_legacy_format=use_legacy_format,
        )
    else:
        data = data_to_reader(data, schema)
        new_table = await self._inner.create_table(
            name,
            mode,
            data,
            storage_options=storage_options,
            use_legacy_format=use_legacy_format,
        )

    return AsyncTable(new_table)

`open_table(name: str, storage_options: Optional[Dict[str, str]] = None, index_cache_size: Optional[int] = None) -> AsyncTable` `async`

Open a Lance Table in the database.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the table.	required
`storage_options`	`Optional[Dict[str, str]]`	Additional options for the storage backend. Options already set on the connection will be inherited by the table, but can be overridden here. See available options at https://lancedb.github.io/lancedb/guides/storage/	`None`
`index_cache_size`	`Optional[int]`	Set the size of the index cache, specified as a number of entries The exact meaning of an "entry" will depend on the type of index: * IVF - there is one entry for each IVF partition * BTREE - there is one entry for the entire index This cache applies to the entire opened table, across all indices. Setting this value higher will increase performance on larger datasets at the expense of more RAM	`None`

Returns:

Type	Description
`A LanceTable object representing the table.`

Source code in lancedb/db.py

async def open_table(
    self,
    name: str,
    storage_options: Optional[Dict[str, str]] = None,
    index_cache_size: Optional[int] = None,
) -> AsyncTable:
    """Open a Lance Table in the database.

    Parameters
    ----------
    name: str
        The name of the table.
    storage_options: dict, optional
        Additional options for the storage backend. Options already set on the
        connection will be inherited by the table, but can be overridden here.
        See available options at
        https://lancedb.github.io/lancedb/guides/storage/
    index_cache_size: int, default 256
        Set the size of the index cache, specified as a number of entries

        The exact meaning of an "entry" will depend on the type of index:
        * IVF - there is one entry for each IVF partition
        * BTREE - there is one entry for the entire index

        This cache applies to the entire opened table, across all indices.
        Setting this value higher will increase performance on larger datasets
        at the expense of more RAM

    Returns
    -------
    A LanceTable object representing the table.
    """
    table = await self._inner.open_table(name, storage_options, index_cache_size)
    return AsyncTable(table)

`drop_table(name: str)` `async`

Drop a table from the database.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the table.	required

Source code in lancedb/db.py

async def drop_table(self, name: str):
    """Drop a table from the database.

    Parameters
    ----------
    name: str
        The name of the table.
    """
    await self._inner.drop_table(name)

`drop_database()` `async`

Drop database This is the same thing as dropping all the tables

Source code in lancedb/db.py

async def drop_database(self):
    """
    Drop database
    This is the same thing as dropping all the tables
    """
    await self._inner.drop_db()

Tables (Asynchronous)

Table hold your actual data as a collection of records / rows.

`lancedb.table.AsyncTable`

An AsyncTable is a collection of Records in a LanceDB Database.

An AsyncTable can be obtained from the AsyncConnection.create_table and AsyncConnection.open_table methods.

An AsyncTable object is expected to be long lived and reused for multiple operations. AsyncTable objects will cache a certain amount of index data in memory. This cache will be freed when the Table is garbage collected. To eagerly free the cache you can call the close method. Once the AsyncTable is closed, it cannot be used for any further operations.

An AsyncTable can also be used as a context manager, and will automatically close when the context is exited. Closing a table is optional. If you do not close the table, it will be closed when the AsyncTable object is garbage collected.

Examples:

Create using AsyncConnection.create_table (more examples in that method's documentation).

>>> import lancedb
>>> async def create_a_table():
...     db = await lancedb.connect_async("./.lancedb")
...     data = [{"vector": [1.1, 1.2], "b": 2}]
...     table = await db.create_table("my_table", data=data)
...     print(await table.query().limit(5).to_arrow())
>>> import asyncio
>>> asyncio.run(create_a_table())
pyarrow.Table
vector: fixed_size_list<item: float>[2]
  child 0, item: float
b: int64
----
vector: [[[1.1,1.2]]]
b: [[2]]

Can append new data with AsyncTable.add().

>>> async def add_to_table():
...     db = await lancedb.connect_async("./.lancedb")
...     table = await db.open_table("my_table")
...     await table.add([{"vector": [0.5, 1.3], "b": 4}])
>>> asyncio.run(add_to_table())

Can query the table with AsyncTable.vector_search.

>>> async def search_table_for_vector():
...     db = await lancedb.connect_async("./.lancedb")
...     table = await db.open_table("my_table")
...     results = (
...       await table.vector_search([0.4, 0.4]).select(["b", "vector"]).to_pandas()
...     )
...     print(results)
>>> asyncio.run(search_table_for_vector())
   b      vector  _distance
0  4  [0.5, 1.3]       0.82
1  2  [1.1, 1.2]       1.13

Search queries are much faster when an index is created. See AsyncTable.create_index.

Source code in lancedb/table.py

class AsyncTable:
    """
    An AsyncTable is a collection of Records in a LanceDB Database.

    An AsyncTable can be obtained from the
    [AsyncConnection.create_table][lancedb.AsyncConnection.create_table] and
    [AsyncConnection.open_table][lancedb.AsyncConnection.open_table] methods.

    An AsyncTable object is expected to be long lived and reused for multiple
    operations. AsyncTable objects will cache a certain amount of index data in memory.
    This cache will be freed when the Table is garbage collected.  To eagerly free the
    cache you can call the [close][lancedb.AsyncTable.close] method.  Once the
    AsyncTable is closed, it cannot be used for any further operations.

    An AsyncTable can also be used as a context manager, and will automatically close
    when the context is exited.  Closing a table is optional.  If you do not close the
    table, it will be closed when the AsyncTable object is garbage collected.

    Examples
    --------

    Create using [AsyncConnection.create_table][lancedb.AsyncConnection.create_table]
    (more examples in that method's documentation).

    >>> import lancedb
    >>> async def create_a_table():
    ...     db = await lancedb.connect_async("./.lancedb")
    ...     data = [{"vector": [1.1, 1.2], "b": 2}]
    ...     table = await db.create_table("my_table", data=data)
    ...     print(await table.query().limit(5).to_arrow())
    >>> import asyncio
    >>> asyncio.run(create_a_table())
    pyarrow.Table
    vector: fixed_size_list<item: float>[2]
      child 0, item: float
    b: int64
    ----
    vector: [[[1.1,1.2]]]
    b: [[2]]

    Can append new data with [AsyncTable.add()][lancedb.table.AsyncTable.add].

    >>> async def add_to_table():
    ...     db = await lancedb.connect_async("./.lancedb")
    ...     table = await db.open_table("my_table")
    ...     await table.add([{"vector": [0.5, 1.3], "b": 4}])
    >>> asyncio.run(add_to_table())

    Can query the table with
    [AsyncTable.vector_search][lancedb.table.AsyncTable.vector_search].

    >>> async def search_table_for_vector():
    ...     db = await lancedb.connect_async("./.lancedb")
    ...     table = await db.open_table("my_table")
    ...     results = (
    ...       await table.vector_search([0.4, 0.4]).select(["b", "vector"]).to_pandas()
    ...     )
    ...     print(results)
    >>> asyncio.run(search_table_for_vector())
       b      vector  _distance
    0  4  [0.5, 1.3]       0.82
    1  2  [1.1, 1.2]       1.13

    Search queries are much faster when an index is created. See
    [AsyncTable.create_index][lancedb.table.AsyncTable.create_index].
    """

    def __init__(self, table: LanceDBTable):
        """Create a new AsyncTable object.

        You should not create AsyncTable objects directly.

        Use [AsyncConnection.create_table][lancedb.AsyncConnection.create_table] and
        [AsyncConnection.open_table][lancedb.AsyncConnection.open_table] to obtain
        Table objects."""
        self._inner = table

    def __repr__(self):
        return self._inner.__repr__()

    def __enter__(self):
        return self

    def __exit__(self, *_):
        self.close()

    def is_open(self) -> bool:
        """Return True if the table is closed."""
        return self._inner.is_open()

    def close(self):
        """Close the table and free any resources associated with it.

        It is safe to call this method multiple times.

        Any attempt to use the table after it has been closed will raise an error."""
        return self._inner.close()

    @property
    def name(self) -> str:
        """The name of the table."""
        return self._inner.name()

    async def schema(self) -> pa.Schema:
        """The [Arrow Schema](https://arrow.apache.org/docs/python/api/datatypes.html#)
        of this Table

        """
        return await self._inner.schema()

    async def count_rows(self, filter: Optional[str] = None) -> int:
        """
        Count the number of rows in the table.

        Parameters
        ----------
        filter: str, optional
            A SQL where clause to filter the rows to count.
        """
        return await self._inner.count_rows(filter)

    def query(self) -> AsyncQuery:
        """
        Returns an [AsyncQuery][lancedb.query.AsyncQuery] that can be used
        to search the table.

        Use methods on the returned query to control query behavior.  The query
        can be executed with methods like [to_arrow][lancedb.query.AsyncQuery.to_arrow],
        [to_pandas][lancedb.query.AsyncQuery.to_pandas] and more.
        """
        return AsyncQuery(self._inner.query())

    async def to_pandas(self) -> "pd.DataFrame":
        """Return the table as a pandas DataFrame.

        Returns
        -------
        pd.DataFrame
        """
        return (await self.to_arrow()).to_pandas()

    async def to_arrow(self) -> pa.Table:
        """Return the table as a pyarrow Table.

        Returns
        -------
        pa.Table
        """
        return await self.query().to_arrow()

    async def create_index(
        self,
        column: str,
        *,
        replace: Optional[bool] = None,
        config: Optional[Union[IvfPq, BTree]] = None,
    ):
        """Create an index to speed up queries

        Indices can be created on vector columns or scalar columns.
        Indices on vector columns will speed up vector searches.
        Indices on scalar columns will speed up filtering (in both
        vector and non-vector searches)

        Parameters
        ----------
        column: str
            The column to index.
        replace: bool, default True
            Whether to replace the existing index

            If this is false, and another index already exists on the same columns
            and the same name, then an error will be returned.  This is true even if
            that index is out of date.

            The default is True
        config: Union[IvfPq, BTree], default None
            For advanced configuration you can specify the type of index you would
            like to create.   You can also specify index-specific parameters when
            creating an index object.
        """
        index = None
        if config is not None:
            index = config._inner
        await self._inner.create_index(column, index=index, replace=replace)

    async def add(
        self,
        data: DATA,
        *,
        mode: Optional[Literal["append", "overwrite"]] = "append",
        on_bad_vectors: Optional[str] = None,
        fill_value: Optional[float] = None,
    ):
        """Add more data to the [Table](Table).

        Parameters
        ----------
        data: DATA
            The data to insert into the table. Acceptable types are:

            - dict or list-of-dict

            - pandas.DataFrame

            - pyarrow.Table or pyarrow.RecordBatch
        mode: str
            The mode to use when writing the data. Valid values are
            "append" and "overwrite".
        on_bad_vectors: str, default "error"
            What to do if any of the vectors are not the same size or contains NaNs.
            One of "error", "drop", "fill".
        fill_value: float, default 0.
            The value to use when filling vectors. Only used if on_bad_vectors="fill".

        """
        schema = await self.schema()
        if on_bad_vectors is None:
            on_bad_vectors = "error"
        if fill_value is None:
            fill_value = 0.0
        data, _ = _sanitize_data(
            data,
            schema,
            metadata=schema.metadata,
            on_bad_vectors=on_bad_vectors,
            fill_value=fill_value,
        )
        if isinstance(data, pa.Table):
            data = pa.RecordBatchReader.from_batches(data.schema, data.to_batches())
        await self._inner.add(data, mode)

    def merge_insert(self, on: Union[str, Iterable[str]]) -> LanceMergeInsertBuilder:
        """
        Returns a [`LanceMergeInsertBuilder`][lancedb.merge.LanceMergeInsertBuilder]
        that can be used to create a "merge insert" operation

        This operation can add rows, update rows, and remove rows all in a single
        transaction. It is a very generic tool that can be used to create
        behaviors like "insert if not exists", "update or insert (i.e. upsert)",
        or even replace a portion of existing data with new data (e.g. replace
        all data where month="january")

        The merge insert operation works by combining new data from a
        **source table** with existing data in a **target table** by using a
        join.  There are three categories of records.

        "Matched" records are records that exist in both the source table and
        the target table. "Not matched" records exist only in the source table
        (e.g. these are new data) "Not matched by source" records exist only
        in the target table (this is old data)

        The builder returned by this method can be used to customize what
        should happen for each category of data.

        Please note that the data may appear to be reordered as part of this
        operation.  This is because updated rows will be deleted from the
        dataset and then reinserted at the end with the new values.

        Parameters
        ----------

        on: Union[str, Iterable[str]]
            A column (or columns) to join on.  This is how records from the
            source table and target table are matched.  Typically this is some
            kind of key or id column.

        Examples
        --------
        >>> import lancedb
        >>> data = pa.table({"a": [2, 1, 3], "b": ["a", "b", "c"]})
        >>> db = lancedb.connect("./.lancedb")
        >>> table = db.create_table("my_table", data)
        >>> new_data = pa.table({"a": [2, 3, 4], "b": ["x", "y", "z"]})
        >>> # Perform a "upsert" operation
        >>> table.merge_insert("a")             \\
        ...      .when_matched_update_all()     \\
        ...      .when_not_matched_insert_all() \\
        ...      .execute(new_data)
        >>> # The order of new rows is non-deterministic since we use
        >>> # a hash-join as part of this operation and so we sort here
        >>> table.to_arrow().sort_by("a").to_pandas()
           a  b
        0  1  b
        1  2  x
        2  3  y
        3  4  z
        """
        on = [on] if isinstance(on, str) else list(on.iter())

        return LanceMergeInsertBuilder(self, on)

    def vector_search(
        self,
        query_vector: Optional[Union[VEC, Tuple]] = None,
    ) -> AsyncVectorQuery:
        """
        Search the table with a given query vector.
        This is a convenience method for preparing a vector query and
        is the same thing as calling `nearestTo` on the builder returned
        by `query`.  Seer [nearest_to][lancedb.query.AsyncQuery.nearest_to] for more
        details.
        """
        return self.query().nearest_to(query_vector)

    async def _execute_query(
        self, query: Query, batch_size: Optional[int] = None
    ) -> pa.RecordBatchReader:
        pass

    async def _do_merge(
        self,
        merge: LanceMergeInsertBuilder,
        new_data: DATA,
        on_bad_vectors: str,
        fill_value: float,
    ):
        pass

    async def delete(self, where: str):
        """Delete rows from the table.

        This can be used to delete a single row, many rows, all rows, or
        sometimes no rows (if your predicate matches nothing).

        Parameters
        ----------
        where: str
            The SQL where clause to use when deleting rows.

            - For example, 'x = 2' or 'x IN (1, 2, 3)'.

            The filter must not be empty, or it will error.

        Examples
        --------
        >>> import lancedb
        >>> data = [
        ...    {"x": 1, "vector": [1, 2]},
        ...    {"x": 2, "vector": [3, 4]},
        ...    {"x": 3, "vector": [5, 6]}
        ... ]
        >>> db = lancedb.connect("./.lancedb")
        >>> table = db.create_table("my_table", data)
        >>> table.to_pandas()
           x      vector
        0  1  [1.0, 2.0]
        1  2  [3.0, 4.0]
        2  3  [5.0, 6.0]
        >>> table.delete("x = 2")
        >>> table.to_pandas()
           x      vector
        0  1  [1.0, 2.0]
        1  3  [5.0, 6.0]

        If you have a list of values to delete, you can combine them into a
        stringified list and use the `IN` operator:

        >>> to_remove = [1, 5]
        >>> to_remove = ", ".join([str(v) for v in to_remove])
        >>> to_remove
        '1, 5'
        >>> table.delete(f"x IN ({to_remove})")
        >>> table.to_pandas()
           x      vector
        0  3  [5.0, 6.0]
        """
        return await self._inner.delete(where)

    async def update(
        self,
        updates: Optional[Dict[str, Any]] = None,
        *,
        where: Optional[str] = None,
        updates_sql: Optional[Dict[str, str]] = None,
    ):
        """
        This can be used to update zero to all rows in the table.

        If a filter is provided with `where` then only rows matching the
        filter will be updated.  Otherwise all rows will be updated.

        Parameters
        ----------
        updates: dict, optional
            The updates to apply.  The keys should be the name of the column to
            update.  The values should be the new values to assign.  This is
            required unless updates_sql is supplied.
        where: str, optional
            An SQL filter that controls which rows are updated. For example, 'x = 2'
            or 'x IN (1, 2, 3)'.  Only rows that satisfy this filter will be udpated.
        updates_sql: dict, optional
            The updates to apply, expressed as SQL expression strings.  The keys should
            be column names. The values should be SQL expressions.  These can be SQL
            literals (e.g. "7" or "'foo'") or they can be expressions based on the
            previous value of the row (e.g. "x + 1" to increment the x column by 1)

        Examples
        --------
        >>> import asyncio
        >>> import lancedb
        >>> import pandas as pd
        >>> async def demo_update():
        ...     data = pd.DataFrame({"x": [1, 2], "vector": [[1, 2], [3, 4]]})
        ...     db = await lancedb.connect_async("./.lancedb")
        ...     table = await db.create_table("my_table", data)
        ...     # x is [1, 2], vector is [[1, 2], [3, 4]]
        ...     await table.update({"vector": [10, 10]}, where="x = 2")
        ...     # x is [1, 2], vector is [[1, 2], [10, 10]]
        ...     await table.update(updates_sql={"x": "x + 1"})
        ...     # x is [2, 3], vector is [[1, 2], [10, 10]]
        >>> asyncio.run(demo_update())
        """
        if updates is not None and updates_sql is not None:
            raise ValueError("Only one of updates or updates_sql can be provided")
        if updates is None and updates_sql is None:
            raise ValueError("Either updates or updates_sql must be provided")

        if updates is not None:
            updates_sql = {k: value_to_sql(v) for k, v in updates.items()}

        return await self._inner.update(updates_sql, where)

    async def version(self) -> int:
        """
        Retrieve the version of the table

        LanceDb supports versioning.  Every operation that modifies the table increases
        version.  As long as a version hasn't been deleted you can `[Self::checkout]`
        that version to view the data at that point.  In addition, you can
        `[Self::restore]` the version to replace the current table with a previous
        version.
        """
        return await self._inner.version()

    async def checkout(self, version):
        """
        Checks out a specific version of the Table

        Any read operation on the table will now access the data at the checked out
        version. As a consequence, calling this method will disable any read consistency
        interval that was previously set.

        This is a read-only operation that turns the table into a sort of "view"
        or "detached head".  Other table instances will not be affected.  To make the
        change permanent you can use the `[Self::restore]` method.

        Any operation that modifies the table will fail while the table is in a checked
        out state.

        To return the table to a normal state use `[Self::checkout_latest]`
        """
        await self._inner.checkout(version)

    async def checkout_latest(self):
        """
        Ensures the table is pointing at the latest version

        This can be used to manually update a table when the read_consistency_interval
        is None
        It can also be used to undo a `[Self::checkout]` operation
        """
        await self._inner.checkout_latest()

    async def restore(self):
        """
        Restore the table to the currently checked out version

        This operation will fail if checkout has not been called previously

        This operation will overwrite the latest version of the table with a
        previous version.  Any changes made since the checked out version will
        no longer be visible.

        Once the operation concludes the table will no longer be in a checked
        out state and the read_consistency_interval, if any, will apply.
        """
        await self._inner.restore()

    async def optimize(
        self, *, cleanup_older_than: Optional[timedelta] = None
    ) -> OptimizeStats:
        """
        Optimize the on-disk data and indices for better performance.

        Modeled after ``VACUUM`` in PostgreSQL.

        Optimization covers three operations:

         * Compaction: Merges small files into larger ones
         * Prune: Removes old versions of the dataset
         * Index: Optimizes the indices, adding new data to existing indices

        Parameters
        ----------
        cleanup_older_than: timedelta, optional default 7 days
            All files belonging to versions older than this will be removed.  Set
            to 0 days to remove all versions except the latest.  The latest version
            is never removed.

        Experimental API
        ----------------

        The optimization process is undergoing active development and may change.
        Our goal with these changes is to improve the performance of optimization and
        reduce the complexity.

        That being said, it is essential today to run optimize if you want the best
        performance.  It should be stable and safe to use in production, but it our
        hope that the API may be simplified (or not even need to be called) in the
        future.

        The frequency an application shoudl call optimize is based on the frequency of
        data modifications.  If data is frequently added, deleted, or updated then
        optimize should be run frequently.  A good rule of thumb is to run optimize if
        you have added or modified 100,000 or more records or run more than 20 data
        modification operations.
        """
        if cleanup_older_than is not None:
            cleanup_older_than = round(cleanup_older_than.total_seconds() * 1000)
        return await self._inner.optimize(cleanup_older_than)

    async def list_indices(self) -> IndexConfig:
        """
        List all indices that have been created with Self::create_index
        """
        return await self._inner.list_indices()

`name: str` `property`

The name of the table.

`init(table: LanceDBTable)`

Create a new AsyncTable object.

You should not create AsyncTable objects directly.

Use AsyncConnection.create_table and AsyncConnection.open_table to obtain Table objects.

Source code in lancedb/table.py

def __init__(self, table: LanceDBTable):
    """Create a new AsyncTable object.

    You should not create AsyncTable objects directly.

    Use [AsyncConnection.create_table][lancedb.AsyncConnection.create_table] and
    [AsyncConnection.open_table][lancedb.AsyncConnection.open_table] to obtain
    Table objects."""
    self._inner = table

`is_open() -> bool`

Return True if the table is closed.

Source code in lancedb/table.py

def is_open(self) -> bool:
    """Return True if the table is closed."""
    return self._inner.is_open()

`close()`

Close the table and free any resources associated with it.

It is safe to call this method multiple times.

Any attempt to use the table after it has been closed will raise an error.

Source code in lancedb/table.py

def close(self):
    """Close the table and free any resources associated with it.

    It is safe to call this method multiple times.

    Any attempt to use the table after it has been closed will raise an error."""
    return self._inner.close()

`schema() -> pa.Schema` `async`

The Arrow Schema of this Table

Source code in lancedb/table.py

async def schema(self) -> pa.Schema:
    """The [Arrow Schema](https://arrow.apache.org/docs/python/api/datatypes.html#)
    of this Table

    """
    return await self._inner.schema()

`count_rows(filter: Optional[str] = None) -> int` `async`

Count the number of rows in the table.

Parameters:

Name	Type	Description	Default
`filter`	`Optional[str]`	A SQL where clause to filter the rows to count.	`None`

Source code in lancedb/table.py

async def count_rows(self, filter: Optional[str] = None) -> int:
    """
    Count the number of rows in the table.

    Parameters
    ----------
    filter: str, optional
        A SQL where clause to filter the rows to count.
    """
    return await self._inner.count_rows(filter)

`query() -> AsyncQuery`

Returns an AsyncQuery that can be used to search the table.

Use methods on the returned query to control query behavior. The query can be executed with methods like to_arrow, to_pandas and more.

Source code in lancedb/table.py

def query(self) -> AsyncQuery:
    """
    Returns an [AsyncQuery][lancedb.query.AsyncQuery] that can be used
    to search the table.

    Use methods on the returned query to control query behavior.  The query
    can be executed with methods like [to_arrow][lancedb.query.AsyncQuery.to_arrow],
    [to_pandas][lancedb.query.AsyncQuery.to_pandas] and more.
    """
    return AsyncQuery(self._inner.query())

`to_pandas() -> 'pd.DataFrame'` `async`

Return the table as a pandas DataFrame.

Returns:

Type	Description
`DataFrame`

Source code in lancedb/table.py

async def to_pandas(self) -> "pd.DataFrame":
    """Return the table as a pandas DataFrame.

    Returns
    -------
    pd.DataFrame
    """
    return (await self.to_arrow()).to_pandas()

`to_arrow() -> pa.Table` `async`

Return the table as a pyarrow Table.

Returns:

Type	Description
`Table`

Source code in lancedb/table.py

async def to_arrow(self) -> pa.Table:
    """Return the table as a pyarrow Table.

    Returns
    -------
    pa.Table
    """
    return await self.query().to_arrow()

`create_index(column: str, *, replace: Optional[bool] = None, config: Optional[Union[IvfPq, BTree]] = None)` `async`

Create an index to speed up queries

Indices can be created on vector columns or scalar columns. Indices on vector columns will speed up vector searches. Indices on scalar columns will speed up filtering (in both vector and non-vector searches)

Parameters:

Name	Type	Description	Default
`column`	`str`	The column to index.	required
`replace`	`Optional[bool]`	Whether to replace the existing index If this is false, and another index already exists on the same columns and the same name, then an error will be returned. This is true even if that index is out of date. The default is True	`None`
`config`	`Optional[Union[IvfPq, BTree]]`	For advanced configuration you can specify the type of index you would like to create. You can also specify index-specific parameters when creating an index object.	`None`

Source code in lancedb/table.py

async def create_index(
    self,
    column: str,
    *,
    replace: Optional[bool] = None,
    config: Optional[Union[IvfPq, BTree]] = None,
):
    """Create an index to speed up queries

    Indices can be created on vector columns or scalar columns.
    Indices on vector columns will speed up vector searches.
    Indices on scalar columns will speed up filtering (in both
    vector and non-vector searches)

    Parameters
    ----------
    column: str
        The column to index.
    replace: bool, default True
        Whether to replace the existing index

        If this is false, and another index already exists on the same columns
        and the same name, then an error will be returned.  This is true even if
        that index is out of date.

        The default is True
    config: Union[IvfPq, BTree], default None
        For advanced configuration you can specify the type of index you would
        like to create.   You can also specify index-specific parameters when
        creating an index object.
    """
    index = None
    if config is not None:
        index = config._inner
    await self._inner.create_index(column, index=index, replace=replace)

`add(data: DATA, *, mode: Optional[Literal['append', 'overwrite']] = 'append', on_bad_vectors: Optional[str] = None, fill_value: Optional[float] = None)` `async`

Add more data to the Table.

Parameters:

Name	Type	Description	Default
`data`	`DATA`	The data to insert into the table. Acceptable types are: dict or list-of-dict pandas.DataFrame pyarrow.Table or pyarrow.RecordBatch	required
`mode`	`Optional[Literal['append', 'overwrite']]`	The mode to use when writing the data. Valid values are "append" and "overwrite".	`'append'`
`on_bad_vectors`	`Optional[str]`	What to do if any of the vectors are not the same size or contains NaNs. One of "error", "drop", "fill".	`None`
`fill_value`	`Optional[float]`	The value to use when filling vectors. Only used if on_bad_vectors="fill".	`None`

Source code in lancedb/table.py

async def add(
    self,
    data: DATA,
    *,
    mode: Optional[Literal["append", "overwrite"]] = "append",
    on_bad_vectors: Optional[str] = None,
    fill_value: Optional[float] = None,
):
    """Add more data to the [Table](Table).

    Parameters
    ----------
    data: DATA
        The data to insert into the table. Acceptable types are:

        - dict or list-of-dict

        - pandas.DataFrame

        - pyarrow.Table or pyarrow.RecordBatch
    mode: str
        The mode to use when writing the data. Valid values are
        "append" and "overwrite".
    on_bad_vectors: str, default "error"
        What to do if any of the vectors are not the same size or contains NaNs.
        One of "error", "drop", "fill".
    fill_value: float, default 0.
        The value to use when filling vectors. Only used if on_bad_vectors="fill".

    """
    schema = await self.schema()
    if on_bad_vectors is None:
        on_bad_vectors = "error"
    if fill_value is None:
        fill_value = 0.0
    data, _ = _sanitize_data(
        data,
        schema,
        metadata=schema.metadata,
        on_bad_vectors=on_bad_vectors,
        fill_value=fill_value,
    )
    if isinstance(data, pa.Table):
        data = pa.RecordBatchReader.from_batches(data.schema, data.to_batches())
    await self._inner.add(data, mode)

`merge_insert(on: Union[str, Iterable[str]]) -> LanceMergeInsertBuilder`

Returns a LanceMergeInsertBuilder that can be used to create a "merge insert" operation

This operation can add rows, update rows, and remove rows all in a single transaction. It is a very generic tool that can be used to create behaviors like "insert if not exists", "update or insert (i.e. upsert)", or even replace a portion of existing data with new data (e.g. replace all data where month="january")

The merge insert operation works by combining new data from a source table with existing data in a target table by using a join. There are three categories of records.

"Matched" records are records that exist in both the source table and the target table. "Not matched" records exist only in the source table (e.g. these are new data) "Not matched by source" records exist only in the target table (this is old data)

The builder returned by this method can be used to customize what should happen for each category of data.

Please note that the data may appear to be reordered as part of this operation. This is because updated rows will be deleted from the dataset and then reinserted at the end with the new values.

Parameters:

Name	Type	Description	Default
`on`	`Union[str, Iterable[str]]`	A column (or columns) to join on. This is how records from the source table and target table are matched. Typically this is some kind of key or id column.	required

Examples:

>>> import lancedb
>>> data = pa.table({"a": [2, 1, 3], "b": ["a", "b", "c"]})
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data)
>>> new_data = pa.table({"a": [2, 3, 4], "b": ["x", "y", "z"]})
>>> # Perform a "upsert" operation
>>> table.merge_insert("a")             \
...      .when_matched_update_all()     \
...      .when_not_matched_insert_all() \
...      .execute(new_data)
>>> # The order of new rows is non-deterministic since we use
>>> # a hash-join as part of this operation and so we sort here
>>> table.to_arrow().sort_by("a").to_pandas()
   a  b
0  1  b
1  2  x
2  3  y
3  4  z

Source code in lancedb/table.py

def merge_insert(self, on: Union[str, Iterable[str]]) -> LanceMergeInsertBuilder:
    """
    Returns a [`LanceMergeInsertBuilder`][lancedb.merge.LanceMergeInsertBuilder]
    that can be used to create a "merge insert" operation

    This operation can add rows, update rows, and remove rows all in a single
    transaction. It is a very generic tool that can be used to create
    behaviors like "insert if not exists", "update or insert (i.e. upsert)",
    or even replace a portion of existing data with new data (e.g. replace
    all data where month="january")

    The merge insert operation works by combining new data from a
    **source table** with existing data in a **target table** by using a
    join.  There are three categories of records.

    "Matched" records are records that exist in both the source table and
    the target table. "Not matched" records exist only in the source table
    (e.g. these are new data) "Not matched by source" records exist only
    in the target table (this is old data)

    The builder returned by this method can be used to customize what
    should happen for each category of data.

    Please note that the data may appear to be reordered as part of this
    operation.  This is because updated rows will be deleted from the
    dataset and then reinserted at the end with the new values.

    Parameters
    ----------

    on: Union[str, Iterable[str]]
        A column (or columns) to join on.  This is how records from the
        source table and target table are matched.  Typically this is some
        kind of key or id column.

    Examples
    --------
    >>> import lancedb
    >>> data = pa.table({"a": [2, 1, 3], "b": ["a", "b", "c"]})
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", data)
    >>> new_data = pa.table({"a": [2, 3, 4], "b": ["x", "y", "z"]})
    >>> # Perform a "upsert" operation
    >>> table.merge_insert("a")             \\
    ...      .when_matched_update_all()     \\
    ...      .when_not_matched_insert_all() \\
    ...      .execute(new_data)
    >>> # The order of new rows is non-deterministic since we use
    >>> # a hash-join as part of this operation and so we sort here
    >>> table.to_arrow().sort_by("a").to_pandas()
       a  b
    0  1  b
    1  2  x
    2  3  y
    3  4  z
    """
    on = [on] if isinstance(on, str) else list(on.iter())

    return LanceMergeInsertBuilder(self, on)

`vector_search(query_vector: Optional[Union[VEC, Tuple]] = None) -> AsyncVectorQuery`

Search the table with a given query vector. This is a convenience method for preparing a vector query and is the same thing as calling nearestTo on the builder returned by query. Seer nearest_to for more details.

Source code in lancedb/table.py

def vector_search(
    self,
    query_vector: Optional[Union[VEC, Tuple]] = None,
) -> AsyncVectorQuery:
    """
    Search the table with a given query vector.
    This is a convenience method for preparing a vector query and
    is the same thing as calling `nearestTo` on the builder returned
    by `query`.  Seer [nearest_to][lancedb.query.AsyncQuery.nearest_to] for more
    details.
    """
    return self.query().nearest_to(query_vector)

`delete(where: str)` `async`

Delete rows from the table.

This can be used to delete a single row, many rows, all rows, or sometimes no rows (if your predicate matches nothing).

Parameters:

Name	Type	Description	Default
`where`	`str`	The SQL where clause to use when deleting rows. For example, 'x = 2' or 'x IN (1, 2, 3)'. The filter must not be empty, or it will error.	required

Examples:

>>> import lancedb
>>> data = [
...    {"x": 1, "vector": [1, 2]},
...    {"x": 2, "vector": [3, 4]},
...    {"x": 3, "vector": [5, 6]}
... ]
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data)
>>> table.to_pandas()
   x      vector
0  1  [1.0, 2.0]
1  2  [3.0, 4.0]
2  3  [5.0, 6.0]
>>> table.delete("x = 2")
>>> table.to_pandas()
   x      vector
0  1  [1.0, 2.0]
1  3  [5.0, 6.0]

If you have a list of values to delete, you can combine them into a stringified list and use the IN operator:

>>> to_remove = [1, 5]
>>> to_remove = ", ".join([str(v) for v in to_remove])
>>> to_remove
'1, 5'
>>> table.delete(f"x IN ({to_remove})")
>>> table.to_pandas()
   x      vector
0  3  [5.0, 6.0]

Source code in lancedb/table.py

async def delete(self, where: str):
    """Delete rows from the table.

    This can be used to delete a single row, many rows, all rows, or
    sometimes no rows (if your predicate matches nothing).

    Parameters
    ----------
    where: str
        The SQL where clause to use when deleting rows.

        - For example, 'x = 2' or 'x IN (1, 2, 3)'.

        The filter must not be empty, or it will error.

    Examples
    --------
    >>> import lancedb
    >>> data = [
    ...    {"x": 1, "vector": [1, 2]},
    ...    {"x": 2, "vector": [3, 4]},
    ...    {"x": 3, "vector": [5, 6]}
    ... ]
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", data)
    >>> table.to_pandas()
       x      vector
    0  1  [1.0, 2.0]
    1  2  [3.0, 4.0]
    2  3  [5.0, 6.0]
    >>> table.delete("x = 2")
    >>> table.to_pandas()
       x      vector
    0  1  [1.0, 2.0]
    1  3  [5.0, 6.0]

    If you have a list of values to delete, you can combine them into a
    stringified list and use the `IN` operator:

    >>> to_remove = [1, 5]
    >>> to_remove = ", ".join([str(v) for v in to_remove])
    >>> to_remove
    '1, 5'
    >>> table.delete(f"x IN ({to_remove})")
    >>> table.to_pandas()
       x      vector
    0  3  [5.0, 6.0]
    """
    return await self._inner.delete(where)

`update(updates: Optional[Dict[str, Any]] = None, *, where: Optional[str] = None, updates_sql: Optional[Dict[str, str]] = None)` `async`

This can be used to update zero to all rows in the table.

If a filter is provided with where then only rows matching the filter will be updated. Otherwise all rows will be updated.

Parameters:

Name	Type	Description	Default
`updates`	`Optional[Dict[str, Any]]`	The updates to apply. The keys should be the name of the column to update. The values should be the new values to assign. This is required unless updates_sql is supplied.	`None`
`where`	`Optional[str]`	An SQL filter that controls which rows are updated. For example, 'x = 2' or 'x IN (1, 2, 3)'. Only rows that satisfy this filter will be udpated.	`None`
`updates_sql`	`Optional[Dict[str, str]]`	The updates to apply, expressed as SQL expression strings. The keys should be column names. The values should be SQL expressions. These can be SQL literals (e.g. "7" or "'foo'") or they can be expressions based on the previous value of the row (e.g. "x + 1" to increment the x column by 1)	`None`

Examples:

>>> import asyncio
>>> import lancedb
>>> import pandas as pd
>>> async def demo_update():
...     data = pd.DataFrame({"x": [1, 2], "vector": [[1, 2], [3, 4]]})
...     db = await lancedb.connect_async("./.lancedb")
...     table = await db.create_table("my_table", data)
...     # x is [1, 2], vector is [[1, 2], [3, 4]]
...     await table.update({"vector": [10, 10]}, where="x = 2")
...     # x is [1, 2], vector is [[1, 2], [10, 10]]
...     await table.update(updates_sql={"x": "x + 1"})
...     # x is [2, 3], vector is [[1, 2], [10, 10]]
>>> asyncio.run(demo_update())

Source code in lancedb/table.py

async def update(
    self,
    updates: Optional[Dict[str, Any]] = None,
    *,
    where: Optional[str] = None,
    updates_sql: Optional[Dict[str, str]] = None,
):
    """
    This can be used to update zero to all rows in the table.

    If a filter is provided with `where` then only rows matching the
    filter will be updated.  Otherwise all rows will be updated.

    Parameters
    ----------
    updates: dict, optional
        The updates to apply.  The keys should be the name of the column to
        update.  The values should be the new values to assign.  This is
        required unless updates_sql is supplied.
    where: str, optional
        An SQL filter that controls which rows are updated. For example, 'x = 2'
        or 'x IN (1, 2, 3)'.  Only rows that satisfy this filter will be udpated.
    updates_sql: dict, optional
        The updates to apply, expressed as SQL expression strings.  The keys should
        be column names. The values should be SQL expressions.  These can be SQL
        literals (e.g. "7" or "'foo'") or they can be expressions based on the
        previous value of the row (e.g. "x + 1" to increment the x column by 1)

    Examples
    --------
    >>> import asyncio
    >>> import lancedb
    >>> import pandas as pd
    >>> async def demo_update():
    ...     data = pd.DataFrame({"x": [1, 2], "vector": [[1, 2], [3, 4]]})
    ...     db = await lancedb.connect_async("./.lancedb")
    ...     table = await db.create_table("my_table", data)
    ...     # x is [1, 2], vector is [[1, 2], [3, 4]]
    ...     await table.update({"vector": [10, 10]}, where="x = 2")
    ...     # x is [1, 2], vector is [[1, 2], [10, 10]]
    ...     await table.update(updates_sql={"x": "x + 1"})
    ...     # x is [2, 3], vector is [[1, 2], [10, 10]]
    >>> asyncio.run(demo_update())
    """
    if updates is not None and updates_sql is not None:
        raise ValueError("Only one of updates or updates_sql can be provided")
    if updates is None and updates_sql is None:
        raise ValueError("Either updates or updates_sql must be provided")

    if updates is not None:
        updates_sql = {k: value_to_sql(v) for k, v in updates.items()}

    return await self._inner.update(updates_sql, where)

`version() -> int` `async`

Retrieve the version of the table

LanceDb supports versioning. Every operation that modifies the table increases version. As long as a version hasn't been deleted you can [Self::checkout] that version to view the data at that point. In addition, you can [Self::restore] the version to replace the current table with a previous version.

Source code in lancedb/table.py

async def version(self) -> int:
    """
    Retrieve the version of the table

    LanceDb supports versioning.  Every operation that modifies the table increases
    version.  As long as a version hasn't been deleted you can `[Self::checkout]`
    that version to view the data at that point.  In addition, you can
    `[Self::restore]` the version to replace the current table with a previous
    version.
    """
    return await self._inner.version()

`checkout(version)` `async`

Checks out a specific version of the Table

Any read operation on the table will now access the data at the checked out version. As a consequence, calling this method will disable any read consistency interval that was previously set.

This is a read-only operation that turns the table into a sort of "view" or "detached head". Other table instances will not be affected. To make the change permanent you can use the [Self::restore] method.

Any operation that modifies the table will fail while the table is in a checked out state.

To return the table to a normal state use [Self::checkout_latest]

Source code in lancedb/table.py

async def checkout(self, version):
    """
    Checks out a specific version of the Table

    Any read operation on the table will now access the data at the checked out
    version. As a consequence, calling this method will disable any read consistency
    interval that was previously set.

    This is a read-only operation that turns the table into a sort of "view"
    or "detached head".  Other table instances will not be affected.  To make the
    change permanent you can use the `[Self::restore]` method.

    Any operation that modifies the table will fail while the table is in a checked
    out state.

    To return the table to a normal state use `[Self::checkout_latest]`
    """
    await self._inner.checkout(version)

`checkout_latest()` `async`

Ensures the table is pointing at the latest version

This can be used to manually update a table when the read_consistency_interval is None It can also be used to undo a [Self::checkout] operation

Source code in lancedb/table.py

async def checkout_latest(self):
    """
    Ensures the table is pointing at the latest version

    This can be used to manually update a table when the read_consistency_interval
    is None
    It can also be used to undo a `[Self::checkout]` operation
    """
    await self._inner.checkout_latest()

`restore()` `async`

Restore the table to the currently checked out version

This operation will fail if checkout has not been called previously

This operation will overwrite the latest version of the table with a previous version. Any changes made since the checked out version will no longer be visible.

Once the operation concludes the table will no longer be in a checked out state and the read_consistency_interval, if any, will apply.

Source code in lancedb/table.py

async def restore(self):
    """
    Restore the table to the currently checked out version

    This operation will fail if checkout has not been called previously

    This operation will overwrite the latest version of the table with a
    previous version.  Any changes made since the checked out version will
    no longer be visible.

    Once the operation concludes the table will no longer be in a checked
    out state and the read_consistency_interval, if any, will apply.
    """
    await self._inner.restore()

`optimize(*, cleanup_older_than: Optional[timedelta] = None) -> OptimizeStats` `async`

Optimize the on-disk data and indices for better performance.

Modeled after VACUUM in PostgreSQL.

Optimization covers three operations:

Compaction: Merges small files into larger ones
Prune: Removes old versions of the dataset
Index: Optimizes the indices, adding new data to existing indices

Parameters:

Name	Type	Description	Default
`cleanup_older_than`	`Optional[timedelta]`	All files belonging to versions older than this will be removed. Set to 0 days to remove all versions except the latest. The latest version is never removed.	`None`

Experimental API

The optimization process is undergoing active development and may change. Our goal with these changes is to improve the performance of optimization and reduce the complexity.

That being said, it is essential today to run optimize if you want the best performance. It should be stable and safe to use in production, but it our hope that the API may be simplified (or not even need to be called) in the future.

The frequency an application shoudl call optimize is based on the frequency of data modifications. If data is frequently added, deleted, or updated then optimize should be run frequently. A good rule of thumb is to run optimize if you have added or modified 100,000 or more records or run more than 20 data modification operations.

Source code in lancedb/table.py

async def optimize(
    self, *, cleanup_older_than: Optional[timedelta] = None
) -> OptimizeStats:
    """
    Optimize the on-disk data and indices for better performance.

    Modeled after ``VACUUM`` in PostgreSQL.

    Optimization covers three operations:

     * Compaction: Merges small files into larger ones
     * Prune: Removes old versions of the dataset
     * Index: Optimizes the indices, adding new data to existing indices

    Parameters
    ----------
    cleanup_older_than: timedelta, optional default 7 days
        All files belonging to versions older than this will be removed.  Set
        to 0 days to remove all versions except the latest.  The latest version
        is never removed.

    Experimental API
    ----------------

    The optimization process is undergoing active development and may change.
    Our goal with these changes is to improve the performance of optimization and
    reduce the complexity.

    That being said, it is essential today to run optimize if you want the best
    performance.  It should be stable and safe to use in production, but it our
    hope that the API may be simplified (or not even need to be called) in the
    future.

    The frequency an application shoudl call optimize is based on the frequency of
    data modifications.  If data is frequently added, deleted, or updated then
    optimize should be run frequently.  A good rule of thumb is to run optimize if
    you have added or modified 100,000 or more records or run more than 20 data
    modification operations.
    """
    if cleanup_older_than is not None:
        cleanup_older_than = round(cleanup_older_than.total_seconds() * 1000)
    return await self._inner.optimize(cleanup_older_than)

`list_indices() -> IndexConfig` `async`

List all indices that have been created with Self::create_index

Source code in lancedb/table.py

async def list_indices(self) -> IndexConfig:
    """
    List all indices that have been created with Self::create_index
    """
    return await self._inner.list_indices()

Indices (Asynchronous)

Indices can be created on a table to speed up queries. This section lists the indices that LanceDb supports.

`lancedb.index.BTree`

Bases: object

Describes a btree index configuration

A btree index is an index on scalar columns. The index stores a copy of the column in sorted order. A header entry is created for each block of rows (currently the block size is fixed at 4096). These header entries are stored in a separate cacheable structure (a btree). To search for data the header is used to determine which blocks need to be read from disk.

For example, a btree index in a table with 1Bi rows requires sizeof(Scalar) * 256Ki bytes of memory and will generally need to read sizeof(Scalar) * 4096 bytes to find the correct row ids.

This index is good for scalar columns with mostly distinct values and does best when the query is highly selective.

The btree index does not currently have any parameters though parameters such as the block size may be added in the future.

Source code in lancedb/index.py

class BTree(object):
    """Describes a btree index configuration

    A btree index is an index on scalar columns.  The index stores a copy of the
    column in sorted order.  A header entry is created for each block of rows
    (currently the block size is fixed at 4096).  These header entries are stored
    in a separate cacheable structure (a btree).  To search for data the header is
    used to determine which blocks need to be read from disk.

    For example, a btree index in a table with 1Bi rows requires
    sizeof(Scalar) * 256Ki bytes of memory and will generally need to read
    sizeof(Scalar) * 4096 bytes to find the correct row ids.

    This index is good for scalar columns with mostly distinct values and does best
    when the query is highly selective.

    The btree index does not currently have any parameters though parameters such as
    the block size may be added in the future.
    """

    def __init__(self):
        self._inner = LanceDbIndex.btree()

`lancedb.index.IvfPq`

Bases: object

Describes an IVF PQ Index

This index stores a compressed (quantized) copy of every vector. These vectors are grouped into partitions of similar vectors. Each partition keeps track of a centroid which is the average value of all vectors in the group.

During a query the centroids are compared with the query vector to find the closest partitions. The compressed vectors in these partitions are then searched to find the closest vectors.

The compression scheme is called product quantization. Each vector is divide into subvectors and then each subvector is quantized into a small number of bits. the parameters num_bits and num_subvectors control this process, providing a tradeoff between index size (and thus search speed) and index accuracy.

The partitioning process is called IVF and the num_partitions parameter controls how many groups to create.

Note that training an IVF PQ index on a large dataset is a slow operation and currently is also a memory intensive operation.

Source code in lancedb/index.py

class IvfPq(object):
    """Describes an IVF PQ Index

    This index stores a compressed (quantized) copy of every vector.  These vectors
    are grouped into partitions of similar vectors.  Each partition keeps track of
    a centroid which is the average value of all vectors in the group.

    During a query the centroids are compared with the query vector to find the
    closest partitions.  The compressed vectors in these partitions are then
    searched to find the closest vectors.

    The compression scheme is called product quantization.  Each vector is divide
    into subvectors and then each subvector is quantized into a small number of
    bits.  the parameters `num_bits` and `num_subvectors` control this process,
    providing a tradeoff between index size (and thus search speed) and index
    accuracy.

    The partitioning process is called IVF and the `num_partitions` parameter
    controls how many groups to create.

    Note that training an IVF PQ index on a large dataset is a slow operation and
    currently is also a memory intensive operation.
    """

    def __init__(
        self,
        *,
        distance_type: Optional[str] = None,
        num_partitions: Optional[int] = None,
        num_sub_vectors: Optional[int] = None,
        max_iterations: Optional[int] = None,
        sample_rate: Optional[int] = None,
    ):
        """
        Create an IVF PQ index config

        Parameters
        ----------
        distance_type: str, default "L2"
            The distance metric used to train the index

            This is used when training the index to calculate the IVF partitions
            (vectors are grouped in partitions with similar vectors according to this
            distance type) and to calculate a subvector's code during quantization.

            The distance type used to train an index MUST match the distance type used
            to search the index.  Failure to do so will yield inaccurate results.

            The following distance types are available:

            "l2" - Euclidean distance. This is a very common distance metric that
            accounts for both magnitude and direction when determining the distance
            between vectors. L2 distance has a range of [0, ∞).

            "cosine" - Cosine distance.  Cosine distance is a distance metric
            calculated from the cosine similarity between two vectors. Cosine
            similarity is a measure of similarity between two non-zero vectors of an
            inner product space. It is defined to equal the cosine of the angle
            between them.  Unlike L2, the cosine distance is not affected by the
            magnitude of the vectors.  Cosine distance has a range of [0, 2].

            Note: the cosine distance is undefined when one (or both) of the vectors
            are all zeros (there is no direction).  These vectors are invalid and may
            never be returned from a vector search.

            "dot" - Dot product. Dot distance is the dot product of two vectors. Dot
            distance has a range of (-∞, ∞). If the vectors are normalized (i.e. their
            L2 norm is 1), then dot distance is equivalent to the cosine distance.
        num_partitions: int, default sqrt(num_rows)
            The number of IVF partitions to create.

            This value should generally scale with the number of rows in the dataset.
            By default the number of partitions is the square root of the number of
            rows.

            If this value is too large then the first part of the search (picking the
            right partition) will be slow.  If this value is too small then the second
            part of the search (searching within a partition) will be slow.
        num_sub_vectors: int, default is vector dimension / 16
            Number of sub-vectors of PQ.

            This value controls how much the vector is compressed during the
            quantization step.  The more sub vectors there are the less the vector is
            compressed.  The default is the dimension of the vector divided by 16.  If
            the dimension is not evenly divisible by 16 we use the dimension divded by
            8.

            The above two cases are highly preferred.  Having 8 or 16 values per
            subvector allows us to use efficient SIMD instructions.

            If the dimension is not visible by 8 then we use 1 subvector.  This is not
            ideal and will likely result in poor performance.
        max_iterations: int, default 50
            Max iteration to train kmeans.

            When training an IVF PQ index we use kmeans to calculate the partitions.
            This parameter controls how many iterations of kmeans to run.

            Increasing this might improve the quality of the index but in most cases
            these extra iterations have diminishing returns.

            The default value is 50.
        sample_rate: int, default 256
            The rate used to calculate the number of training vectors for kmeans.

            When an IVF PQ index is trained, we need to calculate partitions.  These
            are groups of vectors that are similar to each other.  To do this we use an
            algorithm called kmeans.

            Running kmeans on a large dataset can be slow.  To speed this up we run
            kmeans on a random sample of the data.  This parameter controls the size of
            the sample.  The total number of vectors used to train the index is
            `sample_rate * num_partitions`.

            Increasing this value might improve the quality of the index but in most
            cases the default should be sufficient.

            The default value is 256.
        """
        self._inner = LanceDbIndex.ivf_pq(
            distance_type=distance_type,
            num_partitions=num_partitions,
            num_sub_vectors=num_sub_vectors,
            max_iterations=max_iterations,
            sample_rate=sample_rate,
        )

`init(*, distance_type: Optional[str] = None, num_partitions: Optional[int] = None, num_sub_vectors: Optional[int] = None, max_iterations: Optional[int] = None, sample_rate: Optional[int] = None)`

Create an IVF PQ index config

Parameters:

Name	Type	Description	Default
`distance_type`	`Optional[str]`	The distance metric used to train the index This is used when training the index to calculate the IVF partitions (vectors are grouped in partitions with similar vectors according to this distance type) and to calculate a subvector's code during quantization. The distance type used to train an index MUST match the distance type used to search the index. Failure to do so will yield inaccurate results. The following distance types are available: "l2" - Euclidean distance. This is a very common distance metric that accounts for both magnitude and direction when determining the distance between vectors. L2 distance has a range of [0, ∞). "cosine" - Cosine distance. Cosine distance is a distance metric calculated from the cosine similarity between two vectors. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is defined to equal the cosine of the angle between them. Unlike L2, the cosine distance is not affected by the magnitude of the vectors. Cosine distance has a range of [0, 2]. Note: the cosine distance is undefined when one (or both) of the vectors are all zeros (there is no direction). These vectors are invalid and may never be returned from a vector search. "dot" - Dot product. Dot distance is the dot product of two vectors. Dot distance has a range of (-∞, ∞). If the vectors are normalized (i.e. their L2 norm is 1), then dot distance is equivalent to the cosine distance.	`None`
`num_partitions`	`Optional[int]`	The number of IVF partitions to create. This value should generally scale with the number of rows in the dataset. By default the number of partitions is the square root of the number of rows. If this value is too large then the first part of the search (picking the right partition) will be slow. If this value is too small then the second part of the search (searching within a partition) will be slow.	`None`
`num_sub_vectors`	`Optional[int]`	Number of sub-vectors of PQ. This value controls how much the vector is compressed during the quantization step. The more sub vectors there are the less the vector is compressed. The default is the dimension of the vector divided by 16. If the dimension is not evenly divisible by 16 we use the dimension divded by 8. The above two cases are highly preferred. Having 8 or 16 values per subvector allows us to use efficient SIMD instructions. If the dimension is not visible by 8 then we use 1 subvector. This is not ideal and will likely result in poor performance.	`None`
`max_iterations`	`Optional[int]`	Max iteration to train kmeans. When training an IVF PQ index we use kmeans to calculate the partitions. This parameter controls how many iterations of kmeans to run. Increasing this might improve the quality of the index but in most cases these extra iterations have diminishing returns. The default value is 50.	`None`
`sample_rate`	`Optional[int]`	The rate used to calculate the number of training vectors for kmeans. When an IVF PQ index is trained, we need to calculate partitions. These are groups of vectors that are similar to each other. To do this we use an algorithm called kmeans. Running kmeans on a large dataset can be slow. To speed this up we run kmeans on a random sample of the data. This parameter controls the size of the sample. The total number of vectors used to train the index is `sample_rate * num_partitions`. Increasing this value might improve the quality of the index but in most cases the default should be sufficient. The default value is 256.	`None`

Source code in lancedb/index.py

def __init__(
    self,
    *,
    distance_type: Optional[str] = None,
    num_partitions: Optional[int] = None,
    num_sub_vectors: Optional[int] = None,
    max_iterations: Optional[int] = None,
    sample_rate: Optional[int] = None,
):
    """
    Create an IVF PQ index config

    Parameters
    ----------
    distance_type: str, default "L2"
        The distance metric used to train the index

        This is used when training the index to calculate the IVF partitions
        (vectors are grouped in partitions with similar vectors according to this
        distance type) and to calculate a subvector's code during quantization.

        The distance type used to train an index MUST match the distance type used
        to search the index.  Failure to do so will yield inaccurate results.

        The following distance types are available:

        "l2" - Euclidean distance. This is a very common distance metric that
        accounts for both magnitude and direction when determining the distance
        between vectors. L2 distance has a range of [0, ∞).

        "cosine" - Cosine distance.  Cosine distance is a distance metric
        calculated from the cosine similarity between two vectors. Cosine
        similarity is a measure of similarity between two non-zero vectors of an
        inner product space. It is defined to equal the cosine of the angle
        between them.  Unlike L2, the cosine distance is not affected by the
        magnitude of the vectors.  Cosine distance has a range of [0, 2].

        Note: the cosine distance is undefined when one (or both) of the vectors
        are all zeros (there is no direction).  These vectors are invalid and may
        never be returned from a vector search.

        "dot" - Dot product. Dot distance is the dot product of two vectors. Dot
        distance has a range of (-∞, ∞). If the vectors are normalized (i.e. their
        L2 norm is 1), then dot distance is equivalent to the cosine distance.
    num_partitions: int, default sqrt(num_rows)
        The number of IVF partitions to create.

        This value should generally scale with the number of rows in the dataset.
        By default the number of partitions is the square root of the number of
        rows.

        If this value is too large then the first part of the search (picking the
        right partition) will be slow.  If this value is too small then the second
        part of the search (searching within a partition) will be slow.
    num_sub_vectors: int, default is vector dimension / 16
        Number of sub-vectors of PQ.

        This value controls how much the vector is compressed during the
        quantization step.  The more sub vectors there are the less the vector is
        compressed.  The default is the dimension of the vector divided by 16.  If
        the dimension is not evenly divisible by 16 we use the dimension divded by
        8.

        The above two cases are highly preferred.  Having 8 or 16 values per
        subvector allows us to use efficient SIMD instructions.

        If the dimension is not visible by 8 then we use 1 subvector.  This is not
        ideal and will likely result in poor performance.
    max_iterations: int, default 50
        Max iteration to train kmeans.

        When training an IVF PQ index we use kmeans to calculate the partitions.
        This parameter controls how many iterations of kmeans to run.

        Increasing this might improve the quality of the index but in most cases
        these extra iterations have diminishing returns.

        The default value is 50.
    sample_rate: int, default 256
        The rate used to calculate the number of training vectors for kmeans.

        When an IVF PQ index is trained, we need to calculate partitions.  These
        are groups of vectors that are similar to each other.  To do this we use an
        algorithm called kmeans.

        Running kmeans on a large dataset can be slow.  To speed this up we run
        kmeans on a random sample of the data.  This parameter controls the size of
        the sample.  The total number of vectors used to train the index is
        `sample_rate * num_partitions`.

        Increasing this value might improve the quality of the index but in most
        cases the default should be sufficient.

        The default value is 256.
    """
    self._inner = LanceDbIndex.ivf_pq(
        distance_type=distance_type,
        num_partitions=num_partitions,
        num_sub_vectors=num_sub_vectors,
        max_iterations=max_iterations,
        sample_rate=sample_rate,
    )

Querying (Asynchronous)

Queries allow you to return data from your database. Basic queries can be created with the AsyncTable.query method to return the entire (typically filtered) table. Vector searches return the rows nearest to a query vector and can be created with the AsyncTable.vector_search method.

`lancedb.query.AsyncQueryBase`

Bases: object

Source code in lancedb/query.py

class AsyncQueryBase(object):
    def __init__(self, inner: Union[LanceQuery | LanceVectorQuery]):
        """
        Construct an AsyncQueryBase

        This method is not intended to be called directly.  Instead, use the
        [AsyncTable.query][lancedb.table.AsyncTable.query] method to create a query.
        """
        self._inner = inner

    def where(self, predicate: str) -> AsyncQuery:
        """
        Only return rows matching the given predicate

        The predicate should be supplied as an SQL query string.

        Examples
        --------

        >>> predicate = "x > 10"
        >>> predicate = "y > 0 AND y < 100"
        >>> predicate = "x > 5 OR y = 'test'"

        Filtering performance can often be improved by creating a scalar index
        on the filter column(s).
        """
        self._inner.where(predicate)
        return self

    def select(self, columns: Union[List[str], dict[str, str]]) -> AsyncQuery:
        """
        Return only the specified columns.

        By default a query will return all columns from the table.  However, this can
        have a very significant impact on latency.  LanceDb stores data in a columnar
        fashion.  This
        means we can finely tune our I/O to select exactly the columns we need.

        As a best practice you should always limit queries to the columns that you need.
        If you pass in a list of column names then only those columns will be
        returned.

        You can also use this method to create new "dynamic" columns based on your
        existing columns. For example, you may not care about "a" or "b" but instead
        simply want "a + b".  This is often seen in the SELECT clause of an SQL query
        (e.g. `SELECT a+b FROM my_table`).

        To create dynamic columns you can pass in a dict[str, str].  A column will be
        returned for each entry in the map.  The key provides the name of the column.
        The value is an SQL string used to specify how the column is calculated.

        For example, an SQL query might state `SELECT a + b AS combined, c`.  The
        equivalent input to this method would be `{"combined": "a + b", "c": "c"}`.

        Columns will always be returned in the order given, even if that order is
        different than the order used when adding the data.
        """
        if isinstance(columns, list) and all(isinstance(c, str) for c in columns):
            self._inner.select_columns(columns)
        elif isinstance(columns, dict) and all(
            isinstance(k, str) and isinstance(v, str) for k, v in columns.items()
        ):
            self._inner.select(list(columns.items()))
        else:
            raise TypeError("columns must be a list of column names or a dict")
        return self

    def limit(self, limit: int) -> AsyncQuery:
        """
        Set the maximum number of results to return.

        By default, a plain search has no limit.  If this method is not
        called then every valid row from the table will be returned.
        """
        self._inner.limit(limit)
        return self

    async def to_batches(
        self, *, max_batch_length: Optional[int] = None
    ) -> AsyncRecordBatchReader:
        """
        Execute the query and return the results as an Apache Arrow RecordBatchReader.

        Parameters
        ----------

        max_batch_length: Optional[int]
            The maximum number of selected records in a single RecordBatch object.
            If not specified, a default batch length is used.
            It is possible for batches to be smaller than the provided length if the
            underlying data is stored in smaller chunks.
        """
        return AsyncRecordBatchReader(await self._inner.execute(max_batch_length))

    async def to_arrow(self) -> pa.Table:
        """
        Execute the query and collect the results into an Apache Arrow Table.

        This method will collect all results into memory before returning.  If
        you expect a large number of results, you may want to use
        [to_batches][lancedb.query.AsyncQueryBase.to_batches]
        """
        batch_iter = await self.to_batches()
        return pa.Table.from_batches(
            await batch_iter.read_all(), schema=batch_iter.schema
        )

    async def to_pandas(self) -> "pd.DataFrame":
        """
        Execute the query and collect the results into a pandas DataFrame.

        This method will collect all results into memory before returning.  If you
        expect a large number of results, you may want to use
        [to_batches][lancedb.query.AsyncQueryBase.to_batches] and convert each batch to
        pandas separately.

        Examples
        --------

        >>> import asyncio
        >>> from lancedb import connect_async
        >>> async def doctest_example():
        ...     conn = await connect_async("./.lancedb")
        ...     table = await conn.create_table("my_table", data=[{"a": 1, "b": 2}])
        ...     async for batch in await table.query().to_batches():
        ...         batch_df = batch.to_pandas()
        >>> asyncio.run(doctest_example())
        """
        return (await self.to_arrow()).to_pandas()

    async def explain_plan(self, verbose: Optional[bool] = False):
        """Return the execution plan for this query.

        Examples
        --------
        >>> import asyncio
        >>> from lancedb import connect_async
        >>> async def doctest_example():
        ...     conn = await connect_async("./.lancedb")
        ...     table = await conn.create_table("my_table", [{"vector": [99, 99]}])
        ...     query = [100, 100]
        ...     plan = await table.query().nearest_to([1, 2]).explain_plan(True)
        ...     print(plan)
        >>> asyncio.run(doctest_example()) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
        ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]
          FilterExec: _distance@2 IS NOT NULL
            SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]
              KNNVectorDistance: metric=l2
                LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false

        Parameters
        ----------
        verbose : bool, default False
            Use a verbose output format.

        Returns
        -------
        plan : str
        """  # noqa: E501
        return await self._inner.explain_plan(verbose)

`init(inner: Union[LanceQuery | LanceVectorQuery])`

Construct an AsyncQueryBase

This method is not intended to be called directly. Instead, use the AsyncTable.query method to create a query.

Source code in lancedb/query.py

def __init__(self, inner: Union[LanceQuery | LanceVectorQuery]):
    """
    Construct an AsyncQueryBase

    This method is not intended to be called directly.  Instead, use the
    [AsyncTable.query][lancedb.table.AsyncTable.query] method to create a query.
    """
    self._inner = inner

`where(predicate: str) -> AsyncQuery`

Only return rows matching the given predicate

The predicate should be supplied as an SQL query string.

Examples:

>>> predicate = "x > 10"
>>> predicate = "y > 0 AND y < 100"
>>> predicate = "x > 5 OR y = 'test'"

Filtering performance can often be improved by creating a scalar index on the filter column(s).

Source code in lancedb/query.py

def where(self, predicate: str) -> AsyncQuery:
    """
    Only return rows matching the given predicate

    The predicate should be supplied as an SQL query string.

    Examples
    --------

    >>> predicate = "x > 10"
    >>> predicate = "y > 0 AND y < 100"
    >>> predicate = "x > 5 OR y = 'test'"

    Filtering performance can often be improved by creating a scalar index
    on the filter column(s).
    """
    self._inner.where(predicate)
    return self

`select(columns: Union[List[str], dict[str, str]]) -> AsyncQuery`

Return only the specified columns.

By default a query will return all columns from the table. However, this can have a very significant impact on latency. LanceDb stores data in a columnar fashion. This means we can finely tune our I/O to select exactly the columns we need.

As a best practice you should always limit queries to the columns that you need. If you pass in a list of column names then only those columns will be returned.

You can also use this method to create new "dynamic" columns based on your existing columns. For example, you may not care about "a" or "b" but instead simply want "a + b". This is often seen in the SELECT clause of an SQL query (e.g. SELECT a+b FROM my_table).

To create dynamic columns you can pass in a dict[str, str]. A column will be returned for each entry in the map. The key provides the name of the column. The value is an SQL string used to specify how the column is calculated.

For example, an SQL query might state SELECT a + b AS combined, c. The equivalent input to this method would be {"combined": "a + b", "c": "c"}.

Columns will always be returned in the order given, even if that order is different than the order used when adding the data.

Source code in lancedb/query.py

def select(self, columns: Union[List[str], dict[str, str]]) -> AsyncQuery:
    """
    Return only the specified columns.

    By default a query will return all columns from the table.  However, this can
    have a very significant impact on latency.  LanceDb stores data in a columnar
    fashion.  This
    means we can finely tune our I/O to select exactly the columns we need.

    As a best practice you should always limit queries to the columns that you need.
    If you pass in a list of column names then only those columns will be
    returned.

    You can also use this method to create new "dynamic" columns based on your
    existing columns. For example, you may not care about "a" or "b" but instead
    simply want "a + b".  This is often seen in the SELECT clause of an SQL query
    (e.g. `SELECT a+b FROM my_table`).

    To create dynamic columns you can pass in a dict[str, str].  A column will be
    returned for each entry in the map.  The key provides the name of the column.
    The value is an SQL string used to specify how the column is calculated.

    For example, an SQL query might state `SELECT a + b AS combined, c`.  The
    equivalent input to this method would be `{"combined": "a + b", "c": "c"}`.

    Columns will always be returned in the order given, even if that order is
    different than the order used when adding the data.
    """
    if isinstance(columns, list) and all(isinstance(c, str) for c in columns):
        self._inner.select_columns(columns)
    elif isinstance(columns, dict) and all(
        isinstance(k, str) and isinstance(v, str) for k, v in columns.items()
    ):
        self._inner.select(list(columns.items()))
    else:
        raise TypeError("columns must be a list of column names or a dict")
    return self

`limit(limit: int) -> AsyncQuery`

Set the maximum number of results to return.

By default, a plain search has no limit. If this method is not called then every valid row from the table will be returned.

Source code in lancedb/query.py

def limit(self, limit: int) -> AsyncQuery:
    """
    Set the maximum number of results to return.

    By default, a plain search has no limit.  If this method is not
    called then every valid row from the table will be returned.
    """
    self._inner.limit(limit)
    return self

`to_batches(*, max_batch_length: Optional[int] = None) -> AsyncRecordBatchReader` `async`

Execute the query and return the results as an Apache Arrow RecordBatchReader.

Parameters:

Name	Type	Description	Default
`max_batch_length`	`Optional[int]`	The maximum number of selected records in a single RecordBatch object. If not specified, a default batch length is used. It is possible for batches to be smaller than the provided length if the underlying data is stored in smaller chunks.	`None`

Source code in lancedb/query.py

async def to_batches(
    self, *, max_batch_length: Optional[int] = None
) -> AsyncRecordBatchReader:
    """
    Execute the query and return the results as an Apache Arrow RecordBatchReader.

    Parameters
    ----------

    max_batch_length: Optional[int]
        The maximum number of selected records in a single RecordBatch object.
        If not specified, a default batch length is used.
        It is possible for batches to be smaller than the provided length if the
        underlying data is stored in smaller chunks.
    """
    return AsyncRecordBatchReader(await self._inner.execute(max_batch_length))

`to_arrow() -> pa.Table` `async`

Execute the query and collect the results into an Apache Arrow Table.

This method will collect all results into memory before returning. If you expect a large number of results, you may want to use to_batches

Source code in lancedb/query.py

async def to_arrow(self) -> pa.Table:
    """
    Execute the query and collect the results into an Apache Arrow Table.

    This method will collect all results into memory before returning.  If
    you expect a large number of results, you may want to use
    [to_batches][lancedb.query.AsyncQueryBase.to_batches]
    """
    batch_iter = await self.to_batches()
    return pa.Table.from_batches(
        await batch_iter.read_all(), schema=batch_iter.schema
    )

`to_pandas() -> 'pd.DataFrame'` `async`

Execute the query and collect the results into a pandas DataFrame.

This method will collect all results into memory before returning. If you expect a large number of results, you may want to use to_batches and convert each batch to pandas separately.

Examples:

>>> import asyncio
>>> from lancedb import connect_async
>>> async def doctest_example():
...     conn = await connect_async("./.lancedb")
...     table = await conn.create_table("my_table", data=[{"a": 1, "b": 2}])
...     async for batch in await table.query().to_batches():
...         batch_df = batch.to_pandas()
>>> asyncio.run(doctest_example())

Source code in lancedb/query.py

async def to_pandas(self) -> "pd.DataFrame":
    """
    Execute the query and collect the results into a pandas DataFrame.

    This method will collect all results into memory before returning.  If you
    expect a large number of results, you may want to use
    [to_batches][lancedb.query.AsyncQueryBase.to_batches] and convert each batch to
    pandas separately.

    Examples
    --------

    >>> import asyncio
    >>> from lancedb import connect_async
    >>> async def doctest_example():
    ...     conn = await connect_async("./.lancedb")
    ...     table = await conn.create_table("my_table", data=[{"a": 1, "b": 2}])
    ...     async for batch in await table.query().to_batches():
    ...         batch_df = batch.to_pandas()
    >>> asyncio.run(doctest_example())
    """
    return (await self.to_arrow()).to_pandas()

`explain_plan(verbose: Optional[bool] = False)` `async`

Return the execution plan for this query.

Examples:

>>> import asyncio
>>> from lancedb import connect_async
>>> async def doctest_example():
...     conn = await connect_async("./.lancedb")
...     table = await conn.create_table("my_table", [{"vector": [99, 99]}])
...     query = [100, 100]
...     plan = await table.query().nearest_to([1, 2]).explain_plan(True)
...     print(plan)
>>> asyncio.run(doctest_example())
ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]
  FilterExec: _distance@2 IS NOT NULL
    SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]
      KNNVectorDistance: metric=l2
        LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false

Parameters:

Name	Type	Description	Default
`verbose`	`bool`	Use a verbose output format.	`False`

Returns:

Name	Type	Description
`plan`	`str`

Source code in lancedb/query.py

async def explain_plan(self, verbose: Optional[bool] = False):
    """Return the execution plan for this query.

    Examples
    --------
    >>> import asyncio
    >>> from lancedb import connect_async
    >>> async def doctest_example():
    ...     conn = await connect_async("./.lancedb")
    ...     table = await conn.create_table("my_table", [{"vector": [99, 99]}])
    ...     query = [100, 100]
    ...     plan = await table.query().nearest_to([1, 2]).explain_plan(True)
    ...     print(plan)
    >>> asyncio.run(doctest_example()) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
    ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]
      FilterExec: _distance@2 IS NOT NULL
        SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]
          KNNVectorDistance: metric=l2
            LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false

    Parameters
    ----------
    verbose : bool, default False
        Use a verbose output format.

    Returns
    -------
    plan : str
    """  # noqa: E501
    return await self._inner.explain_plan(verbose)

`lancedb.query.AsyncQuery`

Bases: AsyncQueryBase

Source code in lancedb/query.py

class AsyncQuery(AsyncQueryBase):
    def __init__(self, inner: LanceQuery):
        """
        Construct an AsyncQuery

        This method is not intended to be called directly.  Instead, use the
        [AsyncTable.query][lancedb.table.AsyncTable.query] method to create a query.
        """
        super().__init__(inner)
        self._inner = inner

    @classmethod
    def _query_vec_to_array(self, vec: Union[VEC, Tuple]):
        if isinstance(vec, list):
            return pa.array(vec)
        if isinstance(vec, np.ndarray):
            return pa.array(vec)
        if isinstance(vec, pa.Array):
            return vec
        if isinstance(vec, pa.ChunkedArray):
            return vec.combine_chunks()
        if isinstance(vec, tuple):
            return pa.array(vec)
        # We've checked everything we formally support in our typings
        # but, as a fallback, let pyarrow try and convert it anyway.
        # This can allow for some more exotic things like iterables
        return pa.array(vec)

    def nearest_to(
        self, query_vector: Optional[Union[VEC, Tuple]] = None
    ) -> AsyncVectorQuery:
        """
        Find the nearest vectors to the given query vector.

        This converts the query from a plain query to a vector query.

        This method will attempt to convert the input to the query vector
        expected by the embedding model.  If the input cannot be converted
        then an error will be thrown.

        By default, there is no embedding model, and the input should be
        something that can be converted to a pyarrow array of floats.  This
        includes lists, numpy arrays, and tuples.

        If there is only one vector column (a column whose data type is a
        fixed size list of floats) then the column does not need to be specified.
        If there is more than one vector column you must use
        [AsyncVectorQuery.column][lancedb.query.AsyncVectorQuery.column] to specify
        which column you would like to compare with.

        If no index has been created on the vector column then a vector query
        will perform a distance comparison between the query vector and every
        vector in the database and then sort the results.  This is sometimes
        called a "flat search"

        For small databases, with tens of thousands of vectors or less, this can
        be reasonably fast.  In larger databases you should create a vector index
        on the column.  If there is a vector index then an "approximate" nearest
        neighbor search (frequently called an ANN search) will be performed.  This
        search is much faster, but the results will be approximate.

        The query can be further parameterized using the returned builder.  There
        are various ANN search parameters that will let you fine tune your recall
        accuracy vs search latency.

        Vector searches always have a [limit][].  If `limit` has not been called then
        a default `limit` of 10 will be used.
        """
        return AsyncVectorQuery(
            self._inner.nearest_to(AsyncQuery._query_vec_to_array(query_vector))
        )

`init(inner: LanceQuery)`

Construct an AsyncQuery

This method is not intended to be called directly. Instead, use the AsyncTable.query method to create a query.

Source code in lancedb/query.py

def __init__(self, inner: LanceQuery):
    """
    Construct an AsyncQuery

    This method is not intended to be called directly.  Instead, use the
    [AsyncTable.query][lancedb.table.AsyncTable.query] method to create a query.
    """
    super().__init__(inner)
    self._inner = inner

`nearest_to(query_vector: Optional[Union[VEC, Tuple]] = None) -> AsyncVectorQuery`

Find the nearest vectors to the given query vector.

This converts the query from a plain query to a vector query.

This method will attempt to convert the input to the query vector expected by the embedding model. If the input cannot be converted then an error will be thrown.

By default, there is no embedding model, and the input should be something that can be converted to a pyarrow array of floats. This includes lists, numpy arrays, and tuples.

If there is only one vector column (a column whose data type is a fixed size list of floats) then the column does not need to be specified. If there is more than one vector column you must use AsyncVectorQuery.column to specify which column you would like to compare with.

If no index has been created on the vector column then a vector query will perform a distance comparison between the query vector and every vector in the database and then sort the results. This is sometimes called a "flat search"

For small databases, with tens of thousands of vectors or less, this can be reasonably fast. In larger databases you should create a vector index on the column. If there is a vector index then an "approximate" nearest neighbor search (frequently called an ANN search) will be performed. This search is much faster, but the results will be approximate.

The query can be further parameterized using the returned builder. There are various ANN search parameters that will let you fine tune your recall accuracy vs search latency.

Vector searches always have a limit. If limit has not been called then a default limit of 10 will be used.

Source code in lancedb/query.py

def nearest_to(
    self, query_vector: Optional[Union[VEC, Tuple]] = None
) -> AsyncVectorQuery:
    """
    Find the nearest vectors to the given query vector.

    This converts the query from a plain query to a vector query.

    This method will attempt to convert the input to the query vector
    expected by the embedding model.  If the input cannot be converted
    then an error will be thrown.

    By default, there is no embedding model, and the input should be
    something that can be converted to a pyarrow array of floats.  This
    includes lists, numpy arrays, and tuples.

    If there is only one vector column (a column whose data type is a
    fixed size list of floats) then the column does not need to be specified.
    If there is more than one vector column you must use
    [AsyncVectorQuery.column][lancedb.query.AsyncVectorQuery.column] to specify
    which column you would like to compare with.

    If no index has been created on the vector column then a vector query
    will perform a distance comparison between the query vector and every
    vector in the database and then sort the results.  This is sometimes
    called a "flat search"

    For small databases, with tens of thousands of vectors or less, this can
    be reasonably fast.  In larger databases you should create a vector index
    on the column.  If there is a vector index then an "approximate" nearest
    neighbor search (frequently called an ANN search) will be performed.  This
    search is much faster, but the results will be approximate.

    The query can be further parameterized using the returned builder.  There
    are various ANN search parameters that will let you fine tune your recall
    accuracy vs search latency.

    Vector searches always have a [limit][].  If `limit` has not been called then
    a default `limit` of 10 will be used.
    """
    return AsyncVectorQuery(
        self._inner.nearest_to(AsyncQuery._query_vec_to_array(query_vector))
    )

`lancedb.query.AsyncVectorQuery`

Bases: AsyncQueryBase

Source code in lancedb/query.py

class AsyncVectorQuery(AsyncQueryBase):
    def __init__(self, inner: LanceVectorQuery):
        """
        Construct an AsyncVectorQuery

        This method is not intended to be called directly.  Instead, create
        a query first with [AsyncTable.query][lancedb.table.AsyncTable.query] and then
        use [AsyncQuery.nearest_to][lancedb.query.AsyncQuery.nearest_to]] to convert to
        a vector query.  Or you can use
        [AsyncTable.vector_search][lancedb.table.AsyncTable.vector_search]
        """
        super().__init__(inner)
        self._inner = inner

    def column(self, column: str) -> AsyncVectorQuery:
        """
        Set the vector column to query

        This controls which column is compared to the query vector supplied in
        the call to [AsyncQuery.nearest_to][lancedb.query.AsyncQuery.nearest_to].

        This parameter must be specified if the table has more than one column
        whose data type is a fixed-size-list of floats.
        """
        self._inner.column(column)
        return self

    def nprobes(self, nprobes: int) -> AsyncVectorQuery:
        """
        Set the number of partitions to search (probe)

        This argument is only used when the vector column has an IVF PQ index.
        If there is no index then this value is ignored.

        The IVF stage of IVF PQ divides the input into partitions (clusters) of
        related values.

        The partition whose centroids are closest to the query vector will be
        exhaustiely searched to find matches.  This parameter controls how many
        partitions should be searched.

        Increasing this value will increase the recall of your query but will
        also increase the latency of your query.  The default value is 20.  This
        default is good for many cases but the best value to use will depend on
        your data and the recall that you need to achieve.

        For best results we recommend tuning this parameter with a benchmark against
        your actual data to find the smallest possible value that will still give
        you the desired recall.
        """
        self._inner.nprobes(nprobes)
        return self

    def refine_factor(self, refine_factor: int) -> AsyncVectorQuery:
        """
        A multiplier to control how many additional rows are taken during the refine
        step

        This argument is only used when the vector column has an IVF PQ index.
        If there is no index then this value is ignored.

        An IVF PQ index stores compressed (quantized) values.  They query vector is
        compared against these values and, since they are compressed, the comparison is
        inaccurate.

        This parameter can be used to refine the results.  It can improve both improve
        recall and correct the ordering of the nearest results.

        To refine results LanceDb will first perform an ANN search to find the nearest
        `limit` * `refine_factor` results.  In other words, if `refine_factor` is 3 and
        `limit` is the default (10) then the first 30 results will be selected.  LanceDb
        then fetches the full, uncompressed, values for these 30 results.  The results
        are then reordered by the true distance and only the nearest 10 are kept.

        Note: there is a difference between calling this method with a value of 1 and
        never calling this method at all.  Calling this method with any value will have
        an impact on your search latency.  When you call this method with a
        `refine_factor` of 1 then LanceDb still needs to fetch the full, uncompressed,
        values so that it can potentially reorder the results.

        Note: if this method is NOT called then the distances returned in the _distance
        column will be approximate distances based on the comparison of the quantized
        query vector and the quantized result vectors.  This can be considerably
        different than the true distance between the query vector and the actual
        uncompressed vector.
        """
        self._inner.refine_factor(refine_factor)
        return self

    def distance_type(self, distance_type: str) -> AsyncVectorQuery:
        """
        Set the distance metric to use

        When performing a vector search we try and find the "nearest" vectors according
        to some kind of distance metric.  This parameter controls which distance metric
        to use.  See @see {@link IvfPqOptions.distanceType} for more details on the
        different distance metrics available.

        Note: if there is a vector index then the distance type used MUST match the
        distance type used to train the vector index.  If this is not done then the
        results will be invalid.

        By default "l2" is used.
        """
        self._inner.distance_type(distance_type)
        return self

    def postfilter(self) -> AsyncVectorQuery:
        """
        If this is called then filtering will happen after the vector search instead of
        before.

        By default filtering will be performed before the vector search.  This is how
        filtering is typically understood to work.  This prefilter step does add some
        additional latency.  Creating a scalar index on the filter column(s) can
        often improve this latency.  However, sometimes a filter is too complex or
        scalar indices cannot be applied to the column.  In these cases postfiltering
        can be used instead of prefiltering to improve latency.

        Post filtering applies the filter to the results of the vector search.  This
        means we only run the filter on a much smaller set of data.  However, it can
        cause the query to return fewer than `limit` results (or even no results) if
        none of the nearest results match the filter.

        Post filtering happens during the "refine stage" (described in more detail in
        @see {@link VectorQuery#refineFactor}).  This means that setting a higher refine
        factor can often help restore some of the results lost by post filtering.
        """
        self._inner.postfilter()
        return self

    def bypass_vector_index(self) -> AsyncVectorQuery:
        """
        If this is called then any vector index is skipped

        An exhaustive (flat) search will be performed.  The query vector will
        be compared to every vector in the table.  At high scales this can be
        expensive.  However, this is often still useful.  For example, skipping
        the vector index can give you ground truth results which you can use to
        calculate your recall to select an appropriate value for nprobes.
        """
        self._inner.bypass_vector_index()
        return self

`init(inner: LanceVectorQuery)`

Construct an AsyncVectorQuery

This method is not intended to be called directly. Instead, create a query first with AsyncTable.query and then use AsyncQuery.nearest_to] to convert to a vector query. Or you can use AsyncTable.vector_search

Source code in lancedb/query.py

def __init__(self, inner: LanceVectorQuery):
    """
    Construct an AsyncVectorQuery

    This method is not intended to be called directly.  Instead, create
    a query first with [AsyncTable.query][lancedb.table.AsyncTable.query] and then
    use [AsyncQuery.nearest_to][lancedb.query.AsyncQuery.nearest_to]] to convert to
    a vector query.  Or you can use
    [AsyncTable.vector_search][lancedb.table.AsyncTable.vector_search]
    """
    super().__init__(inner)
    self._inner = inner

`column(column: str) -> AsyncVectorQuery`

Set the vector column to query

This controls which column is compared to the query vector supplied in the call to AsyncQuery.nearest_to.

This parameter must be specified if the table has more than one column whose data type is a fixed-size-list of floats.

Source code in lancedb/query.py

def column(self, column: str) -> AsyncVectorQuery:
    """
    Set the vector column to query

    This controls which column is compared to the query vector supplied in
    the call to [AsyncQuery.nearest_to][lancedb.query.AsyncQuery.nearest_to].

    This parameter must be specified if the table has more than one column
    whose data type is a fixed-size-list of floats.
    """
    self._inner.column(column)
    return self

`nprobes(nprobes: int) -> AsyncVectorQuery`

Set the number of partitions to search (probe)

This argument is only used when the vector column has an IVF PQ index. If there is no index then this value is ignored.

The IVF stage of IVF PQ divides the input into partitions (clusters) of related values.

The partition whose centroids are closest to the query vector will be exhaustiely searched to find matches. This parameter controls how many partitions should be searched.

Increasing this value will increase the recall of your query but will also increase the latency of your query. The default value is 20. This default is good for many cases but the best value to use will depend on your data and the recall that you need to achieve.

For best results we recommend tuning this parameter with a benchmark against your actual data to find the smallest possible value that will still give you the desired recall.

Source code in lancedb/query.py

def nprobes(self, nprobes: int) -> AsyncVectorQuery:
    """
    Set the number of partitions to search (probe)

    This argument is only used when the vector column has an IVF PQ index.
    If there is no index then this value is ignored.

    The IVF stage of IVF PQ divides the input into partitions (clusters) of
    related values.

    The partition whose centroids are closest to the query vector will be
    exhaustiely searched to find matches.  This parameter controls how many
    partitions should be searched.

    Increasing this value will increase the recall of your query but will
    also increase the latency of your query.  The default value is 20.  This
    default is good for many cases but the best value to use will depend on
    your data and the recall that you need to achieve.

    For best results we recommend tuning this parameter with a benchmark against
    your actual data to find the smallest possible value that will still give
    you the desired recall.
    """
    self._inner.nprobes(nprobes)
    return self

`refine_factor(refine_factor: int) -> AsyncVectorQuery`

A multiplier to control how many additional rows are taken during the refine step

This argument is only used when the vector column has an IVF PQ index. If there is no index then this value is ignored.

An IVF PQ index stores compressed (quantized) values. They query vector is compared against these values and, since they are compressed, the comparison is inaccurate.

This parameter can be used to refine the results. It can improve both improve recall and correct the ordering of the nearest results.

To refine results LanceDb will first perform an ANN search to find the nearest limit * refine_factor results. In other words, if refine_factor is 3 and limit is the default (10) then the first 30 results will be selected. LanceDb then fetches the full, uncompressed, values for these 30 results. The results are then reordered by the true distance and only the nearest 10 are kept.

Note: there is a difference between calling this method with a value of 1 and never calling this method at all. Calling this method with any value will have an impact on your search latency. When you call this method with a refine_factor of 1 then LanceDb still needs to fetch the full, uncompressed, values so that it can potentially reorder the results.

Note: if this method is NOT called then the distances returned in the _distance column will be approximate distances based on the comparison of the quantized query vector and the quantized result vectors. This can be considerably different than the true distance between the query vector and the actual uncompressed vector.

Source code in lancedb/query.py

def refine_factor(self, refine_factor: int) -> AsyncVectorQuery:
    """
    A multiplier to control how many additional rows are taken during the refine
    step

    This argument is only used when the vector column has an IVF PQ index.
    If there is no index then this value is ignored.

    An IVF PQ index stores compressed (quantized) values.  They query vector is
    compared against these values and, since they are compressed, the comparison is
    inaccurate.

    This parameter can be used to refine the results.  It can improve both improve
    recall and correct the ordering of the nearest results.

    To refine results LanceDb will first perform an ANN search to find the nearest
    `limit` * `refine_factor` results.  In other words, if `refine_factor` is 3 and
    `limit` is the default (10) then the first 30 results will be selected.  LanceDb
    then fetches the full, uncompressed, values for these 30 results.  The results
    are then reordered by the true distance and only the nearest 10 are kept.

    Note: there is a difference between calling this method with a value of 1 and
    never calling this method at all.  Calling this method with any value will have
    an impact on your search latency.  When you call this method with a
    `refine_factor` of 1 then LanceDb still needs to fetch the full, uncompressed,
    values so that it can potentially reorder the results.

    Note: if this method is NOT called then the distances returned in the _distance
    column will be approximate distances based on the comparison of the quantized
    query vector and the quantized result vectors.  This can be considerably
    different than the true distance between the query vector and the actual
    uncompressed vector.
    """
    self._inner.refine_factor(refine_factor)
    return self

`distance_type(distance_type: str) -> AsyncVectorQuery`

Set the distance metric to use

When performing a vector search we try and find the "nearest" vectors according to some kind of distance metric. This parameter controls which distance metric to use. See @see {@link IvfPqOptions.distanceType} for more details on the different distance metrics available.

Note: if there is a vector index then the distance type used MUST match the distance type used to train the vector index. If this is not done then the results will be invalid.

By default "l2" is used.

Source code in lancedb/query.py

def distance_type(self, distance_type: str) -> AsyncVectorQuery:
    """
    Set the distance metric to use

    When performing a vector search we try and find the "nearest" vectors according
    to some kind of distance metric.  This parameter controls which distance metric
    to use.  See @see {@link IvfPqOptions.distanceType} for more details on the
    different distance metrics available.

    Note: if there is a vector index then the distance type used MUST match the
    distance type used to train the vector index.  If this is not done then the
    results will be invalid.

    By default "l2" is used.
    """
    self._inner.distance_type(distance_type)
    return self

`postfilter() -> AsyncVectorQuery`

If this is called then filtering will happen after the vector search instead of before.

By default filtering will be performed before the vector search. This is how filtering is typically understood to work. This prefilter step does add some additional latency. Creating a scalar index on the filter column(s) can often improve this latency. However, sometimes a filter is too complex or scalar indices cannot be applied to the column. In these cases postfiltering can be used instead of prefiltering to improve latency.

Post filtering applies the filter to the results of the vector search. This means we only run the filter on a much smaller set of data. However, it can cause the query to return fewer than limit results (or even no results) if none of the nearest results match the filter.

Post filtering happens during the "refine stage" (described in more detail in @see {@link VectorQuery#refineFactor}). This means that setting a higher refine factor can often help restore some of the results lost by post filtering.

Source code in lancedb/query.py

def postfilter(self) -> AsyncVectorQuery:
    """
    If this is called then filtering will happen after the vector search instead of
    before.

    By default filtering will be performed before the vector search.  This is how
    filtering is typically understood to work.  This prefilter step does add some
    additional latency.  Creating a scalar index on the filter column(s) can
    often improve this latency.  However, sometimes a filter is too complex or
    scalar indices cannot be applied to the column.  In these cases postfiltering
    can be used instead of prefiltering to improve latency.

    Post filtering applies the filter to the results of the vector search.  This
    means we only run the filter on a much smaller set of data.  However, it can
    cause the query to return fewer than `limit` results (or even no results) if
    none of the nearest results match the filter.

    Post filtering happens during the "refine stage" (described in more detail in
    @see {@link VectorQuery#refineFactor}).  This means that setting a higher refine
    factor can often help restore some of the results lost by post filtering.
    """
    self._inner.postfilter()
    return self

`bypass_vector_index() -> AsyncVectorQuery`

If this is called then any vector index is skipped

An exhaustive (flat) search will be performed. The query vector will be compared to every vector in the table. At high scales this can be expensive. However, this is often still useful. For example, skipping the vector index can give you ground truth results which you can use to calculate your recall to select an appropriate value for nprobes.

Source code in lancedb/query.py

def bypass_vector_index(self) -> AsyncVectorQuery:
    """
    If this is called then any vector index is skipped

    An exhaustive (flat) search will be performed.  The query vector will
    be compared to every vector in the table.  At high scales this can be
    expensive.  However, this is often still useful.  For example, skipping
    the vector index can give you ground truth results which you can use to
    calculate your recall to select an appropriate value for nprobes.
    """
    self._inner.bypass_vector_index()
    return self

Python API Reference

Installation

Connections (Synchronous)

lancedb.connect(uri: URI, *, api_key: Optional[str] = None, region: str = 'us-east-1', host_override: Optional[str] = None, read_consistency_interval: Optional[timedelta] = None, request_thread_pool: Optional[Union[int, ThreadPoolExecutor]] = None, **kwargs: Any) -> DBConnection

lancedb.db.DBConnection

table_names(page_token: Optional[str] = None, limit: int = 10) -> Iterable[str] abstractmethod

open_table(name: str, *, index_cache_size: Optional[int] = None) -> Table

drop_table(name: str)

rename_table(cur_name: str, new_name: str)

drop_database()

Tables (Synchronous)

lancedb.table.Table

schema: pa.Schema abstractmethod property

count_rows(filter: Optional[str] = None) -> int abstractmethod

to_pandas() -> 'pd.DataFrame'

to_arrow() -> pa.Table abstractmethod

create_index(metric='L2', num_partitions=256, num_sub_vectors=96, vector_column_name: str = VECTOR_COLUMN_NAME, replace: bool = True, accelerator: Optional[str] = None, index_cache_size: Optional[int] = None)

create_scalar_index(column: str, *, replace: bool = True) abstractmethod

add(data: DATA, mode: str = 'append', on_bad_vectors: str = 'error', fill_value: float = 0.0) abstractmethod

merge_insert(on: Union[str, Iterable[str]]) -> LanceMergeInsertBuilder

search(query: Optional[Union[VEC, str, 'PIL.Image.Image', Tuple]] = None, vector_column_name: Optional[str] = None, query_type: str = 'auto') -> LanceQueryBuilder abstractmethod

delete(where: str) abstractmethod

update(where: Optional[str] = None, values: Optional[dict] = None, *, values_sql: Optional[Dict[str, str]] = None) abstractmethod

cleanup_old_versions(older_than: Optional[timedelta] = None, *, delete_unverified: bool = False) -> CleanupStats abstractmethod

compact_files(*args, **kwargs) abstractmethod

add_columns(transforms: Dict[str, str]) abstractmethod

alter_columns(alterations: Iterable[Dict[str, str]]) abstractmethod

drop_columns(columns: Iterable[str]) abstractmethod

Querying (Synchronous)

lancedb.query.Query

lancedb.query.LanceQueryBuilder

create(table: 'Table', query: Optional[Union[np.ndarray, str, 'PIL.Image.Image', Tuple]], query_type: str, vector_column_name: str, ordering_field_name: str = None) -> LanceQueryBuilder classmethod

to_df() -> 'pd.DataFrame'

to_pandas(flatten: Optional[Union[int, bool]] = None) -> 'pd.DataFrame'

to_arrow() -> pa.Table abstractmethod

to_list() -> List[dict]

to_pydantic(model: Type[LanceModel]) -> List[LanceModel]

to_polars() -> 'pl.DataFrame'

limit(limit: Union[int, None]) -> LanceQueryBuilder

select(columns: Union[list[str], dict[str, str]]) -> LanceQueryBuilder

where(where: str, prefilter: bool = False) -> LanceQueryBuilder

with_row_id(with_row_id: bool) -> LanceQueryBuilder

explain_plan(verbose: Optional[bool] = False) -> str

lancedb.query.LanceVectorQueryBuilder

metric(metric: Literal['L2', 'cosine']) -> LanceVectorQueryBuilder

nprobes(nprobes: int) -> LanceVectorQueryBuilder

refine_factor(refine_factor: int) -> LanceVectorQueryBuilder

to_arrow() -> pa.Table

to_batches(batch_size: Optional[int] = None) -> pa.RecordBatchReader

where(where: str, prefilter: bool = False) -> LanceVectorQueryBuilder

rerank(reranker: Reranker, query_string: Optional[str] = None) -> LanceVectorQueryBuilder

lancedb.query.LanceFtsQueryBuilder

phrase_query(phrase_query: bool = True) -> LanceFtsQueryBuilder

rerank(reranker: Reranker) -> LanceFtsQueryBuilder

lancedb.query.LanceHybridQueryBuilder

rerank(normalize='score', reranker: Reranker = LinearCombinationReranker(weight=0.7, fill=1.0)) -> LanceHybridQueryBuilder

limit(limit: int) -> LanceHybridQueryBuilder

select(columns: list) -> LanceHybridQueryBuilder

where(where: str, prefilter: bool = False) -> LanceHybridQueryBuilder

metric(metric: Literal['L2', 'cosine']) -> LanceHybridQueryBuilder

nprobes(nprobes: int) -> LanceHybridQueryBuilder

refine_factor(refine_factor: int) -> LanceHybridQueryBuilder

Embeddings

lancedb.embeddings.registry.EmbeddingFunctionRegistry

register(alias: str = None)

reset()

get(name: str)

parse_functions(metadata: Optional[Dict[bytes, bytes]]) -> Dict[str, EmbeddingFunctionConfig]

function_to_metadata(conf: EmbeddingFunctionConfig)

get_table_metadata(func_list)

lancedb.embeddings.base.EmbeddingFunction

create(**kwargs) classmethod

compute_query_embeddings(*args, **kwargs) -> List[np.array] abstractmethod

compute_source_embeddings(*args, **kwargs) -> List[np.array] abstractmethod

compute_query_embeddings_with_retry(*args, **kwargs) -> List[np.array]

compute_source_embeddings_with_retry(*args, **kwargs) -> List[np.array]

sanitize_input(texts: TEXT) -> Union[List[str], np.ndarray]

ndims() abstractmethod

SourceField(**kwargs)

VectorField(**kwargs)

`lancedb.connect(uri: URI, *, api_key: Optional[str] = None, region: str = 'us-east-1', host_override: Optional[str] = None, read_consistency_interval: Optional[timedelta] = None, request_thread_pool: Optional[Union[int, ThreadPoolExecutor]] = None, **kwargs: Any) -> DBConnection`

`lancedb.db.DBConnection`

`table_names(page_token: Optional[str] = None, limit: int = 10) -> Iterable[str]` `abstractmethod`

`open_table(name: str, *, index_cache_size: Optional[int] = None) -> Table`

`drop_table(name: str)`

`rename_table(cur_name: str, new_name: str)`

`drop_database()`

`lancedb.table.Table`

`schema: pa.Schema` `abstractmethod` `property`

`count_rows(filter: Optional[str] = None) -> int` `abstractmethod`

`to_pandas() -> 'pd.DataFrame'`

`to_arrow() -> pa.Table` `abstractmethod`

`create_index(metric='L2', num_partitions=256, num_sub_vectors=96, vector_column_name: str = VECTOR_COLUMN_NAME, replace: bool = True, accelerator: Optional[str] = None, index_cache_size: Optional[int] = None)`

`create_scalar_index(column: str, *, replace: bool = True)` `abstractmethod`

`add(data: DATA, mode: str = 'append', on_bad_vectors: str = 'error', fill_value: float = 0.0)` `abstractmethod`

`merge_insert(on: Union[str, Iterable[str]]) -> LanceMergeInsertBuilder`

`search(query: Optional[Union[VEC, str, 'PIL.Image.Image', Tuple]] = None, vector_column_name: Optional[str] = None, query_type: str = 'auto') -> LanceQueryBuilder` `abstractmethod`

`delete(where: str)` `abstractmethod`

`update(where: Optional[str] = None, values: Optional[dict] = None, *, values_sql: Optional[Dict[str, str]] = None)` `abstractmethod`

`cleanup_old_versions(older_than: Optional[timedelta] = None, *, delete_unverified: bool = False) -> CleanupStats` `abstractmethod`

`compact_files(*args, **kwargs)` `abstractmethod`

`add_columns(transforms: Dict[str, str])` `abstractmethod`

`alter_columns(alterations: Iterable[Dict[str, str]])` `abstractmethod`

`drop_columns(columns: Iterable[str])` `abstractmethod`

`lancedb.query.Query`

`lancedb.query.LanceQueryBuilder`

`create(table: 'Table', query: Optional[Union[np.ndarray, str, 'PIL.Image.Image', Tuple]], query_type: str, vector_column_name: str, ordering_field_name: str = None) -> LanceQueryBuilder` `classmethod`

`to_df() -> 'pd.DataFrame'`

`to_pandas(flatten: Optional[Union[int, bool]] = None) -> 'pd.DataFrame'`

`to_arrow() -> pa.Table` `abstractmethod`

`to_list() -> List[dict]`

`to_pydantic(model: Type[LanceModel]) -> List[LanceModel]`

`to_polars() -> 'pl.DataFrame'`

`limit(limit: Union[int, None]) -> LanceQueryBuilder`

`select(columns: Union[list[str], dict[str, str]]) -> LanceQueryBuilder`

`where(where: str, prefilter: bool = False) -> LanceQueryBuilder`

`with_row_id(with_row_id: bool) -> LanceQueryBuilder`

`explain_plan(verbose: Optional[bool] = False) -> str`

`lancedb.query.LanceVectorQueryBuilder`

`metric(metric: Literal['L2', 'cosine']) -> LanceVectorQueryBuilder`

`nprobes(nprobes: int) -> LanceVectorQueryBuilder`

`refine_factor(refine_factor: int) -> LanceVectorQueryBuilder`

`to_arrow() -> pa.Table`

`to_batches(batch_size: Optional[int] = None) -> pa.RecordBatchReader`

`where(where: str, prefilter: bool = False) -> LanceVectorQueryBuilder`

`rerank(reranker: Reranker, query_string: Optional[str] = None) -> LanceVectorQueryBuilder`

`lancedb.query.LanceFtsQueryBuilder`

`phrase_query(phrase_query: bool = True) -> LanceFtsQueryBuilder`

`rerank(reranker: Reranker) -> LanceFtsQueryBuilder`

`lancedb.query.LanceHybridQueryBuilder`

`rerank(normalize='score', reranker: Reranker = LinearCombinationReranker(weight=0.7, fill=1.0)) -> LanceHybridQueryBuilder`

`limit(limit: int) -> LanceHybridQueryBuilder`

`select(columns: list) -> LanceHybridQueryBuilder`

`where(where: str, prefilter: bool = False) -> LanceHybridQueryBuilder`

`metric(metric: Literal['L2', 'cosine']) -> LanceHybridQueryBuilder`

`nprobes(nprobes: int) -> LanceHybridQueryBuilder`

`refine_factor(refine_factor: int) -> LanceHybridQueryBuilder`

`lancedb.embeddings.registry.EmbeddingFunctionRegistry`

`register(alias: str = None)`

`reset()`

`get(name: str)`

`parse_functions(metadata: Optional[Dict[bytes, bytes]]) -> Dict[str, EmbeddingFunctionConfig]`

`function_to_metadata(conf: EmbeddingFunctionConfig)`

`get_table_metadata(func_list)`

`lancedb.embeddings.base.EmbeddingFunction`

`create(**kwargs)` `classmethod`

`compute_query_embeddings(*args, **kwargs) -> List[np.array]` `abstractmethod`

`compute_source_embeddings(*args, **kwargs) -> List[np.array]` `abstractmethod`

`compute_query_embeddings_with_retry(*args, **kwargs) -> List[np.array]`

`compute_source_embeddings_with_retry(*args, **kwargs) -> List[np.array]`

`sanitize_input(texts: TEXT) -> Union[List[str], np.ndarray]`

`ndims()` `abstractmethod`

`SourceField(**kwargs)`

`VectorField(**kwargs)`

`lancedb.embeddings.base.TextEmbeddingFunction`

`generate_embeddings(texts: Union[List[str], np.ndarray], *args, **kwargs) -> List[np.array]` `abstractmethod`

`lancedb.embeddings.sentence_transformers.SentenceTransformerEmbeddings`

`embedding_model` `property`

`generate_embeddings(texts: Union[List[str], np.ndarray]) -> List[np.array]`

`get_embedding_model()`