Skip to content

Python API Reference

This section contains the API reference for the Python API. There is a synchronous and an asynchronous API client.

The general flow of using the API is:

  1. Use lancedb.connect or lancedb.connect_async to connect to a database.
  2. Use the returned lancedb.DBConnection or lancedb.AsyncConnection to create or open tables.
  3. Use the returned lancedb.table.Table or lancedb.AsyncTable to query or modify tables.

Installation

pip install lancedb

The following methods describe the synchronous API client. There is also an asynchronous API client.

Connections (Synchronous)

lancedb.connect

connect(uri: URI, *, api_key: Optional[str] = None, region: str = 'us-east-1', host_override: Optional[str] = None, read_consistency_interval: Optional[timedelta] = None, request_thread_pool: Optional[Union[int, ThreadPoolExecutor]] = None, client_config: Union[ClientConfig, Dict[str, Any], None] = None, storage_options: Optional[Dict[str, str]] = None, **kwargs: Any) -> DBConnection

Connect to a LanceDB database.

Parameters:

  • uri (URI) –

    The uri of the database.

  • api_key (Optional[str], default: None ) –

    If presented, connect to LanceDB cloud. Otherwise, connect to a database on file system or cloud storage. Can be set via environment variable LANCEDB_API_KEY.

  • region (str, default: 'us-east-1' ) –

    The region to use for LanceDB Cloud.

  • host_override (Optional[str], default: None ) –

    The override url for LanceDB Cloud.

  • read_consistency_interval (Optional[timedelta], default: None ) –

    (For LanceDB OSS only) The interval at which to check for updates to the table from other processes. If None, then consistency is not checked. For performance reasons, this is the default. For strong consistency, set this to zero seconds. Then every read will check for updates from other processes. As a compromise, you can set this to a non-zero timedelta for eventual consistency. If more than that interval has passed since the last check, then the table will be checked for updates. Note: this consistency only applies to read operations. Write operations are always consistent.

  • client_config (Union[ClientConfig, Dict[str, Any], None], default: None ) –

    Configuration options for the LanceDB Cloud HTTP client. If a dict, then the keys are the attributes of the ClientConfig class. If None, then the default configuration is used.

  • storage_options (Optional[Dict[str, str]], default: None ) –

    Additional options for the storage backend. See available options at https://lancedb.github.io/lancedb/guides/storage/

Examples:

For a local directory, provide a path for the database:

>>> import lancedb
>>> db = lancedb.connect("~/.lancedb")

For object storage, use a URI prefix:

>>> db = lancedb.connect("s3://my-bucket/lancedb",
...                      storage_options={"aws_access_key_id": "***"})

Connect to LanceDB cloud:

>>> db = lancedb.connect("db://my_database", api_key="ldb_...",
...                      client_config={"retry_config": {"retries": 5}})

Returns:

  • conn ( DBConnection ) –

    A connection to a LanceDB database.

Source code in lancedb/__init__.py
def connect(
    uri: URI,
    *,
    api_key: Optional[str] = None,
    region: str = "us-east-1",
    host_override: Optional[str] = None,
    read_consistency_interval: Optional[timedelta] = None,
    request_thread_pool: Optional[Union[int, ThreadPoolExecutor]] = None,
    client_config: Union[ClientConfig, Dict[str, Any], None] = None,
    storage_options: Optional[Dict[str, str]] = None,
    **kwargs: Any,
) -> DBConnection:
    """Connect to a LanceDB database.

    Parameters
    ----------
    uri: str or Path
        The uri of the database.
    api_key: str, optional
        If presented, connect to LanceDB cloud.
        Otherwise, connect to a database on file system or cloud storage.
        Can be set via environment variable `LANCEDB_API_KEY`.
    region: str, default "us-east-1"
        The region to use for LanceDB Cloud.
    host_override: str, optional
        The override url for LanceDB Cloud.
    read_consistency_interval: timedelta, default None
        (For LanceDB OSS only)
        The interval at which to check for updates to the table from other
        processes. If None, then consistency is not checked. For performance
        reasons, this is the default. For strong consistency, set this to
        zero seconds. Then every read will check for updates from other
        processes. As a compromise, you can set this to a non-zero timedelta
        for eventual consistency. If more than that interval has passed since
        the last check, then the table will be checked for updates. Note: this
        consistency only applies to read operations. Write operations are
        always consistent.
    client_config: ClientConfig or dict, optional
        Configuration options for the LanceDB Cloud HTTP client. If a dict, then
        the keys are the attributes of the ClientConfig class. If None, then the
        default configuration is used.
    storage_options: dict, optional
        Additional options for the storage backend. See available options at
        <https://lancedb.github.io/lancedb/guides/storage/>

    Examples
    --------

    For a local directory, provide a path for the database:

    >>> import lancedb
    >>> db = lancedb.connect("~/.lancedb")

    For object storage, use a URI prefix:

    >>> db = lancedb.connect("s3://my-bucket/lancedb",
    ...                      storage_options={"aws_access_key_id": "***"})

    Connect to LanceDB cloud:

    >>> db = lancedb.connect("db://my_database", api_key="ldb_...",
    ...                      client_config={"retry_config": {"retries": 5}})

    Returns
    -------
    conn : DBConnection
        A connection to a LanceDB database.
    """
    from .remote.db import RemoteDBConnection

    if isinstance(uri, str) and uri.startswith("db://"):
        if api_key is None:
            api_key = os.environ.get("LANCEDB_API_KEY")
        if api_key is None:
            raise ValueError(f"api_key is required to connected LanceDB cloud: {uri}")
        if isinstance(request_thread_pool, int):
            request_thread_pool = ThreadPoolExecutor(request_thread_pool)
        return RemoteDBConnection(
            uri,
            api_key,
            region,
            host_override,
            # TODO: remove this (deprecation warning downstream)
            request_thread_pool=request_thread_pool,
            client_config=client_config,
            storage_options=storage_options,
            **kwargs,
        )

    if kwargs:
        raise ValueError(f"Unknown keyword arguments: {kwargs}")
    return LanceDBConnection(
        uri,
        read_consistency_interval=read_consistency_interval,
        storage_options=storage_options,
    )

lancedb.db.DBConnection

Bases: EnforceOverrides

An active LanceDB connection interface.

Source code in lancedb/db.py
class DBConnection(EnforceOverrides):
    """An active LanceDB connection interface."""

    @abstractmethod
    def table_names(
        self, page_token: Optional[str] = None, limit: int = 10
    ) -> Iterable[str]:
        """List all tables in this database, in sorted order

        Parameters
        ----------
        page_token: str, optional
            The token to use for pagination. If not present, start from the beginning.
            Typically, this token is last table name from the previous page.
            Only supported by LanceDb Cloud.
        limit: int, default 10
            The size of the page to return.
            Only supported by LanceDb Cloud.

        Returns
        -------
        Iterable of str
        """
        pass

    @abstractmethod
    def create_table(
        self,
        name: str,
        data: Optional[DATA] = None,
        schema: Optional[Union[pa.Schema, LanceModel]] = None,
        mode: str = "create",
        exist_ok: bool = False,
        on_bad_vectors: str = "error",
        fill_value: float = 0.0,
        embedding_functions: Optional[List[EmbeddingFunctionConfig]] = None,
        *,
        storage_options: Optional[Dict[str, str]] = None,
        data_storage_version: Optional[str] = None,
        enable_v2_manifest_paths: Optional[bool] = None,
    ) -> Table:
        """Create a [Table][lancedb.table.Table] in the database.

        Parameters
        ----------
        name: str
            The name of the table.
        data: The data to initialize the table, *optional*
            User must provide at least one of `data` or `schema`.
            Acceptable types are:

            - list-of-dict

            - pandas.DataFrame

            - pyarrow.Table or pyarrow.RecordBatch
        schema: The schema of the table, *optional*
            Acceptable types are:

            - pyarrow.Schema

            - [LanceModel][lancedb.pydantic.LanceModel]
        mode: str; default "create"
            The mode to use when creating the table.
            Can be either "create" or "overwrite".
            By default, if the table already exists, an exception is raised.
            If you want to overwrite the table, use mode="overwrite".
        exist_ok: bool, default False
            If a table by the same name already exists, then raise an exception
            if exist_ok=False. If exist_ok=True, then open the existing table;
            it will not add the provided data but will validate against any
            schema that's specified.
        on_bad_vectors: str, default "error"
            What to do if any of the vectors are not the same size or contains NaNs.
            One of "error", "drop", "fill".
        fill_value: float
            The value to use when filling vectors. Only used if on_bad_vectors="fill".
        storage_options: dict, optional
            Additional options for the storage backend. Options already set on the
            connection will be inherited by the table, but can be overridden here.
            See available options at
            <https://lancedb.github.io/lancedb/guides/storage/>
        data_storage_version: optional, str, default "stable"
            The version of the data storage format to use. Newer versions are more
            efficient but require newer versions of lance to read.  The default is
            "stable" which will use the legacy v2 version.  See the user guide
            for more details.
        enable_v2_manifest_paths: bool, optional, default False
            Use the new V2 manifest paths. These paths provide more efficient
            opening of datasets with many versions on object stores.  WARNING:
            turning this on will make the dataset unreadable for older versions
            of LanceDB (prior to 0.13.0). To migrate an existing dataset, instead
            use the
            [Table.migrate_manifest_paths_v2][lancedb.table.Table.migrate_v2_manifest_paths]
            method.

        Returns
        -------
        LanceTable
            A reference to the newly created table.

        !!! note

            The vector index won't be created by default.
            To create the index, call the `create_index` method on the table.

        Examples
        --------

        Can create with list of tuples or dictionaries:

        >>> import lancedb
        >>> db = lancedb.connect("./.lancedb")
        >>> data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},
        ...         {"vector": [0.2, 1.8], "lat": 40.1, "long":  -74.1}]
        >>> db.create_table("my_table", data)
        LanceTable(name='my_table', version=1, ...)
        >>> db["my_table"].head()
        pyarrow.Table
        vector: fixed_size_list<item: float>[2]
          child 0, item: float
        lat: double
        long: double
        ----
        vector: [[[1.1,1.2],[0.2,1.8]]]
        lat: [[45.5,40.1]]
        long: [[-122.7,-74.1]]

        You can also pass a pandas DataFrame:

        >>> import pandas as pd
        >>> data = pd.DataFrame({
        ...    "vector": [[1.1, 1.2], [0.2, 1.8]],
        ...    "lat": [45.5, 40.1],
        ...    "long": [-122.7, -74.1]
        ... })
        >>> db.create_table("table2", data)
        LanceTable(name='table2', version=1, ...)
        >>> db["table2"].head()
        pyarrow.Table
        vector: fixed_size_list<item: float>[2]
          child 0, item: float
        lat: double
        long: double
        ----
        vector: [[[1.1,1.2],[0.2,1.8]]]
        lat: [[45.5,40.1]]
        long: [[-122.7,-74.1]]

        Data is converted to Arrow before being written to disk. For maximum
        control over how data is saved, either provide the PyArrow schema to
        convert to or else provide a [PyArrow Table](pyarrow.Table) directly.

        >>> import pyarrow as pa
        >>> custom_schema = pa.schema([
        ...   pa.field("vector", pa.list_(pa.float32(), 2)),
        ...   pa.field("lat", pa.float32()),
        ...   pa.field("long", pa.float32())
        ... ])
        >>> db.create_table("table3", data, schema = custom_schema)
        LanceTable(name='table3', version=1, ...)
        >>> db["table3"].head()
        pyarrow.Table
        vector: fixed_size_list<item: float>[2]
          child 0, item: float
        lat: float
        long: float
        ----
        vector: [[[1.1,1.2],[0.2,1.8]]]
        lat: [[45.5,40.1]]
        long: [[-122.7,-74.1]]


        It is also possible to create an table from `[Iterable[pa.RecordBatch]]`:


        >>> import pyarrow as pa
        >>> def make_batches():
        ...     for i in range(5):
        ...         yield pa.RecordBatch.from_arrays(
        ...             [
        ...                 pa.array([[3.1, 4.1], [5.9, 26.5]],
        ...                     pa.list_(pa.float32(), 2)),
        ...                 pa.array(["foo", "bar"]),
        ...                 pa.array([10.0, 20.0]),
        ...             ],
        ...             ["vector", "item", "price"],
        ...         )
        >>> schema=pa.schema([
        ...     pa.field("vector", pa.list_(pa.float32(), 2)),
        ...     pa.field("item", pa.utf8()),
        ...     pa.field("price", pa.float32()),
        ... ])
        >>> db.create_table("table4", make_batches(), schema=schema)
        LanceTable(name='table4', version=1, ...)

        """
        raise NotImplementedError

    def __getitem__(self, name: str) -> LanceTable:
        return self.open_table(name)

    def open_table(
        self,
        name: str,
        *,
        storage_options: Optional[Dict[str, str]] = None,
        index_cache_size: Optional[int] = None,
    ) -> Table:
        """Open a Lance Table in the database.

        Parameters
        ----------
        name: str
            The name of the table.
        index_cache_size: int, default 256
            Set the size of the index cache, specified as a number of entries

            The exact meaning of an "entry" will depend on the type of index:
            * IVF - there is one entry for each IVF partition
            * BTREE - there is one entry for the entire index

            This cache applies to the entire opened table, across all indices.
            Setting this value higher will increase performance on larger datasets
            at the expense of more RAM
        storage_options: dict, optional
            Additional options for the storage backend. Options already set on the
            connection will be inherited by the table, but can be overridden here.
            See available options at
            <https://lancedb.github.io/lancedb/guides/storage/>

        Returns
        -------
        A LanceTable object representing the table.
        """
        raise NotImplementedError

    def drop_table(self, name: str):
        """Drop a table from the database.

        Parameters
        ----------
        name: str
            The name of the table.
        """
        raise NotImplementedError

    def rename_table(self, cur_name: str, new_name: str):
        """Rename a table in the database.

        Parameters
        ----------
        cur_name: str
            The current name of the table.
        new_name: str
            The new name of the table.
        """
        raise NotImplementedError

    def drop_database(self):
        """
        Drop database
        This is the same thing as dropping all the tables
        """
        raise NotImplementedError

    @property
    def uri(self) -> str:
        return self._uri

table_names abstractmethod

table_names(page_token: Optional[str] = None, limit: int = 10) -> Iterable[str]

List all tables in this database, in sorted order

Parameters:

  • page_token (Optional[str], default: None ) –

    The token to use for pagination. If not present, start from the beginning. Typically, this token is last table name from the previous page. Only supported by LanceDb Cloud.

  • limit (int, default: 10 ) –

    The size of the page to return. Only supported by LanceDb Cloud.

Returns:

  • Iterable of str
Source code in lancedb/db.py
@abstractmethod
def table_names(
    self, page_token: Optional[str] = None, limit: int = 10
) -> Iterable[str]:
    """List all tables in this database, in sorted order

    Parameters
    ----------
    page_token: str, optional
        The token to use for pagination. If not present, start from the beginning.
        Typically, this token is last table name from the previous page.
        Only supported by LanceDb Cloud.
    limit: int, default 10
        The size of the page to return.
        Only supported by LanceDb Cloud.

    Returns
    -------
    Iterable of str
    """
    pass

create_table abstractmethod

create_table(name: str, data: Optional[DATA] = None, schema: Optional[Union[Schema, LanceModel]] = None, mode: str = 'create', exist_ok: bool = False, on_bad_vectors: str = 'error', fill_value: float = 0.0, embedding_functions: Optional[List[EmbeddingFunctionConfig]] = None, *, storage_options: Optional[Dict[str, str]] = None, data_storage_version: Optional[str] = None, enable_v2_manifest_paths: Optional[bool] = None) -> Table

Create a Table in the database.

Parameters:

  • name (str) –

    The name of the table.

  • data (Optional[DATA], default: None ) –

    User must provide at least one of data or schema. Acceptable types are:

    • list-of-dict

    • pandas.DataFrame

    • pyarrow.Table or pyarrow.RecordBatch

  • schema (Optional[Union[Schema, LanceModel]], default: None ) –

    Acceptable types are:

  • mode (str, default: 'create' ) –

    The mode to use when creating the table. Can be either "create" or "overwrite". By default, if the table already exists, an exception is raised. If you want to overwrite the table, use mode="overwrite".

  • exist_ok (bool, default: False ) –

    If a table by the same name already exists, then raise an exception if exist_ok=False. If exist_ok=True, then open the existing table; it will not add the provided data but will validate against any schema that's specified.

  • on_bad_vectors (str, default: 'error' ) –

    What to do if any of the vectors are not the same size or contains NaNs. One of "error", "drop", "fill".

  • fill_value (float, default: 0.0 ) –

    The value to use when filling vectors. Only used if on_bad_vectors="fill".

  • storage_options (Optional[Dict[str, str]], default: None ) –

    Additional options for the storage backend. Options already set on the connection will be inherited by the table, but can be overridden here. See available options at https://lancedb.github.io/lancedb/guides/storage/

  • data_storage_version (Optional[str], default: None ) –

    The version of the data storage format to use. Newer versions are more efficient but require newer versions of lance to read. The default is "stable" which will use the legacy v2 version. See the user guide for more details.

  • enable_v2_manifest_paths (Optional[bool], default: None ) –

    Use the new V2 manifest paths. These paths provide more efficient opening of datasets with many versions on object stores. WARNING: turning this on will make the dataset unreadable for older versions of LanceDB (prior to 0.13.0). To migrate an existing dataset, instead use the Table.migrate_manifest_paths_v2 method.

Returns:

  • LanceTable

    A reference to the newly created table.

  • !!! note

    The vector index won't be created by default. To create the index, call the create_index method on the table.

Examples:

Can create with list of tuples or dictionaries:

>>> import lancedb
>>> db = lancedb.connect("./.lancedb")
>>> data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},
...         {"vector": [0.2, 1.8], "lat": 40.1, "long":  -74.1}]
>>> db.create_table("my_table", data)
LanceTable(name='my_table', version=1, ...)
>>> db["my_table"].head()
pyarrow.Table
vector: fixed_size_list<item: float>[2]
  child 0, item: float
lat: double
long: double
----
vector: [[[1.1,1.2],[0.2,1.8]]]
lat: [[45.5,40.1]]
long: [[-122.7,-74.1]]

You can also pass a pandas DataFrame:

>>> import pandas as pd
>>> data = pd.DataFrame({
...    "vector": [[1.1, 1.2], [0.2, 1.8]],
...    "lat": [45.5, 40.1],
...    "long": [-122.7, -74.1]
... })
>>> db.create_table("table2", data)
LanceTable(name='table2', version=1, ...)
>>> db["table2"].head()
pyarrow.Table
vector: fixed_size_list<item: float>[2]
  child 0, item: float
lat: double
long: double
----
vector: [[[1.1,1.2],[0.2,1.8]]]
lat: [[45.5,40.1]]
long: [[-122.7,-74.1]]

Data is converted to Arrow before being written to disk. For maximum control over how data is saved, either provide the PyArrow schema to convert to or else provide a PyArrow Table directly.

>>> import pyarrow as pa
>>> custom_schema = pa.schema([
...   pa.field("vector", pa.list_(pa.float32(), 2)),
...   pa.field("lat", pa.float32()),
...   pa.field("long", pa.float32())
... ])
>>> db.create_table("table3", data, schema = custom_schema)
LanceTable(name='table3', version=1, ...)
>>> db["table3"].head()
pyarrow.Table
vector: fixed_size_list<item: float>[2]
  child 0, item: float
lat: float
long: float
----
vector: [[[1.1,1.2],[0.2,1.8]]]
lat: [[45.5,40.1]]
long: [[-122.7,-74.1]]

It is also possible to create an table from [Iterable[pa.RecordBatch]]:

>>> import pyarrow as pa
>>> def make_batches():
...     for i in range(5):
...         yield pa.RecordBatch.from_arrays(
...             [
...                 pa.array([[3.1, 4.1], [5.9, 26.5]],
...                     pa.list_(pa.float32(), 2)),
...                 pa.array(["foo", "bar"]),
...                 pa.array([10.0, 20.0]),
...             ],
...             ["vector", "item", "price"],
...         )
>>> schema=pa.schema([
...     pa.field("vector", pa.list_(pa.float32(), 2)),
...     pa.field("item", pa.utf8()),
...     pa.field("price", pa.float32()),
... ])
>>> db.create_table("table4", make_batches(), schema=schema)
LanceTable(name='table4', version=1, ...)
Source code in lancedb/db.py
@abstractmethod
def create_table(
    self,
    name: str,
    data: Optional[DATA] = None,
    schema: Optional[Union[pa.Schema, LanceModel]] = None,
    mode: str = "create",
    exist_ok: bool = False,
    on_bad_vectors: str = "error",
    fill_value: float = 0.0,
    embedding_functions: Optional[List[EmbeddingFunctionConfig]] = None,
    *,
    storage_options: Optional[Dict[str, str]] = None,
    data_storage_version: Optional[str] = None,
    enable_v2_manifest_paths: Optional[bool] = None,
) -> Table:
    """Create a [Table][lancedb.table.Table] in the database.

    Parameters
    ----------
    name: str
        The name of the table.
    data: The data to initialize the table, *optional*
        User must provide at least one of `data` or `schema`.
        Acceptable types are:

        - list-of-dict

        - pandas.DataFrame

        - pyarrow.Table or pyarrow.RecordBatch
    schema: The schema of the table, *optional*
        Acceptable types are:

        - pyarrow.Schema

        - [LanceModel][lancedb.pydantic.LanceModel]
    mode: str; default "create"
        The mode to use when creating the table.
        Can be either "create" or "overwrite".
        By default, if the table already exists, an exception is raised.
        If you want to overwrite the table, use mode="overwrite".
    exist_ok: bool, default False
        If a table by the same name already exists, then raise an exception
        if exist_ok=False. If exist_ok=True, then open the existing table;
        it will not add the provided data but will validate against any
        schema that's specified.
    on_bad_vectors: str, default "error"
        What to do if any of the vectors are not the same size or contains NaNs.
        One of "error", "drop", "fill".
    fill_value: float
        The value to use when filling vectors. Only used if on_bad_vectors="fill".
    storage_options: dict, optional
        Additional options for the storage backend. Options already set on the
        connection will be inherited by the table, but can be overridden here.
        See available options at
        <https://lancedb.github.io/lancedb/guides/storage/>
    data_storage_version: optional, str, default "stable"
        The version of the data storage format to use. Newer versions are more
        efficient but require newer versions of lance to read.  The default is
        "stable" which will use the legacy v2 version.  See the user guide
        for more details.
    enable_v2_manifest_paths: bool, optional, default False
        Use the new V2 manifest paths. These paths provide more efficient
        opening of datasets with many versions on object stores.  WARNING:
        turning this on will make the dataset unreadable for older versions
        of LanceDB (prior to 0.13.0). To migrate an existing dataset, instead
        use the
        [Table.migrate_manifest_paths_v2][lancedb.table.Table.migrate_v2_manifest_paths]
        method.

    Returns
    -------
    LanceTable
        A reference to the newly created table.

    !!! note

        The vector index won't be created by default.
        To create the index, call the `create_index` method on the table.

    Examples
    --------

    Can create with list of tuples or dictionaries:

    >>> import lancedb
    >>> db = lancedb.connect("./.lancedb")
    >>> data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},
    ...         {"vector": [0.2, 1.8], "lat": 40.1, "long":  -74.1}]
    >>> db.create_table("my_table", data)
    LanceTable(name='my_table', version=1, ...)
    >>> db["my_table"].head()
    pyarrow.Table
    vector: fixed_size_list<item: float>[2]
      child 0, item: float
    lat: double
    long: double
    ----
    vector: [[[1.1,1.2],[0.2,1.8]]]
    lat: [[45.5,40.1]]
    long: [[-122.7,-74.1]]

    You can also pass a pandas DataFrame:

    >>> import pandas as pd
    >>> data = pd.DataFrame({
    ...    "vector": [[1.1, 1.2], [0.2, 1.8]],
    ...    "lat": [45.5, 40.1],
    ...    "long": [-122.7, -74.1]
    ... })
    >>> db.create_table("table2", data)
    LanceTable(name='table2', version=1, ...)
    >>> db["table2"].head()
    pyarrow.Table
    vector: fixed_size_list<item: float>[2]
      child 0, item: float
    lat: double
    long: double
    ----
    vector: [[[1.1,1.2],[0.2,1.8]]]
    lat: [[45.5,40.1]]
    long: [[-122.7,-74.1]]

    Data is converted to Arrow before being written to disk. For maximum
    control over how data is saved, either provide the PyArrow schema to
    convert to or else provide a [PyArrow Table](pyarrow.Table) directly.

    >>> import pyarrow as pa
    >>> custom_schema = pa.schema([
    ...   pa.field("vector", pa.list_(pa.float32(), 2)),
    ...   pa.field("lat", pa.float32()),
    ...   pa.field("long", pa.float32())
    ... ])
    >>> db.create_table("table3", data, schema = custom_schema)
    LanceTable(name='table3', version=1, ...)
    >>> db["table3"].head()
    pyarrow.Table
    vector: fixed_size_list<item: float>[2]
      child 0, item: float
    lat: float
    long: float
    ----
    vector: [[[1.1,1.2],[0.2,1.8]]]
    lat: [[45.5,40.1]]
    long: [[-122.7,-74.1]]


    It is also possible to create an table from `[Iterable[pa.RecordBatch]]`:


    >>> import pyarrow as pa
    >>> def make_batches():
    ...     for i in range(5):
    ...         yield pa.RecordBatch.from_arrays(
    ...             [
    ...                 pa.array([[3.1, 4.1], [5.9, 26.5]],
    ...                     pa.list_(pa.float32(), 2)),
    ...                 pa.array(["foo", "bar"]),
    ...                 pa.array([10.0, 20.0]),
    ...             ],
    ...             ["vector", "item", "price"],
    ...         )
    >>> schema=pa.schema([
    ...     pa.field("vector", pa.list_(pa.float32(), 2)),
    ...     pa.field("item", pa.utf8()),
    ...     pa.field("price", pa.float32()),
    ... ])
    >>> db.create_table("table4", make_batches(), schema=schema)
    LanceTable(name='table4', version=1, ...)

    """
    raise NotImplementedError

open_table

open_table(name: str, *, storage_options: Optional[Dict[str, str]] = None, index_cache_size: Optional[int] = None) -> Table

Open a Lance Table in the database.

Parameters:

  • name (str) –

    The name of the table.

  • index_cache_size (Optional[int], default: None ) –

    Set the size of the index cache, specified as a number of entries

    The exact meaning of an "entry" will depend on the type of index: * IVF - there is one entry for each IVF partition * BTREE - there is one entry for the entire index

    This cache applies to the entire opened table, across all indices. Setting this value higher will increase performance on larger datasets at the expense of more RAM

  • storage_options (Optional[Dict[str, str]], default: None ) –

    Additional options for the storage backend. Options already set on the connection will be inherited by the table, but can be overridden here. See available options at https://lancedb.github.io/lancedb/guides/storage/

Returns:

  • A LanceTable object representing the table.
Source code in lancedb/db.py
def open_table(
    self,
    name: str,
    *,
    storage_options: Optional[Dict[str, str]] = None,
    index_cache_size: Optional[int] = None,
) -> Table:
    """Open a Lance Table in the database.

    Parameters
    ----------
    name: str
        The name of the table.
    index_cache_size: int, default 256
        Set the size of the index cache, specified as a number of entries

        The exact meaning of an "entry" will depend on the type of index:
        * IVF - there is one entry for each IVF partition
        * BTREE - there is one entry for the entire index

        This cache applies to the entire opened table, across all indices.
        Setting this value higher will increase performance on larger datasets
        at the expense of more RAM
    storage_options: dict, optional
        Additional options for the storage backend. Options already set on the
        connection will be inherited by the table, but can be overridden here.
        See available options at
        <https://lancedb.github.io/lancedb/guides/storage/>

    Returns
    -------
    A LanceTable object representing the table.
    """
    raise NotImplementedError

drop_table

drop_table(name: str)

Drop a table from the database.

Parameters:

  • name (str) –

    The name of the table.

Source code in lancedb/db.py
def drop_table(self, name: str):
    """Drop a table from the database.

    Parameters
    ----------
    name: str
        The name of the table.
    """
    raise NotImplementedError

rename_table

rename_table(cur_name: str, new_name: str)

Rename a table in the database.

Parameters:

  • cur_name (str) –

    The current name of the table.

  • new_name (str) –

    The new name of the table.

Source code in lancedb/db.py
def rename_table(self, cur_name: str, new_name: str):
    """Rename a table in the database.

    Parameters
    ----------
    cur_name: str
        The current name of the table.
    new_name: str
        The new name of the table.
    """
    raise NotImplementedError

drop_database

drop_database()

Drop database This is the same thing as dropping all the tables

Source code in lancedb/db.py
def drop_database(self):
    """
    Drop database
    This is the same thing as dropping all the tables
    """
    raise NotImplementedError

Tables (Synchronous)

lancedb.table.Table

Bases: ABC

A Table is a collection of Records in a LanceDB Database.

Examples:

Create using DBConnection.create_table (more examples in that method's documentation).

>>> import lancedb
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data=[{"vector": [1.1, 1.2], "b": 2}])
>>> table.head()
pyarrow.Table
vector: fixed_size_list<item: float>[2]
  child 0, item: float
b: int64
----
vector: [[[1.1,1.2]]]
b: [[2]]

Can append new data with Table.add().

>>> table.add([{"vector": [0.5, 1.3], "b": 4}])

Can query the table with Table.search.

>>> table.search([0.4, 0.4]).select(["b", "vector"]).to_pandas()
   b      vector  _distance
0  4  [0.5, 1.3]       0.82
1  2  [1.1, 1.2]       1.13

Search queries are much faster when an index is created. See Table.create_index.

Source code in lancedb/table.py
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
class Table(ABC):
    """
    A Table is a collection of Records in a LanceDB Database.

    Examples
    --------

    Create using [DBConnection.create_table][lancedb.DBConnection.create_table]
    (more examples in that method's documentation).

    >>> import lancedb
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", data=[{"vector": [1.1, 1.2], "b": 2}])
    >>> table.head()
    pyarrow.Table
    vector: fixed_size_list<item: float>[2]
      child 0, item: float
    b: int64
    ----
    vector: [[[1.1,1.2]]]
    b: [[2]]

    Can append new data with [Table.add()][lancedb.table.Table.add].

    >>> table.add([{"vector": [0.5, 1.3], "b": 4}])

    Can query the table with [Table.search][lancedb.table.Table.search].

    >>> table.search([0.4, 0.4]).select(["b", "vector"]).to_pandas()
       b      vector  _distance
    0  4  [0.5, 1.3]       0.82
    1  2  [1.1, 1.2]       1.13

    Search queries are much faster when an index is created. See
    [Table.create_index][lancedb.table.Table.create_index].
    """

    @property
    @abstractmethod
    def name(self) -> str:
        """The name of this Table"""
        raise NotImplementedError

    @property
    @abstractmethod
    def version(self) -> int:
        """The version of this Table"""
        raise NotImplementedError

    @property
    @abstractmethod
    def schema(self) -> pa.Schema:
        """The [Arrow Schema](https://arrow.apache.org/docs/python/api/datatypes.html#)
        of this Table

        """
        raise NotImplementedError

    @property
    @abstractmethod
    def embedding_functions(self) -> Dict[str, EmbeddingFunctionConfig]:
        """
        Get a mapping from vector column name to it's configured embedding function.
        """

    @abstractmethod
    def count_rows(self, filter: Optional[str] = None) -> int:
        """
        Count the number of rows in the table.

        Parameters
        ----------
        filter: str, optional
            A SQL where clause to filter the rows to count.
        """
        raise NotImplementedError

    def to_pandas(self) -> "pandas.DataFrame":
        """Return the table as a pandas DataFrame.

        Returns
        -------
        pd.DataFrame
        """
        return self.to_arrow().to_pandas()

    @abstractmethod
    def to_arrow(self) -> pa.Table:
        """Return the table as a pyarrow Table.

        Returns
        -------
        pa.Table
        """
        raise NotImplementedError

    def create_index(
        self,
        metric="L2",
        num_partitions=256,
        num_sub_vectors=96,
        vector_column_name: str = VECTOR_COLUMN_NAME,
        replace: bool = True,
        accelerator: Optional[str] = None,
        index_cache_size: Optional[int] = None,
        *,
        index_type: Literal[
            "IVF_FLAT", "IVF_PQ", "IVF_HNSW_SQ", "IVF_HNSW_PQ"
        ] = "IVF_PQ",
        num_bits: int = 8,
        max_iterations: int = 50,
        sample_rate: int = 256,
        m: int = 20,
        ef_construction: int = 300,
    ):
        """Create an index on the table.

        Parameters
        ----------
        metric: str, default "L2"
            The distance metric to use when creating the index.
            Valid values are "L2", "cosine", "dot", or "hamming".
            L2 is euclidean distance.
            Hamming is available only for binary vectors.
        num_partitions: int, default 256
            The number of IVF partitions to use when creating the index.
            Default is 256.
        num_sub_vectors: int, default 96
            The number of PQ sub-vectors to use when creating the index.
            Default is 96.
        vector_column_name: str, default "vector"
            The vector column name to create the index.
        replace: bool, default True
            - If True, replace the existing index if it exists.

            - If False, raise an error if duplicate index exists.
        accelerator: str, default None
            If set, use the given accelerator to create the index.
            Only support "cuda" for now.
        index_cache_size : int, optional
            The size of the index cache in number of entries. Default value is 256.
        num_bits: int
            The number of bits to encode sub-vectors. Only used with the IVF_PQ index.
            Only 4 and 8 are supported.
        """
        raise NotImplementedError

    @abstractmethod
    def create_scalar_index(
        self,
        column: str,
        *,
        replace: bool = True,
        index_type: Literal["BTREE", "BITMAP", "LABEL_LIST"] = "BTREE",
    ):
        """Create a scalar index on a column.

        Parameters
        ----------
        column : str
            The column to be indexed.  Must be a boolean, integer, float,
            or string column.
        replace : bool, default True
            Replace the existing index if it exists.
        index_type: Literal["BTREE", "BITMAP", "LABEL_LIST"], default "BTREE"
            The type of index to create.

        Examples
        --------

        Scalar indices, like vector indices, can be used to speed up scans.  A scalar
        index can speed up scans that contain filter expressions on the indexed column.
        For example, the following scan will be faster if the column ``my_col`` has
        a scalar index:

        >>> import lancedb # doctest: +SKIP
        >>> db = lancedb.connect("/data/lance") # doctest: +SKIP
        >>> img_table = db.open_table("images") # doctest: +SKIP
        >>> my_df = img_table.search().where("my_col = 7", # doctest: +SKIP
        ...                                  prefilter=True).to_pandas()

        Scalar indices can also speed up scans containing a vector search and a
        prefilter:

        >>> import lancedb # doctest: +SKIP
        >>> db = lancedb.connect("/data/lance") # doctest: +SKIP
        >>> img_table = db.open_table("images") # doctest: +SKIP
        >>> img_table.search([1, 2, 3, 4], vector_column_name="vector") # doctest: +SKIP
        ...     .where("my_col != 7", prefilter=True)
        ...     .to_pandas()

        Scalar indices can only speed up scans for basic filters using
        equality, comparison, range (e.g. ``my_col BETWEEN 0 AND 100``), and set
        membership (e.g. `my_col IN (0, 1, 2)`)

        Scalar indices can be used if the filter contains multiple indexed columns and
        the filter criteria are AND'd or OR'd together
        (e.g. ``my_col < 0 AND other_col> 100``)

        Scalar indices may be used if the filter contains non-indexed columns but,
        depending on the structure of the filter, they may not be usable.  For example,
        if the column ``not_indexed`` does not have a scalar index then the filter
        ``my_col = 0 OR not_indexed = 1`` will not be able to use any scalar index on
        ``my_col``.
        """
        raise NotImplementedError

    def create_fts_index(
        self,
        field_names: Union[str, List[str]],
        *,
        ordering_field_names: Optional[Union[str, List[str]]] = None,
        replace: bool = False,
        writer_heap_size: Optional[int] = 1024 * 1024 * 1024,
        use_tantivy: bool = True,
        tokenizer_name: Optional[str] = None,
        with_position: bool = True,
        # tokenizer configs:
        base_tokenizer: Literal["simple", "raw", "whitespace"] = "simple",
        language: str = "English",
        max_token_length: Optional[int] = 40,
        lower_case: bool = True,
        stem: bool = False,
        remove_stop_words: bool = False,
        ascii_folding: bool = False,
    ):
        """Create a full-text search index on the table.

        Warning - this API is highly experimental and is highly likely to change
        in the future.

        Parameters
        ----------
        field_names: str or list of str
            The name(s) of the field to index.
            can be only str if use_tantivy=True for now.
        replace: bool, default False
            If True, replace the existing index if it exists. Note that this is
            not yet an atomic operation; the index will be temporarily
            unavailable while the new index is being created.
        writer_heap_size: int, default 1GB
            Only available with use_tantivy=True
        ordering_field_names:
            A list of unsigned type fields to index to optionally order
            results on at search time.
            only available with use_tantivy=True
        tokenizer_name: str, default "default"
            The tokenizer to use for the index. Can be "raw", "default" or the 2 letter
            language code followed by "_stem". So for english it would be "en_stem".
            For available languages see: https://docs.rs/tantivy/latest/tantivy/tokenizer/enum.Language.html
        use_tantivy: bool, default True
            If True, use the legacy full-text search implementation based on tantivy.
            If False, use the new full-text search implementation based on lance-index.
        with_position: bool, default True
            Only available with use_tantivy=False
            If False, do not store the positions of the terms in the text.
            This can reduce the size of the index and improve indexing speed.
            But it will raise an exception for phrase queries.
        base_tokenizer : str, default "simple"
            The base tokenizer to use for tokenization. Options are:
            - "simple": Splits text by whitespace and punctuation.
            - "whitespace": Split text by whitespace, but not punctuation.
            - "raw": No tokenization. The entire text is treated as a single token.
        language : str, default "English"
            The language to use for tokenization.
        max_token_length : int, default 40
            The maximum token length to index. Tokens longer than this length will be
            ignored.
        lower_case : bool, default True
            Whether to convert the token to lower case. This makes queries
            case-insensitive.
        stem : bool, default False
            Whether to stem the token. Stemming reduces words to their root form.
            For example, in English "running" and "runs" would both be reduced to "run".
        remove_stop_words : bool, default False
            Whether to remove stop words. Stop words are common words that are often
            removed from text before indexing. For example, in English "the" and "and".
        ascii_folding : bool, default False
            Whether to fold ASCII characters. This converts accented characters to
            their ASCII equivalent. For example, "café" would be converted to "cafe".
        """
        raise NotImplementedError

    @abstractmethod
    def add(
        self,
        data: DATA,
        mode: str = "append",
        on_bad_vectors: str = "error",
        fill_value: float = 0.0,
    ):
        """Add more data to the [Table](Table).

        Parameters
        ----------
        data: DATA
            The data to insert into the table. Acceptable types are:

            - list-of-dict

            - pandas.DataFrame

            - pyarrow.Table or pyarrow.RecordBatch
        mode: str
            The mode to use when writing the data. Valid values are
            "append" and "overwrite".
        on_bad_vectors: str, default "error"
            What to do if any of the vectors are not the same size or contains NaNs.
            One of "error", "drop", "fill".
        fill_value: float, default 0.
            The value to use when filling vectors. Only used if on_bad_vectors="fill".

        """
        raise NotImplementedError

    def merge_insert(self, on: Union[str, Iterable[str]]) -> LanceMergeInsertBuilder:
        """
        Returns a [`LanceMergeInsertBuilder`][lancedb.merge.LanceMergeInsertBuilder]
        that can be used to create a "merge insert" operation

        This operation can add rows, update rows, and remove rows all in a single
        transaction. It is a very generic tool that can be used to create
        behaviors like "insert if not exists", "update or insert (i.e. upsert)",
        or even replace a portion of existing data with new data (e.g. replace
        all data where month="january")

        The merge insert operation works by combining new data from a
        **source table** with existing data in a **target table** by using a
        join.  There are three categories of records.

        "Matched" records are records that exist in both the source table and
        the target table. "Not matched" records exist only in the source table
        (e.g. these are new data) "Not matched by source" records exist only
        in the target table (this is old data)

        The builder returned by this method can be used to customize what
        should happen for each category of data.

        Please note that the data may appear to be reordered as part of this
        operation.  This is because updated rows will be deleted from the
        dataset and then reinserted at the end with the new values.

        Parameters
        ----------

        on: Union[str, Iterable[str]]
            A column (or columns) to join on.  This is how records from the
            source table and target table are matched.  Typically this is some
            kind of key or id column.

        Examples
        --------
        >>> import lancedb
        >>> data = pa.table({"a": [2, 1, 3], "b": ["a", "b", "c"]})
        >>> db = lancedb.connect("./.lancedb")
        >>> table = db.create_table("my_table", data)
        >>> new_data = pa.table({"a": [2, 3, 4], "b": ["x", "y", "z"]})
        >>> # Perform a "upsert" operation
        >>> table.merge_insert("a")             \\
        ...      .when_matched_update_all()     \\
        ...      .when_not_matched_insert_all() \\
        ...      .execute(new_data)
        >>> # The order of new rows is non-deterministic since we use
        >>> # a hash-join as part of this operation and so we sort here
        >>> table.to_arrow().sort_by("a").to_pandas()
           a  b
        0  1  b
        1  2  x
        2  3  y
        3  4  z
        """
        on = [on] if isinstance(on, str) else list(on.iter())

        return LanceMergeInsertBuilder(self, on)

    @abstractmethod
    def search(
        self,
        query: Optional[Union[VEC, str, "PIL.Image.Image", Tuple]] = None,
        vector_column_name: Optional[str] = None,
        query_type: QueryType = "auto",
        ordering_field_name: Optional[str] = None,
        fts_columns: Optional[Union[str, List[str]]] = None,
    ) -> LanceQueryBuilder:
        """Create a search query to find the nearest neighbors
        of the given query vector. We currently support [vector search][search]
        and [full-text search][experimental-full-text-search].

        All query options are defined in [Query][lancedb.query.Query].

        Examples
        --------
        >>> import lancedb
        >>> db = lancedb.connect("./.lancedb")
        >>> data = [
        ...    {"original_width": 100, "caption": "bar", "vector": [0.1, 2.3, 4.5]},
        ...    {"original_width": 2000, "caption": "foo",  "vector": [0.5, 3.4, 1.3]},
        ...    {"original_width": 3000, "caption": "test", "vector": [0.3, 6.2, 2.6]}
        ... ]
        >>> table = db.create_table("my_table", data)
        >>> query = [0.4, 1.4, 2.4]
        >>> (table.search(query)
        ...     .where("original_width > 1000", prefilter=True)
        ...     .select(["caption", "original_width", "vector"])
        ...     .limit(2)
        ...     .to_pandas())
          caption  original_width           vector  _distance
        0     foo            2000  [0.5, 3.4, 1.3]   5.220000
        1    test            3000  [0.3, 6.2, 2.6]  23.089996

        Parameters
        ----------
        query: list/np.ndarray/str/PIL.Image.Image, default None
            The targetted vector to search for.

            - *default None*.
            Acceptable types are: list, np.ndarray, PIL.Image.Image

            - If None then the select/where/limit clauses are applied to filter
            the table
        vector_column_name: str, optional
            The name of the vector column to search.

            The vector column needs to be a pyarrow fixed size list type

            - If not specified then the vector column is inferred from
            the table schema

            - If the table has multiple vector columns then the *vector_column_name*
            needs to be specified. Otherwise, an error is raised.
        query_type: str
            *default "auto"*.
            Acceptable types are: "vector", "fts", "hybrid", or "auto"

            - If "auto" then the query type is inferred from the query;

                - If `query` is a list/np.ndarray then the query type is
                "vector";

                - If `query` is a PIL.Image.Image then either do vector search,
                or raise an error if no corresponding embedding function is found.

            - If `query` is a string, then the query type is "vector" if the
            table has embedding functions else the query type is "fts"

        Returns
        -------
        LanceQueryBuilder
            A query builder object representing the query.
            Once executed, the query returns

            - selected columns

            - the vector

            - and also the "_distance" column which is the distance between the query
            vector and the returned vector.
        """
        raise NotImplementedError

    @abstractmethod
    def _execute_query(
        self, query: Query, batch_size: Optional[int] = None
    ) -> pa.RecordBatchReader: ...

    @abstractmethod
    def _do_merge(
        self,
        merge: LanceMergeInsertBuilder,
        new_data: DATA,
        on_bad_vectors: str,
        fill_value: float,
    ): ...

    @abstractmethod
    def delete(self, where: str):
        """Delete rows from the table.

        This can be used to delete a single row, many rows, all rows, or
        sometimes no rows (if your predicate matches nothing).

        Parameters
        ----------
        where: str
            The SQL where clause to use when deleting rows.

            - For example, 'x = 2' or 'x IN (1, 2, 3)'.

            The filter must not be empty, or it will error.

        Examples
        --------
        >>> import lancedb
        >>> data = [
        ...    {"x": 1, "vector": [1.0, 2]},
        ...    {"x": 2, "vector": [3.0, 4]},
        ...    {"x": 3, "vector": [5.0, 6]}
        ... ]
        >>> db = lancedb.connect("./.lancedb")
        >>> table = db.create_table("my_table", data)
        >>> table.to_pandas()
           x      vector
        0  1  [1.0, 2.0]
        1  2  [3.0, 4.0]
        2  3  [5.0, 6.0]
        >>> table.delete("x = 2")
        >>> table.to_pandas()
           x      vector
        0  1  [1.0, 2.0]
        1  3  [5.0, 6.0]

        If you have a list of values to delete, you can combine them into a
        stringified list and use the `IN` operator:

        >>> to_remove = [1, 5]
        >>> to_remove = ", ".join([str(v) for v in to_remove])
        >>> to_remove
        '1, 5'
        >>> table.delete(f"x IN ({to_remove})")
        >>> table.to_pandas()
           x      vector
        0  3  [5.0, 6.0]
        """
        raise NotImplementedError

    @abstractmethod
    def update(
        self,
        where: Optional[str] = None,
        values: Optional[dict] = None,
        *,
        values_sql: Optional[Dict[str, str]] = None,
    ):
        """
        This can be used to update zero to all rows depending on how many
        rows match the where clause. If no where clause is provided, then
        all rows will be updated.

        Either `values` or `values_sql` must be provided. You cannot provide
        both.

        Parameters
        ----------
        where: str, optional
            The SQL where clause to use when updating rows. For example, 'x = 2'
            or 'x IN (1, 2, 3)'. The filter must not be empty, or it will error.
        values: dict, optional
            The values to update. The keys are the column names and the values
            are the values to set.
        values_sql: dict, optional
            The values to update, expressed as SQL expression strings. These can
            reference existing columns. For example, {"x": "x + 1"} will increment
            the x column by 1.

        Examples
        --------
        >>> import lancedb
        >>> import pandas as pd
        >>> data = pd.DataFrame({"x": [1, 2, 3], "vector": [[1.0, 2], [3, 4], [5, 6]]})
        >>> db = lancedb.connect("./.lancedb")
        >>> table = db.create_table("my_table", data)
        >>> table.to_pandas()
           x      vector
        0  1  [1.0, 2.0]
        1  2  [3.0, 4.0]
        2  3  [5.0, 6.0]
        >>> table.update(where="x = 2", values={"vector": [10.0, 10]})
        >>> table.to_pandas()
           x        vector
        0  1    [1.0, 2.0]
        1  3    [5.0, 6.0]
        2  2  [10.0, 10.0]
        >>> table.update(values_sql={"x": "x + 1"})
        >>> table.to_pandas()
           x        vector
        0  2    [1.0, 2.0]
        1  4    [5.0, 6.0]
        2  3  [10.0, 10.0]
        """
        raise NotImplementedError

    @abstractmethod
    def cleanup_old_versions(
        self,
        older_than: Optional[timedelta] = None,
        *,
        delete_unverified: bool = False,
    ) -> CleanupStats:
        """
        Clean up old versions of the table, freeing disk space.

        Parameters
        ----------
        older_than: timedelta, default None
            The minimum age of the version to delete. If None, then this defaults
            to two weeks.
        delete_unverified: bool, default False
            Because they may be part of an in-progress transaction, files newer
            than 7 days old are not deleted by default. If you are sure that
            there are no in-progress transactions, then you can set this to True
            to delete all files older than `older_than`.

        Returns
        -------
        CleanupStats
            The stats of the cleanup operation, including how many bytes were
            freed.

        See Also
        --------
        [Table.optimize][lancedb.table.Table.optimize]: A more comprehensive
            optimization operation that includes cleanup as well as other operations.

        Notes
        -----
        This function is not available in LanceDb Cloud (since LanceDB
        Cloud manages cleanup for you automatically)
        """

    @abstractmethod
    def compact_files(self, *args, **kwargs):
        """
        Run the compaction process on the table.
        This can be run after making several small appends to optimize the table
        for faster reads.

        Arguments are passed onto Lance's
        [compact_files][lance.dataset.DatasetOptimizer.compact_files].
        For most cases, the default should be fine.

        See Also
        --------
        [Table.optimize][lancedb.table.Table.optimize]: A more comprehensive
            optimization operation that includes cleanup as well as other operations.

        Notes
        -----
        This function is not available in LanceDB Cloud (since LanceDB
        Cloud manages compaction for you automatically)
        """

    @abstractmethod
    def optimize(
        self,
        *,
        cleanup_older_than: Optional[timedelta] = None,
        delete_unverified: bool = False,
    ):
        """
        Optimize the on-disk data and indices for better performance.

        Modeled after ``VACUUM`` in PostgreSQL.

        Optimization covers three operations:

         * Compaction: Merges small files into larger ones
         * Prune: Removes old versions of the dataset
         * Index: Optimizes the indices, adding new data to existing indices

        Parameters
        ----------
        cleanup_older_than: timedelta, optional default 7 days
            All files belonging to versions older than this will be removed.  Set
            to 0 days to remove all versions except the latest.  The latest version
            is never removed.
        delete_unverified: bool, default False
            Files leftover from a failed transaction may appear to be part of an
            in-progress operation (e.g. appending new data) and these files will not
            be deleted unless they are at least 7 days old. If delete_unverified is True
            then these files will be deleted regardless of their age.

        Experimental API
        ----------------

        The optimization process is undergoing active development and may change.
        Our goal with these changes is to improve the performance of optimization and
        reduce the complexity.

        That being said, it is essential today to run optimize if you want the best
        performance.  It should be stable and safe to use in production, but it our
        hope that the API may be simplified (or not even need to be called) in the
        future.

        The frequency an application shoudl call optimize is based on the frequency of
        data modifications.  If data is frequently added, deleted, or updated then
        optimize should be run frequently.  A good rule of thumb is to run optimize if
        you have added or modified 100,000 or more records or run more than 20 data
        modification operations.
        """

    @abstractmethod
    def list_indices(self) -> Iterable[IndexConfig]:
        """
        List all indices that have been created with
        [Table.create_index][lancedb.table.Table.create_index]
        """

    @abstractmethod
    def index_stats(self, index_name: str) -> Optional[IndexStatistics]:
        """
        Retrieve statistics about an index

        Parameters
        ----------
        index_name: str
            The name of the index to retrieve statistics for

        Returns
        -------
        IndexStatistics or None
            The statistics about the index. Returns None if the index does not exist.
        """

    @abstractmethod
    def add_columns(self, transforms: Dict[str, str]):
        """
        Add new columns with defined values.

        Parameters
        ----------
        transforms: Dict[str, str]
            A map of column name to a SQL expression to use to calculate the
            value of the new column. These expressions will be evaluated for
            each row in the table, and can reference existing columns.
        """

    @abstractmethod
    def alter_columns(self, *alterations: Iterable[Dict[str, str]]):
        """
        Alter column names and nullability.

        Parameters
        ----------
        alterations : Iterable[Dict[str, Any]]
            A sequence of dictionaries, each with the following keys:
            - "path": str
                The column path to alter. For a top-level column, this is the name.
                For a nested column, this is the dot-separated path, e.g. "a.b.c".
            - "rename": str, optional
                The new name of the column. If not specified, the column name is
                not changed.
            - "data_type": pyarrow.DataType, optional
               The new data type of the column. Existing values will be casted
               to this type. If not specified, the column data type is not changed.
            - "nullable": bool, optional
                Whether the column should be nullable. If not specified, the column
                nullability is not changed. Only non-nullable columns can be changed
                to nullable. Currently, you cannot change a nullable column to
                non-nullable.
        """

    @abstractmethod
    def drop_columns(self, columns: Iterable[str]):
        """
        Drop columns from the table.

        Parameters
        ----------
        columns : Iterable[str]
            The names of the columns to drop.
        """

    @abstractmethod
    def checkout(self, version: int):
        """
        Checks out a specific version of the Table

        Any read operation on the table will now access the data at the checked out
        version. As a consequence, calling this method will disable any read consistency
        interval that was previously set.

        This is a read-only operation that turns the table into a sort of "view"
        or "detached head".  Other table instances will not be affected.  To make the
        change permanent you can use the `[Self::restore]` method.

        Any operation that modifies the table will fail while the table is in a checked
        out state.

        To return the table to a normal state use `[Self::checkout_latest]`
        """

    @abstractmethod
    def checkout_latest(self):
        """
        Ensures the table is pointing at the latest version

        This can be used to manually update a table when the read_consistency_interval
        is None
        It can also be used to undo a `[Self::checkout]` operation
        """

    @abstractmethod
    def list_versions(self) -> List[Dict[str, Any]]:
        """List all versions of the table"""

    @cached_property
    def _dataset_uri(self) -> str:
        return _table_uri(self._conn.uri, self.name)

    def _get_fts_index_path(self) -> Tuple[str, pa_fs.FileSystem, bool]:
        from .remote.table import RemoteTable

        if isinstance(self, RemoteTable) or get_uri_scheme(self._dataset_uri) != "file":
            return ("", None, False)
        path = join_uri(self._dataset_uri, "_indices", "fts")
        fs, path = fs_from_uri(path)
        index_exists = fs.get_file_info(path).type != pa_fs.FileType.NotFound
        return (path, fs, index_exists)

    @abstractmethod
    def uses_v2_manifest_paths(self) -> bool:
        """
        Check if the table is using the new v2 manifest paths.

        Returns
        -------
        bool
            True if the table is using the new v2 manifest paths, False otherwise.
        """

    @abstractmethod
    def migrate_v2_manifest_paths(self):
        """
        Migrate the manifest paths to the new format.

        This will update the manifest to use the new v2 format for paths.

        This function is idempotent, and can be run multiple times without
        changing the state of the object store.

        !!! danger

            This should not be run while other concurrent operations are happening.
            And it should also run until completion before resuming other operations.

        You can use
        [Table.uses_v2_manifest_paths][lancedb.table.Table.uses_v2_manifest_paths]
        to check if the table is already using the new path style.
        """

name abstractmethod property

name: str

The name of this Table

version abstractmethod property

version: int

The version of this Table

schema abstractmethod property

schema: Schema

The Arrow Schema of this Table

embedding_functions abstractmethod property

embedding_functions: Dict[str, EmbeddingFunctionConfig]

Get a mapping from vector column name to it's configured embedding function.

count_rows abstractmethod

count_rows(filter: Optional[str] = None) -> int

Count the number of rows in the table.

Parameters:

  • filter (Optional[str], default: None ) –

    A SQL where clause to filter the rows to count.

Source code in lancedb/table.py
@abstractmethod
def count_rows(self, filter: Optional[str] = None) -> int:
    """
    Count the number of rows in the table.

    Parameters
    ----------
    filter: str, optional
        A SQL where clause to filter the rows to count.
    """
    raise NotImplementedError

to_pandas

to_pandas() -> 'pandas.DataFrame'

Return the table as a pandas DataFrame.

Returns:

  • DataFrame
Source code in lancedb/table.py
def to_pandas(self) -> "pandas.DataFrame":
    """Return the table as a pandas DataFrame.

    Returns
    -------
    pd.DataFrame
    """
    return self.to_arrow().to_pandas()

to_arrow abstractmethod

to_arrow() -> Table

Return the table as a pyarrow Table.

Returns:

Source code in lancedb/table.py
@abstractmethod
def to_arrow(self) -> pa.Table:
    """Return the table as a pyarrow Table.

    Returns
    -------
    pa.Table
    """
    raise NotImplementedError

create_index

create_index(metric='L2', num_partitions=256, num_sub_vectors=96, vector_column_name: str = VECTOR_COLUMN_NAME, replace: bool = True, accelerator: Optional[str] = None, index_cache_size: Optional[int] = None, *, index_type: Literal['IVF_FLAT', 'IVF_PQ', 'IVF_HNSW_SQ', 'IVF_HNSW_PQ'] = 'IVF_PQ', num_bits: int = 8, max_iterations: int = 50, sample_rate: int = 256, m: int = 20, ef_construction: int = 300)

Create an index on the table.

Parameters:

  • metric

    The distance metric to use when creating the index. Valid values are "L2", "cosine", "dot", or "hamming". L2 is euclidean distance. Hamming is available only for binary vectors.

  • num_partitions

    The number of IVF partitions to use when creating the index. Default is 256.

  • num_sub_vectors

    The number of PQ sub-vectors to use when creating the index. Default is 96.

  • vector_column_name (str, default: VECTOR_COLUMN_NAME ) –

    The vector column name to create the index.

  • replace (bool, default: True ) –
    • If True, replace the existing index if it exists.

    • If False, raise an error if duplicate index exists.

  • accelerator (Optional[str], default: None ) –

    If set, use the given accelerator to create the index. Only support "cuda" for now.

  • index_cache_size (int, default: None ) –

    The size of the index cache in number of entries. Default value is 256.

  • num_bits (int, default: 8 ) –

    The number of bits to encode sub-vectors. Only used with the IVF_PQ index. Only 4 and 8 are supported.

Source code in lancedb/table.py
def create_index(
    self,
    metric="L2",
    num_partitions=256,
    num_sub_vectors=96,
    vector_column_name: str = VECTOR_COLUMN_NAME,
    replace: bool = True,
    accelerator: Optional[str] = None,
    index_cache_size: Optional[int] = None,
    *,
    index_type: Literal[
        "IVF_FLAT", "IVF_PQ", "IVF_HNSW_SQ", "IVF_HNSW_PQ"
    ] = "IVF_PQ",
    num_bits: int = 8,
    max_iterations: int = 50,
    sample_rate: int = 256,
    m: int = 20,
    ef_construction: int = 300,
):
    """Create an index on the table.

    Parameters
    ----------
    metric: str, default "L2"
        The distance metric to use when creating the index.
        Valid values are "L2", "cosine", "dot", or "hamming".
        L2 is euclidean distance.
        Hamming is available only for binary vectors.
    num_partitions: int, default 256
        The number of IVF partitions to use when creating the index.
        Default is 256.
    num_sub_vectors: int, default 96
        The number of PQ sub-vectors to use when creating the index.
        Default is 96.
    vector_column_name: str, default "vector"
        The vector column name to create the index.
    replace: bool, default True
        - If True, replace the existing index if it exists.

        - If False, raise an error if duplicate index exists.
    accelerator: str, default None
        If set, use the given accelerator to create the index.
        Only support "cuda" for now.
    index_cache_size : int, optional
        The size of the index cache in number of entries. Default value is 256.
    num_bits: int
        The number of bits to encode sub-vectors. Only used with the IVF_PQ index.
        Only 4 and 8 are supported.
    """
    raise NotImplementedError

create_scalar_index abstractmethod

create_scalar_index(column: str, *, replace: bool = True, index_type: Literal['BTREE', 'BITMAP', 'LABEL_LIST'] = 'BTREE')

Create a scalar index on a column.

Parameters:

  • column (str) –

    The column to be indexed. Must be a boolean, integer, float, or string column.

  • replace (bool, default: True ) –

    Replace the existing index if it exists.

  • index_type (Literal['BTREE', 'BITMAP', 'LABEL_LIST'], default: 'BTREE' ) –

    The type of index to create.

Examples:

Scalar indices, like vector indices, can be used to speed up scans. A scalar index can speed up scans that contain filter expressions on the indexed column. For example, the following scan will be faster if the column my_col has a scalar index:

>>> import lancedb
>>> db = lancedb.connect("/data/lance")
>>> img_table = db.open_table("images")
>>> my_df = img_table.search().where("my_col = 7",
...                                  prefilter=True).to_pandas()

Scalar indices can also speed up scans containing a vector search and a prefilter:

>>> import lancedb
>>> db = lancedb.connect("/data/lance")
>>> img_table = db.open_table("images")
>>> img_table.search([1, 2, 3, 4], vector_column_name="vector")
...     .where("my_col != 7", prefilter=True)
...     .to_pandas()

Scalar indices can only speed up scans for basic filters using equality, comparison, range (e.g. my_col BETWEEN 0 AND 100), and set membership (e.g. my_col IN (0, 1, 2))

Scalar indices can be used if the filter contains multiple indexed columns and the filter criteria are AND'd or OR'd together (e.g. my_col < 0 AND other_col> 100)

Scalar indices may be used if the filter contains non-indexed columns but, depending on the structure of the filter, they may not be usable. For example, if the column not_indexed does not have a scalar index then the filter my_col = 0 OR not_indexed = 1 will not be able to use any scalar index on my_col.

Source code in lancedb/table.py
@abstractmethod
def create_scalar_index(
    self,
    column: str,
    *,
    replace: bool = True,
    index_type: Literal["BTREE", "BITMAP", "LABEL_LIST"] = "BTREE",
):
    """Create a scalar index on a column.

    Parameters
    ----------
    column : str
        The column to be indexed.  Must be a boolean, integer, float,
        or string column.
    replace : bool, default True
        Replace the existing index if it exists.
    index_type: Literal["BTREE", "BITMAP", "LABEL_LIST"], default "BTREE"
        The type of index to create.

    Examples
    --------

    Scalar indices, like vector indices, can be used to speed up scans.  A scalar
    index can speed up scans that contain filter expressions on the indexed column.
    For example, the following scan will be faster if the column ``my_col`` has
    a scalar index:

    >>> import lancedb # doctest: +SKIP
    >>> db = lancedb.connect("/data/lance") # doctest: +SKIP
    >>> img_table = db.open_table("images") # doctest: +SKIP
    >>> my_df = img_table.search().where("my_col = 7", # doctest: +SKIP
    ...                                  prefilter=True).to_pandas()

    Scalar indices can also speed up scans containing a vector search and a
    prefilter:

    >>> import lancedb # doctest: +SKIP
    >>> db = lancedb.connect("/data/lance") # doctest: +SKIP
    >>> img_table = db.open_table("images") # doctest: +SKIP
    >>> img_table.search([1, 2, 3, 4], vector_column_name="vector") # doctest: +SKIP
    ...     .where("my_col != 7", prefilter=True)
    ...     .to_pandas()

    Scalar indices can only speed up scans for basic filters using
    equality, comparison, range (e.g. ``my_col BETWEEN 0 AND 100``), and set
    membership (e.g. `my_col IN (0, 1, 2)`)

    Scalar indices can be used if the filter contains multiple indexed columns and
    the filter criteria are AND'd or OR'd together
    (e.g. ``my_col < 0 AND other_col> 100``)

    Scalar indices may be used if the filter contains non-indexed columns but,
    depending on the structure of the filter, they may not be usable.  For example,
    if the column ``not_indexed`` does not have a scalar index then the filter
    ``my_col = 0 OR not_indexed = 1`` will not be able to use any scalar index on
    ``my_col``.
    """
    raise NotImplementedError

create_fts_index

create_fts_index(field_names: Union[str, List[str]], *, ordering_field_names: Optional[Union[str, List[str]]] = None, replace: bool = False, writer_heap_size: Optional[int] = 1024 * 1024 * 1024, use_tantivy: bool = True, tokenizer_name: Optional[str] = None, with_position: bool = True, base_tokenizer: Literal['simple', 'raw', 'whitespace'] = 'simple', language: str = 'English', max_token_length: Optional[int] = 40, lower_case: bool = True, stem: bool = False, remove_stop_words: bool = False, ascii_folding: bool = False)

Create a full-text search index on the table.

Warning - this API is highly experimental and is highly likely to change in the future.

Parameters:

  • field_names (Union[str, List[str]]) –

    The name(s) of the field to index. can be only str if use_tantivy=True for now.

  • replace (bool, default: False ) –

    If True, replace the existing index if it exists. Note that this is not yet an atomic operation; the index will be temporarily unavailable while the new index is being created.

  • writer_heap_size (Optional[int], default: 1024 * 1024 * 1024 ) –

    Only available with use_tantivy=True

  • ordering_field_names (Optional[Union[str, List[str]]], default: None ) –

    A list of unsigned type fields to index to optionally order results on at search time. only available with use_tantivy=True

  • tokenizer_name (Optional[str], default: None ) –

    The tokenizer to use for the index. Can be "raw", "default" or the 2 letter language code followed by "_stem". So for english it would be "en_stem". For available languages see: https://docs.rs/tantivy/latest/tantivy/tokenizer/enum.Language.html

  • use_tantivy (bool, default: True ) –

    If True, use the legacy full-text search implementation based on tantivy. If False, use the new full-text search implementation based on lance-index.

  • with_position (bool, default: True ) –

    Only available with use_tantivy=False If False, do not store the positions of the terms in the text. This can reduce the size of the index and improve indexing speed. But it will raise an exception for phrase queries.

  • base_tokenizer (str, default: "simple" ) –

    The base tokenizer to use for tokenization. Options are: - "simple": Splits text by whitespace and punctuation. - "whitespace": Split text by whitespace, but not punctuation. - "raw": No tokenization. The entire text is treated as a single token.

  • language (str, default: "English" ) –

    The language to use for tokenization.

  • max_token_length (int, default: 40 ) –

    The maximum token length to index. Tokens longer than this length will be ignored.

  • lower_case (bool, default: True ) –

    Whether to convert the token to lower case. This makes queries case-insensitive.

  • stem (bool, default: False ) –

    Whether to stem the token. Stemming reduces words to their root form. For example, in English "running" and "runs" would both be reduced to "run".

  • remove_stop_words (bool, default: False ) –

    Whether to remove stop words. Stop words are common words that are often removed from text before indexing. For example, in English "the" and "and".

  • ascii_folding (bool, default: False ) –

    Whether to fold ASCII characters. This converts accented characters to their ASCII equivalent. For example, "café" would be converted to "cafe".

Source code in lancedb/table.py
def create_fts_index(
    self,
    field_names: Union[str, List[str]],
    *,
    ordering_field_names: Optional[Union[str, List[str]]] = None,
    replace: bool = False,
    writer_heap_size: Optional[int] = 1024 * 1024 * 1024,
    use_tantivy: bool = True,
    tokenizer_name: Optional[str] = None,
    with_position: bool = True,
    # tokenizer configs:
    base_tokenizer: Literal["simple", "raw", "whitespace"] = "simple",
    language: str = "English",
    max_token_length: Optional[int] = 40,
    lower_case: bool = True,
    stem: bool = False,
    remove_stop_words: bool = False,
    ascii_folding: bool = False,
):
    """Create a full-text search index on the table.

    Warning - this API is highly experimental and is highly likely to change
    in the future.

    Parameters
    ----------
    field_names: str or list of str
        The name(s) of the field to index.
        can be only str if use_tantivy=True for now.
    replace: bool, default False
        If True, replace the existing index if it exists. Note that this is
        not yet an atomic operation; the index will be temporarily
        unavailable while the new index is being created.
    writer_heap_size: int, default 1GB
        Only available with use_tantivy=True
    ordering_field_names:
        A list of unsigned type fields to index to optionally order
        results on at search time.
        only available with use_tantivy=True
    tokenizer_name: str, default "default"
        The tokenizer to use for the index. Can be "raw", "default" or the 2 letter
        language code followed by "_stem". So for english it would be "en_stem".
        For available languages see: https://docs.rs/tantivy/latest/tantivy/tokenizer/enum.Language.html
    use_tantivy: bool, default True
        If True, use the legacy full-text search implementation based on tantivy.
        If False, use the new full-text search implementation based on lance-index.
    with_position: bool, default True
        Only available with use_tantivy=False
        If False, do not store the positions of the terms in the text.
        This can reduce the size of the index and improve indexing speed.
        But it will raise an exception for phrase queries.
    base_tokenizer : str, default "simple"
        The base tokenizer to use for tokenization. Options are:
        - "simple": Splits text by whitespace and punctuation.
        - "whitespace": Split text by whitespace, but not punctuation.
        - "raw": No tokenization. The entire text is treated as a single token.
    language : str, default "English"
        The language to use for tokenization.
    max_token_length : int, default 40
        The maximum token length to index. Tokens longer than this length will be
        ignored.
    lower_case : bool, default True
        Whether to convert the token to lower case. This makes queries
        case-insensitive.
    stem : bool, default False
        Whether to stem the token. Stemming reduces words to their root form.
        For example, in English "running" and "runs" would both be reduced to "run".
    remove_stop_words : bool, default False
        Whether to remove stop words. Stop words are common words that are often
        removed from text before indexing. For example, in English "the" and "and".
    ascii_folding : bool, default False
        Whether to fold ASCII characters. This converts accented characters to
        their ASCII equivalent. For example, "café" would be converted to "cafe".
    """
    raise NotImplementedError

add abstractmethod

add(data: DATA, mode: str = 'append', on_bad_vectors: str = 'error', fill_value: float = 0.0)

Add more data to the Table.

Parameters:

  • data (DATA) –

    The data to insert into the table. Acceptable types are:

    • list-of-dict

    • pandas.DataFrame

    • pyarrow.Table or pyarrow.RecordBatch

  • mode (str, default: 'append' ) –

    The mode to use when writing the data. Valid values are "append" and "overwrite".

  • on_bad_vectors (str, default: 'error' ) –

    What to do if any of the vectors are not the same size or contains NaNs. One of "error", "drop", "fill".

  • fill_value (float, default: 0.0 ) –

    The value to use when filling vectors. Only used if on_bad_vectors="fill".

Source code in lancedb/table.py
@abstractmethod
def add(
    self,
    data: DATA,
    mode: str = "append",
    on_bad_vectors: str = "error",
    fill_value: float = 0.0,
):
    """Add more data to the [Table](Table).

    Parameters
    ----------
    data: DATA
        The data to insert into the table. Acceptable types are:

        - list-of-dict

        - pandas.DataFrame

        - pyarrow.Table or pyarrow.RecordBatch
    mode: str
        The mode to use when writing the data. Valid values are
        "append" and "overwrite".
    on_bad_vectors: str, default "error"
        What to do if any of the vectors are not the same size or contains NaNs.
        One of "error", "drop", "fill".
    fill_value: float, default 0.
        The value to use when filling vectors. Only used if on_bad_vectors="fill".

    """
    raise NotImplementedError

merge_insert

merge_insert(on: Union[str, Iterable[str]]) -> LanceMergeInsertBuilder

Returns a LanceMergeInsertBuilder that can be used to create a "merge insert" operation

This operation can add rows, update rows, and remove rows all in a single transaction. It is a very generic tool that can be used to create behaviors like "insert if not exists", "update or insert (i.e. upsert)", or even replace a portion of existing data with new data (e.g. replace all data where month="january")

The merge insert operation works by combining new data from a source table with existing data in a target table by using a join. There are three categories of records.

"Matched" records are records that exist in both the source table and the target table. "Not matched" records exist only in the source table (e.g. these are new data) "Not matched by source" records exist only in the target table (this is old data)

The builder returned by this method can be used to customize what should happen for each category of data.

Please note that the data may appear to be reordered as part of this operation. This is because updated rows will be deleted from the dataset and then reinserted at the end with the new values.

Parameters:

  • on (Union[str, Iterable[str]]) –

    A column (or columns) to join on. This is how records from the source table and target table are matched. Typically this is some kind of key or id column.

Examples:

>>> import lancedb
>>> data = pa.table({"a": [2, 1, 3], "b": ["a", "b", "c"]})
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data)
>>> new_data = pa.table({"a": [2, 3, 4], "b": ["x", "y", "z"]})
>>> # Perform a "upsert" operation
>>> table.merge_insert("a")             \
...      .when_matched_update_all()     \
...      .when_not_matched_insert_all() \
...      .execute(new_data)
>>> # The order of new rows is non-deterministic since we use
>>> # a hash-join as part of this operation and so we sort here
>>> table.to_arrow().sort_by("a").to_pandas()
   a  b
0  1  b
1  2  x
2  3  y
3  4  z
Source code in lancedb/table.py
def merge_insert(self, on: Union[str, Iterable[str]]) -> LanceMergeInsertBuilder:
    """
    Returns a [`LanceMergeInsertBuilder`][lancedb.merge.LanceMergeInsertBuilder]
    that can be used to create a "merge insert" operation

    This operation can add rows, update rows, and remove rows all in a single
    transaction. It is a very generic tool that can be used to create
    behaviors like "insert if not exists", "update or insert (i.e. upsert)",
    or even replace a portion of existing data with new data (e.g. replace
    all data where month="january")

    The merge insert operation works by combining new data from a
    **source table** with existing data in a **target table** by using a
    join.  There are three categories of records.

    "Matched" records are records that exist in both the source table and
    the target table. "Not matched" records exist only in the source table
    (e.g. these are new data) "Not matched by source" records exist only
    in the target table (this is old data)

    The builder returned by this method can be used to customize what
    should happen for each category of data.

    Please note that the data may appear to be reordered as part of this
    operation.  This is because updated rows will be deleted from the
    dataset and then reinserted at the end with the new values.

    Parameters
    ----------

    on: Union[str, Iterable[str]]
        A column (or columns) to join on.  This is how records from the
        source table and target table are matched.  Typically this is some
        kind of key or id column.

    Examples
    --------
    >>> import lancedb
    >>> data = pa.table({"a": [2, 1, 3], "b": ["a", "b", "c"]})
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", data)
    >>> new_data = pa.table({"a": [2, 3, 4], "b": ["x", "y", "z"]})
    >>> # Perform a "upsert" operation
    >>> table.merge_insert("a")             \\
    ...      .when_matched_update_all()     \\
    ...      .when_not_matched_insert_all() \\
    ...      .execute(new_data)
    >>> # The order of new rows is non-deterministic since we use
    >>> # a hash-join as part of this operation and so we sort here
    >>> table.to_arrow().sort_by("a").to_pandas()
       a  b
    0  1  b
    1  2  x
    2  3  y
    3  4  z
    """
    on = [on] if isinstance(on, str) else list(on.iter())

    return LanceMergeInsertBuilder(self, on)

search abstractmethod

search(query: Optional[Union[VEC, str, 'PIL.Image.Image', Tuple]] = None, vector_column_name: Optional[str] = None, query_type: QueryType = 'auto', ordering_field_name: Optional[str] = None, fts_columns: Optional[Union[str, List[str]]] = None) -> LanceQueryBuilder

Create a search query to find the nearest neighbors of the given query vector. We currently support vector search and [full-text search][experimental-full-text-search].

All query options are defined in Query.

Examples:

>>> import lancedb
>>> db = lancedb.connect("./.lancedb")
>>> data = [
...    {"original_width": 100, "caption": "bar", "vector": [0.1, 2.3, 4.5]},
...    {"original_width": 2000, "caption": "foo",  "vector": [0.5, 3.4, 1.3]},
...    {"original_width": 3000, "caption": "test", "vector": [0.3, 6.2, 2.6]}
... ]
>>> table = db.create_table("my_table", data)
>>> query = [0.4, 1.4, 2.4]
>>> (table.search(query)
...     .where("original_width > 1000", prefilter=True)
...     .select(["caption", "original_width", "vector"])
...     .limit(2)
...     .to_pandas())
  caption  original_width           vector  _distance
0     foo            2000  [0.5, 3.4, 1.3]   5.220000
1    test            3000  [0.3, 6.2, 2.6]  23.089996

Parameters:

  • query (Optional[Union[VEC, str, 'PIL.Image.Image', Tuple]], default: None ) –

    The targetted vector to search for.

    • default None. Acceptable types are: list, np.ndarray, PIL.Image.Image

    • If None then the select/where/limit clauses are applied to filter the table

  • vector_column_name (Optional[str], default: None ) –

    The name of the vector column to search.

    The vector column needs to be a pyarrow fixed size list type

    • If not specified then the vector column is inferred from the table schema

    • If the table has multiple vector columns then the vector_column_name needs to be specified. Otherwise, an error is raised.

  • query_type (QueryType, default: 'auto' ) –

    default "auto". Acceptable types are: "vector", "fts", "hybrid", or "auto"

    • If "auto" then the query type is inferred from the query;

      • If query is a list/np.ndarray then the query type is "vector";

      • If query is a PIL.Image.Image then either do vector search, or raise an error if no corresponding embedding function is found.

    • If query is a string, then the query type is "vector" if the table has embedding functions else the query type is "fts"

Returns:

  • LanceQueryBuilder

    A query builder object representing the query. Once executed, the query returns

    • selected columns

    • the vector

    • and also the "_distance" column which is the distance between the query vector and the returned vector.

Source code in lancedb/table.py
@abstractmethod
def search(
    self,
    query: Optional[Union[VEC, str, "PIL.Image.Image", Tuple]] = None,
    vector_column_name: Optional[str] = None,
    query_type: QueryType = "auto",
    ordering_field_name: Optional[str] = None,
    fts_columns: Optional[Union[str, List[str]]] = None,
) -> LanceQueryBuilder:
    """Create a search query to find the nearest neighbors
    of the given query vector. We currently support [vector search][search]
    and [full-text search][experimental-full-text-search].

    All query options are defined in [Query][lancedb.query.Query].

    Examples
    --------
    >>> import lancedb
    >>> db = lancedb.connect("./.lancedb")
    >>> data = [
    ...    {"original_width": 100, "caption": "bar", "vector": [0.1, 2.3, 4.5]},
    ...    {"original_width": 2000, "caption": "foo",  "vector": [0.5, 3.4, 1.3]},
    ...    {"original_width": 3000, "caption": "test", "vector": [0.3, 6.2, 2.6]}
    ... ]
    >>> table = db.create_table("my_table", data)
    >>> query = [0.4, 1.4, 2.4]
    >>> (table.search(query)
    ...     .where("original_width > 1000", prefilter=True)
    ...     .select(["caption", "original_width", "vector"])
    ...     .limit(2)
    ...     .to_pandas())
      caption  original_width           vector  _distance
    0     foo            2000  [0.5, 3.4, 1.3]   5.220000
    1    test            3000  [0.3, 6.2, 2.6]  23.089996

    Parameters
    ----------
    query: list/np.ndarray/str/PIL.Image.Image, default None
        The targetted vector to search for.

        - *default None*.
        Acceptable types are: list, np.ndarray, PIL.Image.Image

        - If None then the select/where/limit clauses are applied to filter
        the table
    vector_column_name: str, optional
        The name of the vector column to search.

        The vector column needs to be a pyarrow fixed size list type

        - If not specified then the vector column is inferred from
        the table schema

        - If the table has multiple vector columns then the *vector_column_name*
        needs to be specified. Otherwise, an error is raised.
    query_type: str
        *default "auto"*.
        Acceptable types are: "vector", "fts", "hybrid", or "auto"

        - If "auto" then the query type is inferred from the query;

            - If `query` is a list/np.ndarray then the query type is
            "vector";

            - If `query` is a PIL.Image.Image then either do vector search,
            or raise an error if no corresponding embedding function is found.

        - If `query` is a string, then the query type is "vector" if the
        table has embedding functions else the query type is "fts"

    Returns
    -------
    LanceQueryBuilder
        A query builder object representing the query.
        Once executed, the query returns

        - selected columns

        - the vector

        - and also the "_distance" column which is the distance between the query
        vector and the returned vector.
    """
    raise NotImplementedError

delete abstractmethod

delete(where: str)

Delete rows from the table.

This can be used to delete a single row, many rows, all rows, or sometimes no rows (if your predicate matches nothing).

Parameters:

  • where (str) –

    The SQL where clause to use when deleting rows.

    • For example, 'x = 2' or 'x IN (1, 2, 3)'.

    The filter must not be empty, or it will error.

Examples:

>>> import lancedb
>>> data = [
...    {"x": 1, "vector": [1.0, 2]},
...    {"x": 2, "vector": [3.0, 4]},
...    {"x": 3, "vector": [5.0, 6]}
... ]
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data)
>>> table.to_pandas()
   x      vector
0  1  [1.0, 2.0]
1  2  [3.0, 4.0]
2  3  [5.0, 6.0]
>>> table.delete("x = 2")
>>> table.to_pandas()
   x      vector
0  1  [1.0, 2.0]
1  3  [5.0, 6.0]

If you have a list of values to delete, you can combine them into a stringified list and use the IN operator:

>>> to_remove = [1, 5]
>>> to_remove = ", ".join([str(v) for v in to_remove])
>>> to_remove
'1, 5'
>>> table.delete(f"x IN ({to_remove})")
>>> table.to_pandas()
   x      vector
0  3  [5.0, 6.0]
Source code in lancedb/table.py
@abstractmethod
def delete(self, where: str):
    """Delete rows from the table.

    This can be used to delete a single row, many rows, all rows, or
    sometimes no rows (if your predicate matches nothing).

    Parameters
    ----------
    where: str
        The SQL where clause to use when deleting rows.

        - For example, 'x = 2' or 'x IN (1, 2, 3)'.

        The filter must not be empty, or it will error.

    Examples
    --------
    >>> import lancedb
    >>> data = [
    ...    {"x": 1, "vector": [1.0, 2]},
    ...    {"x": 2, "vector": [3.0, 4]},
    ...    {"x": 3, "vector": [5.0, 6]}
    ... ]
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", data)
    >>> table.to_pandas()
       x      vector
    0  1  [1.0, 2.0]
    1  2  [3.0, 4.0]
    2  3  [5.0, 6.0]
    >>> table.delete("x = 2")
    >>> table.to_pandas()
       x      vector
    0  1  [1.0, 2.0]
    1  3  [5.0, 6.0]

    If you have a list of values to delete, you can combine them into a
    stringified list and use the `IN` operator:

    >>> to_remove = [1, 5]
    >>> to_remove = ", ".join([str(v) for v in to_remove])
    >>> to_remove
    '1, 5'
    >>> table.delete(f"x IN ({to_remove})")
    >>> table.to_pandas()
       x      vector
    0  3  [5.0, 6.0]
    """
    raise NotImplementedError

update abstractmethod

update(where: Optional[str] = None, values: Optional[dict] = None, *, values_sql: Optional[Dict[str, str]] = None)

This can be used to update zero to all rows depending on how many rows match the where clause. If no where clause is provided, then all rows will be updated.

Either values or values_sql must be provided. You cannot provide both.

Parameters:

  • where (Optional[str], default: None ) –

    The SQL where clause to use when updating rows. For example, 'x = 2' or 'x IN (1, 2, 3)'. The filter must not be empty, or it will error.

  • values (Optional[dict], default: None ) –

    The values to update. The keys are the column names and the values are the values to set.

  • values_sql (Optional[Dict[str, str]], default: None ) –

    The values to update, expressed as SQL expression strings. These can reference existing columns. For example, {"x": "x + 1"} will increment the x column by 1.

Examples:

>>> import lancedb
>>> import pandas as pd
>>> data = pd.DataFrame({"x": [1, 2, 3], "vector": [[1.0, 2], [3, 4], [5, 6]]})
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data)
>>> table.to_pandas()
   x      vector
0  1  [1.0, 2.0]
1  2  [3.0, 4.0]
2  3  [5.0, 6.0]
>>> table.update(where="x = 2", values={"vector": [10.0, 10]})
>>> table.to_pandas()
   x        vector
0  1    [1.0, 2.0]
1  3    [5.0, 6.0]
2  2  [10.0, 10.0]
>>> table.update(values_sql={"x": "x + 1"})
>>> table.to_pandas()
   x        vector
0  2    [1.0, 2.0]
1  4    [5.0, 6.0]
2  3  [10.0, 10.0]
Source code in lancedb/table.py
@abstractmethod
def update(
    self,
    where: Optional[str] = None,
    values: Optional[dict] = None,
    *,
    values_sql: Optional[Dict[str, str]] = None,
):
    """
    This can be used to update zero to all rows depending on how many
    rows match the where clause. If no where clause is provided, then
    all rows will be updated.

    Either `values` or `values_sql` must be provided. You cannot provide
    both.

    Parameters
    ----------
    where: str, optional
        The SQL where clause to use when updating rows. For example, 'x = 2'
        or 'x IN (1, 2, 3)'. The filter must not be empty, or it will error.
    values: dict, optional
        The values to update. The keys are the column names and the values
        are the values to set.
    values_sql: dict, optional
        The values to update, expressed as SQL expression strings. These can
        reference existing columns. For example, {"x": "x + 1"} will increment
        the x column by 1.

    Examples
    --------
    >>> import lancedb
    >>> import pandas as pd
    >>> data = pd.DataFrame({"x": [1, 2, 3], "vector": [[1.0, 2], [3, 4], [5, 6]]})
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", data)
    >>> table.to_pandas()
       x      vector
    0  1  [1.0, 2.0]
    1  2  [3.0, 4.0]
    2  3  [5.0, 6.0]
    >>> table.update(where="x = 2", values={"vector": [10.0, 10]})
    >>> table.to_pandas()
       x        vector
    0  1    [1.0, 2.0]
    1  3    [5.0, 6.0]
    2  2  [10.0, 10.0]
    >>> table.update(values_sql={"x": "x + 1"})
    >>> table.to_pandas()
       x        vector
    0  2    [1.0, 2.0]
    1  4    [5.0, 6.0]
    2  3  [10.0, 10.0]
    """
    raise NotImplementedError

cleanup_old_versions abstractmethod

cleanup_old_versions(older_than: Optional[timedelta] = None, *, delete_unverified: bool = False) -> CleanupStats

Clean up old versions of the table, freeing disk space.

Parameters:

  • older_than (Optional[timedelta], default: None ) –

    The minimum age of the version to delete. If None, then this defaults to two weeks.

  • delete_unverified (bool, default: False ) –

    Because they may be part of an in-progress transaction, files newer than 7 days old are not deleted by default. If you are sure that there are no in-progress transactions, then you can set this to True to delete all files older than older_than.

Returns:

  • CleanupStats

    The stats of the cleanup operation, including how many bytes were freed.

See Also

Table.optimize: A more comprehensive optimization operation that includes cleanup as well as other operations.

Notes

This function is not available in LanceDb Cloud (since LanceDB Cloud manages cleanup for you automatically)

Source code in lancedb/table.py
@abstractmethod
def cleanup_old_versions(
    self,
    older_than: Optional[timedelta] = None,
    *,
    delete_unverified: bool = False,
) -> CleanupStats:
    """
    Clean up old versions of the table, freeing disk space.

    Parameters
    ----------
    older_than: timedelta, default None
        The minimum age of the version to delete. If None, then this defaults
        to two weeks.
    delete_unverified: bool, default False
        Because they may be part of an in-progress transaction, files newer
        than 7 days old are not deleted by default. If you are sure that
        there are no in-progress transactions, then you can set this to True
        to delete all files older than `older_than`.

    Returns
    -------
    CleanupStats
        The stats of the cleanup operation, including how many bytes were
        freed.

    See Also
    --------
    [Table.optimize][lancedb.table.Table.optimize]: A more comprehensive
        optimization operation that includes cleanup as well as other operations.

    Notes
    -----
    This function is not available in LanceDb Cloud (since LanceDB
    Cloud manages cleanup for you automatically)
    """

compact_files abstractmethod

compact_files(*args, **kwargs)

Run the compaction process on the table. This can be run after making several small appends to optimize the table for faster reads.

Arguments are passed onto Lance's compact_files. For most cases, the default should be fine.

See Also

Table.optimize: A more comprehensive optimization operation that includes cleanup as well as other operations.

Notes

This function is not available in LanceDB Cloud (since LanceDB Cloud manages compaction for you automatically)

Source code in lancedb/table.py
@abstractmethod
def compact_files(self, *args, **kwargs):
    """
    Run the compaction process on the table.
    This can be run after making several small appends to optimize the table
    for faster reads.

    Arguments are passed onto Lance's
    [compact_files][lance.dataset.DatasetOptimizer.compact_files].
    For most cases, the default should be fine.

    See Also
    --------
    [Table.optimize][lancedb.table.Table.optimize]: A more comprehensive
        optimization operation that includes cleanup as well as other operations.

    Notes
    -----
    This function is not available in LanceDB Cloud (since LanceDB
    Cloud manages compaction for you automatically)
    """

optimize abstractmethod

optimize(*, cleanup_older_than: Optional[timedelta] = None, delete_unverified: bool = False)

Optimize the on-disk data and indices for better performance.

Modeled after VACUUM in PostgreSQL.

Optimization covers three operations:

  • Compaction: Merges small files into larger ones
  • Prune: Removes old versions of the dataset
  • Index: Optimizes the indices, adding new data to existing indices

Parameters:

  • cleanup_older_than (Optional[timedelta], default: None ) –

    All files belonging to versions older than this will be removed. Set to 0 days to remove all versions except the latest. The latest version is never removed.

  • delete_unverified (bool, default: False ) –

    Files leftover from a failed transaction may appear to be part of an in-progress operation (e.g. appending new data) and these files will not be deleted unless they are at least 7 days old. If delete_unverified is True then these files will be deleted regardless of their age.

Experimental API

The optimization process is undergoing active development and may change. Our goal with these changes is to improve the performance of optimization and reduce the complexity.

That being said, it is essential today to run optimize if you want the best performance. It should be stable and safe to use in production, but it our hope that the API may be simplified (or not even need to be called) in the future.

The frequency an application shoudl call optimize is based on the frequency of data modifications. If data is frequently added, deleted, or updated then optimize should be run frequently. A good rule of thumb is to run optimize if you have added or modified 100,000 or more records or run more than 20 data modification operations.

Source code in lancedb/table.py
@abstractmethod
def optimize(
    self,
    *,
    cleanup_older_than: Optional[timedelta] = None,
    delete_unverified: bool = False,
):
    """
    Optimize the on-disk data and indices for better performance.

    Modeled after ``VACUUM`` in PostgreSQL.

    Optimization covers three operations:

     * Compaction: Merges small files into larger ones
     * Prune: Removes old versions of the dataset
     * Index: Optimizes the indices, adding new data to existing indices

    Parameters
    ----------
    cleanup_older_than: timedelta, optional default 7 days
        All files belonging to versions older than this will be removed.  Set
        to 0 days to remove all versions except the latest.  The latest version
        is never removed.
    delete_unverified: bool, default False
        Files leftover from a failed transaction may appear to be part of an
        in-progress operation (e.g. appending new data) and these files will not
        be deleted unless they are at least 7 days old. If delete_unverified is True
        then these files will be deleted regardless of their age.

    Experimental API
    ----------------

    The optimization process is undergoing active development and may change.
    Our goal with these changes is to improve the performance of optimization and
    reduce the complexity.

    That being said, it is essential today to run optimize if you want the best
    performance.  It should be stable and safe to use in production, but it our
    hope that the API may be simplified (or not even need to be called) in the
    future.

    The frequency an application shoudl call optimize is based on the frequency of
    data modifications.  If data is frequently added, deleted, or updated then
    optimize should be run frequently.  A good rule of thumb is to run optimize if
    you have added or modified 100,000 or more records or run more than 20 data
    modification operations.
    """

list_indices abstractmethod

list_indices() -> Iterable[IndexConfig]

List all indices that have been created with Table.create_index

Source code in lancedb/table.py
@abstractmethod
def list_indices(self) -> Iterable[IndexConfig]:
    """
    List all indices that have been created with
    [Table.create_index][lancedb.table.Table.create_index]
    """

index_stats abstractmethod

index_stats(index_name: str) -> Optional[IndexStatistics]

Retrieve statistics about an index

Parameters:

  • index_name (str) –

    The name of the index to retrieve statistics for

Returns:

  • IndexStatistics or None

    The statistics about the index. Returns None if the index does not exist.

Source code in lancedb/table.py
@abstractmethod
def index_stats(self, index_name: str) -> Optional[IndexStatistics]:
    """
    Retrieve statistics about an index

    Parameters
    ----------
    index_name: str
        The name of the index to retrieve statistics for

    Returns
    -------
    IndexStatistics or None
        The statistics about the index. Returns None if the index does not exist.
    """

add_columns abstractmethod

add_columns(transforms: Dict[str, str])

Add new columns with defined values.

Parameters:

  • transforms (Dict[str, str]) –

    A map of column name to a SQL expression to use to calculate the value of the new column. These expressions will be evaluated for each row in the table, and can reference existing columns.

Source code in lancedb/table.py
@abstractmethod
def add_columns(self, transforms: Dict[str, str]):
    """
    Add new columns with defined values.

    Parameters
    ----------
    transforms: Dict[str, str]
        A map of column name to a SQL expression to use to calculate the
        value of the new column. These expressions will be evaluated for
        each row in the table, and can reference existing columns.
    """

alter_columns abstractmethod

alter_columns(*alterations: Iterable[Dict[str, str]])

Alter column names and nullability.

Parameters:

  • alterations (Iterable[Dict[str, Any]], default: () ) –

    A sequence of dictionaries, each with the following keys: - "path": str The column path to alter. For a top-level column, this is the name. For a nested column, this is the dot-separated path, e.g. "a.b.c". - "rename": str, optional The new name of the column. If not specified, the column name is not changed. - "data_type": pyarrow.DataType, optional The new data type of the column. Existing values will be casted to this type. If not specified, the column data type is not changed. - "nullable": bool, optional Whether the column should be nullable. If not specified, the column nullability is not changed. Only non-nullable columns can be changed to nullable. Currently, you cannot change a nullable column to non-nullable.

Source code in lancedb/table.py
@abstractmethod
def alter_columns(self, *alterations: Iterable[Dict[str, str]]):
    """
    Alter column names and nullability.

    Parameters
    ----------
    alterations : Iterable[Dict[str, Any]]
        A sequence of dictionaries, each with the following keys:
        - "path": str
            The column path to alter. For a top-level column, this is the name.
            For a nested column, this is the dot-separated path, e.g. "a.b.c".
        - "rename": str, optional
            The new name of the column. If not specified, the column name is
            not changed.
        - "data_type": pyarrow.DataType, optional
           The new data type of the column. Existing values will be casted
           to this type. If not specified, the column data type is not changed.
        - "nullable": bool, optional
            Whether the column should be nullable. If not specified, the column
            nullability is not changed. Only non-nullable columns can be changed
            to nullable. Currently, you cannot change a nullable column to
            non-nullable.
    """

drop_columns abstractmethod

drop_columns(columns: Iterable[str])

Drop columns from the table.

Parameters:

  • columns (Iterable[str]) –

    The names of the columns to drop.

Source code in lancedb/table.py
@abstractmethod
def drop_columns(self, columns: Iterable[str]):
    """
    Drop columns from the table.

    Parameters
    ----------
    columns : Iterable[str]
        The names of the columns to drop.
    """

checkout abstractmethod

checkout(version: int)

Checks out a specific version of the Table

Any read operation on the table will now access the data at the checked out version. As a consequence, calling this method will disable any read consistency interval that was previously set.

This is a read-only operation that turns the table into a sort of "view" or "detached head". Other table instances will not be affected. To make the change permanent you can use the [Self::restore] method.

Any operation that modifies the table will fail while the table is in a checked out state.

To return the table to a normal state use [Self::checkout_latest]

Source code in lancedb/table.py
@abstractmethod
def checkout(self, version: int):
    """
    Checks out a specific version of the Table

    Any read operation on the table will now access the data at the checked out
    version. As a consequence, calling this method will disable any read consistency
    interval that was previously set.

    This is a read-only operation that turns the table into a sort of "view"
    or "detached head".  Other table instances will not be affected.  To make the
    change permanent you can use the `[Self::restore]` method.

    Any operation that modifies the table will fail while the table is in a checked
    out state.

    To return the table to a normal state use `[Self::checkout_latest]`
    """

checkout_latest abstractmethod

checkout_latest()

Ensures the table is pointing at the latest version

This can be used to manually update a table when the read_consistency_interval is None It can also be used to undo a [Self::checkout] operation

Source code in lancedb/table.py
@abstractmethod
def checkout_latest(self):
    """
    Ensures the table is pointing at the latest version

    This can be used to manually update a table when the read_consistency_interval
    is None
    It can also be used to undo a `[Self::checkout]` operation
    """

list_versions abstractmethod

list_versions() -> List[Dict[str, Any]]

List all versions of the table

Source code in lancedb/table.py
@abstractmethod
def list_versions(self) -> List[Dict[str, Any]]:
    """List all versions of the table"""

uses_v2_manifest_paths abstractmethod

uses_v2_manifest_paths() -> bool

Check if the table is using the new v2 manifest paths.

Returns:

  • bool

    True if the table is using the new v2 manifest paths, False otherwise.

Source code in lancedb/table.py
@abstractmethod
def uses_v2_manifest_paths(self) -> bool:
    """
    Check if the table is using the new v2 manifest paths.

    Returns
    -------
    bool
        True if the table is using the new v2 manifest paths, False otherwise.
    """

migrate_v2_manifest_paths abstractmethod

migrate_v2_manifest_paths()

Migrate the manifest paths to the new format.

This will update the manifest to use the new v2 format for paths.

This function is idempotent, and can be run multiple times without changing the state of the object store.

Danger

This should not be run while other concurrent operations are happening. And it should also run until completion before resuming other operations.

You can use Table.uses_v2_manifest_paths to check if the table is already using the new path style.

Source code in lancedb/table.py
@abstractmethod
def migrate_v2_manifest_paths(self):
    """
    Migrate the manifest paths to the new format.

    This will update the manifest to use the new v2 format for paths.

    This function is idempotent, and can be run multiple times without
    changing the state of the object store.

    !!! danger

        This should not be run while other concurrent operations are happening.
        And it should also run until completion before resuming other operations.

    You can use
    [Table.uses_v2_manifest_paths][lancedb.table.Table.uses_v2_manifest_paths]
    to check if the table is already using the new path style.
    """

Querying (Synchronous)

lancedb.query.Query

Bases: BaseModel

The LanceDB Query

Attributes:

  • vector (List[float]) –

    the vector to search for

  • filter (Optional[str]) –

    sql filter to refine the query with, optional

  • prefilter (bool) –

    if True then apply the filter before vector search

  • k (int) –

    top k results to return

  • metric (str) –

    the distance metric between a pair of vectors,

    can support L2 (default), Cosine and Dot. metric definitions

  • columns (Optional[List[str]]) –

    which columns to return in the results

  • nprobes (int) –

    The number of probes used - optional

    • A higher number makes search more accurate but also slower.

    • See discussion in Querying an ANN Index for tuning advice.

  • refine_factor (Optional[int]) –

    Refine the results by reading extra elements and re-ranking them in memory.

    • A higher number makes search more accurate but also slower.

    • See discussion in Querying an ANN Index for tuning advice.

  • offset (int) –

    The offset to start fetching results from

  • fast_search (bool) –

    Skip a flat search of unindexed data. This will improve search performance but search results will not include unindexed data.

    • default False.
Source code in lancedb/query.py
class Query(pydantic.BaseModel):
    """The LanceDB Query

    Attributes
    ----------
    vector : List[float]
        the vector to search for
    filter : Optional[str]
        sql filter to refine the query with, optional
    prefilter : bool
        if True then apply the filter before vector search
    k : int
        top k results to return
    metric : str
        the distance metric between a pair of vectors,

        can support L2 (default), Cosine and Dot.
        [metric definitions][search]
    columns : Optional[List[str]]
        which columns to return in the results
    nprobes : int
        The number of probes used - optional

        - A higher number makes search more accurate but also slower.

        - See discussion in [Querying an ANN Index][querying-an-ann-index] for
          tuning advice.
    refine_factor : Optional[int]
        Refine the results by reading extra elements and re-ranking them in memory.

        - A higher number makes search more accurate but also slower.

        - See discussion in [Querying an ANN Index][querying-an-ann-index] for
          tuning advice.
    offset: int
        The offset to start fetching results from
    fast_search: bool
        Skip a flat search of unindexed data. This will improve
        search performance but search results will not include unindexed data.

        - *default False*.
    """

    vector_column: Optional[str] = None

    # vector to search for
    vector: Union[List[float], List[List[float]]]

    # sql filter to refine the query with
    filter: Optional[str] = None

    # if True then apply the filter before vector search
    prefilter: bool = False

    # full text search query
    full_text_query: Optional[Union[str, dict]] = None

    # top k results to return
    k: int

    # # metrics
    metric: str = "L2"

    # which columns to return in the results
    columns: Optional[Union[List[str], Dict[str, str]]] = None

    # optional query parameters for tuning the results,
    # e.g. `{"nprobes": "10", "refine_factor": "10"}`
    nprobes: int = 10

    lower_bound: Optional[float] = None
    upper_bound: Optional[float] = None

    # Refine factor.
    refine_factor: Optional[int] = None

    with_row_id: bool = False

    offset: int = 0

    fast_search: bool = False

    ef: Optional[int] = None

    # Default is true. Set to false to enforce a brute force search.
    use_index: bool = True

lancedb.query.LanceQueryBuilder

Bases: ABC

An abstract query builder. Subclasses are defined for vector search, full text search, hybrid, and plain SQL filtering.

Source code in lancedb/query.py
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
class LanceQueryBuilder(ABC):
    """An abstract query builder. Subclasses are defined for vector search,
    full text search, hybrid, and plain SQL filtering.
    """

    @classmethod
    def create(
        cls,
        table: "Table",
        query: Optional[Union[np.ndarray, str, "PIL.Image.Image", Tuple]],
        query_type: str,
        vector_column_name: str,
        ordering_field_name: Optional[str] = None,
        fts_columns: Union[str, List[str]] = [],
        fast_search: bool = False,
    ) -> LanceQueryBuilder:
        """
        Create a query builder based on the given query and query type.

        Parameters
        ----------
        table: Table
            The table to query.
        query: Optional[Union[np.ndarray, str, "PIL.Image.Image", Tuple]]
            The query to use. If None, an empty query builder is returned
            which performs simple SQL filtering.
        query_type: str
            The type of query to perform. One of "vector", "fts", "hybrid", or "auto".
            If "auto", the query type is inferred based on the query.
        vector_column_name: str
            The name of the vector column to use for vector search.
        fast_search: bool
            Skip flat search of unindexed data.
        """
        # Check hybrid search first as it supports empty query pattern
        if query_type == "hybrid":
            # hybrid fts and vector query
            return LanceHybridQueryBuilder(
                table, query, vector_column_name, fts_columns=fts_columns
            )

        if query is None:
            return LanceEmptyQueryBuilder(table)

        # remember the string query for reranking purpose
        str_query = query if isinstance(query, str) else None

        # convert "auto" query_type to "vector", "fts"
        # or "hybrid" and convert the query to vector if needed
        query, query_type = cls._resolve_query(
            table, query, query_type, vector_column_name
        )

        if query_type == "hybrid":
            return LanceHybridQueryBuilder(
                table, query, vector_column_name, fts_columns=fts_columns
            )

        if isinstance(query, str):
            # fts
            return LanceFtsQueryBuilder(
                table,
                query,
                ordering_field_name=ordering_field_name,
                fts_columns=fts_columns,
            )

        if isinstance(query, list):
            query = np.array(query, dtype=np.float32)
        elif isinstance(query, np.ndarray):
            query = query.astype(np.float32)
        else:
            raise TypeError(f"Unsupported query type: {type(query)}")

        return LanceVectorQueryBuilder(
            table, query, vector_column_name, str_query, fast_search
        )

    @classmethod
    def _resolve_query(cls, table, query, query_type, vector_column_name):
        # If query_type is fts, then query must be a string.
        # otherwise raise TypeError
        if query_type == "fts":
            if not isinstance(query, str):
                raise TypeError(f"'fts' queries must be a string: {type(query)}")
            return query, query_type
        elif query_type == "vector":
            query = cls._query_to_vector(table, query, vector_column_name)
            return query, query_type
        elif query_type == "auto":
            if isinstance(query, (list, np.ndarray)):
                return query, "vector"
            else:
                conf = table.embedding_functions.get(vector_column_name)
                if conf is not None:
                    query = conf.function.compute_query_embeddings_with_retry(query)[0]
                    return query, "vector"
                else:
                    return query, "fts"
        else:
            raise ValueError(
                f"Invalid query_type, must be 'vector', 'fts', or 'auto': {query_type}"
            )

    @classmethod
    def _query_to_vector(cls, table, query, vector_column_name):
        if isinstance(query, (list, np.ndarray)):
            return query
        conf = table.embedding_functions.get(vector_column_name)
        if conf is not None:
            return conf.function.compute_query_embeddings_with_retry(query)[0]
        else:
            msg = f"No embedding function for {vector_column_name}"
            raise ValueError(msg)

    def __init__(self, table: "Table"):
        self._table = table
        self._limit = 10
        self._offset = 0
        self._columns = None
        self._where = None
        self._prefilter = True
        self._with_row_id = False
        self._vector = None
        self._text = None
        self._ef = None
        self._use_index = True

    @deprecation.deprecated(
        deprecated_in="0.3.1",
        removed_in="0.4.0",
        current_version=__version__,
        details="Use to_pandas() instead",
    )
    def to_df(self) -> "pd.DataFrame":
        """
        *Deprecated alias for `to_pandas()`. Please use `to_pandas()` instead.*

        Execute the query and return the results as a pandas DataFrame.
        In addition to the selected columns, LanceDB also returns a vector
        and also the "_distance" column which is the distance between the query
        vector and the returned vector.
        """
        return self.to_pandas()

    def to_pandas(self, flatten: Optional[Union[int, bool]] = None) -> "pd.DataFrame":
        """
        Execute the query and return the results as a pandas DataFrame.
        In addition to the selected columns, LanceDB also returns a vector
        and also the "_distance" column which is the distance between the query
        vector and the returned vector.

        Parameters
        ----------
        flatten: Optional[Union[int, bool]]
            If flatten is True, flatten all nested columns.
            If flatten is an integer, flatten the nested columns up to the
            specified depth.
            If unspecified, do not flatten the nested columns.
        """
        tbl = flatten_columns(self.to_arrow(), flatten)
        return tbl.to_pandas()

    @abstractmethod
    def to_arrow(self) -> pa.Table:
        """
        Execute the query and return the results as an
        [Apache Arrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table).

        In addition to the selected columns, LanceDB also returns a vector
        and also the "_distance" column which is the distance between the query
        vector and the returned vectors.
        """
        raise NotImplementedError

    @abstractmethod
    def to_batches(self, /, batch_size: Optional[int] = None) -> pa.Table:
        """
        Execute the query and return the results as a pyarrow
        [RecordBatchReader](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html)
        """
        raise NotImplementedError

    def to_list(self) -> List[dict]:
        """
        Execute the query and return the results as a list of dictionaries.

        Each list entry is a dictionary with the selected column names as keys,
        or all table columns if `select` is not called. The vector and the "_distance"
        fields are returned whether or not they're explicitly selected.
        """
        return self.to_arrow().to_pylist()

    def to_pydantic(self, model: Type[LanceModel]) -> List[LanceModel]:
        """Return the table as a list of pydantic models.

        Parameters
        ----------
        model: Type[LanceModel]
            The pydantic model to use.

        Returns
        -------
        List[LanceModel]
        """
        return [
            model(**{k: v for k, v in row.items() if k in model.field_names()})
            for row in self.to_arrow().to_pylist()
        ]

    def to_polars(self) -> "pl.DataFrame":
        """
        Execute the query and return the results as a Polars DataFrame.
        In addition to the selected columns, LanceDB also returns a vector
        and also the "_distance" column which is the distance between the query
        vector and the returned vector.
        """
        import polars as pl

        return pl.from_arrow(self.to_arrow())

    def limit(self, limit: Union[int, None]) -> LanceQueryBuilder:
        """Set the maximum number of results to return.

        Parameters
        ----------
        limit: int
            The maximum number of results to return.
            The default query limit is 10 results.
            For ANN/KNN queries, you must specify a limit.
            Entering 0, a negative number, or None will reset
            the limit to the default value of 10.
            *WARNING* if you have a large dataset, setting
            the limit to a large number, e.g. the table size,
            can potentially result in reading a
            large amount of data into memory and cause
            out of memory issues.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        if limit is None or limit <= 0:
            if isinstance(self, LanceVectorQueryBuilder):
                raise ValueError("Limit is required for ANN/KNN queries")
            else:
                self._limit = None
        else:
            self._limit = limit
        return self

    def offset(self, offset: int) -> LanceQueryBuilder:
        """Set the offset for the results.

        Parameters
        ----------
        offset: int
            The offset to start fetching results from.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        if offset is None or offset <= 0:
            self._offset = 0
        else:
            self._offset = offset
        return self

    def select(self, columns: Union[list[str], dict[str, str]]) -> LanceQueryBuilder:
        """Set the columns to return.

        Parameters
        ----------
        columns: list of str, or dict of str to str default None
            List of column names to be fetched.
            Or a dictionary of column names to SQL expressions.
            All columns are fetched if None or unspecified.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        if isinstance(columns, list) or isinstance(columns, dict):
            self._columns = columns
        else:
            raise ValueError("columns must be a list or a dictionary")
        return self

    def where(self, where: str, prefilter: bool = True) -> LanceQueryBuilder:
        """Set the where clause.

        Parameters
        ----------
        where: str
            The where clause which is a valid SQL where clause. See
            `Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>`_
            for valid SQL expressions.
        prefilter: bool, default True
            If True, apply the filter before vector search, otherwise the
            filter is applied on the result of vector search.
            This feature is **EXPERIMENTAL** and may be removed and modified
            without warning in the future.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        self._where = where
        self._prefilter = prefilter
        return self

    def with_row_id(self, with_row_id: bool) -> LanceQueryBuilder:
        """Set whether to return row ids.

        Parameters
        ----------
        with_row_id: bool
            If True, return _rowid column in the results.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        self._with_row_id = with_row_id
        return self

    def explain_plan(self, verbose: Optional[bool] = False) -> str:
        """Return the execution plan for this query.

        Examples
        --------
        >>> import lancedb
        >>> db = lancedb.connect("./.lancedb")
        >>> table = db.create_table("my_table", [{"vector": [99.0, 99]}])
        >>> query = [100, 100]
        >>> plan = table.search(query).explain_plan(True)
        >>> print(plan) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
        ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]
        GlobalLimitExec: skip=0, fetch=10
          FilterExec: _distance@2 IS NOT NULL
            SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]
              KNNVectorDistance: metric=l2
                LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false

        Parameters
        ----------
        verbose : bool, default False
            Use a verbose output format.

        Returns
        -------
        plan : str
        """  # noqa: E501
        ds = self._table.to_lance()
        return ds.scanner(
            nearest={
                "column": self._vector_column,
                "q": self._query,
                "k": self._limit,
                "metric": self._metric,
                "nprobes": self._nprobes,
                "refine_factor": self._refine_factor,
                "use_index": self._use_index,
            },
            prefilter=self._prefilter,
            filter=self._str_query,
            limit=self._limit,
            with_row_id=self._with_row_id,
            offset=self._offset,
        ).explain_plan(verbose)

    def vector(self, vector: Union[np.ndarray, list]) -> LanceQueryBuilder:
        """Set the vector to search for.

        Parameters
        ----------
        vector: np.ndarray or list
            The vector to search for.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        raise NotImplementedError

    def text(self, text: str) -> LanceQueryBuilder:
        """Set the text to search for.

        Parameters
        ----------
        text: str
            The text to search for.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        raise NotImplementedError

    @abstractmethod
    def rerank(self, reranker: Reranker) -> LanceQueryBuilder:
        """Rerank the results using the specified reranker.

        Parameters
        ----------