Skip to content

Python API Reference

This section contains the API reference for the OSS Python API.

Installation

pip install lancedb

The following methods describe the synchronous API client. There is also an asynchronous API client.

Connections (Synchronous)

lancedb.connect(uri: URI, *, api_key: Optional[str] = None, region: str = 'us-east-1', host_override: Optional[str] = None, read_consistency_interval: Optional[timedelta] = None, request_thread_pool: Optional[Union[int, ThreadPoolExecutor]] = None, **kwargs) -> DBConnection

Connect to a LanceDB database.

Parameters:

Name Type Description Default
uri URI

The uri of the database.

required
api_key Optional[str]

If presented, connect to LanceDB cloud. Otherwise, connect to a database on file system or cloud storage. Can be set via environment variable LANCEDB_API_KEY.

None
region str

The region to use for LanceDB Cloud.

'us-east-1'
host_override Optional[str]

The override url for LanceDB Cloud.

None
read_consistency_interval Optional[timedelta]

(For LanceDB OSS only) The interval at which to check for updates to the table from other processes. If None, then consistency is not checked. For performance reasons, this is the default. For strong consistency, set this to zero seconds. Then every read will check for updates from other processes. As a compromise, you can set this to a non-zero timedelta for eventual consistency. If more than that interval has passed since the last check, then the table will be checked for updates. Note: this consistency only applies to read operations. Write operations are always consistent.

None
request_thread_pool Optional[Union[int, ThreadPoolExecutor]]

The thread pool to use for making batch requests to the LanceDB Cloud API. If an integer, then a ThreadPoolExecutor will be created with that number of threads. If None, then a ThreadPoolExecutor will be created with the default number of threads. If a ThreadPoolExecutor, then that executor will be used for making requests. This is for LanceDB Cloud only and is only used when making batch requests (i.e., passing in multiple queries to the search method at once).

None

Examples:

For a local directory, provide a path for the database:

>>> import lancedb
>>> db = lancedb.connect("~/.lancedb")

For object storage, use a URI prefix:

>>> db = lancedb.connect("s3://my-bucket/lancedb")

Connect to LanceDB cloud:

>>> db = lancedb.connect("db://my_database", api_key="ldb_...")

Returns:

Name Type Description
conn DBConnection

A connection to a LanceDB database.

Source code in lancedb/__init__.py
def connect(
    uri: URI,
    *,
    api_key: Optional[str] = None,
    region: str = "us-east-1",
    host_override: Optional[str] = None,
    read_consistency_interval: Optional[timedelta] = None,
    request_thread_pool: Optional[Union[int, ThreadPoolExecutor]] = None,
    **kwargs,
) -> DBConnection:
    """Connect to a LanceDB database.

    Parameters
    ----------
    uri: str or Path
        The uri of the database.
    api_key: str, optional
        If presented, connect to LanceDB cloud.
        Otherwise, connect to a database on file system or cloud storage.
        Can be set via environment variable `LANCEDB_API_KEY`.
    region: str, default "us-east-1"
        The region to use for LanceDB Cloud.
    host_override: str, optional
        The override url for LanceDB Cloud.
    read_consistency_interval: timedelta, default None
        (For LanceDB OSS only)
        The interval at which to check for updates to the table from other
        processes. If None, then consistency is not checked. For performance
        reasons, this is the default. For strong consistency, set this to
        zero seconds. Then every read will check for updates from other
        processes. As a compromise, you can set this to a non-zero timedelta
        for eventual consistency. If more than that interval has passed since
        the last check, then the table will be checked for updates. Note: this
        consistency only applies to read operations. Write operations are
        always consistent.
    request_thread_pool: int or ThreadPoolExecutor, optional
        The thread pool to use for making batch requests to the LanceDB Cloud API.
        If an integer, then a ThreadPoolExecutor will be created with that
        number of threads. If None, then a ThreadPoolExecutor will be created
        with the default number of threads. If a ThreadPoolExecutor, then that
        executor will be used for making requests. This is for LanceDB Cloud
        only and is only used when making batch requests (i.e., passing in
        multiple queries to the search method at once).

    Examples
    --------

    For a local directory, provide a path for the database:

    >>> import lancedb
    >>> db = lancedb.connect("~/.lancedb")

    For object storage, use a URI prefix:

    >>> db = lancedb.connect("s3://my-bucket/lancedb")

    Connect to LanceDB cloud:

    >>> db = lancedb.connect("db://my_database", api_key="ldb_...")

    Returns
    -------
    conn : DBConnection
        A connection to a LanceDB database.
    """
    if isinstance(uri, str) and uri.startswith("db://"):
        if api_key is None:
            api_key = os.environ.get("LANCEDB_API_KEY")
        if api_key is None:
            raise ValueError(f"api_key is required to connected LanceDB cloud: {uri}")
        if isinstance(request_thread_pool, int):
            request_thread_pool = ThreadPoolExecutor(request_thread_pool)
        return RemoteDBConnection(
            uri,
            api_key,
            region,
            host_override,
            request_thread_pool=request_thread_pool,
            **kwargs,
        )
    return LanceDBConnection(uri, read_consistency_interval=read_consistency_interval)

lancedb.db.DBConnection

Bases: EnforceOverrides

An active LanceDB connection interface.

Source code in lancedb/db.py
class DBConnection(EnforceOverrides):
    """An active LanceDB connection interface."""

    @abstractmethod
    def table_names(
        self, page_token: Optional[str] = None, limit: int = 10
    ) -> Iterable[str]:
        """List all tables in this database, in sorted order

        Parameters
        ----------
        page_token: str, optional
            The token to use for pagination. If not present, start from the beginning.
            Typically, this token is last table name from the previous page.
            Only supported by LanceDb Cloud.
        limit: int, default 10
            The size of the page to return.
            Only supported by LanceDb Cloud.

        Returns
        -------
        Iterable of str
        """
        pass

    @abstractmethod
    def create_table(
        self,
        name: str,
        data: Optional[DATA] = None,
        schema: Optional[Union[pa.Schema, LanceModel]] = None,
        mode: str = "create",
        exist_ok: bool = False,
        on_bad_vectors: str = "error",
        fill_value: float = 0.0,
        embedding_functions: Optional[List[EmbeddingFunctionConfig]] = None,
    ) -> Table:
        """Create a [Table][lancedb.table.Table] in the database.

        Parameters
        ----------
        name: str
            The name of the table.
        data: The data to initialize the table, *optional*
            User must provide at least one of `data` or `schema`.
            Acceptable types are:

            - dict or list-of-dict

            - pandas.DataFrame

            - pyarrow.Table or pyarrow.RecordBatch
        schema: The schema of the table, *optional*
            Acceptable types are:

            - pyarrow.Schema

            - [LanceModel][lancedb.pydantic.LanceModel]
        mode: str; default "create"
            The mode to use when creating the table.
            Can be either "create" or "overwrite".
            By default, if the table already exists, an exception is raised.
            If you want to overwrite the table, use mode="overwrite".
        exist_ok: bool, default False
            If a table by the same name already exists, then raise an exception
            if exist_ok=False. If exist_ok=True, then open the existing table;
            it will not add the provided data but will validate against any
            schema that's specified.
        on_bad_vectors: str, default "error"
            What to do if any of the vectors are not the same size or contains NaNs.
            One of "error", "drop", "fill".
        fill_value: float
            The value to use when filling vectors. Only used if on_bad_vectors="fill".

        Returns
        -------
        LanceTable
            A reference to the newly created table.

        !!! note

            The vector index won't be created by default.
            To create the index, call the `create_index` method on the table.

        Examples
        --------

        Can create with list of tuples or dictionaries:

        >>> import lancedb
        >>> db = lancedb.connect("./.lancedb")
        >>> data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},
        ...         {"vector": [0.2, 1.8], "lat": 40.1, "long":  -74.1}]
        >>> db.create_table("my_table", data)
        LanceTable(connection=..., name="my_table")
        >>> db["my_table"].head()
        pyarrow.Table
        vector: fixed_size_list<item: float>[2]
          child 0, item: float
        lat: double
        long: double
        ----
        vector: [[[1.1,1.2],[0.2,1.8]]]
        lat: [[45.5,40.1]]
        long: [[-122.7,-74.1]]

        You can also pass a pandas DataFrame:

        >>> import pandas as pd
        >>> data = pd.DataFrame({
        ...    "vector": [[1.1, 1.2], [0.2, 1.8]],
        ...    "lat": [45.5, 40.1],
        ...    "long": [-122.7, -74.1]
        ... })
        >>> db.create_table("table2", data)
        LanceTable(connection=..., name="table2")
        >>> db["table2"].head()
        pyarrow.Table
        vector: fixed_size_list<item: float>[2]
          child 0, item: float
        lat: double
        long: double
        ----
        vector: [[[1.1,1.2],[0.2,1.8]]]
        lat: [[45.5,40.1]]
        long: [[-122.7,-74.1]]

        Data is converted to Arrow before being written to disk. For maximum
        control over how data is saved, either provide the PyArrow schema to
        convert to or else provide a [PyArrow Table](pyarrow.Table) directly.

        >>> custom_schema = pa.schema([
        ...   pa.field("vector", pa.list_(pa.float32(), 2)),
        ...   pa.field("lat", pa.float32()),
        ...   pa.field("long", pa.float32())
        ... ])
        >>> db.create_table("table3", data, schema = custom_schema)
        LanceTable(connection=..., name="table3")
        >>> db["table3"].head()
        pyarrow.Table
        vector: fixed_size_list<item: float>[2]
          child 0, item: float
        lat: float
        long: float
        ----
        vector: [[[1.1,1.2],[0.2,1.8]]]
        lat: [[45.5,40.1]]
        long: [[-122.7,-74.1]]


        It is also possible to create an table from `[Iterable[pa.RecordBatch]]`:


        >>> import pyarrow as pa
        >>> def make_batches():
        ...     for i in range(5):
        ...         yield pa.RecordBatch.from_arrays(
        ...             [
        ...                 pa.array([[3.1, 4.1], [5.9, 26.5]],
        ...                     pa.list_(pa.float32(), 2)),
        ...                 pa.array(["foo", "bar"]),
        ...                 pa.array([10.0, 20.0]),
        ...             ],
        ...             ["vector", "item", "price"],
        ...         )
        >>> schema=pa.schema([
        ...     pa.field("vector", pa.list_(pa.float32(), 2)),
        ...     pa.field("item", pa.utf8()),
        ...     pa.field("price", pa.float32()),
        ... ])
        >>> db.create_table("table4", make_batches(), schema=schema)
        LanceTable(connection=..., name="table4")

        """
        raise NotImplementedError

    def __getitem__(self, name: str) -> LanceTable:
        return self.open_table(name)

    def open_table(self, name: str) -> Table:
        """Open a Lance Table in the database.

        Parameters
        ----------
        name: str
            The name of the table.

        Returns
        -------
        A LanceTable object representing the table.
        """
        raise NotImplementedError

    def drop_table(self, name: str):
        """Drop a table from the database.

        Parameters
        ----------
        name: str
            The name of the table.
        """
        raise NotImplementedError

    def drop_database(self):
        """
        Drop database
        This is the same thing as dropping all the tables
        """
        raise NotImplementedError

table_names(page_token: Optional[str] = None, limit: int = 10) -> Iterable[str] abstractmethod

List all tables in this database, in sorted order

Parameters:

Name Type Description Default
page_token Optional[str]

The token to use for pagination. If not present, start from the beginning. Typically, this token is last table name from the previous page. Only supported by LanceDb Cloud.

None
limit int

The size of the page to return. Only supported by LanceDb Cloud.

10

Returns:

Type Description
Iterable of str
Source code in lancedb/db.py
@abstractmethod
def table_names(
    self, page_token: Optional[str] = None, limit: int = 10
) -> Iterable[str]:
    """List all tables in this database, in sorted order

    Parameters
    ----------
    page_token: str, optional
        The token to use for pagination. If not present, start from the beginning.
        Typically, this token is last table name from the previous page.
        Only supported by LanceDb Cloud.
    limit: int, default 10
        The size of the page to return.
        Only supported by LanceDb Cloud.

    Returns
    -------
    Iterable of str
    """
    pass

create_table(name: str, data: Optional[DATA] = None, schema: Optional[Union[pa.Schema, LanceModel]] = None, mode: str = 'create', exist_ok: bool = False, on_bad_vectors: str = 'error', fill_value: float = 0.0, embedding_functions: Optional[List[EmbeddingFunctionConfig]] = None) -> Table abstractmethod

Create a Table in the database.

Parameters:

Name Type Description Default
name str

The name of the table.

required
data Optional[DATA]

User must provide at least one of data or schema. Acceptable types are:

  • dict or list-of-dict

  • pandas.DataFrame

  • pyarrow.Table or pyarrow.RecordBatch

None
schema Optional[Union[Schema, LanceModel]]

Acceptable types are:

None
mode str

The mode to use when creating the table. Can be either "create" or "overwrite". By default, if the table already exists, an exception is raised. If you want to overwrite the table, use mode="overwrite".

'create'
exist_ok bool

If a table by the same name already exists, then raise an exception if exist_ok=False. If exist_ok=True, then open the existing table; it will not add the provided data but will validate against any schema that's specified.

False
on_bad_vectors str

What to do if any of the vectors are not the same size or contains NaNs. One of "error", "drop", "fill".

'error'
fill_value float

The value to use when filling vectors. Only used if on_bad_vectors="fill".

0.0

Returns:

Type Description
LanceTable

A reference to the newly created table.

!!! note

The vector index won't be created by default. To create the index, call the create_index method on the table.

Examples:

Can create with list of tuples or dictionaries:

>>> import lancedb
>>> db = lancedb.connect("./.lancedb")
>>> data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},
...         {"vector": [0.2, 1.8], "lat": 40.1, "long":  -74.1}]
>>> db.create_table("my_table", data)
LanceTable(connection=..., name="my_table")
>>> db["my_table"].head()
pyarrow.Table
vector: fixed_size_list<item: float>[2]
  child 0, item: float
lat: double
long: double
----
vector: [[[1.1,1.2],[0.2,1.8]]]
lat: [[45.5,40.1]]
long: [[-122.7,-74.1]]

You can also pass a pandas DataFrame:

>>> import pandas as pd
>>> data = pd.DataFrame({
...    "vector": [[1.1, 1.2], [0.2, 1.8]],
...    "lat": [45.5, 40.1],
...    "long": [-122.7, -74.1]
... })
>>> db.create_table("table2", data)
LanceTable(connection=..., name="table2")
>>> db["table2"].head()
pyarrow.Table
vector: fixed_size_list<item: float>[2]
  child 0, item: float
lat: double
long: double
----
vector: [[[1.1,1.2],[0.2,1.8]]]
lat: [[45.5,40.1]]
long: [[-122.7,-74.1]]

Data is converted to Arrow before being written to disk. For maximum control over how data is saved, either provide the PyArrow schema to convert to or else provide a PyArrow Table directly.

>>> custom_schema = pa.schema([
...   pa.field("vector", pa.list_(pa.float32(), 2)),
...   pa.field("lat", pa.float32()),
...   pa.field("long", pa.float32())
... ])
>>> db.create_table("table3", data, schema = custom_schema)
LanceTable(connection=..., name="table3")
>>> db["table3"].head()
pyarrow.Table
vector: fixed_size_list<item: float>[2]
  child 0, item: float
lat: float
long: float
----
vector: [[[1.1,1.2],[0.2,1.8]]]
lat: [[45.5,40.1]]
long: [[-122.7,-74.1]]

It is also possible to create an table from [Iterable[pa.RecordBatch]]:

>>> import pyarrow as pa
>>> def make_batches():
...     for i in range(5):
...         yield pa.RecordBatch.from_arrays(
...             [
...                 pa.array([[3.1, 4.1], [5.9, 26.5]],
...                     pa.list_(pa.float32(), 2)),
...                 pa.array(["foo", "bar"]),
...                 pa.array([10.0, 20.0]),
...             ],
...             ["vector", "item", "price"],
...         )
>>> schema=pa.schema([
...     pa.field("vector", pa.list_(pa.float32(), 2)),
...     pa.field("item", pa.utf8()),
...     pa.field("price", pa.float32()),
... ])
>>> db.create_table("table4", make_batches(), schema=schema)
LanceTable(connection=..., name="table4")
Source code in lancedb/db.py
@abstractmethod
def create_table(
    self,
    name: str,
    data: Optional[DATA] = None,
    schema: Optional[Union[pa.Schema, LanceModel]] = None,
    mode: str = "create",
    exist_ok: bool = False,
    on_bad_vectors: str = "error",
    fill_value: float = 0.0,
    embedding_functions: Optional[List[EmbeddingFunctionConfig]] = None,
) -> Table:
    """Create a [Table][lancedb.table.Table] in the database.

    Parameters
    ----------
    name: str
        The name of the table.
    data: The data to initialize the table, *optional*
        User must provide at least one of `data` or `schema`.
        Acceptable types are:

        - dict or list-of-dict

        - pandas.DataFrame

        - pyarrow.Table or pyarrow.RecordBatch
    schema: The schema of the table, *optional*
        Acceptable types are:

        - pyarrow.Schema

        - [LanceModel][lancedb.pydantic.LanceModel]
    mode: str; default "create"
        The mode to use when creating the table.
        Can be either "create" or "overwrite".
        By default, if the table already exists, an exception is raised.
        If you want to overwrite the table, use mode="overwrite".
    exist_ok: bool, default False
        If a table by the same name already exists, then raise an exception
        if exist_ok=False. If exist_ok=True, then open the existing table;
        it will not add the provided data but will validate against any
        schema that's specified.
    on_bad_vectors: str, default "error"
        What to do if any of the vectors are not the same size or contains NaNs.
        One of "error", "drop", "fill".
    fill_value: float
        The value to use when filling vectors. Only used if on_bad_vectors="fill".

    Returns
    -------
    LanceTable
        A reference to the newly created table.

    !!! note

        The vector index won't be created by default.
        To create the index, call the `create_index` method on the table.

    Examples
    --------

    Can create with list of tuples or dictionaries:

    >>> import lancedb
    >>> db = lancedb.connect("./.lancedb")
    >>> data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},
    ...         {"vector": [0.2, 1.8], "lat": 40.1, "long":  -74.1}]
    >>> db.create_table("my_table", data)
    LanceTable(connection=..., name="my_table")
    >>> db["my_table"].head()
    pyarrow.Table
    vector: fixed_size_list<item: float>[2]
      child 0, item: float
    lat: double
    long: double
    ----
    vector: [[[1.1,1.2],[0.2,1.8]]]
    lat: [[45.5,40.1]]
    long: [[-122.7,-74.1]]

    You can also pass a pandas DataFrame:

    >>> import pandas as pd
    >>> data = pd.DataFrame({
    ...    "vector": [[1.1, 1.2], [0.2, 1.8]],
    ...    "lat": [45.5, 40.1],
    ...    "long": [-122.7, -74.1]
    ... })
    >>> db.create_table("table2", data)
    LanceTable(connection=..., name="table2")
    >>> db["table2"].head()
    pyarrow.Table
    vector: fixed_size_list<item: float>[2]
      child 0, item: float
    lat: double
    long: double
    ----
    vector: [[[1.1,1.2],[0.2,1.8]]]
    lat: [[45.5,40.1]]
    long: [[-122.7,-74.1]]

    Data is converted to Arrow before being written to disk. For maximum
    control over how data is saved, either provide the PyArrow schema to
    convert to or else provide a [PyArrow Table](pyarrow.Table) directly.

    >>> custom_schema = pa.schema([
    ...   pa.field("vector", pa.list_(pa.float32(), 2)),
    ...   pa.field("lat", pa.float32()),
    ...   pa.field("long", pa.float32())
    ... ])
    >>> db.create_table("table3", data, schema = custom_schema)
    LanceTable(connection=..., name="table3")
    >>> db["table3"].head()
    pyarrow.Table
    vector: fixed_size_list<item: float>[2]
      child 0, item: float
    lat: float
    long: float
    ----
    vector: [[[1.1,1.2],[0.2,1.8]]]
    lat: [[45.5,40.1]]
    long: [[-122.7,-74.1]]


    It is also possible to create an table from `[Iterable[pa.RecordBatch]]`:


    >>> import pyarrow as pa
    >>> def make_batches():
    ...     for i in range(5):
    ...         yield pa.RecordBatch.from_arrays(
    ...             [
    ...                 pa.array([[3.1, 4.1], [5.9, 26.5]],
    ...                     pa.list_(pa.float32(), 2)),
    ...                 pa.array(["foo", "bar"]),
    ...                 pa.array([10.0, 20.0]),
    ...             ],
    ...             ["vector", "item", "price"],
    ...         )
    >>> schema=pa.schema([
    ...     pa.field("vector", pa.list_(pa.float32(), 2)),
    ...     pa.field("item", pa.utf8()),
    ...     pa.field("price", pa.float32()),
    ... ])
    >>> db.create_table("table4", make_batches(), schema=schema)
    LanceTable(connection=..., name="table4")

    """
    raise NotImplementedError

open_table(name: str) -> Table

Open a Lance Table in the database.

Parameters:

Name Type Description Default
name str

The name of the table.

required

Returns:

Type Description
A LanceTable object representing the table.
Source code in lancedb/db.py
def open_table(self, name: str) -> Table:
    """Open a Lance Table in the database.

    Parameters
    ----------
    name: str
        The name of the table.

    Returns
    -------
    A LanceTable object representing the table.
    """
    raise NotImplementedError

drop_table(name: str)

Drop a table from the database.

Parameters:

Name Type Description Default
name str

The name of the table.

required
Source code in lancedb/db.py
def drop_table(self, name: str):
    """Drop a table from the database.

    Parameters
    ----------
    name: str
        The name of the table.
    """
    raise NotImplementedError

drop_database()

Drop database This is the same thing as dropping all the tables

Source code in lancedb/db.py
def drop_database(self):
    """
    Drop database
    This is the same thing as dropping all the tables
    """
    raise NotImplementedError

Tables (Synchronous)

lancedb.table.Table

Bases: ABC

A Table is a collection of Records in a LanceDB Database.

Examples:

Create using DBConnection.create_table (more examples in that method's documentation).

>>> import lancedb
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data=[{"vector": [1.1, 1.2], "b": 2}])
>>> table.head()
pyarrow.Table
vector: fixed_size_list<item: float>[2]
  child 0, item: float
b: int64
----
vector: [[[1.1,1.2]]]
b: [[2]]

Can append new data with Table.add().

>>> table.add([{"vector": [0.5, 1.3], "b": 4}])

Can query the table with Table.search.

>>> table.search([0.4, 0.4]).select(["b", "vector"]).to_pandas()
   b      vector  _distance
0  4  [0.5, 1.3]       0.82
1  2  [1.1, 1.2]       1.13

Search queries are much faster when an index is created. See Table.create_index.

Source code in lancedb/table.py
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
class Table(ABC):
    """
    A Table is a collection of Records in a LanceDB Database.

    Examples
    --------

    Create using [DBConnection.create_table][lancedb.DBConnection.create_table]
    (more examples in that method's documentation).

    >>> import lancedb
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", data=[{"vector": [1.1, 1.2], "b": 2}])
    >>> table.head()
    pyarrow.Table
    vector: fixed_size_list<item: float>[2]
      child 0, item: float
    b: int64
    ----
    vector: [[[1.1,1.2]]]
    b: [[2]]

    Can append new data with [Table.add()][lancedb.table.Table.add].

    >>> table.add([{"vector": [0.5, 1.3], "b": 4}])

    Can query the table with [Table.search][lancedb.table.Table.search].

    >>> table.search([0.4, 0.4]).select(["b", "vector"]).to_pandas()
       b      vector  _distance
    0  4  [0.5, 1.3]       0.82
    1  2  [1.1, 1.2]       1.13

    Search queries are much faster when an index is created. See
    [Table.create_index][lancedb.table.Table.create_index].
    """

    @property
    @abstractmethod
    def schema(self) -> pa.Schema:
        """The [Arrow Schema](https://arrow.apache.org/docs/python/api/datatypes.html#)
        of this Table

        """
        raise NotImplementedError

    @abstractmethod
    def count_rows(self, filter: Optional[str] = None) -> int:
        """
        Count the number of rows in the table.

        Parameters
        ----------
        filter: str, optional
            A SQL where clause to filter the rows to count.
        """
        raise NotImplementedError

    def to_pandas(self) -> "pd.DataFrame":
        """Return the table as a pandas DataFrame.

        Returns
        -------
        pd.DataFrame
        """
        return self.to_arrow().to_pandas()

    @abstractmethod
    def to_arrow(self) -> pa.Table:
        """Return the table as a pyarrow Table.

        Returns
        -------
        pa.Table
        """
        raise NotImplementedError

    def create_index(
        self,
        metric="L2",
        num_partitions=256,
        num_sub_vectors=96,
        vector_column_name: str = VECTOR_COLUMN_NAME,
        replace: bool = True,
        accelerator: Optional[str] = None,
        index_cache_size: Optional[int] = None,
    ):
        """Create an index on the table.

        Parameters
        ----------
        metric: str, default "L2"
            The distance metric to use when creating the index.
            Valid values are "L2", "cosine", or "dot".
            L2 is euclidean distance.
        num_partitions: int, default 256
            The number of IVF partitions to use when creating the index.
            Default is 256.
        num_sub_vectors: int, default 96
            The number of PQ sub-vectors to use when creating the index.
            Default is 96.
        vector_column_name: str, default "vector"
            The vector column name to create the index.
        replace: bool, default True
            - If True, replace the existing index if it exists.

            - If False, raise an error if duplicate index exists.
        accelerator: str, default None
            If set, use the given accelerator to create the index.
            Only support "cuda" for now.
        index_cache_size : int, optional
            The size of the index cache in number of entries. Default value is 256.
        """
        raise NotImplementedError

    @abstractmethod
    def create_scalar_index(
        self,
        column: str,
        *,
        replace: bool = True,
    ):
        """Create a scalar index on a column.

        Scalar indices, like vector indices, can be used to speed up scans.  A scalar
        index can speed up scans that contain filter expressions on the indexed column.
        For example, the following scan will be faster if the column ``my_col`` has
        a scalar index:

        .. code-block:: python

            import lancedb

            db = lancedb.connect("/data/lance")
            img_table = db.open_table("images")
            my_df = img_table.search().where("my_col = 7", prefilter=True).to_pandas()

        Scalar indices can also speed up scans containing a vector search and a
        prefilter:

        .. code-block::python

            import lancedb

            db = lancedb.connect("/data/lance")
            img_table = db.open_table("images")
            img_table.search([1, 2, 3, 4], vector_column_name="vector")
                .where("my_col != 7", prefilter=True)
                .to_pandas()

        Scalar indices can only speed up scans for basic filters using
        equality, comparison, range (e.g. ``my_col BETWEEN 0 AND 100``), and set
        membership (e.g. `my_col IN (0, 1, 2)`)

        Scalar indices can be used if the filter contains multiple indexed columns and
        the filter criteria are AND'd or OR'd together
        (e.g. ``my_col < 0 AND other_col> 100``)

        Scalar indices may be used if the filter contains non-indexed columns but,
        depending on the structure of the filter, they may not be usable.  For example,
        if the column ``not_indexed`` does not have a scalar index then the filter
        ``my_col = 0 OR not_indexed = 1`` will not be able to use any scalar index on
        ``my_col``.

        **Experimental API**

        Parameters
        ----------
        column : str
            The column to be indexed.  Must be a boolean, integer, float,
            or string column.
        replace : bool, default True
            Replace the existing index if it exists.

        Examples
        --------

        .. code-block:: python

            import lance

            dataset = lance.dataset("./images.lance")
            dataset.create_scalar_index("category")
        """
        raise NotImplementedError

    @abstractmethod
    def add(
        self,
        data: DATA,
        mode: str = "append",
        on_bad_vectors: str = "error",
        fill_value: float = 0.0,
    ):
        """Add more data to the [Table](Table).

        Parameters
        ----------
        data: DATA
            The data to insert into the table. Acceptable types are:

            - dict or list-of-dict

            - pandas.DataFrame

            - pyarrow.Table or pyarrow.RecordBatch
        mode: str
            The mode to use when writing the data. Valid values are
            "append" and "overwrite".
        on_bad_vectors: str, default "error"
            What to do if any of the vectors are not the same size or contains NaNs.
            One of "error", "drop", "fill".
        fill_value: float, default 0.
            The value to use when filling vectors. Only used if on_bad_vectors="fill".

        """
        raise NotImplementedError

    def merge_insert(self, on: Union[str, Iterable[str]]) -> LanceMergeInsertBuilder:
        """
        Returns a [`LanceMergeInsertBuilder`][lancedb.merge.LanceMergeInsertBuilder]
        that can be used to create a "merge insert" operation

        This operation can add rows, update rows, and remove rows all in a single
        transaction. It is a very generic tool that can be used to create
        behaviors like "insert if not exists", "update or insert (i.e. upsert)",
        or even replace a portion of existing data with new data (e.g. replace
        all data where month="january")

        The merge insert operation works by combining new data from a
        **source table** with existing data in a **target table** by using a
        join.  There are three categories of records.

        "Matched" records are records that exist in both the source table and
        the target table. "Not matched" records exist only in the source table
        (e.g. these are new data) "Not matched by source" records exist only
        in the target table (this is old data)

        The builder returned by this method can be used to customize what
        should happen for each category of data.

        Please note that the data may appear to be reordered as part of this
        operation.  This is because updated rows will be deleted from the
        dataset and then reinserted at the end with the new values.

        Parameters
        ----------

        on: Union[str, Iterable[str]]
            A column (or columns) to join on.  This is how records from the
            source table and target table are matched.  Typically this is some
            kind of key or id column.

        Examples
        --------
        >>> import lancedb
        >>> data = pa.table({"a": [2, 1, 3], "b": ["a", "b", "c"]})
        >>> db = lancedb.connect("./.lancedb")
        >>> table = db.create_table("my_table", data)
        >>> new_data = pa.table({"a": [2, 3, 4], "b": ["x", "y", "z"]})
        >>> # Perform a "upsert" operation
        >>> table.merge_insert("a")             \\
        ...      .when_matched_update_all()     \\
        ...      .when_not_matched_insert_all() \\
        ...      .execute(new_data)
        >>> # The order of new rows is non-deterministic since we use
        >>> # a hash-join as part of this operation and so we sort here
        >>> table.to_arrow().sort_by("a").to_pandas()
           a  b
        0  1  b
        1  2  x
        2  3  y
        3  4  z
        """
        on = [on] if isinstance(on, str) else list(on.iter())

        return LanceMergeInsertBuilder(self, on)

    @abstractmethod
    def search(
        self,
        query: Optional[Union[VEC, str, "PIL.Image.Image", Tuple]] = None,
        vector_column_name: Optional[str] = None,
        query_type: str = "auto",
    ) -> LanceQueryBuilder:
        """Create a search query to find the nearest neighbors
        of the given query vector. We currently support [vector search][search]
        and [full-text search][experimental-full-text-search].

        All query options are defined in [Query][lancedb.query.Query].

        Examples
        --------
        >>> import lancedb
        >>> db = lancedb.connect("./.lancedb")
        >>> data = [
        ...    {"original_width": 100, "caption": "bar", "vector": [0.1, 2.3, 4.5]},
        ...    {"original_width": 2000, "caption": "foo",  "vector": [0.5, 3.4, 1.3]},
        ...    {"original_width": 3000, "caption": "test", "vector": [0.3, 6.2, 2.6]}
        ... ]
        >>> table = db.create_table("my_table", data)
        >>> query = [0.4, 1.4, 2.4]
        >>> (table.search(query)
        ...     .where("original_width > 1000", prefilter=True)
        ...     .select(["caption", "original_width", "vector"])
        ...     .limit(2)
        ...     .to_pandas())
          caption  original_width           vector  _distance
        0     foo            2000  [0.5, 3.4, 1.3]   5.220000
        1    test            3000  [0.3, 6.2, 2.6]  23.089996

        Parameters
        ----------
        query: list/np.ndarray/str/PIL.Image.Image, default None
            The targetted vector to search for.

            - *default None*.
            Acceptable types are: list, np.ndarray, PIL.Image.Image

            - If None then the select/where/limit clauses are applied to filter
            the table
        vector_column_name: str, optional
            The name of the vector column to search.

            The vector column needs to be a pyarrow fixed size list type

            - If not specified then the vector column is inferred from
            the table schema

            - If the table has multiple vector columns then the *vector_column_name*
            needs to be specified. Otherwise, an error is raised.
        query_type: str
            *default "auto"*.
            Acceptable types are: "vector", "fts", "hybrid", or "auto"

            - If "auto" then the query type is inferred from the query;

                - If `query` is a list/np.ndarray then the query type is
                "vector";

                - If `query` is a PIL.Image.Image then either do vector search,
                or raise an error if no corresponding embedding function is found.

            - If `query` is a string, then the query type is "vector" if the
            table has embedding functions else the query type is "fts"

        Returns
        -------
        LanceQueryBuilder
            A query builder object representing the query.
            Once executed, the query returns

            - selected columns

            - the vector

            - and also the "_distance" column which is the distance between the query
            vector and the returned vector.
        """
        raise NotImplementedError

    @abstractmethod
    def _execute_query(
        self, query: Query, batch_size: Optional[int] = None
    ) -> pa.RecordBatchReader:
        pass

    @abstractmethod
    def _do_merge(
        self,
        merge: LanceMergeInsertBuilder,
        new_data: DATA,
        on_bad_vectors: str,
        fill_value: float,
    ):
        pass

    @abstractmethod
    def delete(self, where: str):
        """Delete rows from the table.

        This can be used to delete a single row, many rows, all rows, or
        sometimes no rows (if your predicate matches nothing).

        Parameters
        ----------
        where: str
            The SQL where clause to use when deleting rows.

            - For example, 'x = 2' or 'x IN (1, 2, 3)'.

            The filter must not be empty, or it will error.

        Examples
        --------
        >>> import lancedb
        >>> data = [
        ...    {"x": 1, "vector": [1, 2]},
        ...    {"x": 2, "vector": [3, 4]},
        ...    {"x": 3, "vector": [5, 6]}
        ... ]
        >>> db = lancedb.connect("./.lancedb")
        >>> table = db.create_table("my_table", data)
        >>> table.to_pandas()
           x      vector
        0  1  [1.0, 2.0]
        1  2  [3.0, 4.0]
        2  3  [5.0, 6.0]
        >>> table.delete("x = 2")
        >>> table.to_pandas()
           x      vector
        0  1  [1.0, 2.0]
        1  3  [5.0, 6.0]

        If you have a list of values to delete, you can combine them into a
        stringified list and use the `IN` operator:

        >>> to_remove = [1, 5]
        >>> to_remove = ", ".join([str(v) for v in to_remove])
        >>> to_remove
        '1, 5'
        >>> table.delete(f"x IN ({to_remove})")
        >>> table.to_pandas()
           x      vector
        0  3  [5.0, 6.0]
        """
        raise NotImplementedError

    @abstractmethod
    def update(
        self,
        where: Optional[str] = None,
        values: Optional[dict] = None,
        *,
        values_sql: Optional[Dict[str, str]] = None,
    ):
        """
        This can be used to update zero to all rows depending on how many
        rows match the where clause. If no where clause is provided, then
        all rows will be updated.

        Either `values` or `values_sql` must be provided. You cannot provide
        both.

        Parameters
        ----------
        where: str, optional
            The SQL where clause to use when updating rows. For example, 'x = 2'
            or 'x IN (1, 2, 3)'. The filter must not be empty, or it will error.
        values: dict, optional
            The values to update. The keys are the column names and the values
            are the values to set.
        values_sql: dict, optional
            The values to update, expressed as SQL expression strings. These can
            reference existing columns. For example, {"x": "x + 1"} will increment
            the x column by 1.

        Examples
        --------
        >>> import lancedb
        >>> import pandas as pd
        >>> data = pd.DataFrame({"x": [1, 2, 3], "vector": [[1, 2], [3, 4], [5, 6]]})
        >>> db = lancedb.connect("./.lancedb")
        >>> table = db.create_table("my_table", data)
        >>> table.to_pandas()
           x      vector
        0  1  [1.0, 2.0]
        1  2  [3.0, 4.0]
        2  3  [5.0, 6.0]
        >>> table.update(where="x = 2", values={"vector": [10, 10]})
        >>> table.to_pandas()
           x        vector
        0  1    [1.0, 2.0]
        1  3    [5.0, 6.0]
        2  2  [10.0, 10.0]
        >>> table.update(values_sql={"x": "x + 1"})
        >>> table.to_pandas()
           x        vector
        0  2    [1.0, 2.0]
        1  4    [5.0, 6.0]
        2  3  [10.0, 10.0]
        """
        raise NotImplementedError

    @abstractmethod
    def cleanup_old_versions(
        self,
        older_than: Optional[timedelta] = None,
        *,
        delete_unverified: bool = False,
    ) -> CleanupStats:
        """
        Clean up old versions of the table, freeing disk space.

        Note: This function is not available in LanceDb Cloud (since LanceDb
        Cloud manages cleanup for you automatically)

        Parameters
        ----------
        older_than: timedelta, default None
            The minimum age of the version to delete. If None, then this defaults
            to two weeks.
        delete_unverified: bool, default False
            Because they may be part of an in-progress transaction, files newer
            than 7 days old are not deleted by default. If you are sure that
            there are no in-progress transactions, then you can set this to True
            to delete all files older than `older_than`.

        Returns
        -------
        CleanupStats
            The stats of the cleanup operation, including how many bytes were
            freed.
        """

    @abstractmethod
    def compact_files(self, *args, **kwargs):
        """
        Run the compaction process on the table.

        Note: This function is not available in LanceDb Cloud (since LanceDb
        Cloud manages compaction for you automatically)

        This can be run after making several small appends to optimize the table
        for faster reads.

        Arguments are passed onto :meth:`lance.dataset.DatasetOptimizer.compact_files`.
        For most cases, the default should be fine.
        """

    @abstractmethod
    def add_columns(self, transforms: Dict[str, str]):
        """
        Add new columns with defined values.

        This is not yet available in LanceDB Cloud.

        Parameters
        ----------
        transforms: Dict[str, str]
            A map of column name to a SQL expression to use to calculate the
            value of the new column. These expressions will be evaluated for
            each row in the table, and can reference existing columns.
        """

    @abstractmethod
    def alter_columns(self, alterations: Iterable[Dict[str, str]]):
        """
        Alter column names and nullability.

        This is not yet available in LanceDB Cloud.

        alterations : Iterable[Dict[str, Any]]
            A sequence of dictionaries, each with the following keys:
            - "path": str
                The column path to alter. For a top-level column, this is the name.
                For a nested column, this is the dot-separated path, e.g. "a.b.c".
            - "name": str, optional
                The new name of the column. If not specified, the column name is
                not changed.
            - "nullable": bool, optional
                Whether the column should be nullable. If not specified, the column
                nullability is not changed. Only non-nullable columns can be changed
                to nullable. Currently, you cannot change a nullable column to
                non-nullable.
        """

    @abstractmethod
    def drop_columns(self, columns: Iterable[str]):
        """
        Drop columns from the table.

        This is not yet available in LanceDB Cloud.

        Parameters
        ----------
        columns : Iterable[str]
            The names of the columns to drop.
        """

schema: pa.Schema abstractmethod property

The Arrow Schema of this Table

count_rows(filter: Optional[str] = None) -> int abstractmethod

Count the number of rows in the table.

Parameters:

Name Type Description Default
filter Optional[str]

A SQL where clause to filter the rows to count.

None
Source code in lancedb/table.py
@abstractmethod
def count_rows(self, filter: Optional[str] = None) -> int:
    """
    Count the number of rows in the table.

    Parameters
    ----------
    filter: str, optional
        A SQL where clause to filter the rows to count.
    """
    raise NotImplementedError

to_pandas() -> 'pd.DataFrame'

Return the table as a pandas DataFrame.

Returns:

Type Description
DataFrame
Source code in lancedb/table.py
def to_pandas(self) -> "pd.DataFrame":
    """Return the table as a pandas DataFrame.

    Returns
    -------
    pd.DataFrame
    """
    return self.to_arrow().to_pandas()

to_arrow() -> pa.Table abstractmethod

Return the table as a pyarrow Table.

Returns:

Type Description
Table
Source code in lancedb/table.py
@abstractmethod
def to_arrow(self) -> pa.Table:
    """Return the table as a pyarrow Table.

    Returns
    -------
    pa.Table
    """
    raise NotImplementedError

create_index(metric='L2', num_partitions=256, num_sub_vectors=96, vector_column_name: str = VECTOR_COLUMN_NAME, replace: bool = True, accelerator: Optional[str] = None, index_cache_size: Optional[int] = None)

Create an index on the table.

Parameters:

Name Type Description Default
metric

The distance metric to use when creating the index. Valid values are "L2", "cosine", or "dot". L2 is euclidean distance.

'L2'
num_partitions

The number of IVF partitions to use when creating the index. Default is 256.

256
num_sub_vectors

The number of PQ sub-vectors to use when creating the index. Default is 96.

96
vector_column_name str

The vector column name to create the index.

VECTOR_COLUMN_NAME
replace bool
  • If True, replace the existing index if it exists.

  • If False, raise an error if duplicate index exists.

True
accelerator Optional[str]

If set, use the given accelerator to create the index. Only support "cuda" for now.

None
index_cache_size int

The size of the index cache in number of entries. Default value is 256.

None
Source code in lancedb/table.py
def create_index(
    self,
    metric="L2",
    num_partitions=256,
    num_sub_vectors=96,
    vector_column_name: str = VECTOR_COLUMN_NAME,
    replace: bool = True,
    accelerator: Optional[str] = None,
    index_cache_size: Optional[int] = None,
):
    """Create an index on the table.

    Parameters
    ----------
    metric: str, default "L2"
        The distance metric to use when creating the index.
        Valid values are "L2", "cosine", or "dot".
        L2 is euclidean distance.
    num_partitions: int, default 256
        The number of IVF partitions to use when creating the index.
        Default is 256.
    num_sub_vectors: int, default 96
        The number of PQ sub-vectors to use when creating the index.
        Default is 96.
    vector_column_name: str, default "vector"
        The vector column name to create the index.
    replace: bool, default True
        - If True, replace the existing index if it exists.

        - If False, raise an error if duplicate index exists.
    accelerator: str, default None
        If set, use the given accelerator to create the index.
        Only support "cuda" for now.
    index_cache_size : int, optional
        The size of the index cache in number of entries. Default value is 256.
    """
    raise NotImplementedError

create_scalar_index(column: str, *, replace: bool = True) abstractmethod

Create a scalar index on a column.

Scalar indices, like vector indices, can be used to speed up scans. A scalar index can speed up scans that contain filter expressions on the indexed column. For example, the following scan will be faster if the column my_col has a scalar index:

.. code-block:: python

import lancedb

db = lancedb.connect("/data/lance")
img_table = db.open_table("images")
my_df = img_table.search().where("my_col = 7", prefilter=True).to_pandas()

Scalar indices can also speed up scans containing a vector search and a prefilter:

.. code-block::python

import lancedb

db = lancedb.connect("/data/lance")
img_table = db.open_table("images")
img_table.search([1, 2, 3, 4], vector_column_name="vector")
    .where("my_col != 7", prefilter=True)
    .to_pandas()

Scalar indices can only speed up scans for basic filters using equality, comparison, range (e.g. my_col BETWEEN 0 AND 100), and set membership (e.g. my_col IN (0, 1, 2))

Scalar indices can be used if the filter contains multiple indexed columns and the filter criteria are AND'd or OR'd together (e.g. my_col < 0 AND other_col> 100)

Scalar indices may be used if the filter contains non-indexed columns but, depending on the structure of the filter, they may not be usable. For example, if the column not_indexed does not have a scalar index then the filter my_col = 0 OR not_indexed = 1 will not be able to use any scalar index on my_col.

Experimental API

Parameters:

Name Type Description Default
column str

The column to be indexed. Must be a boolean, integer, float, or string column.

required
replace bool

Replace the existing index if it exists.

True

Examples:

.. code-block:: python

import lance

dataset = lance.dataset("./images.lance")
dataset.create_scalar_index("category")
Source code in lancedb/table.py
@abstractmethod
def create_scalar_index(
    self,
    column: str,
    *,
    replace: bool = True,
):
    """Create a scalar index on a column.

    Scalar indices, like vector indices, can be used to speed up scans.  A scalar
    index can speed up scans that contain filter expressions on the indexed column.
    For example, the following scan will be faster if the column ``my_col`` has
    a scalar index:

    .. code-block:: python

        import lancedb

        db = lancedb.connect("/data/lance")
        img_table = db.open_table("images")
        my_df = img_table.search().where("my_col = 7", prefilter=True).to_pandas()

    Scalar indices can also speed up scans containing a vector search and a
    prefilter:

    .. code-block::python

        import lancedb

        db = lancedb.connect("/data/lance")
        img_table = db.open_table("images")
        img_table.search([1, 2, 3, 4], vector_column_name="vector")
            .where("my_col != 7", prefilter=True)
            .to_pandas()

    Scalar indices can only speed up scans for basic filters using
    equality, comparison, range (e.g. ``my_col BETWEEN 0 AND 100``), and set
    membership (e.g. `my_col IN (0, 1, 2)`)

    Scalar indices can be used if the filter contains multiple indexed columns and
    the filter criteria are AND'd or OR'd together
    (e.g. ``my_col < 0 AND other_col> 100``)

    Scalar indices may be used if the filter contains non-indexed columns but,
    depending on the structure of the filter, they may not be usable.  For example,
    if the column ``not_indexed`` does not have a scalar index then the filter
    ``my_col = 0 OR not_indexed = 1`` will not be able to use any scalar index on
    ``my_col``.

    **Experimental API**

    Parameters
    ----------
    column : str
        The column to be indexed.  Must be a boolean, integer, float,
        or string column.
    replace : bool, default True
        Replace the existing index if it exists.

    Examples
    --------

    .. code-block:: python

        import lance

        dataset = lance.dataset("./images.lance")
        dataset.create_scalar_index("category")
    """
    raise NotImplementedError

add(data: DATA, mode: str = 'append', on_bad_vectors: str = 'error', fill_value: float = 0.0) abstractmethod

Add more data to the Table.

Parameters:

Name Type Description Default
data DATA

The data to insert into the table. Acceptable types are:

  • dict or list-of-dict

  • pandas.DataFrame

  • pyarrow.Table or pyarrow.RecordBatch

required
mode str

The mode to use when writing the data. Valid values are "append" and "overwrite".

'append'
on_bad_vectors str

What to do if any of the vectors are not the same size or contains NaNs. One of "error", "drop", "fill".

'error'
fill_value float

The value to use when filling vectors. Only used if on_bad_vectors="fill".

0.0
Source code in lancedb/table.py
@abstractmethod
def add(
    self,
    data: DATA,
    mode: str = "append",
    on_bad_vectors: str = "error",
    fill_value: float = 0.0,
):
    """Add more data to the [Table](Table).

    Parameters
    ----------
    data: DATA
        The data to insert into the table. Acceptable types are:

        - dict or list-of-dict

        - pandas.DataFrame

        - pyarrow.Table or pyarrow.RecordBatch
    mode: str
        The mode to use when writing the data. Valid values are
        "append" and "overwrite".
    on_bad_vectors: str, default "error"
        What to do if any of the vectors are not the same size or contains NaNs.
        One of "error", "drop", "fill".
    fill_value: float, default 0.
        The value to use when filling vectors. Only used if on_bad_vectors="fill".

    """
    raise NotImplementedError

merge_insert(on: Union[str, Iterable[str]]) -> LanceMergeInsertBuilder

Returns a LanceMergeInsertBuilder that can be used to create a "merge insert" operation

This operation can add rows, update rows, and remove rows all in a single transaction. It is a very generic tool that can be used to create behaviors like "insert if not exists", "update or insert (i.e. upsert)", or even replace a portion of existing data with new data (e.g. replace all data where month="january")

The merge insert operation works by combining new data from a source table with existing data in a target table by using a join. There are three categories of records.

"Matched" records are records that exist in both the source table and the target table. "Not matched" records exist only in the source table (e.g. these are new data) "Not matched by source" records exist only in the target table (this is old data)

The builder returned by this method can be used to customize what should happen for each category of data.

Please note that the data may appear to be reordered as part of this operation. This is because updated rows will be deleted from the dataset and then reinserted at the end with the new values.

Parameters:

Name Type Description Default
on Union[str, Iterable[str]]

A column (or columns) to join on. This is how records from the source table and target table are matched. Typically this is some kind of key or id column.

required

Examples:

>>> import lancedb
>>> data = pa.table({"a": [2, 1, 3], "b": ["a", "b", "c"]})
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data)
>>> new_data = pa.table({"a": [2, 3, 4], "b": ["x", "y", "z"]})
>>> # Perform a "upsert" operation
>>> table.merge_insert("a")             \
...      .when_matched_update_all()     \
...      .when_not_matched_insert_all() \
...      .execute(new_data)
>>> # The order of new rows is non-deterministic since we use
>>> # a hash-join as part of this operation and so we sort here
>>> table.to_arrow().sort_by("a").to_pandas()
   a  b
0  1  b
1  2  x
2  3  y
3  4  z
Source code in lancedb/table.py
def merge_insert(self, on: Union[str, Iterable[str]]) -> LanceMergeInsertBuilder:
    """
    Returns a [`LanceMergeInsertBuilder`][lancedb.merge.LanceMergeInsertBuilder]
    that can be used to create a "merge insert" operation

    This operation can add rows, update rows, and remove rows all in a single
    transaction. It is a very generic tool that can be used to create
    behaviors like "insert if not exists", "update or insert (i.e. upsert)",
    or even replace a portion of existing data with new data (e.g. replace
    all data where month="january")

    The merge insert operation works by combining new data from a
    **source table** with existing data in a **target table** by using a
    join.  There are three categories of records.

    "Matched" records are records that exist in both the source table and
    the target table. "Not matched" records exist only in the source table
    (e.g. these are new data) "Not matched by source" records exist only
    in the target table (this is old data)

    The builder returned by this method can be used to customize what
    should happen for each category of data.

    Please note that the data may appear to be reordered as part of this
    operation.  This is because updated rows will be deleted from the
    dataset and then reinserted at the end with the new values.

    Parameters
    ----------

    on: Union[str, Iterable[str]]
        A column (or columns) to join on.  This is how records from the
        source table and target table are matched.  Typically this is some
        kind of key or id column.

    Examples
    --------
    >>> import lancedb
    >>> data = pa.table({"a": [2, 1, 3], "b": ["a", "b", "c"]})
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", data)
    >>> new_data = pa.table({"a": [2, 3, 4], "b": ["x", "y", "z"]})
    >>> # Perform a "upsert" operation
    >>> table.merge_insert("a")             \\
    ...      .when_matched_update_all()     \\
    ...      .when_not_matched_insert_all() \\
    ...      .execute(new_data)
    >>> # The order of new rows is non-deterministic since we use
    >>> # a hash-join as part of this operation and so we sort here
    >>> table.to_arrow().sort_by("a").to_pandas()
       a  b
    0  1  b
    1  2  x
    2  3  y
    3  4  z
    """
    on = [on] if isinstance(on, str) else list(on.iter())

    return LanceMergeInsertBuilder(self, on)

search(query: Optional[Union[VEC, str, 'PIL.Image.Image', Tuple]] = None, vector_column_name: Optional[str] = None, query_type: str = 'auto') -> LanceQueryBuilder abstractmethod

Create a search query to find the nearest neighbors of the given query vector. We currently support vector search and [full-text search][experimental-full-text-search].

All query options are defined in Query.

Examples:

>>> import lancedb
>>> db = lancedb.connect("./.lancedb")
>>> data = [
...    {"original_width": 100, "caption": "bar", "vector": [0.1, 2.3, 4.5]},
...    {"original_width": 2000, "caption": "foo",  "vector": [0.5, 3.4, 1.3]},
...    {"original_width": 3000, "caption": "test", "vector": [0.3, 6.2, 2.6]}
... ]
>>> table = db.create_table("my_table", data)
>>> query = [0.4, 1.4, 2.4]
>>> (table.search(query)
...     .where("original_width > 1000", prefilter=True)
...     .select(["caption", "original_width", "vector"])
...     .limit(2)
...     .to_pandas())
  caption  original_width           vector  _distance
0     foo            2000  [0.5, 3.4, 1.3]   5.220000
1    test            3000  [0.3, 6.2, 2.6]  23.089996

Parameters:

Name Type Description Default
query Optional[Union[VEC, str, 'PIL.Image.Image', Tuple]]

The targetted vector to search for.

  • default None. Acceptable types are: list, np.ndarray, PIL.Image.Image

  • If None then the select/where/limit clauses are applied to filter the table

None
vector_column_name Optional[str]

The name of the vector column to search.

The vector column needs to be a pyarrow fixed size list type

  • If not specified then the vector column is inferred from the table schema

  • If the table has multiple vector columns then the vector_column_name needs to be specified. Otherwise, an error is raised.

None
query_type str

default "auto". Acceptable types are: "vector", "fts", "hybrid", or "auto"

  • If "auto" then the query type is inferred from the query;

    • If query is a list/np.ndarray then the query type is "vector";

    • If query is a PIL.Image.Image then either do vector search, or raise an error if no corresponding embedding function is found.

  • If query is a string, then the query type is "vector" if the table has embedding functions else the query type is "fts"

'auto'

Returns:

Type Description
LanceQueryBuilder

A query builder object representing the query. Once executed, the query returns

  • selected columns

  • the vector

  • and also the "_distance" column which is the distance between the query vector and the returned vector.

Source code in lancedb/table.py
@abstractmethod
def search(
    self,
    query: Optional[Union[VEC, str, "PIL.Image.Image", Tuple]] = None,
    vector_column_name: Optional[str] = None,
    query_type: str = "auto",
) -> LanceQueryBuilder:
    """Create a search query to find the nearest neighbors
    of the given query vector. We currently support [vector search][search]
    and [full-text search][experimental-full-text-search].

    All query options are defined in [Query][lancedb.query.Query].

    Examples
    --------
    >>> import lancedb
    >>> db = lancedb.connect("./.lancedb")
    >>> data = [
    ...    {"original_width": 100, "caption": "bar", "vector": [0.1, 2.3, 4.5]},
    ...    {"original_width": 2000, "caption": "foo",  "vector": [0.5, 3.4, 1.3]},
    ...    {"original_width": 3000, "caption": "test", "vector": [0.3, 6.2, 2.6]}
    ... ]
    >>> table = db.create_table("my_table", data)
    >>> query = [0.4, 1.4, 2.4]
    >>> (table.search(query)
    ...     .where("original_width > 1000", prefilter=True)
    ...     .select(["caption", "original_width", "vector"])
    ...     .limit(2)
    ...     .to_pandas())
      caption  original_width           vector  _distance
    0     foo            2000  [0.5, 3.4, 1.3]   5.220000
    1    test            3000  [0.3, 6.2, 2.6]  23.089996

    Parameters
    ----------
    query: list/np.ndarray/str/PIL.Image.Image, default None
        The targetted vector to search for.

        - *default None*.
        Acceptable types are: list, np.ndarray, PIL.Image.Image

        - If None then the select/where/limit clauses are applied to filter
        the table
    vector_column_name: str, optional
        The name of the vector column to search.

        The vector column needs to be a pyarrow fixed size list type

        - If not specified then the vector column is inferred from
        the table schema

        - If the table has multiple vector columns then the *vector_column_name*
        needs to be specified. Otherwise, an error is raised.
    query_type: str
        *default "auto"*.
        Acceptable types are: "vector", "fts", "hybrid", or "auto"

        - If "auto" then the query type is inferred from the query;

            - If `query` is a list/np.ndarray then the query type is
            "vector";

            - If `query` is a PIL.Image.Image then either do vector search,
            or raise an error if no corresponding embedding function is found.

        - If `query` is a string, then the query type is "vector" if the
        table has embedding functions else the query type is "fts"

    Returns
    -------
    LanceQueryBuilder
        A query builder object representing the query.
        Once executed, the query returns

        - selected columns

        - the vector

        - and also the "_distance" column which is the distance between the query
        vector and the returned vector.
    """
    raise NotImplementedError

delete(where: str) abstractmethod

Delete rows from the table.

This can be used to delete a single row, many rows, all rows, or sometimes no rows (if your predicate matches nothing).

Parameters:

Name Type Description Default
where str

The SQL where clause to use when deleting rows.

  • For example, 'x = 2' or 'x IN (1, 2, 3)'.

The filter must not be empty, or it will error.

required

Examples:

>>> import lancedb
>>> data = [
...    {"x": 1, "vector": [1, 2]},
...    {"x": 2, "vector": [3, 4]},
...    {"x": 3, "vector": [5, 6]}
... ]
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data)
>>> table.to_pandas()
   x      vector
0  1  [1.0, 2.0]
1  2  [3.0, 4.0]
2  3  [5.0, 6.0]
>>> table.delete("x = 2")
>>> table.to_pandas()
   x      vector
0  1  [1.0, 2.0]
1  3  [5.0, 6.0]

If you have a list of values to delete, you can combine them into a stringified list and use the IN operator:

>>> to_remove = [1, 5]
>>> to_remove = ", ".join([str(v) for v in to_remove])
>>> to_remove
'1, 5'
>>> table.delete(f"x IN ({to_remove})")
>>> table.to_pandas()
   x      vector
0  3  [5.0, 6.0]
Source code in lancedb/table.py
@abstractmethod
def delete(self, where: str):
    """Delete rows from the table.

    This can be used to delete a single row, many rows, all rows, or
    sometimes no rows (if your predicate matches nothing).

    Parameters
    ----------
    where: str
        The SQL where clause to use when deleting rows.

        - For example, 'x = 2' or 'x IN (1, 2, 3)'.

        The filter must not be empty, or it will error.

    Examples
    --------
    >>> import lancedb
    >>> data = [
    ...    {"x": 1, "vector": [1, 2]},
    ...    {"x": 2, "vector": [3, 4]},
    ...    {"x": 3, "vector": [5, 6]}
    ... ]
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", data)
    >>> table.to_pandas()
       x      vector
    0  1  [1.0, 2.0]
    1  2  [3.0, 4.0]
    2  3  [5.0, 6.0]
    >>> table.delete("x = 2")
    >>> table.to_pandas()
       x      vector
    0  1  [1.0, 2.0]
    1  3  [5.0, 6.0]

    If you have a list of values to delete, you can combine them into a
    stringified list and use the `IN` operator:

    >>> to_remove = [1, 5]
    >>> to_remove = ", ".join([str(v) for v in to_remove])
    >>> to_remove
    '1, 5'
    >>> table.delete(f"x IN ({to_remove})")
    >>> table.to_pandas()
       x      vector
    0  3  [5.0, 6.0]
    """
    raise NotImplementedError

update(where: Optional[str] = None, values: Optional[dict] = None, *, values_sql: Optional[Dict[str, str]] = None) abstractmethod

This can be used to update zero to all rows depending on how many rows match the where clause. If no where clause is provided, then all rows will be updated.

Either values or values_sql must be provided. You cannot provide both.

Parameters:

Name Type Description Default
where Optional[str]

The SQL where clause to use when updating rows. For example, 'x = 2' or 'x IN (1, 2, 3)'. The filter must not be empty, or it will error.

None
values Optional[dict]

The values to update. The keys are the column names and the values are the values to set.

None
values_sql Optional[Dict[str, str]]

The values to update, expressed as SQL expression strings. These can reference existing columns. For example, {"x": "x + 1"} will increment the x column by 1.

None

Examples:

>>> import lancedb
>>> import pandas as pd
>>> data = pd.DataFrame({"x": [1, 2, 3], "vector": [[1, 2], [3, 4], [5, 6]]})
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data)
>>> table.to_pandas()
   x      vector
0  1  [1.0, 2.0]
1  2  [3.0, 4.0]
2  3  [5.0, 6.0]
>>> table.update(where="x = 2", values={"vector": [10, 10]})
>>> table.to_pandas()
   x        vector
0  1    [1.0, 2.0]
1  3    [5.0, 6.0]
2  2  [10.0, 10.0]
>>> table.update(values_sql={"x": "x + 1"})
>>> table.to_pandas()
   x        vector
0  2    [1.0, 2.0]
1  4    [5.0, 6.0]
2  3  [10.0, 10.0]
Source code in lancedb/table.py
@abstractmethod
def update(
    self,
    where: Optional[str] = None,
    values: Optional[dict] = None,
    *,
    values_sql: Optional[Dict[str, str]] = None,
):
    """
    This can be used to update zero to all rows depending on how many
    rows match the where clause. If no where clause is provided, then
    all rows will be updated.

    Either `values` or `values_sql` must be provided. You cannot provide
    both.

    Parameters
    ----------
    where: str, optional
        The SQL where clause to use when updating rows. For example, 'x = 2'
        or 'x IN (1, 2, 3)'. The filter must not be empty, or it will error.
    values: dict, optional
        The values to update. The keys are the column names and the values
        are the values to set.
    values_sql: dict, optional
        The values to update, expressed as SQL expression strings. These can
        reference existing columns. For example, {"x": "x + 1"} will increment
        the x column by 1.

    Examples
    --------
    >>> import lancedb
    >>> import pandas as pd
    >>> data = pd.DataFrame({"x": [1, 2, 3], "vector": [[1, 2], [3, 4], [5, 6]]})
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", data)
    >>> table.to_pandas()
       x      vector
    0  1  [1.0, 2.0]
    1  2  [3.0, 4.0]
    2  3  [5.0, 6.0]
    >>> table.update(where="x = 2", values={"vector": [10, 10]})
    >>> table.to_pandas()
       x        vector
    0  1    [1.0, 2.0]
    1  3    [5.0, 6.0]
    2  2  [10.0, 10.0]
    >>> table.update(values_sql={"x": "x + 1"})
    >>> table.to_pandas()
       x        vector
    0  2    [1.0, 2.0]
    1  4    [5.0, 6.0]
    2  3  [10.0, 10.0]
    """
    raise NotImplementedError

cleanup_old_versions(older_than: Optional[timedelta] = None, *, delete_unverified: bool = False) -> CleanupStats abstractmethod

Clean up old versions of the table, freeing disk space.

Note: This function is not available in LanceDb Cloud (since LanceDb Cloud manages cleanup for you automatically)

Parameters:

Name Type Description Default
older_than Optional[timedelta]

The minimum age of the version to delete. If None, then this defaults to two weeks.

None
delete_unverified bool

Because they may be part of an in-progress transaction, files newer than 7 days old are not deleted by default. If you are sure that there are no in-progress transactions, then you can set this to True to delete all files older than older_than.

False

Returns:

Type Description
CleanupStats

The stats of the cleanup operation, including how many bytes were freed.

Source code in lancedb/table.py
@abstractmethod
def cleanup_old_versions(
    self,
    older_than: Optional[timedelta] = None,
    *,
    delete_unverified: bool = False,
) -> CleanupStats:
    """
    Clean up old versions of the table, freeing disk space.

    Note: This function is not available in LanceDb Cloud (since LanceDb
    Cloud manages cleanup for you automatically)

    Parameters
    ----------
    older_than: timedelta, default None
        The minimum age of the version to delete. If None, then this defaults
        to two weeks.
    delete_unverified: bool, default False
        Because they may be part of an in-progress transaction, files newer
        than 7 days old are not deleted by default. If you are sure that
        there are no in-progress transactions, then you can set this to True
        to delete all files older than `older_than`.

    Returns
    -------
    CleanupStats
        The stats of the cleanup operation, including how many bytes were
        freed.
    """

compact_files(*args, **kwargs) abstractmethod

Run the compaction process on the table.

Note: This function is not available in LanceDb Cloud (since LanceDb Cloud manages compaction for you automatically)

This can be run after making several small appends to optimize the table for faster reads.

Arguments are passed onto :meth:lance.dataset.DatasetOptimizer.compact_files. For most cases, the default should be fine.

Source code in lancedb/table.py
@abstractmethod
def compact_files(self, *args, **kwargs):
    """
    Run the compaction process on the table.

    Note: This function is not available in LanceDb Cloud (since LanceDb
    Cloud manages compaction for you automatically)

    This can be run after making several small appends to optimize the table
    for faster reads.

    Arguments are passed onto :meth:`lance.dataset.DatasetOptimizer.compact_files`.
    For most cases, the default should be fine.
    """

add_columns(transforms: Dict[str, str]) abstractmethod

Add new columns with defined values.

This is not yet available in LanceDB Cloud.

Parameters:

Name Type Description Default
transforms Dict[str, str]

A map of column name to a SQL expression to use to calculate the value of the new column. These expressions will be evaluated for each row in the table, and can reference existing columns.

required
Source code in lancedb/table.py
@abstractmethod
def add_columns(self, transforms: Dict[str, str]):
    """
    Add new columns with defined values.

    This is not yet available in LanceDB Cloud.

    Parameters
    ----------
    transforms: Dict[str, str]
        A map of column name to a SQL expression to use to calculate the
        value of the new column. These expressions will be evaluated for
        each row in the table, and can reference existing columns.
    """

alter_columns(alterations: Iterable[Dict[str, str]]) abstractmethod

Alter column names and nullability.

This is not yet available in LanceDB Cloud.

alterations : Iterable[Dict[str, Any]] A sequence of dictionaries, each with the following keys: - "path": str The column path to alter. For a top-level column, this is the name. For a nested column, this is the dot-separated path, e.g. "a.b.c". - "name": str, optional The new name of the column. If not specified, the column name is not changed. - "nullable": bool, optional Whether the column should be nullable. If not specified, the column nullability is not changed. Only non-nullable columns can be changed to nullable. Currently, you cannot change a nullable column to non-nullable.

Source code in lancedb/table.py
@abstractmethod
def alter_columns(self, alterations: Iterable[Dict[str, str]]):
    """
    Alter column names and nullability.

    This is not yet available in LanceDB Cloud.

    alterations : Iterable[Dict[str, Any]]
        A sequence of dictionaries, each with the following keys:
        - "path": str
            The column path to alter. For a top-level column, this is the name.
            For a nested column, this is the dot-separated path, e.g. "a.b.c".
        - "name": str, optional
            The new name of the column. If not specified, the column name is
            not changed.
        - "nullable": bool, optional
            Whether the column should be nullable. If not specified, the column
            nullability is not changed. Only non-nullable columns can be changed
            to nullable. Currently, you cannot change a nullable column to
            non-nullable.
    """

drop_columns(columns: Iterable[str]) abstractmethod

Drop columns from the table.

This is not yet available in LanceDB Cloud.

Parameters:

Name Type Description Default
columns Iterable[str]

The names of the columns to drop.

required
Source code in lancedb/table.py
@abstractmethod
def drop_columns(self, columns: Iterable[str]):
    """
    Drop columns from the table.

    This is not yet available in LanceDB Cloud.

    Parameters
    ----------
    columns : Iterable[str]
        The names of the columns to drop.
    """

Querying (Synchronous)

lancedb.query.Query

Bases: BaseModel

The LanceDB Query

Attributes:

Name Type Description
vector List[float]

the vector to search for

filter Optional[str]

sql filter to refine the query with, optional

prefilter bool

if True then apply the filter before vector search

k int

top k results to return

metric str

the distance metric between a pair of vectors,

can support L2 (default), Cosine and Dot. metric definitions

columns Optional[List[str]]

which columns to return in the results

nprobes int

The number of probes used - optional

  • A higher number makes search more accurate but also slower.

  • See discussion in Querying an ANN Index for tuning advice.

refine_factor Optional[int]

Refine the results by reading extra elements and re-ranking them in memory.

  • A higher number makes search more accurate but also slower.

  • See discussion in Querying an ANN Index for tuning advice.

Source code in lancedb/query.py
class Query(pydantic.BaseModel):
    """The LanceDB Query

    Attributes
    ----------
    vector : List[float]
        the vector to search for
    filter : Optional[str]
        sql filter to refine the query with, optional
    prefilter : bool
        if True then apply the filter before vector search
    k : int
        top k results to return
    metric : str
        the distance metric between a pair of vectors,

        can support L2 (default), Cosine and Dot.
        [metric definitions][search]
    columns : Optional[List[str]]
        which columns to return in the results
    nprobes : int
        The number of probes used - optional

        - A higher number makes search more accurate but also slower.

        - See discussion in [Querying an ANN Index][querying-an-ann-index] for
          tuning advice.
    refine_factor : Optional[int]
        Refine the results by reading extra elements and re-ranking them in memory.

        - A higher number makes search more accurate but also slower.

        - See discussion in [Querying an ANN Index][querying-an-ann-index] for
          tuning advice.
    """

    vector_column: Optional[str] = None

    # vector to search for
    vector: Union[List[float], List[List[float]]]

    # sql filter to refine the query with
    filter: Optional[str] = None

    # if True then apply the filter before vector search
    prefilter: bool = False

    # top k results to return
    k: int

    # # metrics
    metric: str = "L2"

    # which columns to return in the results
    columns: Optional[Union[List[str], Dict[str, str]]] = None

    # optional query parameters for tuning the results,
    # e.g. `{"nprobes": "10", "refine_factor": "10"}`
    nprobes: int = 10

    # Refine factor.
    refine_factor: Optional[int] = None

    with_row_id: bool = False

lancedb.query.LanceQueryBuilder

Bases: ABC

An abstract query builder. Subclasses are defined for vector search, full text search, hybrid, and plain SQL filtering.

Source code in lancedb/query.py
class LanceQueryBuilder(ABC):
    """An abstract query builder. Subclasses are defined for vector search,
    full text search, hybrid, and plain SQL filtering.
    """

    @classmethod
    def create(
        cls,
        table: "Table",
        query: Optional[Union[np.ndarray, str, "PIL.Image.Image", Tuple]],
        query_type: str,
        vector_column_name: str,
        ordering_field_name: str = None,
    ) -> LanceQueryBuilder:
        """
        Create a query builder based on the given query and query type.

        Parameters
        ----------
        table: Table
            The table to query.
        query: Optional[Union[np.ndarray, str, "PIL.Image.Image", Tuple]]
            The query to use. If None, an empty query builder is returned
            which performs simple SQL filtering.
        query_type: str
            The type of query to perform. One of "vector", "fts", "hybrid", or "auto".
            If "auto", the query type is inferred based on the query.
        vector_column_name: str
            The name of the vector column to use for vector search.
        """
        if query is None:
            return LanceEmptyQueryBuilder(table)

        if query_type == "hybrid":
            # hybrid fts and vector query
            return LanceHybridQueryBuilder(table, query, vector_column_name)

        # remember the string query for reranking purpose
        str_query = query if isinstance(query, str) else None

        # convert "auto" query_type to "vector", "fts"
        # or "hybrid" and convert the query to vector if needed
        query, query_type = cls._resolve_query(
            table, query, query_type, vector_column_name
        )

        if query_type == "hybrid":
            return LanceHybridQueryBuilder(table, query, vector_column_name)

        if isinstance(query, str):
            # fts
            return LanceFtsQueryBuilder(
                table, query, ordering_field_name=ordering_field_name
            )

        if isinstance(query, list):
            query = np.array(query, dtype=np.float32)
        elif isinstance(query, np.ndarray):
            query = query.astype(np.float32)
        else:
            raise TypeError(f"Unsupported query type: {type(query)}")

        return LanceVectorQueryBuilder(table, query, vector_column_name, str_query)

    @classmethod
    def _resolve_query(cls, table, query, query_type, vector_column_name):
        # If query_type is fts, then query must be a string.
        # otherwise raise TypeError
        if query_type == "fts":
            if not isinstance(query, str):
                raise TypeError(f"'fts' queries must be a string: {type(query)}")
            return query, query_type
        elif query_type == "vector":
            query = cls._query_to_vector(table, query, vector_column_name)
            return query, query_type
        elif query_type == "auto":
            if isinstance(query, (list, np.ndarray)):
                return query, "vector"
            if isinstance(query, tuple):
                return query, "hybrid"
            else:
                conf = table.embedding_functions.get(vector_column_name)
                if conf is not None:
                    query = conf.function.compute_query_embeddings_with_retry(query)[0]
                    return query, "vector"
                else:
                    return query, "fts"
        else:
            raise ValueError(
                f"Invalid query_type, must be 'vector', 'fts', or 'auto': {query_type}"
            )

    @classmethod
    def _query_to_vector(cls, table, query, vector_column_name):
        if isinstance(query, (list, np.ndarray)):
            return query
        conf = table.embedding_functions.get(vector_column_name)
        if conf is not None:
            return conf.function.compute_query_embeddings_with_retry(query)[0]
        else:
            msg = f"No embedding function for {vector_column_name}"
            raise ValueError(msg)

    def __init__(self, table: "Table"):
        self._table = table
        self._limit = 10
        self._columns = None
        self._where = None
        self._with_row_id = False

    @deprecation.deprecated(
        deprecated_in="0.3.1",
        removed_in="0.4.0",
        current_version=__version__,
        details="Use to_pandas() instead",
    )
    def to_df(self) -> "pd.DataFrame":
        """
        *Deprecated alias for `to_pandas()`. Please use `to_pandas()` instead.*

        Execute the query and return the results as a pandas DataFrame.
        In addition to the selected columns, LanceDB also returns a vector
        and also the "_distance" column which is the distance between the query
        vector and the returned vector.
        """
        return self.to_pandas()

    def to_pandas(self, flatten: Optional[Union[int, bool]] = None) -> "pd.DataFrame":
        """
        Execute the query and return the results as a pandas DataFrame.
        In addition to the selected columns, LanceDB also returns a vector
        and also the "_distance" column which is the distance between the query
        vector and the returned vector.

        Parameters
        ----------
        flatten: Optional[Union[int, bool]]
            If flatten is True, flatten all nested columns.
            If flatten is an integer, flatten the nested columns up to the
            specified depth.
            If unspecified, do not flatten the nested columns.
        """
        tbl = self.to_arrow()
        if flatten is True:
            while True:
                tbl = tbl.flatten()
                # loop through all columns to check if there is any struct column
                if any(pa.types.is_struct(col.type) for col in tbl.schema):
                    continue
                else:
                    break
        elif isinstance(flatten, int):
            if flatten <= 0:
                raise ValueError(
                    "Please specify a positive integer for flatten or the boolean "
                    "value `True`"
                )
            while flatten > 0:
                tbl = tbl.flatten()
                flatten -= 1
        return tbl.to_pandas()

    @abstractmethod
    def to_arrow(self) -> pa.Table:
        """
        Execute the query and return the results as an
        [Apache Arrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table).

        In addition to the selected columns, LanceDB also returns a vector
        and also the "_distance" column which is the distance between the query
        vector and the returned vectors.
        """
        raise NotImplementedError

    def to_list(self) -> List[dict]:
        """
        Execute the query and return the results as a list of dictionaries.

        Each list entry is a dictionary with the selected column names as keys,
        or all table columns if `select` is not called. The vector and the "_distance"
        fields are returned whether or not they're explicitly selected.
        """
        return self.to_arrow().to_pylist()

    def to_pydantic(self, model: Type[LanceModel]) -> List[LanceModel]:
        """Return the table as a list of pydantic models.

        Parameters
        ----------
        model: Type[LanceModel]
            The pydantic model to use.

        Returns
        -------
        List[LanceModel]
        """
        return [
            model(**{k: v for k, v in row.items() if k in model.field_names()})
            for row in self.to_arrow().to_pylist()
        ]

    def to_polars(self) -> "pl.DataFrame":
        """
        Execute the query and return the results as a Polars DataFrame.
        In addition to the selected columns, LanceDB also returns a vector
        and also the "_distance" column which is the distance between the query
        vector and the returned vector.
        """
        import polars as pl

        return pl.from_arrow(self.to_arrow())

    def limit(self, limit: Union[int, None]) -> LanceQueryBuilder:
        """Set the maximum number of results to return.

        Parameters
        ----------
        limit: int
            The maximum number of results to return.
            By default the query is limited to the first 10.
            Call this method and pass 0, a negative value,
            or None to remove the limit.
            *WARNING* if you have a large dataset, removing
            the limit can potentially result in reading a
            large amount of data into memory and cause
            out of memory issues.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        if limit is None or limit <= 0:
            self._limit = None
        else:
            self._limit = limit
        return self

    def select(self, columns: Union[list[str], dict[str, str]]) -> LanceQueryBuilder:
        """Set the columns to return.

        Parameters
        ----------
        columns: list of str, or dict of str to str default None
            List of column names to be fetched.
            Or a dictionary of column names to SQL expressions.
            All columns are fetched if None or unspecified.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        if isinstance(columns, list) or isinstance(columns, dict):
            self._columns = columns
        else:
            raise ValueError("columns must be a list or a dictionary")
        return self

    def where(self, where: str, prefilter: bool = False) -> LanceQueryBuilder:
        """Set the where clause.

        Parameters
        ----------
        where: str
            The where clause which is a valid SQL where clause. See
            `Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>`_
            for valid SQL expressions.
        prefilter: bool, default False
            If True, apply the filter before vector search, otherwise the
            filter is applied on the result of vector search.
            This feature is **EXPERIMENTAL** and may be removed and modified
            without warning in the future.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        self._where = where
        self._prefilter = prefilter
        return self

    def with_row_id(self, with_row_id: bool) -> LanceQueryBuilder:
        """Set whether to return row ids.

        Parameters
        ----------
        with_row_id: bool
            If True, return _rowid column in the results.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        self._with_row_id = with_row_id
        return self

create(table: 'Table', query: Optional[Union[np.ndarray, str, 'PIL.Image.Image', Tuple]], query_type: str, vector_column_name: str, ordering_field_name: str = None) -> LanceQueryBuilder classmethod

Create a query builder based on the given query and query type.

Parameters:

Name Type Description Default
table 'Table'

The table to query.

required
query Optional[Union[ndarray, str, 'PIL.Image.Image', Tuple]]

The query to use. If None, an empty query builder is returned which performs simple SQL filtering.

required
query_type str

The type of query to perform. One of "vector", "fts", "hybrid", or "auto". If "auto", the query type is inferred based on the query.

required
vector_column_name str

The name of the vector column to use for vector search.

required
Source code in lancedb/query.py
@classmethod
def create(
    cls,
    table: "Table",
    query: Optional[Union[np.ndarray, str, "PIL.Image.Image", Tuple]],
    query_type: str,
    vector_column_name: str,
    ordering_field_name: str = None,
) -> LanceQueryBuilder:
    """
    Create a query builder based on the given query and query type.

    Parameters
    ----------
    table: Table
        The table to query.
    query: Optional[Union[np.ndarray, str, "PIL.Image.Image", Tuple]]
        The query to use. If None, an empty query builder is returned
        which performs simple SQL filtering.
    query_type: str
        The type of query to perform. One of "vector", "fts", "hybrid", or "auto".
        If "auto", the query type is inferred based on the query.
    vector_column_name: str
        The name of the vector column to use for vector search.
    """
    if query is None:
        return LanceEmptyQueryBuilder(table)

    if query_type == "hybrid":
        # hybrid fts and vector query
        return LanceHybridQueryBuilder(table, query, vector_column_name)

    # remember the string query for reranking purpose
    str_query = query if isinstance(query, str) else None

    # convert "auto" query_type to "vector", "fts"
    # or "hybrid" and convert the query to vector if needed
    query, query_type = cls._resolve_query(
        table, query, query_type, vector_column_name
    )

    if query_type == "hybrid":
        return LanceHybridQueryBuilder(table, query, vector_column_name)

    if isinstance(query, str):
        # fts
        return LanceFtsQueryBuilder(
            table, query, ordering_field_name=ordering_field_name
        )

    if isinstance(query, list):
        query = np.array(query, dtype=np.float32)
    elif isinstance(query, np.ndarray):
        query = query.astype(np.float32)
    else:
        raise TypeError(f"Unsupported query type: {type(query)}")

    return LanceVectorQueryBuilder(table, query, vector_column_name, str_query)

to_df() -> 'pd.DataFrame'

Deprecated alias for to_pandas(). Please use to_pandas() instead.

Execute the query and return the results as a pandas DataFrame. In addition to the selected columns, LanceDB also returns a vector and also the "_distance" column which is the distance between the query vector and the returned vector.

Source code in lancedb/query.py
@deprecation.deprecated(
    deprecated_in="0.3.1",
    removed_in="0.4.0",
    current_version=__version__,
    details="Use to_pandas() instead",
)
def to_df(self) -> "pd.DataFrame":
    """
    *Deprecated alias for `to_pandas()`. Please use `to_pandas()` instead.*

    Execute the query and return the results as a pandas DataFrame.
    In addition to the selected columns, LanceDB also returns a vector
    and also the "_distance" column which is the distance between the query
    vector and the returned vector.
    """
    return self.to_pandas()

to_pandas(flatten: Optional[Union[int, bool]] = None) -> 'pd.DataFrame'

Execute the query and return the results as a pandas DataFrame. In addition to the selected columns, LanceDB also returns a vector and also the "_distance" column which is the distance between the query vector and the returned vector.

Parameters:

Name Type Description Default
flatten Optional[Union[int, bool]]

If flatten is True, flatten all nested columns. If flatten is an integer, flatten the nested columns up to the specified depth. If unspecified, do not flatten the nested columns.

None
Source code in lancedb/query.py
def to_pandas(self, flatten: Optional[Union[int, bool]] = None) -> "pd.DataFrame":
    """
    Execute the query and return the results as a pandas DataFrame.
    In addition to the selected columns, LanceDB also returns a vector
    and also the "_distance" column which is the distance between the query
    vector and the returned vector.

    Parameters
    ----------
    flatten: Optional[Union[int, bool]]
        If flatten is True, flatten all nested columns.
        If flatten is an integer, flatten the nested columns up to the
        specified depth.
        If unspecified, do not flatten the nested columns.
    """
    tbl = self.to_arrow()
    if flatten is True:
        while True:
            tbl = tbl.flatten()
            # loop through all columns to check if there is any struct column
            if any(pa.types.is_struct(col.type) for col in tbl.schema):
                continue
            else:
                break
    elif isinstance(flatten, int):
        if flatten <= 0:
            raise ValueError(
                "Please specify a positive integer for flatten or the boolean "
                "value `True`"
            )
        while flatten > 0:
            tbl = tbl.flatten()
            flatten -= 1
    return tbl.to_pandas()

to_arrow() -> pa.Table abstractmethod

Execute the query and return the results as an Apache Arrow Table.

In addition to the selected columns, LanceDB also returns a vector and also the "_distance" column which is the distance between the query vector and the returned vectors.

Source code in lancedb/query.py
@abstractmethod
def to_arrow(self) -> pa.Table:
    """
    Execute the query and return the results as an
    [Apache Arrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table).

    In addition to the selected columns, LanceDB also returns a vector
    and also the "_distance" column which is the distance between the query
    vector and the returned vectors.
    """
    raise NotImplementedError

to_list() -> List[dict]

Execute the query and return the results as a list of dictionaries.

Each list entry is a dictionary with the selected column names as keys, or all table columns if select is not called. The vector and the "_distance" fields are returned whether or not they're explicitly selected.

Source code in lancedb/query.py
def to_list(self) -> List[dict]:
    """
    Execute the query and return the results as a list of dictionaries.

    Each list entry is a dictionary with the selected column names as keys,
    or all table columns if `select` is not called. The vector and the "_distance"
    fields are returned whether or not they're explicitly selected.
    """
    return self.to_arrow().to_pylist()

to_pydantic(model: Type[LanceModel]) -> List[LanceModel]

Return the table as a list of pydantic models.

Parameters:

Name Type Description Default
model Type[LanceModel]

The pydantic model to use.

required

Returns:

Type Description
List[LanceModel]
Source code in lancedb/query.py
def to_pydantic(self, model: Type[LanceModel]) -> List[LanceModel]:
    """Return the table as a list of pydantic models.

    Parameters
    ----------
    model: Type[LanceModel]
        The pydantic model to use.

    Returns
    -------
    List[LanceModel]
    """
    return [
        model(**{k: v for k, v in row.items() if k in model.field_names()})
        for row in self.to_arrow().to_pylist()
    ]

to_polars() -> 'pl.DataFrame'

Execute the query and return the results as a Polars DataFrame. In addition to the selected columns, LanceDB also returns a vector and also the "_distance" column which is the distance between the query vector and the returned vector.

Source code in lancedb/query.py
def to_polars(self) -> "pl.DataFrame":
    """
    Execute the query and return the results as a Polars DataFrame.
    In addition to the selected columns, LanceDB also returns a vector
    and also the "_distance" column which is the distance between the query
    vector and the returned vector.
    """
    import polars as pl

    return pl.from_arrow(self.to_arrow())

limit(limit: Union[int, None]) -> LanceQueryBuilder

Set the maximum number of results to return.

Parameters:

Name Type Description Default
limit Union[int, None]

The maximum number of results to return. By default the query is limited to the first 10. Call this method and pass 0, a negative value, or None to remove the limit. WARNING if you have a large dataset, removing the limit can potentially result in reading a large amount of data into memory and cause out of memory issues.

required

Returns:

Type Description
LanceQueryBuilder

The LanceQueryBuilder object.

Source code in lancedb/query.py
def limit(self, limit: Union[int, None]) -> LanceQueryBuilder:
    """Set the maximum number of results to return.

    Parameters
    ----------
    limit: int
        The maximum number of results to return.
        By default the query is limited to the first 10.
        Call this method and pass 0, a negative value,
        or None to remove the limit.
        *WARNING* if you have a large dataset, removing
        the limit can potentially result in reading a
        large amount of data into memory and cause
        out of memory issues.

    Returns
    -------
    LanceQueryBuilder
        The LanceQueryBuilder object.
    """
    if limit is None or limit <= 0:
        self._limit = None
    else:
        self._limit = limit
    return self

select(columns: Union[list[str], dict[str, str]]) -> LanceQueryBuilder

Set the columns to return.

Parameters:

Name Type Description Default
columns Union[list[str], dict[str, str]]

List of column names to be fetched. Or a dictionary of column names to SQL expressions. All columns are fetched if None or unspecified.

required

Returns:

Type Description
LanceQueryBuilder

The LanceQueryBuilder object.

Source code in lancedb/query.py
def select(self, columns: Union[list[str], dict[str, str]]) -> LanceQueryBuilder:
    """Set the columns to return.

    Parameters
    ----------
    columns: list of str, or dict of str to str default None
        List of column names to be fetched.
        Or a dictionary of column names to SQL expressions.
        All columns are fetched if None or unspecified.

    Returns
    -------
    LanceQueryBuilder
        The LanceQueryBuilder object.
    """
    if isinstance(columns, list) or isinstance(columns, dict):
        self._columns = columns
    else:
        raise ValueError("columns must be a list or a dictionary")
    return self

where(where: str, prefilter: bool = False) -> LanceQueryBuilder

Set the where clause.

Parameters:

Name Type Description Default
where str

The where clause which is a valid SQL where clause. See Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>_ for valid SQL expressions.

required
prefilter bool

If True, apply the filter before vector search, otherwise the filter is applied on the result of vector search. This feature is EXPERIMENTAL and may be removed and modified without warning in the future.

False

Returns:

Type Description
LanceQueryBuilder

The LanceQueryBuilder object.

Source code in lancedb/query.py
def where(self, where: str, prefilter: bool = False) -> LanceQueryBuilder:
    """Set the where clause.

    Parameters
    ----------
    where: str
        The where clause which is a valid SQL where clause. See
        `Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>`_
        for valid SQL expressions.
    prefilter: bool, default False
        If True, apply the filter before vector search, otherwise the
        filter is applied on the result of vector search.
        This feature is **EXPERIMENTAL** and may be removed and modified
        without warning in the future.

    Returns
    -------
    LanceQueryBuilder
        The LanceQueryBuilder object.
    """
    self._where = where
    self._prefilter = prefilter
    return self

with_row_id(with_row_id: bool) -> LanceQueryBuilder

Set whether to return row ids.

Parameters:

Name Type Description Default
with_row_id bool

If True, return _rowid column in the results.

required

Returns:

Type Description
LanceQueryBuilder

The LanceQueryBuilder object.

Source code in lancedb/query.py
def with_row_id(self, with_row_id: bool) -> LanceQueryBuilder:
    """Set whether to return row ids.

    Parameters
    ----------
    with_row_id: bool
        If True, return _rowid column in the results.

    Returns
    -------
    LanceQueryBuilder
        The LanceQueryBuilder object.
    """
    self._with_row_id = with_row_id
    return self

lancedb.query.LanceVectorQueryBuilder

Bases: LanceQueryBuilder

Examples:

>>> import lancedb
>>> data = [{"vector": [1.1, 1.2], "b": 2},
...         {"vector": [0.5, 1.3], "b": 4},
...         {"vector": [0.4, 0.4], "b": 6},
...         {"vector": [0.4, 0.4], "b": 10}]
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data=data)
>>> (table.search([0.4, 0.4])
...       .metric("cosine")
...       .where("b < 10")
...       .select(["b", "vector"])
...       .limit(2)
...       .to_pandas())
   b      vector  _distance
0  6  [0.4, 0.4]        0.0
Source code in lancedb/query.py
class LanceVectorQueryBuilder(LanceQueryBuilder):
    """
    Examples
    --------
    >>> import lancedb
    >>> data = [{"vector": [1.1, 1.2], "b": 2},
    ...         {"vector": [0.5, 1.3], "b": 4},
    ...         {"vector": [0.4, 0.4], "b": 6},
    ...         {"vector": [0.4, 0.4], "b": 10}]
    >>> db = lancedb.connect("./.lancedb")
    >>> table = db.create_table("my_table", data=data)
    >>> (table.search([0.4, 0.4])
    ...       .metric("cosine")
    ...       .where("b < 10")
    ...       .select(["b", "vector"])
    ...       .limit(2)
    ...       .to_pandas())
       b      vector  _distance
    0  6  [0.4, 0.4]        0.0
    """

    def __init__(
        self,
        table: "Table",
        query: Union[np.ndarray, list, "PIL.Image.Image"],
        vector_column: str,
        str_query: Optional[str] = None,
    ):
        super().__init__(table)
        self._query = query
        self._metric = "L2"
        self._nprobes = 20
        self._refine_factor = None
        self._vector_column = vector_column
        self._prefilter = False
        self._reranker = None
        self._str_query = str_query

    def metric(self, metric: Literal["L2", "cosine"]) -> LanceVectorQueryBuilder:
        """Set the distance metric to use.

        Parameters
        ----------
        metric: "L2" or "cosine"
            The distance metric to use. By default "L2" is used.

        Returns
        -------
        LanceVectorQueryBuilder
            The LanceQueryBuilder object.
        """
        self._metric = metric
        return self

    def nprobes(self, nprobes: int) -> LanceVectorQueryBuilder:
        """Set the number of probes to use.

        Higher values will yield better recall (more likely to find vectors if
        they exist) at the expense of latency.

        See discussion in [Querying an ANN Index][querying-an-ann-index] for
        tuning advice.

        Parameters
        ----------
        nprobes: int
            The number of probes to use.

        Returns
        -------
        LanceVectorQueryBuilder
            The LanceQueryBuilder object.
        """
        self._nprobes = nprobes
        return self

    def refine_factor(self, refine_factor: int) -> LanceVectorQueryBuilder:
        """Set the refine factor to use, increasing the number of vectors sampled.

        As an example, a refine factor of 2 will sample 2x as many vectors as
        requested, re-ranks them, and returns the top half most relevant results.

        See discussion in [Querying an ANN Index][querying-an-ann-index] for
        tuning advice.

        Parameters
        ----------
        refine_factor: int
            The refine factor to use.

        Returns
        -------
        LanceVectorQueryBuilder
            The LanceQueryBuilder object.
        """
        self._refine_factor = refine_factor
        return self

    def to_arrow(self) -> pa.Table:
        """
        Execute the query and return the results as an
        [Apache Arrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table).

        In addition to the selected columns, LanceDB also returns a vector
        and also the "_distance" column which is the distance between the query
        vector and the returned vectors.
        """
        return self.to_batches().read_all()

    def to_batches(self, /, batch_size: Optional[int] = None) -> pa.RecordBatchReader:
        """
        Execute the query and return the result as a RecordBatchReader object.

        Parameters
        ----------
        batch_size: int
            The maximum number of selected records in a RecordBatch object.

        Returns
        -------
        pa.RecordBatchReader
        """
        vector = self._query if isinstance(self._query, list) else self._query.tolist()
        if isinstance(vector[0], np.ndarray):
            vector = [v.tolist() for v in vector]
        query = Query(
            vector=vector,
            filter=self._where,
            prefilter=self._prefilter,
            k=self._limit,
            metric=self._metric,
            columns=self._columns,
            nprobes=self._nprobes,
            refine_factor=self._refine_factor,
            vector_column=self._vector_column,
            with_row_id=self._with_row_id,
        )
        result_set = self._table._execute_query(query, batch_size)
        if self._reranker is not None:
            rs_table = result_set.read_all()
            result_set = self._reranker.rerank_vector(self._str_query, rs_table)
            # convert result_set back to RecordBatchReader
            result_set = pa.RecordBatchReader.from_batches(
                result_set.schema, result_set.to_batches()
            )

        return result_set

    def where(self, where: str, prefilter: bool = False) -> LanceVectorQueryBuilder:
        """Set the where clause.

        Parameters
        ----------
        where: str
            The where clause which is a valid SQL where clause. See
            `Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>`_
            for valid SQL expressions.
        prefilter: bool, default False
            If True, apply the filter before vector search, otherwise the
            filter is applied on the result of vector search.
            This feature is **EXPERIMENTAL** and may be removed and modified
            without warning in the future.

        Returns
        -------
        LanceQueryBuilder
            The LanceQueryBuilder object.
        """
        self._where = where
        self._prefilter = prefilter
        return self

    def rerank(
        self, reranker: Reranker, query_string: Optional[str] = None
    ) -> LanceVectorQueryBuilder:
        """Rerank the results using the specified reranker.

        Parameters
        ----------
        reranker: Reranker
            The reranker to use.

        query_string: Optional[str]
            The query to use for reranking. This needs to be specified explicitly here
            as the query used for vector search may already be vectorized and the
            reranker requires a string query.
            This is only required if the query used for vector search is not a string.
            Note: This doesn't yet support the case where the query is multimodal or a
            list of vectors.

        Returns
        -------
        LanceVectorQueryBuilder
            The LanceQueryBuilder object.
        """
        self._reranker = reranker
        if self._str_query is None and query_string is None:
            raise ValueError(
                """
                The query used for vector search is not a string.
                In this case, the reranker query needs to be specified explicitly.
                """
            )
        if query_string is not None and not isinstance(query_string, str):
            raise ValueError("Reranking currently only supports string queries")
        self._str_query = query_string if query_string is not None else self._str_query
        return self

metric(metric: Literal['L2', 'cosine']) -> LanceVectorQueryBuilder

Set the distance metric to use.

Parameters:

Name Type Description Default
metric Literal['L2', 'cosine']

The distance metric to use. By default "L2" is used.

required

Returns:

Type Description
LanceVectorQueryBuilder

The LanceQueryBuilder object.

Source code in lancedb/query.py
def metric(self, metric: Literal["L2", "cosine"]) -> LanceVectorQueryBuilder:
    """Set the distance metric to use.

    Parameters
    ----------
    metric: "L2" or "cosine"
        The distance metric to use. By default "L2" is used.

    Returns
    -------
    LanceVectorQueryBuilder
        The LanceQueryBuilder object.
    """
    self._metric = metric
    return self

nprobes(nprobes: int) -> LanceVectorQueryBuilder

Set the number of probes to use.

Higher values will yield better recall (more likely to find vectors if they exist) at the expense of latency.

See discussion in Querying an ANN Index for tuning advice.

Parameters:

Name Type Description Default
nprobes int

The number of probes to use.

required

Returns:

Type Description
LanceVectorQueryBuilder

The LanceQueryBuilder object.

Source code in lancedb/query.py
def nprobes(self, nprobes: int) -> LanceVectorQueryBuilder:
    """Set the number of probes to use.

    Higher values will yield better recall (more likely to find vectors if
    they exist) at the expense of latency.

    See discussion in [Querying an ANN Index][querying-an-ann-index] for
    tuning advice.

    Parameters
    ----------
    nprobes: int
        The number of probes to use.

    Returns
    -------
    LanceVectorQueryBuilder
        The LanceQueryBuilder object.
    """
    self._nprobes = nprobes
    return self

refine_factor(refine_factor: int) -> LanceVectorQueryBuilder

Set the refine factor to use, increasing the number of vectors sampled.

As an example, a refine factor of 2 will sample 2x as many vectors as requested, re-ranks them, and returns the top half most relevant results.

See discussion in Querying an ANN Index for tuning advice.

Parameters:

Name Type Description Default
refine_factor int

The refine factor to use.

required

Returns:

Type Description
LanceVectorQueryBuilder

The LanceQueryBuilder object.

Source code in lancedb/query.py
def refine_factor(self, refine_factor: int) -> LanceVectorQueryBuilder:
    """Set the refine factor to use, increasing the number of vectors sampled.

    As an example, a refine factor of 2 will sample 2x as many vectors as
    requested, re-ranks them, and returns the top half most relevant results.

    See discussion in [Querying an ANN Index][querying-an-ann-index] for
    tuning advice.

    Parameters
    ----------
    refine_factor: int
        The refine factor to use.

    Returns
    -------
    LanceVectorQueryBuilder
        The LanceQueryBuilder object.
    """
    self._refine_factor = refine_factor
    return self

to_arrow() -> pa.Table

Execute the query and return the results as an Apache Arrow Table.

In addition to the selected columns, LanceDB also returns a vector and also the "_distance" column which is the distance between the query vector and the returned vectors.

Source code in lancedb/query.py
def to_arrow(self) -> pa.Table:
    """
    Execute the query and return the results as an
    [Apache Arrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table).

    In addition to the selected columns, LanceDB also returns a vector
    and also the "_distance" column which is the distance between the query
    vector and the returned vectors.
    """
    return self.to_batches().read_all()

to_batches(batch_size: Optional[int] = None) -> pa.RecordBatchReader

Execute the query and return the result as a RecordBatchReader object.

Parameters:

Name Type Description Default
batch_size Optional[int]

The maximum number of selected records in a RecordBatch object.

None

Returns:

Type Description
RecordBatchReader
Source code in lancedb/query.py
def to_batches(self, /, batch_size: Optional[int] = None) -> pa.RecordBatchReader:
    """
    Execute the query and return the result as a RecordBatchReader object.

    Parameters
    ----------
    batch_size: int
        The maximum number of selected records in a RecordBatch object.

    Returns
    -------
    pa.RecordBatchReader
    """
    vector = self._query if isinstance(self._query, list) else self._query.tolist()
    if isinstance(vector[0], np.ndarray):
        vector = [v.tolist() for v in vector]
    query = Query(
        vector=vector,
        filter=self._where,
        prefilter=self._prefilter,
        k=self._limit,
        metric=self._metric,
        columns=self._columns,
        nprobes=self._nprobes,
        refine_factor=self._refine_factor,
        vector_column=self._vector_column,
        with_row_id=self._with_row_id,
    )
    result_set = self._table._execute_query(query, batch_size)
    if self._reranker is not None:
        rs_table = result_set.read_all()
        result_set = self._reranker.rerank_vector(self._str_query, rs_table)
        # convert result_set back to RecordBatchReader
        result_set = pa.RecordBatchReader.from_batches(
            result_set.schema, result_set.to_batches()
        )

    return result_set

where(where: str, prefilter: bool = False) -> LanceVectorQueryBuilder

Set the where clause.

Parameters:

Name Type Description Default
where str

The where clause which is a valid SQL where clause. See Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>_ for valid SQL expressions.

required
prefilter bool

If True, apply the filter before vector search, otherwise the filter is applied on the result of vector search. This feature is EXPERIMENTAL and may be removed and modified without warning in the future.

False

Returns:

Type Description
LanceQueryBuilder

The LanceQueryBuilder object.

Source code in lancedb/query.py
def where(self, where: str, prefilter: bool = False) -> LanceVectorQueryBuilder:
    """Set the where clause.

    Parameters
    ----------
    where: str
        The where clause which is a valid SQL where clause. See
        `Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>`_
        for valid SQL expressions.
    prefilter: bool, default False
        If True, apply the filter before vector search, otherwise the
        filter is applied on the result of vector search.
        This feature is **EXPERIMENTAL** and may be removed and modified
        without warning in the future.

    Returns
    -------
    LanceQueryBuilder
        The LanceQueryBuilder object.
    """
    self._where = where
    self._prefilter = prefilter
    return self

rerank(reranker: Reranker, query_string: Optional[str] = None) -> LanceVectorQueryBuilder

Rerank the results using the specified reranker.

Parameters:

Name Type Description Default
reranker Reranker

The reranker to use.

required
query_string Optional[str]

The query to use for reranking. This needs to be specified explicitly here as the query used for vector search may already be vectorized and the reranker requires a string query. This is only required if the query used for vector search is not a string. Note: This doesn't yet support the case where the query is multimodal or a list of vectors.

None

Returns:

Type Description
LanceVectorQueryBuilder

The LanceQueryBuilder object.

Source code in lancedb/query.py
def rerank(
    self, reranker: Reranker, query_string: Optional[str] = None
) -> LanceVectorQueryBuilder:
    """Rerank the results using the specified reranker.

    Parameters
    ----------
    reranker: Reranker
        The reranker to use.

    query_string: Optional[str]
        The query to use for reranking. This needs to be specified explicitly here
        as the query used for vector search may already be vectorized and the
        reranker requires a string query.
        This is only required if the query used for vector search is not a string.
        Note: This doesn't yet support the case where the query is multimodal or a
        list of vectors.

    Returns
    -------
    LanceVectorQueryBuilder
        The LanceQueryBuilder object.
    """
    self._reranker = reranker
    if self._str_query is None and query_string is None:
        raise ValueError(
            """
            The query used for vector search is not a string.
            In this case, the reranker query needs to be specified explicitly.
            """
        )
    if query_string is not None and not isinstance(query_string, str):
        raise ValueError("Reranking currently only supports string queries")
    self._str_query = query_string if query_string is not None else self._str_query
    return self

lancedb.query.LanceFtsQueryBuilder

Bases: LanceQueryBuilder

A builder for full text search for LanceDB.

Source code in lancedb/query.py
class LanceFtsQueryBuilder(LanceQueryBuilder):
    """A builder for full text search for LanceDB."""

    def __init__(self, table: "Table", query: str, ordering_field_name: str = None):
        super().__init__(table)
        self._query = query
        self._phrase_query = False
        self.ordering_field_name = ordering_field_name
        self._reranker = None

    def phrase_query(self, phrase_query: bool = True) -> LanceFtsQueryBuilder:
        """Set whether to use phrase query.

        Parameters
        ----------
        phrase_query: bool, default True
            If True, then the query will be wrapped in quotes and
            double quotes replaced by single quotes.

        Returns
        -------
        LanceFtsQueryBuilder
            The LanceFtsQueryBuilder object.
        """
        self._phrase_query = phrase_query
        return self

    def to_arrow(self) -> pa.Table:
        try:
            import tantivy
        except ImportError:
            raise ImportError(
                "Please install tantivy-py `pip install tantivy` to use the full text search feature."  # noqa: E501
            )

        from .fts import search_index

        # get the index path
        index_path = self._table._get_fts_index_path()
        # check if the index exist
        if not Path(index_path).exists():
            raise FileNotFoundError(
                "Fts index does not exist. "
                "Please first call table.create_fts_index(['<field_names>']) to "
                "create the fts index."
            )
        # open the index
        index = tantivy.Index.open(index_path)
        # get the scores and doc ids
        query = self._query
        if self._phrase_query:
            query = query.replace('"', "'")
            query = f'"{query}"'
        row_ids, scores = search_index(
            index, query, self._limit, ordering_field=self.ordering_field_name
        )
        if len(row_ids) == 0:
            empty_schema = pa.schema([pa.field("score", pa.float32())])
            return pa.Table.from_pylist([], schema=empty_schema)
        scores = pa.array(scores)
        output_tbl = self._table.to_lance().take(row_ids, columns=self._columns)
        output_tbl = output_tbl.append_column("score", scores)
        # this needs to match vector search results which are uint64
        row_ids = pa.array(row_ids, type=pa.uint64())

        if self._where is not None:
            tmp_name = "__lancedb__duckdb__indexer__"
            output_tbl = output_tbl.append_column(
                tmp_name, pa.array(range(len(output_tbl)))
            )
            try:
                # TODO would be great to have Substrait generate pyarrow compute
                # expressions or conversely have pyarrow support SQL expressions
                # using Substrait
                import duckdb

                indexer = duckdb.sql(
                    f"SELECT {tmp_name} FROM output_tbl WHERE {self._where}"
                ).to_arrow_table()[tmp_name]
                output_tbl = output_tbl.take(indexer).drop([tmp_name])
                row_ids = row_ids.take(indexer)

            except ImportError:
                import tempfile

                import lance

                # TODO Use "memory://" instead once that's supported
                with tempfile.TemporaryDirectory() as tmp:
                    ds = lance.write_dataset(output_tbl, tmp)
                    output_tbl = ds.to_table(filter=self._where)
                    indexer = output_tbl[tmp_name]
                    row_ids = row_ids.take(indexer)
                    output_tbl = output_tbl.drop([tmp_name])

        if self._with_row_id:
            output_tbl = output_tbl.append_column("_rowid", row_ids)

        if self._reranker is not None:
            output_tbl = self._reranker.rerank_fts(self._query, output_tbl)
        return output_tbl

    def rerank(self, reranker: Reranker) -> LanceFtsQueryBuilder:
        """Rerank the results using the specified reranker.

        Parameters
        ----------
        reranker: Reranker
            The reranker to use.

        Returns
        -------
        LanceFtsQueryBuilder
            The LanceQueryBuilder object.
        """
        self._reranker = reranker
        return self

phrase_query(phrase_query: bool = True) -> LanceFtsQueryBuilder

Set whether to use phrase query.

Parameters:

Name Type Description Default
phrase_query bool

If True, then the query will be wrapped in quotes and double quotes replaced by single quotes.

True

Returns:

Type Description
LanceFtsQueryBuilder

The LanceFtsQueryBuilder object.

Source code in lancedb/query.py
def phrase_query(self, phrase_query: bool = True) -> LanceFtsQueryBuilder:
    """Set whether to use phrase query.

    Parameters
    ----------
    phrase_query: bool, default True
        If True, then the query will be wrapped in quotes and
        double quotes replaced by single quotes.

    Returns
    -------
    LanceFtsQueryBuilder
        The LanceFtsQueryBuilder object.
    """
    self._phrase_query = phrase_query
    return self

rerank(reranker: Reranker) -> LanceFtsQueryBuilder

Rerank the results using the specified reranker.

Parameters:

Name Type Description Default
reranker Reranker

The reranker to use.

required

Returns:

Type Description
LanceFtsQueryBuilder

The LanceQueryBuilder object.

Source code in lancedb/query.py
def rerank(self, reranker: Reranker) -> LanceFtsQueryBuilder:
    """Rerank the results using the specified reranker.

    Parameters
    ----------
    reranker: Reranker
        The reranker to use.

    Returns
    -------
    LanceFtsQueryBuilder
        The LanceQueryBuilder object.
    """
    self._reranker = reranker
    return self

lancedb.query.LanceHybridQueryBuilder

Bases: LanceQueryBuilder

A query builder that performs hybrid vector and full text search. Results are combined and reranked based on the specified reranker. By default, the results are reranked using the LinearCombinationReranker.

To make the vector and fts results comparable, the scores are normalized. Instead of normalizing scores, the normalize parameter can be set to "rank" in the rerank method to convert the scores to ranks and then normalize them.

Source code in lancedb/query.py
class LanceHybridQueryBuilder(LanceQueryBuilder):
    """
    A query builder that performs hybrid vector and full text search.
    Results are combined and reranked based on the specified reranker.
    By default, the results are reranked using the LinearCombinationReranker.

    To make the vector and fts results comparable, the scores are normalized.
    Instead of normalizing scores, the `normalize` parameter can be set to "rank"
    in the `rerank` method to convert the scores to ranks and then normalize them.
    """

    def __init__(self, table: "Table", query: str, vector_column: str):
        super().__init__(table)
        self._validate_fts_index()
        vector_query, fts_query = self._validate_query(query)
        self._fts_query = LanceFtsQueryBuilder(table, fts_query)
        vector_query = self._query_to_vector(table, vector_query, vector_column)
        self._vector_query = LanceVectorQueryBuilder(table, vector_query, vector_column)
        self._norm = "score"
        self._reranker = LinearCombinationReranker(weight=0.7, fill=1.0)

    def _validate_fts_index(self):
        if self._table._get_fts_index_path() is None:
            raise ValueError(
                "Please create a full-text search index " "to perform hybrid search."
            )

    def _validate_query(self, query):
        # Temp hack to support vectorized queries for hybrid search
        if isinstance(query, str):
            return query, query
        elif isinstance(query, tuple):
            if len(query) != 2:
                raise ValueError(
                    "The query must be a tuple of (vector_query, fts_query)."
                )
            if not isinstance(query[0], (list, np.ndarray, pa.Array, pa.ChunkedArray)):
                raise ValueError(f"The vector query must be one of {VEC}.")
            if not isinstance(query[1], str):
                raise ValueError("The fts query must be a string.")
            return query[0], query[1]
        else:
            raise ValueError(
                "The query must be either a string or a tuple of (vector, string)."
            )

    def to_arrow(self) -> pa.Table:
        with ThreadPoolExecutor() as executor:
            fts_future = executor.submit(self._fts_query.with_row_id(True).to_arrow)
            vector_future = executor.submit(
                self._vector_query.with_row_id(True).to_arrow
            )
            fts_results = fts_future.result()
            vector_results = vector_future.result()

        # convert to ranks first if needed
        if self._norm == "rank":
            vector_results = self._rank(vector_results, "_distance")
            fts_results = self._rank(fts_results, "score")
        # normalize the scores to be between 0 and 1, 0 being most relevant
        vector_results = self._normalize_scores(vector_results, "_distance")

        # In fts higher scores represent relevance. Not inverting them here as
        # rerankers might need to preserve this score to support `return_score="all"`
        fts_results = self._normalize_scores(fts_results, "score")

        results = self._reranker.rerank_hybrid(
            self._fts_query._query, vector_results, fts_results
        )

        if not isinstance(results, pa.Table):  # Enforce type
            raise TypeError(
                f"rerank_hybrid must return a pyarrow.Table, got {type(results)}"
            )

        # apply limit after reranking
        results = results.slice(length=self._limit)

        if not self._with_row_id:
            results = results.drop(["_rowid"])
        return results

    def _rank(self, results: pa.Table, column: str, ascending: bool = True):
        if len(results) == 0:
            return results
        # Get the _score column from results
        scores = results.column(column).to_numpy()
        sort_indices = np.argsort(scores)
        if not ascending:
            sort_indices = sort_indices[::-1]
        ranks = np.empty_like(sort_indices)
        ranks[sort_indices] = np.arange(len(scores)) + 1
        # replace the _score column with the ranks
        _score_idx = results.column_names.index(column)
        results = results.set_column(
            _score_idx, column, pa.array(ranks, type=pa.float32())
        )
        return results

    def _normalize_scores(self, results: pa.Table, column: str, invert=False):
        if len(results) == 0:
            return results
        # Get the _score column from results
        scores = results.column(column).to_numpy()
        # normalize the scores by subtracting the min and dividing by the max
        max, min = np.max(scores), np.min(scores)
        if np.isclose(max, min):
            rng = max
        else:
            rng = max - min
        scores = (scores - min) / rng
        if invert:
            scores = 1 - scores
        # replace the _score column with the ranks
        _score_idx = results.column_names.index(column)
        results = results.set_column(
            _score_idx, column, pa.array(scores, type=pa.float32())
        )
        return results

    def rerank(
        self,
        normalize="score",
        reranker: Reranker = LinearCombinationReranker(weight=0.7, fill=1.0),
    ) -> LanceHybridQueryBuilder:
        """
        Rerank the hybrid search results using the specified reranker. The reranker
        must be an instance of Reranker class.

        Parameters
        ----------
        normalize: str, default "score"
            The method to normalize the scores. Can be "rank" or "score". If "rank",
            the scores are converted to ranks and then normalized. If "score", the
            scores are normalized directly.
        reranker: Reranker, default LinearCombinationReranker(weight=0.7, fill=1.0)
            The reranker to use. Must be an instance of Reranker class.
        Returns
        -------
        LanceHybridQueryBuilder
            The LanceHybridQueryBuilder object.
        """
        if normalize not in ["rank", "score"]:
            raise ValueError("normalize must be 'rank' or 'score'.")
        if reranker and not isinstance(reranker, Reranker):
            raise ValueError("reranker must be an instance of Reranker class.")

        self._norm = normalize
        self._reranker = reranker

        return self

    def limit(self, limit: int) -> LanceHybridQueryBuilder:
        """
        Set the maximum number of results to return for both vector and fts search
        components.

        Parameters
        ----------
        limit: int
            The maximum number of results to return.

        Returns
        -------
        LanceHybridQueryBuilder
            The LanceHybridQueryBuilder object.
        """
        self._vector_query.limit(limit)
        self._fts_query.limit(limit)
        self._limit = limit

        return self

    def select(self, columns: list) -> LanceHybridQueryBuilder:
        """
        Set the columns to return for both vector and fts search.

        Parameters
        ----------
        columns: list
            The columns to return.

        Returns
        -------
        LanceHybridQueryBuilder
            The LanceHybridQueryBuilder object.
        """
        self._vector_query.select(columns)
        self._fts_query.select(columns)
        return self

    def where(self, where: str, prefilter: bool = False) -> LanceHybridQueryBuilder:
        """
        Set the where clause for both vector and fts search.

        Parameters
        ----------
        where: str
            The where clause which is a valid SQL where clause. See
            `Lance filter pushdown <https://lancedb.github.io/lance/read_and_write.html#filter-push-down>`_
            for valid SQL expressions.

        prefilter: bool, default False
            If True, apply the filter before vector search, otherwise the
            filter is applied on the result of vector search.

        Returns
        -------
        LanceHybridQueryBuilder
            The LanceHybridQueryBuilder object.
        """

        self._vector_query.where(where, prefilter=prefilter)
        self._fts_query.where(where)
        return