Python API Reference
This section contains the API reference for the Python API. There is a synchronous and an asynchronous API client.
The general flow of using the API is:
- Use lancedb.connect or lancedb.connect_async to connect to a database.
- Use the returned lancedb.DBConnection or lancedb.AsyncConnection to create or open tables.
- Use the returned lancedb.table.Table or lancedb.AsyncTable to query or modify tables.
Installation
The following methods describe the synchronous API client. There is also an asynchronous API client.
Connections (Synchronous)
lancedb.connect
connect(uri: URI, *, api_key: Optional[str] = None, region: str = 'us-east-1', host_override: Optional[str] = None, read_consistency_interval: Optional[timedelta] = None, request_thread_pool: Optional[Union[int, ThreadPoolExecutor]] = None, client_config: Union[ClientConfig, Dict[str, Any], None] = None, storage_options: Optional[Dict[str, str]] = None, **kwargs: Any) -> DBConnection
Connect to a LanceDB database.
Parameters:
-
uri
(URI
) –The uri of the database.
-
api_key
(Optional[str]
, default:None
) –If presented, connect to LanceDB cloud. Otherwise, connect to a database on file system or cloud storage. Can be set via environment variable
LANCEDB_API_KEY
. -
region
(str
, default:'us-east-1'
) –The region to use for LanceDB Cloud.
-
host_override
(Optional[str]
, default:None
) –The override url for LanceDB Cloud.
-
read_consistency_interval
(Optional[timedelta]
, default:None
) –(For LanceDB OSS only) The interval at which to check for updates to the table from other processes. If None, then consistency is not checked. For performance reasons, this is the default. For strong consistency, set this to zero seconds. Then every read will check for updates from other processes. As a compromise, you can set this to a non-zero timedelta for eventual consistency. If more than that interval has passed since the last check, then the table will be checked for updates. Note: this consistency only applies to read operations. Write operations are always consistent.
-
client_config
(Union[ClientConfig, Dict[str, Any], None]
, default:None
) –Configuration options for the LanceDB Cloud HTTP client. If a dict, then the keys are the attributes of the ClientConfig class. If None, then the default configuration is used.
-
storage_options
(Optional[Dict[str, str]]
, default:None
) –Additional options for the storage backend. See available options at https://lancedb.github.io/lancedb/guides/storage/
Examples:
For a local directory, provide a path for the database:
For object storage, use a URI prefix:
>>> db = lancedb.connect("s3://my-bucket/lancedb",
... storage_options={"aws_access_key_id": "***"})
Connect to LanceDB cloud:
>>> db = lancedb.connect("db://my_database", api_key="ldb_...",
... client_config={"retry_config": {"retries": 5}})
Returns:
-
conn
(DBConnection
) –A connection to a LanceDB database.
Source code in lancedb/__init__.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
|
lancedb.db.DBConnection
Bases: EnforceOverrides
An active LanceDB connection interface.
Source code in lancedb/db.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 |
|
table_names
abstractmethod
List all tables in this database, in sorted order
Parameters:
-
page_token
(Optional[str]
, default:None
) –The token to use for pagination. If not present, start from the beginning. Typically, this token is last table name from the previous page. Only supported by LanceDb Cloud.
-
limit
(int
, default:10
) –The size of the page to return. Only supported by LanceDb Cloud.
Returns:
-
Iterable of str
–
Source code in lancedb/db.py
create_table
abstractmethod
create_table(name: str, data: Optional[DATA] = None, schema: Optional[Union[Schema, LanceModel]] = None, mode: str = 'create', exist_ok: bool = False, on_bad_vectors: str = 'error', fill_value: float = 0.0, embedding_functions: Optional[List[EmbeddingFunctionConfig]] = None, *, storage_options: Optional[Dict[str, str]] = None, data_storage_version: Optional[str] = None, enable_v2_manifest_paths: Optional[bool] = None) -> Table
Create a Table in the database.
Parameters:
-
name
(str
) –The name of the table.
-
data
(Optional[DATA]
, default:None
) –User must provide at least one of
data
orschema
. Acceptable types are:-
list-of-dict
-
pandas.DataFrame
-
pyarrow.Table or pyarrow.RecordBatch
-
-
schema
(Optional[Union[Schema, LanceModel]]
, default:None
) –Acceptable types are:
-
pyarrow.Schema
-
-
mode
(str
, default:'create'
) –The mode to use when creating the table. Can be either "create" or "overwrite". By default, if the table already exists, an exception is raised. If you want to overwrite the table, use mode="overwrite".
-
exist_ok
(bool
, default:False
) –If a table by the same name already exists, then raise an exception if exist_ok=False. If exist_ok=True, then open the existing table; it will not add the provided data but will validate against any schema that's specified.
-
on_bad_vectors
(str
, default:'error'
) –What to do if any of the vectors are not the same size or contains NaNs. One of "error", "drop", "fill".
-
fill_value
(float
, default:0.0
) –The value to use when filling vectors. Only used if on_bad_vectors="fill".
-
storage_options
(Optional[Dict[str, str]]
, default:None
) –Additional options for the storage backend. Options already set on the connection will be inherited by the table, but can be overridden here. See available options at https://lancedb.github.io/lancedb/guides/storage/
-
data_storage_version
(Optional[str]
, default:None
) –The version of the data storage format to use. Newer versions are more efficient but require newer versions of lance to read. The default is "stable" which will use the legacy v2 version. See the user guide for more details.
-
enable_v2_manifest_paths
(Optional[bool]
, default:None
) –Use the new V2 manifest paths. These paths provide more efficient opening of datasets with many versions on object stores. WARNING: turning this on will make the dataset unreadable for older versions of LanceDB (prior to 0.13.0). To migrate an existing dataset, instead use the Table.migrate_manifest_paths_v2 method.
Returns:
-
LanceTable
–A reference to the newly created table.
-
!!! note
–The vector index won't be created by default. To create the index, call the
create_index
method on the table.
Examples:
Can create with list of tuples or dictionaries:
>>> import lancedb
>>> db = lancedb.connect("./.lancedb")
>>> data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},
... {"vector": [0.2, 1.8], "lat": 40.1, "long": -74.1}]
>>> db.create_table("my_table", data)
LanceTable(name='my_table', version=1, ...)
>>> db["my_table"].head()
pyarrow.Table
vector: fixed_size_list<item: float>[2]
child 0, item: float
lat: double
long: double
----
vector: [[[1.1,1.2],[0.2,1.8]]]
lat: [[45.5,40.1]]
long: [[-122.7,-74.1]]
You can also pass a pandas DataFrame:
>>> import pandas as pd
>>> data = pd.DataFrame({
... "vector": [[1.1, 1.2], [0.2, 1.8]],
... "lat": [45.5, 40.1],
... "long": [-122.7, -74.1]
... })
>>> db.create_table("table2", data)
LanceTable(name='table2', version=1, ...)
>>> db["table2"].head()
pyarrow.Table
vector: fixed_size_list<item: float>[2]
child 0, item: float
lat: double
long: double
----
vector: [[[1.1,1.2],[0.2,1.8]]]
lat: [[45.5,40.1]]
long: [[-122.7,-74.1]]
Data is converted to Arrow before being written to disk. For maximum control over how data is saved, either provide the PyArrow schema to convert to or else provide a PyArrow Table directly.
>>> import pyarrow as pa
>>> custom_schema = pa.schema([
... pa.field("vector", pa.list_(pa.float32(), 2)),
... pa.field("lat", pa.float32()),
... pa.field("long", pa.float32())
... ])
>>> db.create_table("table3", data, schema = custom_schema)
LanceTable(name='table3', version=1, ...)
>>> db["table3"].head()
pyarrow.Table
vector: fixed_size_list<item: float>[2]
child 0, item: float
lat: float
long: float
----
vector: [[[1.1,1.2],[0.2,1.8]]]
lat: [[45.5,40.1]]
long: [[-122.7,-74.1]]
It is also possible to create an table from [Iterable[pa.RecordBatch]]
:
>>> import pyarrow as pa
>>> def make_batches():
... for i in range(5):
... yield pa.RecordBatch.from_arrays(
... [
... pa.array([[3.1, 4.1], [5.9, 26.5]],
... pa.list_(pa.float32(), 2)),
... pa.array(["foo", "bar"]),
... pa.array([10.0, 20.0]),
... ],
... ["vector", "item", "price"],
... )
>>> schema=pa.schema([
... pa.field("vector", pa.list_(pa.float32(), 2)),
... pa.field("item", pa.utf8()),
... pa.field("price", pa.float32()),
... ])
>>> db.create_table("table4", make_batches(), schema=schema)
LanceTable(name='table4', version=1, ...)
Source code in lancedb/db.py
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 |
|
open_table
open_table(name: str, *, storage_options: Optional[Dict[str, str]] = None, index_cache_size: Optional[int] = None) -> Table
Open a Lance Table in the database.
Parameters:
-
name
(str
) –The name of the table.
-
index_cache_size
(Optional[int]
, default:None
) –Set the size of the index cache, specified as a number of entries
The exact meaning of an "entry" will depend on the type of index: * IVF - there is one entry for each IVF partition * BTREE - there is one entry for the entire index
This cache applies to the entire opened table, across all indices. Setting this value higher will increase performance on larger datasets at the expense of more RAM
-
storage_options
(Optional[Dict[str, str]]
, default:None
) –Additional options for the storage backend. Options already set on the connection will be inherited by the table, but can be overridden here. See available options at https://lancedb.github.io/lancedb/guides/storage/
Returns:
-
A LanceTable object representing the table.
–
Source code in lancedb/db.py
drop_table
rename_table
Rename a table in the database.
Parameters:
-
cur_name
(str
) –The current name of the table.
-
new_name
(str
) –The new name of the table.
Tables (Synchronous)
lancedb.table.Table
Bases: ABC
A Table is a collection of Records in a LanceDB Database.
Examples:
Create using DBConnection.create_table (more examples in that method's documentation).
>>> import lancedb
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data=[{"vector": [1.1, 1.2], "b": 2}])
>>> table.head()
pyarrow.Table
vector: fixed_size_list<item: float>[2]
child 0, item: float
b: int64
----
vector: [[[1.1,1.2]]]
b: [[2]]
Can append new data with Table.add().
Can query the table with Table.search.
>>> table.search([0.4, 0.4]).select(["b", "vector"]).to_pandas()
b vector _distance
0 4 [0.5, 1.3] 0.82
1 2 [1.1, 1.2] 1.13
Search queries are much faster when an index is created. See Table.create_index.
Source code in lancedb/table.py
442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 |
|
embedding_functions
abstractmethod
property
embedding_functions: Dict[str, EmbeddingFunctionConfig]
Get a mapping from vector column name to it's configured embedding function.
count_rows
abstractmethod
Count the number of rows in the table.
Parameters:
-
filter
(Optional[str]
, default:None
) –A SQL where clause to filter the rows to count.
Source code in lancedb/table.py
to_pandas
to_arrow
abstractmethod
to_arrow() -> Table
create_index
create_index(metric='L2', num_partitions=256, num_sub_vectors=96, vector_column_name: str = VECTOR_COLUMN_NAME, replace: bool = True, accelerator: Optional[str] = None, index_cache_size: Optional[int] = None, *, index_type: Literal['IVF_FLAT', 'IVF_PQ', 'IVF_HNSW_SQ', 'IVF_HNSW_PQ'] = 'IVF_PQ', num_bits: int = 8, max_iterations: int = 50, sample_rate: int = 256, m: int = 20, ef_construction: int = 300)
Create an index on the table.
Parameters:
-
metric
–The distance metric to use when creating the index. Valid values are "L2", "cosine", "dot", or "hamming". L2 is euclidean distance. Hamming is available only for binary vectors.
-
num_partitions
–The number of IVF partitions to use when creating the index. Default is 256.
-
num_sub_vectors
–The number of PQ sub-vectors to use when creating the index. Default is 96.
-
vector_column_name
(str
, default:VECTOR_COLUMN_NAME
) –The vector column name to create the index.
-
replace
(bool
, default:True
) –-
If True, replace the existing index if it exists.
-
If False, raise an error if duplicate index exists.
-
-
accelerator
(Optional[str]
, default:None
) –If set, use the given accelerator to create the index. Only support "cuda" for now.
-
index_cache_size
(int
, default:None
) –The size of the index cache in number of entries. Default value is 256.
-
num_bits
(int
, default:8
) –The number of bits to encode sub-vectors. Only used with the IVF_PQ index. Only 4 and 8 are supported.
Source code in lancedb/table.py
create_scalar_index
abstractmethod
create_scalar_index(column: str, *, replace: bool = True, index_type: Literal['BTREE', 'BITMAP', 'LABEL_LIST'] = 'BTREE')
Create a scalar index on a column.
Parameters:
-
column
(str
) –The column to be indexed. Must be a boolean, integer, float, or string column.
-
replace
(bool
, default:True
) –Replace the existing index if it exists.
-
index_type
(Literal['BTREE', 'BITMAP', 'LABEL_LIST']
, default:'BTREE'
) –The type of index to create.
Examples:
Scalar indices, like vector indices, can be used to speed up scans. A scalar
index can speed up scans that contain filter expressions on the indexed column.
For example, the following scan will be faster if the column my_col
has
a scalar index:
>>> import lancedb
>>> db = lancedb.connect("/data/lance")
>>> img_table = db.open_table("images")
>>> my_df = img_table.search().where("my_col = 7",
... prefilter=True).to_pandas()
Scalar indices can also speed up scans containing a vector search and a prefilter:
>>> import lancedb
>>> db = lancedb.connect("/data/lance")
>>> img_table = db.open_table("images")
>>> img_table.search([1, 2, 3, 4], vector_column_name="vector")
... .where("my_col != 7", prefilter=True)
... .to_pandas()
Scalar indices can only speed up scans for basic filters using
equality, comparison, range (e.g. my_col BETWEEN 0 AND 100
), and set
membership (e.g. my_col IN (0, 1, 2)
)
Scalar indices can be used if the filter contains multiple indexed columns and
the filter criteria are AND'd or OR'd together
(e.g. my_col < 0 AND other_col> 100
)
Scalar indices may be used if the filter contains non-indexed columns but,
depending on the structure of the filter, they may not be usable. For example,
if the column not_indexed
does not have a scalar index then the filter
my_col = 0 OR not_indexed = 1
will not be able to use any scalar index on
my_col
.
Source code in lancedb/table.py
create_fts_index
create_fts_index(field_names: Union[str, List[str]], *, ordering_field_names: Optional[Union[str, List[str]]] = None, replace: bool = False, writer_heap_size: Optional[int] = 1024 * 1024 * 1024, use_tantivy: bool = True, tokenizer_name: Optional[str] = None, with_position: bool = True, base_tokenizer: Literal['simple', 'raw', 'whitespace'] = 'simple', language: str = 'English', max_token_length: Optional[int] = 40, lower_case: bool = True, stem: bool = False, remove_stop_words: bool = False, ascii_folding: bool = False)
Create a full-text search index on the table.
Warning - this API is highly experimental and is highly likely to change in the future.
Parameters:
-
field_names
(Union[str, List[str]]
) –The name(s) of the field to index. can be only str if use_tantivy=True for now.
-
replace
(bool
, default:False
) –If True, replace the existing index if it exists. Note that this is not yet an atomic operation; the index will be temporarily unavailable while the new index is being created.
-
writer_heap_size
(Optional[int]
, default:1024 * 1024 * 1024
) –Only available with use_tantivy=True
-
ordering_field_names
(Optional[Union[str, List[str]]]
, default:None
) –A list of unsigned type fields to index to optionally order results on at search time. only available with use_tantivy=True
-
tokenizer_name
(Optional[str]
, default:None
) –The tokenizer to use for the index. Can be "raw", "default" or the 2 letter language code followed by "_stem". So for english it would be "en_stem". For available languages see: https://docs.rs/tantivy/latest/tantivy/tokenizer/enum.Language.html
-
use_tantivy
(bool
, default:True
) –If True, use the legacy full-text search implementation based on tantivy. If False, use the new full-text search implementation based on lance-index.
-
with_position
(bool
, default:True
) –Only available with use_tantivy=False If False, do not store the positions of the terms in the text. This can reduce the size of the index and improve indexing speed. But it will raise an exception for phrase queries.
-
base_tokenizer
(str
, default:"simple"
) –The base tokenizer to use for tokenization. Options are: - "simple": Splits text by whitespace and punctuation. - "whitespace": Split text by whitespace, but not punctuation. - "raw": No tokenization. The entire text is treated as a single token.
-
language
(str
, default:"English"
) –The language to use for tokenization.
-
max_token_length
(int
, default:40
) –The maximum token length to index. Tokens longer than this length will be ignored.
-
lower_case
(bool
, default:True
) –Whether to convert the token to lower case. This makes queries case-insensitive.
-
stem
(bool
, default:False
) –Whether to stem the token. Stemming reduces words to their root form. For example, in English "running" and "runs" would both be reduced to "run".
-
remove_stop_words
(bool
, default:False
) –Whether to remove stop words. Stop words are common words that are often removed from text before indexing. For example, in English "the" and "and".
-
ascii_folding
(bool
, default:False
) –Whether to fold ASCII characters. This converts accented characters to their ASCII equivalent. For example, "café" would be converted to "cafe".
Source code in lancedb/table.py
add
abstractmethod
Add more data to the Table.
Parameters:
-
data
(DATA
) –The data to insert into the table. Acceptable types are:
-
list-of-dict
-
pandas.DataFrame
-
pyarrow.Table or pyarrow.RecordBatch
-
-
mode
(str
, default:'append'
) –The mode to use when writing the data. Valid values are "append" and "overwrite".
-
on_bad_vectors
(str
, default:'error'
) –What to do if any of the vectors are not the same size or contains NaNs. One of "error", "drop", "fill".
-
fill_value
(float
, default:0.0
) –The value to use when filling vectors. Only used if on_bad_vectors="fill".
Source code in lancedb/table.py
merge_insert
merge_insert(on: Union[str, Iterable[str]]) -> LanceMergeInsertBuilder
Returns a LanceMergeInsertBuilder
that can be used to create a "merge insert" operation
This operation can add rows, update rows, and remove rows all in a single transaction. It is a very generic tool that can be used to create behaviors like "insert if not exists", "update or insert (i.e. upsert)", or even replace a portion of existing data with new data (e.g. replace all data where month="january")
The merge insert operation works by combining new data from a source table with existing data in a target table by using a join. There are three categories of records.
"Matched" records are records that exist in both the source table and the target table. "Not matched" records exist only in the source table (e.g. these are new data) "Not matched by source" records exist only in the target table (this is old data)
The builder returned by this method can be used to customize what should happen for each category of data.
Please note that the data may appear to be reordered as part of this operation. This is because updated rows will be deleted from the dataset and then reinserted at the end with the new values.
Parameters:
-
on
(Union[str, Iterable[str]]
) –A column (or columns) to join on. This is how records from the source table and target table are matched. Typically this is some kind of key or id column.
Examples:
>>> import lancedb
>>> data = pa.table({"a": [2, 1, 3], "b": ["a", "b", "c"]})
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data)
>>> new_data = pa.table({"a": [2, 3, 4], "b": ["x", "y", "z"]})
>>> # Perform a "upsert" operation
>>> table.merge_insert("a") \
... .when_matched_update_all() \
... .when_not_matched_insert_all() \
... .execute(new_data)
>>> # The order of new rows is non-deterministic since we use
>>> # a hash-join as part of this operation and so we sort here
>>> table.to_arrow().sort_by("a").to_pandas()
a b
0 1 b
1 2 x
2 3 y
3 4 z
Source code in lancedb/table.py
search
abstractmethod
search(query: Optional[Union[VEC, str, 'PIL.Image.Image', Tuple]] = None, vector_column_name: Optional[str] = None, query_type: QueryType = 'auto', ordering_field_name: Optional[str] = None, fts_columns: Optional[Union[str, List[str]]] = None) -> LanceQueryBuilder
Create a search query to find the nearest neighbors of the given query vector. We currently support vector search and [full-text search][experimental-full-text-search].
All query options are defined in Query.
Examples:
>>> import lancedb
>>> db = lancedb.connect("./.lancedb")
>>> data = [
... {"original_width": 100, "caption": "bar", "vector": [0.1, 2.3, 4.5]},
... {"original_width": 2000, "caption": "foo", "vector": [0.5, 3.4, 1.3]},
... {"original_width": 3000, "caption": "test", "vector": [0.3, 6.2, 2.6]}
... ]
>>> table = db.create_table("my_table", data)
>>> query = [0.4, 1.4, 2.4]
>>> (table.search(query)
... .where("original_width > 1000", prefilter=True)
... .select(["caption", "original_width", "vector"])
... .limit(2)
... .to_pandas())
caption original_width vector _distance
0 foo 2000 [0.5, 3.4, 1.3] 5.220000
1 test 3000 [0.3, 6.2, 2.6] 23.089996
Parameters:
-
query
(Optional[Union[VEC, str, 'PIL.Image.Image', Tuple]]
, default:None
) –The targetted vector to search for.
-
default None. Acceptable types are: list, np.ndarray, PIL.Image.Image
-
If None then the select/where/limit clauses are applied to filter the table
-
-
vector_column_name
(Optional[str]
, default:None
) –The name of the vector column to search.
The vector column needs to be a pyarrow fixed size list type
-
If not specified then the vector column is inferred from the table schema
-
If the table has multiple vector columns then the vector_column_name needs to be specified. Otherwise, an error is raised.
-
-
query_type
(QueryType
, default:'auto'
) –default "auto". Acceptable types are: "vector", "fts", "hybrid", or "auto"
-
If "auto" then the query type is inferred from the query;
-
If
query
is a list/np.ndarray then the query type is "vector"; -
If
query
is a PIL.Image.Image then either do vector search, or raise an error if no corresponding embedding function is found.
-
-
If
query
is a string, then the query type is "vector" if the table has embedding functions else the query type is "fts"
-
Returns:
-
LanceQueryBuilder
–A query builder object representing the query. Once executed, the query returns
-
selected columns
-
the vector
-
and also the "_distance" column which is the distance between the query vector and the returned vector.
-
Source code in lancedb/table.py
817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 |
|
delete
abstractmethod
Delete rows from the table.
This can be used to delete a single row, many rows, all rows, or sometimes no rows (if your predicate matches nothing).
Parameters:
-
where
(str
) –The SQL where clause to use when deleting rows.
- For example, 'x = 2' or 'x IN (1, 2, 3)'.
The filter must not be empty, or it will error.
Examples:
>>> import lancedb
>>> data = [
... {"x": 1, "vector": [1.0, 2]},
... {"x": 2, "vector": [3.0, 4]},
... {"x": 3, "vector": [5.0, 6]}
... ]
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data)
>>> table.to_pandas()
x vector
0 1 [1.0, 2.0]
1 2 [3.0, 4.0]
2 3 [5.0, 6.0]
>>> table.delete("x = 2")
>>> table.to_pandas()
x vector
0 1 [1.0, 2.0]
1 3 [5.0, 6.0]
If you have a list of values to delete, you can combine them into a
stringified list and use the IN
operator:
>>> to_remove = [1, 5]
>>> to_remove = ", ".join([str(v) for v in to_remove])
>>> to_remove
'1, 5'
>>> table.delete(f"x IN ({to_remove})")
>>> table.to_pandas()
x vector
0 3 [5.0, 6.0]
Source code in lancedb/table.py
update
abstractmethod
update(where: Optional[str] = None, values: Optional[dict] = None, *, values_sql: Optional[Dict[str, str]] = None)
This can be used to update zero to all rows depending on how many rows match the where clause. If no where clause is provided, then all rows will be updated.
Either values
or values_sql
must be provided. You cannot provide
both.
Parameters:
-
where
(Optional[str]
, default:None
) –The SQL where clause to use when updating rows. For example, 'x = 2' or 'x IN (1, 2, 3)'. The filter must not be empty, or it will error.
-
values
(Optional[dict]
, default:None
) –The values to update. The keys are the column names and the values are the values to set.
-
values_sql
(Optional[Dict[str, str]]
, default:None
) –The values to update, expressed as SQL expression strings. These can reference existing columns. For example, {"x": "x + 1"} will increment the x column by 1.
Examples:
>>> import lancedb
>>> import pandas as pd
>>> data = pd.DataFrame({"x": [1, 2, 3], "vector": [[1.0, 2], [3, 4], [5, 6]]})
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data)
>>> table.to_pandas()
x vector
0 1 [1.0, 2.0]
1 2 [3.0, 4.0]
2 3 [5.0, 6.0]
>>> table.update(where="x = 2", values={"vector": [10.0, 10]})
>>> table.to_pandas()
x vector
0 1 [1.0, 2.0]
1 3 [5.0, 6.0]
2 2 [10.0, 10.0]
>>> table.update(values_sql={"x": "x + 1"})
>>> table.to_pandas()
x vector
0 2 [1.0, 2.0]
1 4 [5.0, 6.0]
2 3 [10.0, 10.0]
Source code in lancedb/table.py
cleanup_old_versions
abstractmethod
cleanup_old_versions(older_than: Optional[timedelta] = None, *, delete_unverified: bool = False) -> CleanupStats
Clean up old versions of the table, freeing disk space.
Parameters:
-
older_than
(Optional[timedelta]
, default:None
) –The minimum age of the version to delete. If None, then this defaults to two weeks.
-
delete_unverified
(bool
, default:False
) –Because they may be part of an in-progress transaction, files newer than 7 days old are not deleted by default. If you are sure that there are no in-progress transactions, then you can set this to True to delete all files older than
older_than
.
Returns:
-
CleanupStats
–The stats of the cleanup operation, including how many bytes were freed.
See Also
Table.optimize: A more comprehensive optimization operation that includes cleanup as well as other operations.
Notes
This function is not available in LanceDb Cloud (since LanceDB Cloud manages cleanup for you automatically)
Source code in lancedb/table.py
compact_files
abstractmethod
Run the compaction process on the table. This can be run after making several small appends to optimize the table for faster reads.
Arguments are passed onto Lance's compact_files. For most cases, the default should be fine.
See Also
Table.optimize: A more comprehensive optimization operation that includes cleanup as well as other operations.
Notes
This function is not available in LanceDB Cloud (since LanceDB Cloud manages compaction for you automatically)
Source code in lancedb/table.py
optimize
abstractmethod
Optimize the on-disk data and indices for better performance.
Modeled after VACUUM
in PostgreSQL.
Optimization covers three operations:
- Compaction: Merges small files into larger ones
- Prune: Removes old versions of the dataset
- Index: Optimizes the indices, adding new data to existing indices
Parameters:
-
cleanup_older_than
(Optional[timedelta]
, default:None
) –All files belonging to versions older than this will be removed. Set to 0 days to remove all versions except the latest. The latest version is never removed.
-
delete_unverified
(bool
, default:False
) –Files leftover from a failed transaction may appear to be part of an in-progress operation (e.g. appending new data) and these files will not be deleted unless they are at least 7 days old. If delete_unverified is True then these files will be deleted regardless of their age.
Experimental API
The optimization process is undergoing active development and may change. Our goal with these changes is to improve the performance of optimization and reduce the complexity.
That being said, it is essential today to run optimize if you want the best performance. It should be stable and safe to use in production, but it our hope that the API may be simplified (or not even need to be called) in the future.
The frequency an application shoudl call optimize is based on the frequency of data modifications. If data is frequently added, deleted, or updated then optimize should be run frequently. A good rule of thumb is to run optimize if you have added or modified 100,000 or more records or run more than 20 data modification operations.
Source code in lancedb/table.py
list_indices
abstractmethod
index_stats
abstractmethod
Retrieve statistics about an index
Parameters:
-
index_name
(str
) –The name of the index to retrieve statistics for
Returns:
-
IndexStatistics or None
–The statistics about the index. Returns None if the index does not exist.
Source code in lancedb/table.py
add_columns
abstractmethod
Add new columns with defined values.
Parameters:
-
transforms
(Dict[str, str]
) –A map of column name to a SQL expression to use to calculate the value of the new column. These expressions will be evaluated for each row in the table, and can reference existing columns.
Source code in lancedb/table.py
alter_columns
abstractmethod
Alter column names and nullability.
Parameters:
-
alterations
(Iterable[Dict[str, Any]]
, default:()
) –A sequence of dictionaries, each with the following keys: - "path": str The column path to alter. For a top-level column, this is the name. For a nested column, this is the dot-separated path, e.g. "a.b.c". - "rename": str, optional The new name of the column. If not specified, the column name is not changed. - "data_type": pyarrow.DataType, optional The new data type of the column. Existing values will be casted to this type. If not specified, the column data type is not changed. - "nullable": bool, optional Whether the column should be nullable. If not specified, the column nullability is not changed. Only non-nullable columns can be changed to nullable. Currently, you cannot change a nullable column to non-nullable.
Source code in lancedb/table.py
drop_columns
abstractmethod
Drop columns from the table.
Parameters:
-
columns
(Iterable[str]
) –The names of the columns to drop.
checkout
abstractmethod
Checks out a specific version of the Table
Any read operation on the table will now access the data at the checked out version. As a consequence, calling this method will disable any read consistency interval that was previously set.
This is a read-only operation that turns the table into a sort of "view"
or "detached head". Other table instances will not be affected. To make the
change permanent you can use the [Self::restore]
method.
Any operation that modifies the table will fail while the table is in a checked out state.
To return the table to a normal state use [Self::checkout_latest]
Source code in lancedb/table.py
checkout_latest
abstractmethod
Ensures the table is pointing at the latest version
This can be used to manually update a table when the read_consistency_interval
is None
It can also be used to undo a [Self::checkout]
operation
Source code in lancedb/table.py
list_versions
abstractmethod
uses_v2_manifest_paths
abstractmethod
Check if the table is using the new v2 manifest paths.
Returns:
-
bool
–True if the table is using the new v2 manifest paths, False otherwise.
migrate_v2_manifest_paths
abstractmethod
Migrate the manifest paths to the new format.
This will update the manifest to use the new v2 format for paths.
This function is idempotent, and can be run multiple times without changing the state of the object store.
Danger
This should not be run while other concurrent operations are happening. And it should also run until completion before resuming other operations.
You can use Table.uses_v2_manifest_paths to check if the table is already using the new path style.
Source code in lancedb/table.py
Querying (Synchronous)
lancedb.query.Query
Bases: BaseModel
The LanceDB Query
Attributes:
-
vector
(List[float]
) –the vector to search for
-
filter
(Optional[str]
) –sql filter to refine the query with, optional
-
prefilter
(bool
) –if True then apply the filter before vector search
-
k
(int
) –top k results to return
-
metric
(str
) –the distance metric between a pair of vectors,
can support L2 (default), Cosine and Dot. metric definitions
-
columns
(Optional[List[str]]
) –which columns to return in the results
-
nprobes
(int
) –The number of probes used - optional
-
A higher number makes search more accurate but also slower.
-
See discussion in Querying an ANN Index for tuning advice.
-
-
refine_factor
(Optional[int]
) –Refine the results by reading extra elements and re-ranking them in memory.
-
A higher number makes search more accurate but also slower.
-
See discussion in Querying an ANN Index for tuning advice.
-
-
offset
(int
) –The offset to start fetching results from
-
fast_search
(bool
) –Skip a flat search of unindexed data. This will improve search performance but search results will not include unindexed data.
- default False.
Source code in lancedb/query.py
lancedb.query.LanceQueryBuilder
Bases: ABC
An abstract query builder. Subclasses are defined for vector search, full text search, hybrid, and plain SQL filtering.
Source code in lancedb/query.py
136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 |
|