Lance Dataset

The core of Lance is the LanceDataset class. User can open a dataset by using lance.dataset().

lance.dataset(uri: str | Path | None = None, version: int | str | None = None, asof: ts_types | None = None, block_size: int | None = None, commit_lock: CommitLock | None = None, index_cache_size: int | None = None, storage_options: Dict[str, str] | None = None, default_scan_options: Dict[str, str] | None = None, metadata_cache_size_bytes: int | None = None, index_cache_size_bytes: int | None = None, read_params: Dict[str, any] | None = None, session: Session | None = None, namespace: any | None = None, table_id: list | None = None, ignore_namespace_table_storage_options: bool = False, s3_credentials_refresh_offset_seconds: int | None = None) LanceDataset

Opens the Lance dataset from the address specified.

Parameters:
  • uri (str, optional) – Address to the Lance dataset. It can be a local file path /tmp/data.lance, or a cloud object store URI, i.e., s3://bucket/data.lance. Either uri or (namespace + table_id) must be provided, but not both.

  • version (optional, int | str) – If specified, load a specific version of the Lance dataset. Else, loads the latest version. A version number (int) or a tag (str) can be provided.

  • asof (optional, datetime or str) – If specified, find the latest version created on or earlier than the given argument value. If a version is already specified, this arg is ignored.

  • block_size (optional, int) – Block size in bytes. Provide a hint for the size of the minimal I/O request.

  • commit_lock (optional, lance.commit.CommitLock) – A custom commit lock. Only needed if your object store does not support atomic commits. See the user guide for more details.

  • index_cache_size (optional, int) –

    Index cache size. Index cache is a LRU cache with TTL. This number specifies the number of index pages, for example, IVF partitions, to be cached in the host memory. Default value is 256.

    Roughly, for an IVF_PQ partition with n rows, the size of each index page equals the combination of the pq code (nd.array([n,pq], dtype=uint8)) and the row ids (nd.array([n], dtype=uint64)). Approximately, n = Total Rows / number of IVF partitions. pq = number of PQ sub-vectors.

  • storage_options (optional, dict) – Extra options that make sense for a particular storage connection. This is used to store connection parameters like credentials, endpoint, etc.

  • default_scan_options (optional, dict) –

    Default scan options that are used when scanning the dataset. This accepts the same arguments described in lance.LanceDataset.scanner(). The arguments will be applied to any scan operation.

    This can be useful to supply defaults for common parameters such as batch_size.

    It can also be used to create a view of the dataset that includes meta fields such as _rowid or _rowaddr. If default_scan_options is provided then the schema returned by lance.LanceDataset.schema() will include these fields if the appropriate scan options are set.

  • metadata_cache_size_bytes (optional, int) – Size of the metadata cache in bytes. This cache is used to store metadata information about the dataset, such as schema and statistics. If not specified, a default size will be used.

  • read_params (optional, dict) –

    Dictionary of read parameters. Currently supports: - cache_repetition_index (bool): Whether to cache repetition indices for

    large string/binary columns

    • validate_on_decode (bool): Whether to validate data during decoding

  • session (optional, lance.Session) – A session to use for this dataset. This contains the caches used by the across multiple datasets.

  • namespace (optional) – A namespace instance from which to fetch table location and storage options. This can be any object with a describe_table(table_id, version) method that returns a dict with ‘location’ and ‘storage_options’ keys. For example, use lance_namespace.connect() from the lance_namespace package. Must be provided together with table_id. Cannot be used with uri. When provided, the table location will be fetched automatically from the namespace via describe_table().

  • table_id (optional, list of str) – The table identifier when using a namespace (e.g., [“my_table”]). Must be provided together with namespace. Cannot be used with uri.

  • ignore_namespace_table_storage_options (bool, default False) – Only applicable when using namespace and table_id. If True, storage options returned from the namespace’s describe_table() will be ignored (treated as None). If False (default), storage options from describe_table() will be used and a dynamic storage options provider will be created to automatically refresh credentials before they expire.

  • s3_credentials_refresh_offset_seconds (optional, int) – The number of seconds before credential expiration to trigger a refresh. Default is 60 seconds. Only applicable when using AWS S3 with temporary credentials. For example, if set to 60, credentials will be refreshed when they have less than 60 seconds remaining before expiration. This should be set shorter than the credential lifetime to avoid using expired credentials.

Notes

When using namespace and table_id: - The uri parameter is optional and will be fetched from the namespace - Storage options from describe_table() will be used unless

ignore_namespace_table_storage_options=True

  • Initial storage options from describe_table() will be merged with any provided storage_options