Object Store Configuration¶
Lance supports object stores such as AWS S3 (and compatible stores), Azure Blob Store,
and Google Cloud Storage. Which object store to use is determined by the URI scheme of
the dataset path. For example, s3://bucket/path
will use S3, az://bucket/path
will use Azure, and gs://bucket/path
will use GCS.
Added in version 0.10.7: Passing options directly to storage options.
These object stores take additional configuration objects. There are two ways to
specify these configurations: by setting environment variables or by passing them
to the storage_options
parameter of lance.dataset()
and
lance.write_dataset()
. So for example, to globally set a higher timeout,
you would run in your shell:
export TIMEOUT=60s
If you only want to set the timeout for a single dataset, you can pass it as a storage option:
import lance
ds = lance.dataset("s3://path", storage_options={"timeout": "60s"})
General Configuration¶
These options apply to all object stores.
Key |
Description |
---|---|
|
Allow non-TLS, i.e. non-HTTPS connections. Default, |
|
Number of times to retry a download. Default, |
|
Skip certificate validation on https connections. Default, |
|
Timeout for only the connect phase of a Client. Default, |
|
Timeout for the entire request, from connection until the response body
has finished. Default, |
|
User agent string to use in requests. |
|
URL of a proxy server to use for requests. Default, |
|
PEM-formatted CA certificate for proxy connections |
|
List of hosts that bypass proxy. This is a comma separated list of domains
and IP masks. Any subdomain of the provided domain will be bypassed. For
example, |
|
Number of times for a s3 client to retry the request. Default, |
|
Timeout for a s3 client to retry the request in seconds. Default, |
S3 Configuration¶
S3 (and S3-compatible stores) have additional configuration options that configure authorization and S3-specific features (such as server-side encryption).
AWS credentials can be set in the environment variables AWS_ACCESS_KEY_ID
,
AWS_SECRET_ACCESS_KEY
, and AWS_SESSION_TOKEN
. Alternatively, they can be
passed as parameters to the storage_options
parameter:
import lance
ds = lance.dataset(
"s3://bucket/path",
storage_options={
"access_key_id": "my-access-key",
"secret_access_key": "my-secret-key",
"session_token": "my-session-token",
}
)
If you are using AWS SSO, you can specify the AWS_PROFILE
environment variable.
It cannot be specified in the storage_options
parameter.
The following keys can be used as both environment variables or keys in the
storage_options
parameter:
Key |
Description |
---|---|
|
The AWS region the bucket is in. This can be automatically detected when using AWS S3, but must be specified for S3-compatible stores. |
|
The AWS access key ID to use. |
|
The AWS secret access key to use. |
|
The AWS session token to use. |
|
The endpoint to use for S3-compatible stores. |
|
Whether to use virtual hosted-style requests, where bucket name is part
of the endpoint. Meant to be used with |
|
Whether to use S3 Express One Zone endpoints. Default, |
|
The server-side encryption algorithm to use. Must be one of |
|
The KMS key ID to use for server-side encryption. If set,
|
|
Whether to use bucket keys for server-side encryption. |
S3-compatible stores¶
Lance can also connect to S3-compatible stores, such as MinIO. To do so, you must specify both region and endpoint:
import lance
ds = lance.dataset(
"s3://bucket/path",
storage_options={
"region": "us-east-1",
"endpoint": "http://minio:9000",
}
)
This can also be done with the AWS_ENDPOINT
and AWS_DEFAULT_REGION
environment variables.
S3 Express¶
Added in version 0.9.7.
Lance supports S3 Express One Zone endpoints, but requires additional configuration. Also, S3 Express endpoints only support connecting from an EC2 instance within the same region
To configure Lance to use an S3 Express endpoint, you must set the storage option
s3_express
. The bucket name in your table URI should include the suffix.
import lance
ds = lance.dataset(
"s3://my-bucket--use1-az4--x-s3/path/imagenet.lance",
storage_options={
"region": "us-east-1",
"s3_express": "true",
}
)
Google Cloud Storage Configuration¶
GCS credentials are configured by setting the GOOGLE_SERVICE_ACCOUNT
environment
variable to the path of a JSON file containing the service account credentials.
Alternatively, you can pass the path to the JSON file in the storage_options
import lance
ds = lance.dataset(
"gs://my-bucket/my-dataset",
storage_options={
"service_account": "path/to/service-account.json",
}
)
Note
By default, GCS uses HTTP/1 for communication, as opposed to HTTP/2. This improves
maximum throughput significantly. However, if you wish to use HTTP/2 for some reason,
you can set the environment variable HTTP1_ONLY
to false
.
The following keys can be used as both environment variables or keys in the
storage_options
parameter:
Key |
Description |
---|---|
|
Path to the service account JSON file. |
|
The serialized service account key. |
|
Path to the application credentials. |
Azure Blob Storage Configuration¶
Azure Blob Storage credentials can be configured by setting the AZURE_STORAGE_ACCOUNT_NAME
and AZURE_STORAGE_ACCOUNT_KEY
environment variables. Alternatively, you can pass
the account name and key in the storage_options
parameter:
import lance
ds = lance.dataset(
"az://my-container/my-dataset",
storage_options={
"account_name": "some-account",
"account_key": "some-key",
}
)
These keys can be used as both environment variables or keys in the storage_options
parameter:
Key |
Description |
---|---|
|
The name of the azure storage account. |
|
The serialized service account key. |
|
Service principal client id for authorizing requests. |
|
Service principal client secret for authorizing requests. |
|
Tenant id used in oauth flows. |
|
Shared access signature. The signature is expected to be percent-encoded, much like they are provided in the azure storage explorer or azure portal. |
|
Bearer token. |
|
Use object store with azurite storage emulator. |
|
Override the endpoint used to communicate with blob storage. |
|
Use object store with url scheme account.dfs.fabric.microsoft.com. |
|
Endpoint to request a imds managed identity token. |
|
Object id for use with managed identity authentication. |
|
Msi resource id for use with managed identity authentication. |
|
File containing token for Azure AD workload identity federation. |
|
Use azure cli for acquiring access token. |
|
Disables tagging objects. This can be desirable if not supported by the backing store. |