Extension Arrays

Lance provides extensions for Arrow arrays and Pandas Series to represent data types for machine learning applications.

BFloat16

BFloat16 is a 16-bit floating point number that is designed for machine learning use cases. Intuitively, it only has 2-3 digits of precision, but it has the same range as a 32-bit float: ~1e-38 to ~1e38. By comparison, a 16-bit float has a range of ~5.96e-8 to 65504.

Lance provides an Arrow extension array (lance.arrow.BFloat16Array) and a Pandas extension array (lance.pandas.BFloat16Dtype) for BFloat16. These are compatible with the ml_dtypes bfloat16 NumPy extension array.

If you are using Pandas, you can use the lance.bfloat16 dtype string to create the array:

import pandas as pd
import lance.arrow

series = pd.Series([1.1, 2.1, 3.4], dtype="lance.bfloat16")
series
0    1.1015625
1      2.09375
2      3.40625
dtype: lance.bfloat16

To create an an arrow array, use the lance.arrow.bfloat16_array() function:

from lance.arrow import bfloat16_array

array = bfloat16_array([1.1, 2.1, 3.4])
array
<lance.arrow.BFloat16Array object at 0x.+>
[1.1015625, 2.09375, 3.40625]

Finally, if you have a pre-existing NumPy array, you can convert it into either:

import numpy as np
from ml_dtypes import bfloat16
from lance.arrow import PandasBFloat16Array, BFloat16Array

np_array = np.array([1.1, 2.1, 3.4], dtype=bfloat16)
PandasBFloat16Array.from_numpy(np_array)
BFloat16Array.from_numpy(np_array)
<PandasBFloat16Array>
[1.1015625, 2.09375, 3.40625]
Length: 3, dtype: lance.bfloat16
<lance.arrow.BFloat16Array object at 0x.+>
[1.1015625, 2.09375, 3.40625]

When reading, these can be converted back to to the NumPy bfloat16 dtype using each array class’s to_numpy method.

ImageURI

lance.arrow.ImageURIArray is an array that stores the URI location of images in some other storage system. For example, file:///path/to/image.png for a local filesystem or s3://bucket/path/image.jpeg for an image on AWS S3. Use this array type when you want to lazily load images from an existing storage medium.

It can be created by calling lance.arrow.ImageURIArray.from_uris() with a list of URIs represented by either pyarrow.StringArray or an iterable that yields strings. Note that the URIs are not strongly validated and images are not read into memory automatically.

from lance.arrow import ImageURIArray

ImageURIArray.from_uris([
    "/tmp/image1.jpg",
    "file:///tmp/image2.jpg",
    "s3://example/image3.jpg"
])
<lance.arrow.ImageURIArray object at 0x.+>
['/tmp/image1.jpg', 'file:///tmp/image2.jpg', 's3://example/image2.jpg']

lance.arrow.ImageURIArray.read_uris() will read images into memory and return them as a new lance.arrow.EncodedImageArray object.

from lance.arrow import ImageURIArray

relative_path = "images/1.png"
uris = [os.path.join(os.path.dirname(__file__), relative_path)]
ImageURIArray.from_uris(uris).read_uris()
<lance.arrow.EncodedImageArray object at 0x...>
[b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00...']

EncodedImage

lance.arrow.EncodedImageArray is an array that stores jpeg and png images in their encoded and compressed representation as they would appear written on disk. Use this array when you want to manipulate images in their compressed format such as when you’re reading them from disk or embedding them into HTML.

It can be created by calling lance.arrow.ImageURIArray.read_uris() on an existing lance.arrow.ImageURIArray. This will read the referenced images into memory. It can also be created by calling lance.arrow.ImageArray.from_array() and passing it an array of encoded images already read into pyarrow.BinaryArray or by calling lance.arrow.ImageTensorArray.to_encoded().

A lance.arrow.EncodedImageArray.to_tensor() method is provided to decode encoded images and return them as lance.arrow.FixedShapeImageTensorArray, from which they can be converted to numpy arrays or TensorFlow tensors. For decoding images, it will first attempt to use a decoder provided via the optional function parameter. If decoder is not provided it will attempt to use Pillow and tensorflow in that order. If neither library or custom decoder is available an exception will be raised.

from lance.arrow import ImageURIArray

uris = [os.path.join(os.path.dirname(__file__), "images/1.png")]
encoded_images = ImageURIArray.from_uris(uris).read_uris()
print(encoded_images.to_tensor())

def tensorflow_decoder(images):
    import tensorflow as tf
    import numpy as np

    return np.stack(tf.io.decode_png(img.as_py(), channels=3) for img in images.storage)

print(encoded_images.to_tensor(tensorflow_decoder))
<lance.arrow.FixedShapeImageTensorArray object at 0x...>
[[42, 42, 42, 255]]
<lance.arrow.FixedShapeImageTensorArray object at 0x...>
[[42, 42, 42, 255]]

FixedShapeImageTensor

lance.arrow.FixedShapeImageTensorArray is an array that stores images as tensors where each individual pixel is represented as a numeric value. Typically images are stored as 3 dimensional tensors shaped (height, width, channels). In color images each pixel is represented by three values (channels) as per RGB color model. Images from this array can be read out as numpy arrays individually or stacked together into a single 4 dimensional numpy array shaped (batch_size, height, width, channels).

It can be created by calling lance.arrow.EncodedImageArray.to_tensor() on a previously existing lance.arrow.EncodedImageArray. This will decode encoded images and return them as a lance.arrow.FixedShapeImageTensorArray. It can also be created by calling lance.arrow.ImageArray.from_array() and passing in a pyarrow.FixedShapeTensorArray.

It can be encoded into to lance.arrow.EncodedImageArray by calling lance.arrow.FixedShapeImageTensorArray.to_encoded() and passing custom encoder If encoder is not provided it will attempt to use tensorflow and Pillow in that order. Default encoders will encode to PNG. If neither library is available it will raise an exception.

from lance.arrow import ImageURIArray

def jpeg_encoder(images):
    import tensorflow as tf

    encoded_images = (
        tf.io.encode_jpeg(x).numpy() for x in tf.convert_to_tensor(images)
    )
    return pa.array(encoded_images, type=pa.binary())

uris = [os.path.join(os.path.dirname(__file__), "images/1.png")]
tensor_images = ImageURIArray.from_uris(uris).read_uris().to_tensor()
print(tensor_images.to_encoded())
print(tensor_images.to_encoded(jpeg_encoder))
<lance.arrow.EncodedImageArray object at 0x...>
[b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00...']
<lance.arrow.EncodedImageArray object at 0x00007f8d90b91b40>
[b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x01...']