FiftyOne
FiftyOne is an open source toolkit that enables users to curate better data and build better models. It includes tools for data exploration, visualization, and management, as well as features for collaboration and sharing.
Any developers, data scientists, and researchers who work with computer vision and machine learning can use FiftyOne to improve the quality of their datasets and deliver insights about their models.
FiftyOne provides an API to create LanceDB tables and run similarity queries, both programmatically in Python and via point-and-click in the App.
Let's get started and see how to use LanceDB to create a similarity index on your FiftyOne datasets.
Overview
Embeddings are foundational to all of the vector search features. In FiftyOne, embeddings are managed by the FiftyOne Brain that provides powerful machine learning techniques designed to transform how you curate your data from an art into a measurable science.
Have you ever wanted to find the images most similar to an image in your dataset?
The FiftyOne Brain makes computing visual similarity really easy. You can compute the similarity of samples in your dataset using an embedding model and store the results in the brain key.
You can then sort your samples by similarity or use this information to find potential duplicate images.
Here we will be doing the following :
-
Create Index - In order to run similarity queries against our media, we need to index the data. We can do this via the
compute_similarity()
function.- In the function, specify the model you want to use to generate the embedding vectors, and what vector search engine you want to use on the backend (here LanceDB).
Tip
You can also give the similarity index a name(
brain_key
), which is useful if you want to run vector searches against multiple indexes. -
Query - Once you have generated your similarity index, you can query your dataset with
sort_by_similarity()
. The query can be any of the following:- An ID (sample or patch)
- A query vector of same dimension as the index
- A list of IDs (samples or patches)
- A text prompt (search semantically)
Prerequisites: install necessary dependencies
-
Create and activate a virtual environment
Install virtualenv package and run the following command in your project directory.
From inside the project directory run the following to activate the virtual environment. -
Install the following packages in the virtual environment
To install FiftyOne, ensure you have activated any virtual environment that you are using, then run
Understand basic workflow
The basic workflow shown below uses LanceDB to create a similarity index on your FiftyOne datasets:
-
Load a dataset into FiftyOne.
-
Compute embedding vectors for samples or patches in your dataset, or select a model to use to generate embeddings.
-
Use the
compute_similarity()
method to generate a LanceDB table for the samples or object patches embeddings in a dataset by setting the parameterbackend="lancedb"
and specifying abrain_key
of your choice. -
Use this LanceDB table to query your data with
sort_by_similarity()
. -
If desired, delete the table.
Quick Example
Let's jump on a quick example that demonstrates this workflow.
import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz
# Step 1: Load your data into FiftyOne
dataset = foz.load_zoo_dataset("quickstart")
# Steps 2 and 3: Compute embeddings and create a similarity index
lancedb_index = fob.compute_similarity(
dataset,
model="clip-vit-base32-torch",
brain_key="lancedb_index",
backend="lancedb",
)
Note
Running the code above will download the clip model (2.6Gb)
Once the similarity index has been generated, we can query our data in FiftyOne by specifying the brain_key
:
# Step 4: Query your data
query = dataset.first().id # query by sample ID
view = dataset.sort_by_similarity(
query,
brain_key="lancedb_index",
k=10, # limit to 10 most similar samples
)
DatasetView
.
Note
DatasetView
does not hold its contents in-memory. Views simply store the rule(s) that are applied to extract the content of interest from the underlying Dataset when the view is iterated/aggregated on.
This means, for example, that the contents of a DatasetView
may change as the underlying Dataset is modified.
Can you query a view instead of dataset?
Yes, you can also query a view.
Performing a similarity search on a DatasetView
will only return results from the view; if the view contains samples that were not included in the index, they will never be included in the result.
This means that you can index an entire Dataset once and then perform searches on subsets of the dataset by constructing views that contain the images of interest.
# Step 5 (optional): Cleanup
# Delete the LanceDB table
lancedb_index.cleanup()
# Delete run record from FiftyOne
dataset.delete_brain_run("lancedb_index")
Using LanceDB backend
By default, calling compute_similarity()
or sort_by_similarity()
will use an sklearn backend.
To use the LanceDB backend, simply set the optional backend
parameter of compute_similarity()
to "lancedb"
:
import fiftyone.brain as fob
#... rest of the code
fob.compute_similarity(..., backend="lancedb", ...)
Alternatively, you can configure FiftyOne to use the LanceDB backend by setting the following environment variable.
In your terminal, set the environment variable using:
Note
This will only run during the terminal session. Once terminal is closed, environment variable is deleted.
Alternatively, you can permanently configure FiftyOne to use the LanceDB backend creating a brain_config.json
at ~/.fiftyone/brain_config.json
. The JSON file may contain any desired subset of config fields that you wish to customize.
brain_config
and will set it according to your customization. You can check the configuration by running the following code :
LanceDB config parameters
The LanceDB backend supports query parameters that can be used to customize your similarity queries. These parameters include:
Name | Purpose | Default |
---|---|---|
table_name | The name of the LanceDB table to use. If none is provided, a new table will be created | None |
metric | The embedding distance metric to use when creating a new table. The supported values are ("cosine", "euclidean") | "cosine" |
uri | The database URI to use. In this Database URI, tables will be created. | "/tmp/lancedb" |
There are two ways to specify/customize the parameters:
-
Using
brain_config.json
file -
Directly passing to
compute_similarity()
to configure a specific new index :
For a much more in depth walkthrough of the integration, visit the LanceDB x Voxel51 docs page.