Langchain
Quick Start
You can load your document data using langchain's loaders, for this example we are using TextLoader
and OpenAIEmbeddings
as the embedding model. Checkout Complete example here - LangChain demo
import os
from langchain.document_loaders import TextLoader
from langchain.vectorstores import LanceDB
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
os.environ["OPENAI_API_KEY"] = "sk-..."
loader = TextLoader("../../modules/state_of_the_union.txt") # Replace with your data path
documents = loader.load()
documents = CharacterTextSplitter().split_documents(documents)
embeddings = OpenAIEmbeddings()
docsearch = LanceDB.from_documents(documents, embeddings)
query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(query)
print(docs[0].page_content)
Documentation
In the above example LanceDB
vector store class object is created using from_documents()
method which is a classmethod
and returns the initialized class object.
You can also use LanceDB.from_texts(texts: List[str],embedding: Embeddings)
class method.
The exhaustive list of parameters for LanceDB
vector store are :
- connection
: (Optional) lancedb.db.LanceDBConnection
connection object to use. If not provided, a new connection will be created.
- embedding
: Langchain embedding model.
- vector_key
: (Optional) Column name to use for vector's in the table. Defaults to 'vector'
.
- id_key
: (Optional) Column name to use for id's in the table. Defaults to 'id'
.
- text_key
: (Optional) Column name to use for text in the table. Defaults to 'text'
.
- table_name
: (Optional) Name of your table in the database. Defaults to 'vectorstore'
.
- api_key
: (Optional) API key to use for LanceDB cloud database. Defaults to None
.
- region
: (Optional) Region to use for LanceDB cloud database. Only for LanceDB Cloud, defaults to None
.
- mode
: (Optional) Mode to use for adding data to the table. Defaults to 'overwrite'
.
- reranker
: (Optional) The reranker to use for LanceDB.
- relevance_score_fn
: (Optional[Callable[[float], float]]) Langchain relevance score function to be used. Defaults to None
.
db_url = "db://lang_test" # url of db you created
api_key = "xxxxx" # your API key
region="us-east-1-dev" # your selected region
vector_store = LanceDB(
uri=db_url,
api_key=api_key, #(dont include for local API)
region=region, #(dont include for local API)
embedding=embeddings,
table_name='langchain_test' #Optional
)
Methods
add_texts()
texts
:Iterable
of strings to add to the vectorstore.metadatas
: Optionallist[dict()]
of metadatas associated with the texts.ids
: Optionallist
of ids to associate with the texts.kwargs
:Any
This method adds texts and stores respective embeddings automatically.
vector_store.add_texts(texts = ['test_123'], metadatas =[{'source' :'wiki'}])
#Additionaly, to explore the table you can load it into a df or save it in a csv file:
tbl = vector_store.get_table()
print("tbl:", tbl)
pd_df = tbl.to_pandas()
pd_df.to_csv("docsearch.csv", index=False)
# you can also create a new vector store object using an older connection object:
vector_store = LanceDB(connection=tbl, embedding=embeddings)
create_index()
col_name
:Optional[str] = None
vector_col
:Optional[str] = None
num_partitions
:Optional[int] = 256
num_sub_vectors
:Optional[int] = 96
index_cache_size
:Optional[int] = None
This method creates an index for the vector store. For index creation make sure your table has enough data in it. An ANN index is ususally not needed for datasets ~100K vectors. For large-scale (>1M) or higher dimension vectors, it is beneficial to create an ANN index.
# for creating vector index
vector_store.create_index(vector_col='vector', metric = 'cosine')
# for creating scalar index(for non-vector columns)
vector_store.create_index(col_name='text')
similarity_search()
query
:str
k
:Optional[int] = None
filter
:Optional[Dict[str, str]] = None
fts
:Optional[bool] = False
name
:Optional[str] = None
kwargs
:Any
Return documents most similar to the query without relevance scores
similarity_search_by_vector()
embedding
:List[float]
k
:Optional[int] = None
filter
:Optional[Dict[str, str]] = None
name
:Optional[str] = None
kwargs
:Any
Returns documents most similar to the query vector.
similarity_search_with_score()
query
:str
k
:Optional[int] = None
filter
:Optional[Dict[str, str]] = None
kwargs
:Any
Returns documents most similar to the query string with relevance scores, gets called by base class's similarity_search_with_relevance_scores
which selects relevance score based on our _select_relevance_score_fn
.
docs = docsearch.similarity_search_with_relevance_scores(query)
print("relevance score - ", docs[0][1])
print("text- ", docs[0][0].page_content[:1000])
similarity_search_by_vector_with_relevance_scores()
embedding
:List[float]
k
:Optional[int] = None
filter
:Optional[Dict[str, str]] = None
name
:Optional[str] = None
kwargs
:Any
Return documents most similar to the query vector with relevance scores. Relevance score
docs = docsearch.similarity_search_by_vector_with_relevance_scores(query_embedding)
print("relevance score - ", docs[0][1])
print("text- ", docs[0][0].page_content[:1000])
max_marginal_relevance_search()
query
:str
k
:Optional[int] = None
fetch_k
: Number of Documents to fetch to pass to MMR algorithm,Optional[int] = None
lambda_mult
: Number between 0 and 1 that determines the degree of diversity among the results with 0 corresponding to maximum diversity and 1 to minimum diversity. Defaults to 0.5.float = 0.5
filter
:Optional[Dict[str, str]] = None
kwargs
:Any
Returns docs selected using the maximal marginal relevance(MMR). Maximal marginal relevance optimizes for similarity to query AND diversity among selected documents.
Similarly, max_marginal_relevance_search_by_vector()
function returns docs most similar to the embedding passed to the function using MMR. instead of a string query you need to pass the embedding to be searched for.
result = docsearch.max_marginal_relevance_search(
query="text"
)
result_texts = [doc.page_content for doc in result]
print(result_texts)
## search by vector :
result = docsearch.max_marginal_relevance_search_by_vector(
embeddings.embed_query("text")
)
result_texts = [doc.page_content for doc in result]
print(result_texts)
add_images()
uris
: File path to the image.List[str]
.metadatas
: Optional list of metadatas.(Optional[List[dict]], optional)
ids
: Optional list of IDs.(Optional[List[str]], optional)
Adds images by automatically creating their embeddings and adds them to the vectorstore.