In this article we will be leveraging a new feature in Azure Cosmos DB for MongoDB vCore, known as the Hierarchical Navigable Small World (HNSW) Vector Index. This article will demonstrate the implementation of the HNSW index using LangChain and Python. The integration of HNSW enables us to conduct rapid approximate nearest neighbor searches at enhanced speeds while maintaining high precision.
Per Microsoft, using the HSNW vector index with large datasets can lead to memory resource depletion or decreased performance of concurrent database operations. To minimize the likelihood of these issues, it is advisable to:
- Only use HNSW indexes on a cluster tier of M50 or higher. (tier M40 may produce an error when creating the index – I will be adjusting this once the feature goes into GA)
- Scale to a higher cluster tier or reduce the size of the database if your encounter errors.
In this article
- Prerequisites
- What is Azure Cosmos DB for MongoDB vCore
- How Does HNSW Speed up Vector Search?
- Setting Up Python
- Using LangChain with Azure Cosmos DB for MongoDB vCore
- LangChain Vector Search on Azure Cosmos DB for MongoDB
Prerequisites
- If you don’t have an Azure subscription, create an Azure free account before you begin.
- Setup account for OpenAI API – Overview – OpenAI API
- Create an Azure Cosmos DB for MongoDB vCore by following this QuickStart.
What is Azure Cosmos DB for MongoDB vCore
Azure Cosmos DB for MongoDB vCore empowers users to leverage the advantages of a fully managed MongoDB-compatible database within the Azure environment. This integration streamlines the deployment and management of MongoDB databases on the Azure cloud platform, guaranteeing exceptional performance and reliability for applications that require MongoDB compatibility.
Enable Preview Feature
To make use of the HNSW vector index with Azure Cosmos DB, it is necessary to enable the “mongoHnswIndex” preview feature. This can be done by following the instructions provided in the following link: enable preview features.
Once enabled, the ‘HNSW index for vector search’ will be officially marked ‘Registered’, as depicted below.
How Does HNSW Speed up Vector Search?
Hierarchical Navigable Small World (HNSW) makes searching for similar vectors faster. It does this by arranging the vectors into a hierarchical graph. This helps to speed up the search by only looking at nearby vectors in the hierarchy, which reduces the number of comparisons needed. HNSW also uses a “small world” approach, where each vector is connected to a small number of other vectors. With this approach distant vectors can be reached in a limited number of steps – speeding up the search.
Setting Up Python
In this article, Python will be utilized, requiring its installation on your computer. Python and LangChain will be employed to import vectors into Azure Cosmos DB for MongoDB vCore and perform a similarity search using the HNSW index.
Python version 3.11.4 was utilized during development and testing for the article.
The first step is to setup your python virtual environment.
python -m venv venv
Create a new file called ‘requirements.txt’ in your working directory and add the following to the file:
pymongo
openai
python-dotenv
tiktoken
langchain-openai
azure-storage-blob
azure-identity
langchain==0.1.11
langchain-community==0.0.27
langchain-core==0.1.30
Activate your environment and install dependencies by running the following commands.
venv\Scripts\activate
python -m pip install -r requirements.txt
Create a file, named ‘.env’, to store the environment variables required to connect to your LLM (using OpenAI in this article) and Azure Cosmos DB).
OPENAI_API_KEY="**Your Open AI Key**"
MONGO_CONNECTION_STRING="mongodb+srv:**your connection string from Azure Cosmos DB**"
Environment Variable | Description |
---|---|
OPENAI_API_KEY | The key to connect to OpenAI API. If you do not possess an API key for Open AI, you can proceed by following the guidelines outlined here. |
MONGO_CONNECTION_STRING | The Connection string for Azure Cosmos DB for MongoDB vCore – demonstrated here. |
Using LangChain with Azure Cosmos DB for MongoDB vCore
We can utilize the identical JSONLoader as described in the article “LangChain Vector Search with Cosmos DB for MongoDB.” Please create the ‘jsonloader.py’ file referenced in the article in your working directory and proceed.
Download sample Dataset
Download and unzip the documents folder containing the sample JSON files into your working directory. These files contain snippets from books on rocket engineering and interstellar space propulsion for our vector search hands-on.
Download: Rocket_Propulsion_Data.zip.
Creating Cosmos DB Loader and HNSW Index
Create a Python file called ‘cosmosdbloader.py’ in your working directory and proceed to add the following code to create our Cosmos DB Loader.
from os import environ
from pathlib import Path
from typing import List, Optional, Union
from dotenv import load_dotenv
from pymongo import MongoClient
from jsonloader import JSONLoader
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores.azure_cosmos_db import AzureCosmosDBVectorSearch, CosmosDBSimilarityType,CosmosDBVectorSearchType
load_dotenv(override=True)
class CosmosDBLoader():
def __init__(
self,
file_path: Union[str, Path],
enable_HNSW:bool=False):
self.file_path = Path(file_path).resolve()
self.enable_HNSW = enable_HNSW
def load(self):
"""load embeddings from file_path into cosmosDB vector store"""
#variable from '.env' file
OPENAI_API_KEY = environ.get("OPENAI_API_KEY")
MONGO_CONNECTION_STRING = environ.get("MONGO_CONNECTION_STRING")
#hardcoded variables
DB_NAME = "research"
COLLECTION_NAME = "resources"
EMBEDDING_FIELD_NAME = "embedding"
INDEX_NAME = "vectorSearchIndex"
client = MongoClient(MONGO_CONNECTION_STRING)
db = client[DB_NAME]
collection = db[COLLECTION_NAME]
loader = JSONLoader(self.file_path )
docs = loader.load()
#load documents into Cosmos DB Vector Store
vector_store = AzureCosmosDBVectorSearch.from_documents(
docs,
OpenAIEmbeddings(disallowed_special=()),
collection=collection,
index_name=INDEX_NAME)
if vector_store.index_exists() == False:
#Create an index for vector search
num_lists = 1 #for a small demo, you can start with numLists set to 1 to perform a brute-force search across all vectors.
dimensions = 1536
if self.enable_HNSW:
print('VECTOR_HNSW')
# m= 16 is default; higher value is suitable for datasets with high dimensionality and/or high accuracy requirements
# ef_construction= 64 is default; Higher value will result in better index quality and higher accuracy, but it will also increase the time required to build the index.
vector_store.create_index(similarity =CosmosDBSimilarityType.COS,
kind=CosmosDBVectorSearchType.VECTOR_HNSW ,
m=16,
ef_construction=64,
dimensions=dimensions)
else:
print('VECTOR_IVF')
vector_store.create_index(kind=CosmosDBVectorSearchType.VECTOR_IVF ,
num_lists=num_lists,
dimensions=dimensions,
similarity =CosmosDBSimilarityType.COS)
CosmosDBLoader
is responsible for loading our sample documents into Cosmos DB vector store. The constructor __init__
initializes the file path and an option to enable Hierarchical Navigable Small World (HNSW) indexing.
The load
method utilizes the JSONLoader
to create the LangChain documents, ‘docs’, for loading into Azure Cosmos DB for MongoDB. If the index for vector search does not exist, it creates one based on the specified parameters, such as the number of lists, dimensions, and similarity type. Additionally, it will create the HNSW index when self.enable_HNSW=True
.
Loading Data into Cosmos DB Vector Store
Finally, our data can be loaded into Azure Cosmos DB for MongoDB vCore utilizing the HNSW vector index.
To load the data, create another Python file in your working directory named ‘main.py’ and incorporate the subsequent code. The parameter, enable_HNSW=True
, enables the CosmosDBLoader
to toggle the use of the HNSW index.
from cosmosdbloader import CosmosDBLoader
import json
import base64
file_names = ['documents/Rocket_Propulsion_Elements_with_images.json','documents/Introduction_To_Rocket_Science_And_Engineering_with_images.json']
file_names += file_names
for file_name in file_names:
CosmosDBLoader(f"{file_name}",enable_HNSW=True).load()
From your terminal, please proceed to execute the following command:
python main.py
Output: VECTOR_HNSW
Verify the successful loading of documents using MongoDB Compass (or a similar tool).
LangChain Vector Search on Azure Cosmos DB for MongoDB
Once our data is loaded and the search index is created, we can initiate a vector search. This type of search allows us to find similar items based on their data rather than relying on exact matches in a property field. For example, the following Python script connects to Azure Cosmos DB and performs a similarity search. It then retrieves and displays metadata and content for the top 10 documents matching the query “supersonic combustion.”
from os import environ
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores.azure_cosmos_db import AzureCosmosDBVectorSearch
import time
load_dotenv(override=True)
#variable from '.env' file
MONGO_CONNECTION_STRING = environ.get("MONGO_CONNECTION_STRING")
#hardcoded variables
DB_NAME = "research"
COLLECTION_NAME = "resources"
INDEX_NAME = "vectorSearchIndex"
#connect to Azure Cosmos DB for vector search
vector_store = AzureCosmosDBVectorSearch.from_connection_string(MONGO_CONNECTION_STRING,
DB_NAME + "." + COLLECTION_NAME,
OpenAIEmbeddings(disallowed_special=()),
index_name=INDEX_NAME )
start_time = time.time()
#query to use in similarity_search
query = "supersonic combustion"
docs = vector_store.similarity_search(query,k=10)
#format results from search
for doc in docs:
print({'id' :doc.metadata["page_id"],
'title':doc.metadata["title"],
'source':f"{doc.metadata['chapter']} (page-{doc.metadata['pagenumber']})",
'content':doc.page_content})
print(f"--- {time.time() - start_time:.2f} seconds ---")
When executing the code, it will display the document content and metadata retrieved for the search query “supersonic combustion”.
As an additional point, the similarity_search
function will enable you to specify kind=CosmosDBVectorSearchType.VECTOR_HNSW
in order to transmit the ef_search
value to Azure Cosmos DB for a HNSW vector search. However, this was causing an error, so I retained the default setting and will make the necessary adjustments once the issue is resolved.
Clean-Up! In this article we used an Azure Cosmos DB for MongoDB cluster tier of M50. Please delete the cluster in-order to avoid additional charges.
Great job! You’ve successfully utilized LangChain to upload documents from various JSON files into Azure Cosmos DB for MongoDB vCore, incorporating an HNSW vector index. Next, you can look at using a RAG pattern with the following series: LangChain RAG with React, FastAPI, Cosmos DB Vector: Part 1.
Leave a Reply