Azure Cosmos DB for MongoDB HNSW Vector Search

This article discusses leveraging the new HNSW Vector Index feature in Azure Cosmos DB for MongoDB vCore using Python and LangChain. It covers enabling the HNSW index, setting up Python, loading data, and conducting a vector search.

In this article we will be leveraging a new feature in Azure Cosmos DB for MongoDB vCore, known as the Hierarchical Navigable Small World (HNSW) Vector Index. This article will demonstrate the implementation of the HNSW index using LangChain and Python. The integration of HNSW enables us to conduct rapid approximate nearest neighbor searches at enhanced speeds while maintaining high precision.

Per Microsoft, using the HSNW vector index with large datasets can lead to memory resource depletion or decreased performance of concurrent database operations. To minimize the likelihood of these issues, it is advisable to:

  • Only use HNSW indexes on a cluster tier of M50 or higher. (tier M40 may produce an error when creating the index – I will be adjusting this once the feature goes into GA)
  • Scale to a higher cluster tier or reduce the size of the database if your encounter errors.

In this article

Prerequisites

What is Azure Cosmos DB for MongoDB vCore

Azure Cosmos DB for MongoDB vCore empowers users to leverage the advantages of a fully managed MongoDB-compatible database within the Azure environment. This integration streamlines the deployment and management of MongoDB databases on the Azure cloud platform, guaranteeing exceptional performance and reliability for applications that require MongoDB compatibility.

Enable Preview Feature

To make use of the HNSW vector index with Azure Cosmos DB, it is necessary to enable the “mongoHnswIndex” preview feature. This can be done by following the instructions provided in the following link: enable preview features.

Once enabled, the ‘HNSW index for vector search’ will be officially marked ‘Registered’, as depicted below.

Enable Azure Preview feature HNSW index for vector search for Azure Cosmos DB

Hierarchical Navigable Small World (HNSW) makes searching for similar vectors faster. It does this by arranging the vectors into a hierarchical graph. This helps to speed up the search by only looking at nearby vectors in the hierarchy, which reduces the number of comparisons needed. HNSW also uses a “small world” approach, where each vector is connected to a small number of other vectors. With this approach distant vectors can be reached in a limited number of steps – speeding up the search.

Setting Up Python

In this article, Python will be utilized, requiring its installation on your computer. Python and LangChain will be employed to import vectors into Azure Cosmos DB for MongoDB vCore and perform a similarity search using the HNSW index.

Python version 3.11.4 was utilized during development and testing for the article.

The first step is to setup your python virtual environment.

python -m venv venv

Create a new file called ‘requirements.txt’ in your working directory and add the following to the file:

pymongo 
openai
python-dotenv
tiktoken
langchain-openai
azure-storage-blob
azure-identity
langchain==0.1.11
langchain-community==0.0.27
langchain-core==0.1.30

Activate your environment and install dependencies by running the following commands.

venv\Scripts\activate
python -m pip install -r requirements.txt

Create a file, named ‘.env’, to store the environment variables required to connect to your LLM (using OpenAI in this article) and Azure Cosmos DB).

OPENAI_API_KEY="**Your Open AI Key**"
MONGO_CONNECTION_STRING="mongodb+srv:**your connection string from Azure Cosmos DB**"
Environment VariableDescription
OPENAI_API_KEYThe key to connect to OpenAI API. If you do not possess an API key for Open AI, you can proceed by following the guidelines outlined here.
MONGO_CONNECTION_STRINGThe Connection string for Azure Cosmos DB for MongoDB vCore – demonstrated here.
.env file variables

Using LangChain with Azure Cosmos DB for MongoDB vCore

We can utilize the identical JSONLoader as described in the article “LangChain Vector Search with Cosmos DB for MongoDB.” Please create the ‘jsonloader.py’ file referenced in the article in your working directory and proceed.

Download sample Dataset

Download and unzip the documents folder containing the sample JSON files into your working directory. These files contain snippets from books on rocket engineering and interstellar space propulsion for our vector search hands-on.

Download: Rocket_Propulsion_Data.zip.

Creating Cosmos DB Loader and HNSW Index

Create a Python file called ‘cosmosdbloader.py’ in your working directory and proceed to add the following code to create our Cosmos DB Loader.

from os import environ
from pathlib import Path
from typing import List, Optional, Union
from dotenv import load_dotenv
from pymongo import MongoClient
from jsonloader import JSONLoader
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores.azure_cosmos_db import AzureCosmosDBVectorSearch, CosmosDBSimilarityType,CosmosDBVectorSearchType


load_dotenv(override=True)


class CosmosDBLoader():
def __init__(
self,
file_path: Union[str, Path],
enable_HNSW:bool=False):
self.file_path = Path(file_path).resolve()
self.enable_HNSW = enable_HNSW

def load(self):
"""load embeddings from file_path into cosmosDB vector store"""

#variable from '.env' file
OPENAI_API_KEY = environ.get("OPENAI_API_KEY")
MONGO_CONNECTION_STRING = environ.get("MONGO_CONNECTION_STRING")

#hardcoded variables
DB_NAME = "research"
COLLECTION_NAME = "resources"
EMBEDDING_FIELD_NAME = "embedding"
INDEX_NAME = "vectorSearchIndex"

client = MongoClient(MONGO_CONNECTION_STRING)
db = client[DB_NAME]
collection = db[COLLECTION_NAME]

loader = JSONLoader(self.file_path )

docs = loader.load()

#load documents into Cosmos DB Vector Store
vector_store = AzureCosmosDBVectorSearch.from_documents(
docs,
OpenAIEmbeddings(disallowed_special=()),
collection=collection,
index_name=INDEX_NAME)

if vector_store.index_exists() == False:
#Create an index for vector search
num_lists = 1 #for a small demo, you can start with numLists set to 1 to perform a brute-force search across all vectors.
dimensions = 1536

if self.enable_HNSW:
print('VECTOR_HNSW')
# m= 16 is default; higher value is suitable for datasets with high dimensionality and/or high accuracy requirements
# ef_construction= 64 is default; Higher value will result in better index quality and higher accuracy, but it will also increase the time required to build the index.
vector_store.create_index(similarity =CosmosDBSimilarityType.COS,
kind=CosmosDBVectorSearchType.VECTOR_HNSW ,
m=16,
ef_construction=64,
dimensions=dimensions)
else:
print('VECTOR_IVF')
vector_store.create_index(kind=CosmosDBVectorSearchType.VECTOR_IVF ,
num_lists=num_lists,
dimensions=dimensions,
similarity =CosmosDBSimilarityType.COS)


CosmosDBLoader is responsible for loading our sample documents into Cosmos DB vector store. The constructor __init__ initializes the file path and an option to enable Hierarchical Navigable Small World (HNSW) indexing.

The load method utilizes the JSONLoader to create the LangChain documents, ‘docs’, for loading into Azure Cosmos DB for MongoDB. If the index for vector search does not exist, it creates one based on the specified parameters, such as the number of lists, dimensions, and similarity type. Additionally, it will create the HNSW index when self.enable_HNSW=True.

Loading Data into Cosmos DB Vector Store

Finally, our data can be loaded into Azure Cosmos DB for MongoDB vCore utilizing the HNSW vector index.

To load the data, create another Python file in your working directory named ‘main.py’ and incorporate the subsequent code. The parameter, enable_HNSW=True, enables the CosmosDBLoader to toggle the use of the HNSW index.

from cosmosdbloader import CosmosDBLoader
import json
import base64


file_names = ['documents/Rocket_Propulsion_Elements_with_images.json','documents/Introduction_To_Rocket_Science_And_Engineering_with_images.json']

file_names += file_names

for file_name in file_names:

CosmosDBLoader(f"{file_name}",enable_HNSW=True).load()

From your terminal, please proceed to execute the following command:

python main.py
Output: VECTOR_HNSW

Verify the successful loading of documents using MongoDB Compass (or a similar tool).

MongoDB Compass showing loaded data in Azure Cosmos DB for MongoDB

LangChain Vector Search on Azure Cosmos DB for MongoDB

Once our data is loaded and the search index is created, we can initiate a vector search. This type of search allows us to find similar items based on their data rather than relying on exact matches in a property field. For example, the following Python script connects to Azure Cosmos DB and performs a similarity search. It then retrieves and displays metadata and content for the top 10 documents matching the query “supersonic combustion.”

from os import environ
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores.azure_cosmos_db import AzureCosmosDBVectorSearch
import time

load_dotenv(override=True)

#variable from '.env' file
MONGO_CONNECTION_STRING = environ.get("MONGO_CONNECTION_STRING")

#hardcoded variables
DB_NAME = "research"
COLLECTION_NAME = "resources"
INDEX_NAME = "vectorSearchIndex"

#connect to Azure Cosmos DB for vector search
vector_store = AzureCosmosDBVectorSearch.from_connection_string(MONGO_CONNECTION_STRING,
DB_NAME + "." + COLLECTION_NAME,
OpenAIEmbeddings(disallowed_special=()),
index_name=INDEX_NAME )


start_time = time.time()

#query to use in similarity_search
query = "supersonic combustion"

docs = vector_store.similarity_search(query,k=10)

#format results from search
for doc in docs:
print({'id' :doc.metadata["page_id"],
'title':doc.metadata["title"],
'source':f"{doc.metadata['chapter']} (page-{doc.metadata['pagenumber']})",
'content':doc.page_content})


print(f"--- {time.time() - start_time:.2f} seconds ---")

When executing the code, it will display the document content and metadata retrieved for the search query “supersonic combustion”.

As an additional point, the similarity_search function will enable you to specify kind=CosmosDBVectorSearchType.VECTOR_HNSW in order to transmit the ef_search value to Azure Cosmos DB for a HNSW vector search. However, this was causing an error, so I retained the default setting and will make the necessary adjustments once the issue is resolved.

Clean-Up! In this article we used an Azure Cosmos DB for MongoDB cluster tier of M50. Please delete the cluster in-order to avoid additional charges.

Great job! You’ve successfully utilized LangChain to upload documents from various JSON files into Azure Cosmos DB for MongoDB vCore, incorporating an HNSW vector index. Next, you can look at using a RAG pattern with the following series: LangChain RAG with React, FastAPI, Cosmos DB Vector: Part 1.

2 responses to “Azure Cosmos DB for MongoDB HNSW Vector Search”

  1. Unable to create index
    here is the error
    OperationFailure: cosmosSearchOptions, full error: {‘ok’: 0.0, ‘errmsg’: ‘cosmosSearchOptions’, ‘code’: 197, ‘codeName’: ‘InvalidIndexSpecificationOption’}

    1. I receive a similar error when using an M30 cluster tier. Are you using at least an M40 cluster tier – the minimum tier size: Create a vector index using HNSW

      In addition, have you enabled the Preview Feature for HNSW?

Leave a Reply

Discover more from Stochastic Coder

Subscribe now to keep reading and get access to the full archive.

Continue reading