The second part of the LangChain RAG Pattern with React, FastAPI, and Cosmos DB Vector Store series continues from Part 1 by creating the FastAPI interface. This involves building a strong and efficient web API with FastAPI for later use by the user web application developed in Part 3. By using FastAPI, developers can effectively handle requests and responses while ensuring great performance. This part of the series provides valuable insights into seamlessly integrating FastAPI, setting the stage for a thorough and dynamic implementation of the LangChain RAG Pattern.

In this article
- Prerequisites
- FastAPI: What is it?
- Download the Project
- Setting Up Python Environment
- Perform a Vector Search
- Leverage RAG to Answer Questions
- Walkthrough LangChain Q&A with RAG
Prerequisites
- If you don’t have an Azure subscription, create an Azure free account before you begin.
- Complete part 1 – LangChain RAG with React, FastAPI, Cosmos DB Vector: Part 1
FastAPI: What is it?
FastAPI is a cutting-edge web framework that enables developers to swiftly build APIs using Python 3.8+. Renowned for its remarkable performance, it rivals the speed of NodeJS and Go, thanks to its integration with Starlette and Pydantic. Not only does FastAPI expedite development by up to 300%*, but it also significantly reduces human-induced errors by 40%*. With excellent editor support and comprehensive documentation, FastAPI ensures an intuitive and easy-to-use experience, minimizing code duplication and maximizing productivity.
*estimation based on tests on an internal development team at FastAPI, building production applications
Download the Project
All the code and sample datasets are accessible to you on my GitHub.

Clone the Demo API project (aptly named demo_api) GitHub Repo
The demo_api project is set up to easily handle web requests and process data transactions against the Azure Cosmos DB for Mongo DB vector database and an Azure Storage Account. This design keeps the code organized, making it easier to develop new features and fix existing functionality.

Setting Up Python Environment
In this tutorial, Python is utilized for the development of a FastAPI web interface, which necessitates its setup on the user’s computer. The tutorial involves the use of Python and LangChain for vector search against Azure Cosmos DB for MongoDB vCore, as well as the execution of Q&A RAG chains. Python version 3.11.4 was used throughout the development and testing of this walkthrough.
First setup your python virtual environment in the demo_api directory.
python -m venv venv
Activate your environment and install dependencies using the requirements file in the demo_api directory:
venv\Scripts\activate
python -m pip install -r requirements.txt
Create a file, named ‘.env’ in the demo_api directory, to store your environment variables.
VECTORDB_ENDPOINT='https://[your-web-app].azurewebsites.net/'
VECTORDB_API_KEY='**your_key**'
OPENAI_API_KEY="**Your Open AI Key**"
MONGO_CONNECTION_STRING="mongodb+srv:**your connection string from Azure Cosmos DB**"
AZURE_STORAGE_CONNECTION_STRING="**"
AZURE_STORAGE_CONTAINER="images"
| Environment Variable | Description |
|---|---|
| VECTORDB_ENDPOINT | The URL Endpoint for Weaviate. You can leave the default value: ‘https://%5Byour-web-app%5D.azurewebsites.net/’ as Weaviate is not used in this exercise. |
| VECTORDB_API_KEY | The Weaviate authentication api key. You can leave the default value: ‘**your_key**’ as Weaviate is not used in this exercise. |
| OPENAI_API_KEY | The key to connect to OpenAI API. If you do not possess an API key for Open AI, you can proceed by following the guidelines outlined here. |
| MONGO_CONNECTION_STRING | The Connection string for Azure Cosmos DB for MongoDB vCore (see here) |
| AZURE_STORAGE_CONNECTION_STRING | The Connection string for Azure Storage Account (see here) |
| AZURE_STORAGE_CONTAINER | The container name used from part 1, defaults to ‘images‘. |
In the GitHub repository, the setup is arranged to facilitate the transition between different vector storage systems, specifically Azure Cosmos DB for MongoDB and Weaviate. However, a thorough exploration of this capability is slated to be introduced in an upcoming article.
With the environment configured and variables set up, we are ready to initiate the FastAPI server. Run the following command from the demo_api directory to initiate the server.
python main.py
The FastAPI server launches on the localhost loopback 127.0.0.1 port 8000 by default. You can access the Swagger documents using the following localhost address: http://127.0.0.1:8000/docs

Perform a Vector Search
With our FastAPI service running locally, we can test a vector search against our vector database by using the /search/{query} endpoint. In this scenario, we submit the query “what is a turbofan” and then it conveys a series of corresponding matches to the web interface.
Click Try It out for /search/{query}.

Enter “what is a turbofan” for the query and click Execute.

The response consists of a JSON array containing resources depicted below.
{resource_id": "b801d68a8a0d4f1dac72dc6c170c395b",
"page_id": "cd3eea19611c423aaaf85e6da691f23d",
"title": "Rocket Propulsion Elements",
"source": "Chapter 1 - Classification (page-2)",
"content":"..text.."}
Leverage RAG to Answer Questions
The retrieved resources (documents) can serve as the groundwork for an LLM in addressing inquiries. In this regard, our RAG endpoint for Q&A (question and answer): /search/qa/{query} can be utilized for this purpose. Upon submitting the query “what is a turbofan”, the response includes ‘text’ in addition to the relevant documents. This ‘text’ constitutes the LLM’s answer derived/grounded from our vector search matching the query.

The response from our RAG endpoint includes the LLM ‘answer’ to our query stored in the ‘text’ value along with a list of the resources used to support the answer under ‘ResourceCollection’.
{
"text": "A turbofan is a type of air-breathing engine that uses a fan to compress air and mix it with fuel for combustion. It is a type of ducted engine that is more fuel-efficient than a turbojet. Turbofans are commonly used in commercial aircraft for propulsion.",
"ResourceCollection": [
{
"resource_id": "b801d68a8a0d4f1dac72dc6c170c395b",
"page_id": "cd3eea19611c423aaaf85e6da691f23d",
"title": "Rocket Propulsion Elements",
"source": "Chapter 1 - Classification (page-2)",
"content": "....text..."
},...]
}
The query first retrieves documents via a vector search. Subsequently, it prompts the LLM to produce the answer using both the documents and the query.

Walkthrough LangChain Q&A with RAG
When integrating RAG in the API, the web search components initiate all requests, followed by the search service, and finally the data components. In our case, we use MongoDB data search, which connects to Azure Cosmos DB for MongoDB vCore. The layers exchange Model components, and most of the LangChain code resides in the service layer. I implemented this approach to facilitate the interchangeability of data sources while using the same chains.

Web Layer
The Web layer is responsible for routing requests and managing communication with the caller. In the context of the Q&A RAG, we only have two code files: the primary FastAPI application (main.py) and the API router for search (search.py).
main.py
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from web import search, content
app = FastAPI()
origins = ["*"]
app.add_middleware(
CORSMiddleware,
allow_origins=origins,
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
app.include_router(search.router)
app.include_router(content.router)
@app.get("/")
def get() -> str:
return "running"
if __name__ == "__main__":
import uvicorn
uvicorn.run("main:app", reload=True)
This code, main.py, kicks us off and sets up the FastAPI web application with CORS middleware enabled, allowing it to handle requests from different origins seamlessly. By including routers for the search and content-related endpoints, the application becomes well-equipped to process various types of requests effectively. Moreover, the definition of a default endpoint (“/”) that returns a simple “running” message serves as a valuable tool for testing the application’s functionality. I leveraged this simple message during development and testing phases. The utilization of uvicorn to run the FastAPI application and enabling auto-reload reinforces the efficiency of the development process, reducing the time and effort required for testing and refining the application.
web/search.py
from fastapi import APIRouter
from service import search as search
from model.resource import Resource
from model.airesults import AIResults
router = APIRouter(prefix="/search")
@router.get("/{query}")
def get_search(query) -> list[Resource]:
return search.get_query(query)
@router.get("/summary/{query}")
def get_query_summary(query) -> AIResults:
return search.get_query_summary(query)
@router.get("/qa/{query}")
def get_query_qa(query) -> AIResults:
return search.get_qa_from_query(query)
The web search code is responsible for managing various search-related endpoints. It facilitates the integration of functions from the search service module, as well as the inclusion of models for resources (Resource) and AI results (AIResults). The router is distinctly prefixed with “/search” and encompasses three essential GET routes, each serving distinct purposes.
The first endpoint, /search/{query}, plays a pivotal role in retrieving search results based on the provided query. This process involves invoking the get_query function from the search service module, which then orchestrates the return of a list of Resource objects, encapsulating the essence of the search results obtained.
Moving on to the /search/summary/{query} route, this pathway is designed to fetch a summary of the search results associated with the provided query. This functionality is realized by enlisting the get_query_summary function from the search service module, ultimately presenting an AIResults object encapsulating the summarized outcomes of the search process.
Lastly, the /search/qa/{query} route focuses on retrieving question and answer results using the LangChain RAG pattern. This is achieved through invocation of the get_qa_from_query function from the search service module.
Service Layer
The service layers serve as the foundation for our primary business logic. In the context of this particular use case, the service layers play a vital role as the repository for LangChain code, positioning them as a central component within our technological framework.
service/search.py
from data.mongodb import search as search
from langchain.docstore.document import Document
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.chains.llm import LLMChain
from langchain.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from model.airesults import AIResults
from model.resource import Resource
template:str = """Use the following pieces of context to answer the question at the end.
If none of the pieces of context answer the question, just say you don't know.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
{context}
Question: {question}
Answer:"""
def get_query(query:str)-> list[Resource]:
resources, docs = search.similarity_search(query)
return resources
def get_query_summary(query:str) -> str:
prompt_template = """Write a summary of the following:
"{text}"
CONCISE SUMMARY:"""
prompt = PromptTemplate.from_template(prompt_template)
resources, docs = search.similarity_search(query)
if len(resources)==0:return AIResults(text="No Documents Found",ResourceCollection=resources)
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-16k")
llm_chain = LLMChain(llm=llm, prompt=prompt)
stuff_chain = StuffDocumentsChain(llm_chain=llm_chain, document_variable_name="text")
return AIResults(stuff_chain.run(docs),resources)
def get_qa_from_query(query:str) -> str:
resources, docs = search.similarity_search(query)
if len(resources) ==0 :return AIResults(text="No Documents Found",ResourceCollection=resources)
custom_rag_prompt = PromptTemplate.from_template(template)
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
content = format_docs(docs)
rag_chain = (
{"context": lambda x: content , "question": RunnablePassthrough()}
| custom_rag_prompt
| llm
| StrOutputParser()
)
return AIResults(text=rag_chain.invoke(query),ResourceCollection=resources)
The service search.py demonstrates the integration of database interactions, language models, and prompt templates to deliver functionalities such as summarization and question-answering based on user queries. It initially defines a template string named template which functions as a blueprint for creating questions and answers for RAG. This template encompasses placeholders for context and question prompts. The search service comprises three functions:
The first search service function, get_query(query:str) -> list[Resource]: This function accepts a query string as input and conducts a vector search in the database for resources similar to the query, returning a list of matching resources.
The second function, get_query_summary(query:str) -> str: This function generates a concise summary of the documents related to the given query. It initially seeks out relevant documents via a vector search, then utilizes a StuffDocumentChain to summarize the documents.
The final function, get_qa_from_query(query:str) -> str: This function retrieves documents related to the query and crafts answers to questions based on the retrieved documents. It employs the global template for creating questions and answers along with a custom_rag_prompt to pass in the documents returned from the vector search.
Data Layer
Finally, we have reached the Data Layer. Although the GitHub repository includes code for both Weaviate and Cosmos DB for MongoDB (the code simply refers to MongoDB), our focus in this series will be exclusively on connecting to Cosmos DB. In the Data Layer, I utilize the module as a singleton to handle database related connections in the init.py file. While this approach is clean, I often contemplate the use of global variables. In addition to the init.py file, there is also the search.py file, which executes the vector search against CosmoDB.
data/mongodb/init.py
from os import environ
from dotenv import load_dotenv
from pymongo import MongoClient
from pymongo.collection import Collection
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores.azure_cosmos_db import AzureCosmosDBVectorSearch
load_dotenv(override=True)
collection: Collection | None = None
vector_store: AzureCosmosDBVectorSearch | None=None
def mongodb_init():
MONGO_CONNECTION_STRING = environ.get("MONGO_CONNECTION_STRING")
DB_NAME = "research"
COLLECTION_NAME = "resources"
INDEX_NAME = "vectorSearchIndex"
global collection, vector_store
client = MongoClient(MONGO_CONNECTION_STRING)
db = client[DB_NAME]
collection = db[COLLECTION_NAME]
vector_store = AzureCosmosDBVectorSearch.from_connection_string(MONGO_CONNECTION_STRING,
DB_NAME + "." + COLLECTION_NAME,
OpenAIEmbeddings(disallowed_special=()),
index_name=INDEX_NAME
)
mongodb_init()
The init.py file first loads environment variables from a .env file using the load_dotenv(override=True) method. Subsequently, it declares the global variables collection (representing the CosmosDB MongoDB collection) and vector_store (representing the CosmosDB Vector Search). The MongoClient is then initialized with the MongoDB connection string, and both the database and collection objects are retrieved. These objects are subsequently passed into the AzureCosmosDBVectorSearch from_connection_string() method.
data/mongodb/search.py
from model.resource import Resource
from .init import collection, vector_store
from langchain.docstore.document import Document
from typing import List, Optional, Union
def results_to_model(result:Document) -> Resource:
return Resource(resource_id = result.metadata["resource_id"],
page_id = result.metadata["page_id"],
title=result.metadata["title"],
source=f"{result.metadata['chapter']} (page-{result.metadata['pagenumber']})",
content=result.page_content)
def similarity_search(query:str)-> tuple[list[Resource], list[Document]]:
docs = vector_store.similarity_search_with_score(query,4)
# Cosine Similarity:
#It measures the cosine of the angle between two vectors in an n-dimensional space.
#The values of similarity metrics typically range between 0 and 1, with higher values indicating greater similarity between the vectors.
docs_filters = [doc for doc, score in docs if score >=.75]
# List the scores for documents
for doc, score in docs:
print(score)
# Print number of documents passing score threshold
print(len(docs_filters))
return [results_to_model(document) for document in docs_filters],docs_filters
The data retrieval code accesses global variables (collection, vector_search) from our singleton and currently encompasses two functions pertaining to similarity search and data transformation.
The first function, results_to_model(result: Document) -> Resource, accepts a LangChain Document object as input and yields a Resource object. It extracts metadata attributes from the provided Document, such as resource_id, page_id, title, chapter, and page number. This function ensures reuse of vector search returned resources beyond LangChain.
The second function, similarity_search(query: str) -> Tuple[List[Resource], List[Document]], conducts a similarity search using a specified query. This function invokes the similarity_search_with_score method of vector_store to retrieve documents that are similar to the given query, along with their respective similarity scores. It filters the retrieved documents based on a threshold score of 0.75 and transforms the filtered documents into Resource objects using the results_to_model function, returning both the transformed Resource objects and the original Document objects.
Congratulations on successfully completing the setup of the web framework! It has indeed required significant effort to broaden our foundation for the upcoming tasks in Part 3.

Leave a Reply