Chat With Your PDFs — Deploying a Simple RAG Application with Langchain, OpenAI and Choreo
Large Language Models have turned the technology world upside down in the past two years and continue to disrupt the technology landscape. What ChatGPT started has been supported by other technology giants with the creation of LLMs like Claude, Gemini, and Llama. Both tech-savvy individuals and those less familiar with technology have integrated LLMs into their daily lives.
One caveat of LLMs can be said as not have enough context about our own data or custom information. LLMs are trained on a specific data set and able to answer almost all the questions we ask using their language skills. But if we want to ask it about a specific custom piece of data, which is outside of its knowledge base, it would struggle and hallucinate. RAG or Retrieval-Augmented Generation, is a type of application that is designed to overcome this challenge. This combines a custom knowledge-base of our own and a LLMs language ability to refine the output of the LLM and make it answer queries about custom/specific scenarios more accurately.
A birds-eye view of a RAG system can be shown below.
In simple terms, when a user asks a query, a RAG application queries the connected knowledge base for relevant context and bundles that context with the user's query, and sends it to the LLM. LLM then processes the data and provides an appropriate answer.
Steps of Creating a RAG Application
We can divide the creation of a simple RAG application into 3 parts.
- Data Collection & Preparation
- Embedding and indexing data
- Retrieving indexed data and quarrying the LLM.
Let's look at the 3 steps in detail below.
Data Collection and Preparation
In this step, all relevant data must be collected and pre-processed if necessary. These data can be PDFs, Markdown files, Text files and even Plain Text.
Embedding and indexing data
In this step, we send the collected data to, what we call an “Embedding Model” and it returns a vector representation, a numerical model, of the provided data. Those “vector embeddings” are then stored in a “Vector Database”. Vector database is a database that stores mathematical representations of structured and unstructured data.
Retrieving data embeddings and Quarrying the LLM
When a prompt is provided, the vector database is queried to find relevant data context to the prompt, and the found context is sent along with the prompt to a LLM, which in turn takes the prompt and the provided data into account and responds with a relevant answer.
As the title already says, in this article let's see how we can create a RAG application to ask questions about a PDF file we provide. In a sense, we could “chat” with our PDF file by utilizing this application.
To create the RAG application we use Langchain, which is a popular Python framework for creating RAG applications. As the underlying models, we are utilizing OpenAIs GPT models and embedding models. As for the knowledge base, we are using “Pinecorne” which is a popular vector database that you can try online for free. Even after all these, we would need to expose our application to the internet for it to be useful, for this, we use WSO2’s state-of-the-art integration platform as a service, Choreo. We would deploy our Python/Langchain application on Choreo and expose it as an API anyone can invoke.
Overall our tech-stack would look like the following.
- Langchain — For creating the RAG application
- OpenAI Embedding Model — Embedding data
- Pinecorne — Storing vector embeddings
- OpenAI GPT — LLM
- WSO2 Choreo — Building, Deploying, and Exposing our application to the public.
Simple RAG Architecture
Let's get coding!
Before we start coding our application, you would need to create three accounts.
- Choreo Account — able to build and deploy your application
- OpenAI Platform Account — allows you to access OpenAI APIs
- Pinecone Account — allows you to access the Pinecone serverless vector database.
Then you need to get two API keys for OpenAPI and Pinecone
- OpenAI — Go to https://platform.openai.com/api-keys and create a new secret key. Copy and save it in a safe location.
- Pinecone — Left Side Pane → Manage → API Keys → Create API Key. Copy and save it.
I am using Python 3.11 and PyCharm IDE to develop this project.
First, you need to create a new project in PyCharm ( or your preferred IDE ) and add the following requirements.txt
.
langchain~=0.1.12
langchain-community
pinecone-client
langchain-openai
openai
langchain_pinecone
uvicorn~=0.30.5
fastapi~=0.112.0
pydantic~=1.10.17
pypdf
Then add your Pinecone API Key and OpenAI API key to a .env file. Make sure NEVER to push this file to git!
OPENAI_API_KEY=""
PINECONE_API_KEY=""
After that copy this PDF I’ve exacted from NASA [1], about dark matter, to your directory. After this, your directory would look like this
.
├── .env
├── dark-energy.pdf
└── requirements.txt
If you refer to our first step of creating a RAG application, the data collection (i.e. the pdf) and the setting up of the project is done. Next, we have to create the embeddings and store them in Pinecone. Let's create a file embed.py
for this purpose.
import os
from langchain_community.document_loaders import PyPDFLoader
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain_pinecone import PineconeVectorStore
from pinecone import Pinecone, ServerlessSpec
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")
def create_or_update_index(texts, index_name, embedding):
pc = Pinecone(api_key=os.environ.get("PINECONE_API_KEY"))
if index_name not in pc.list_indexes().names():
pc.create_index(
name=index_name,
metric='cosine',
dimension=1536,
spec=ServerlessSpec(
cloud='aws',
region='us-east-1'
)
)
print(f"Created new index: {index_name}")
vectorstore = PineconeVectorStore.from_documents(
texts,
embedding=embedding,
index_name=index_name
)
else:
print(f"Index {index_name} already exists")
vectorstore = PineconeVectorStore.from_existing_index(
index_name=index_name,
embedding=embedding
)
return vectorstore
pdf_path = "dark-energy.pdf"
loader = PyPDFLoader(pdf_path)
documents = loader.load()
text_splitter = CharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
texts = text_splitter.split_documents(documents)
embedding = OpenAIEmbeddings(model="text-embedding-ada-002")
index_name = "pdf-index"
vectorstore = create_or_update_index(texts, index_name, embedding)
Here we utilize the PyPDFLoader
from langchain_community
library to load our stored PDF. Then we need to break down our large PDF into small chunks of text, for this purpose we can use CharacterTextSplitter
which specifies a chunk_size
(maximum size of a chunk) and a chunk_overlap
(amount of overlap between two chunks). The following web application “ChunkViz” can visualize the chunking and overlapping of chunks in a text neatly.
The dark green parts are overlapping portions of the chunks.
After defining the TextSplitter
we can input the document to the text splitter and get chunked text.
texts = text_splitter.split_documents(documents)
After that, next step is the embedding process. We need to send the chunked texts into an embedding model and get vector embeddings for those in return. In this case, we are using the text-embedding-ada-002
model of OpenAI.
embedding = OpenAIEmbeddings(model="text-embedding-ada-002")
After getting the embeddings we can create a new index in Pinecone vector db and store our embeddings using create_or_update_index
the method. This method uses pc.create_index
the method of Pinecone Python library to send the embeddings to a new index.
After creating the embed.py
file, install the dependencies by running,
pip install -r requirements.txt
Then you can run the embed.py
.
python3 embed.py
This would take some time and if all goes well, you will see the following message.
Created new index: pdf-index
And now If you go to the Pinecone UI, you’ll see a new index has been created!
Now, we have to create an API that would allow a user to ask questions about the contents of the PDF. For the creation of the API, we will use FastAPI which is a powerful web framework for Python.
Create the following api.py
file.
import os
from fastapi import FastAPI
from pydantic import BaseModel
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_pinecone import PineconeVectorStore
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from pinecone import Pinecone
app = FastAPI()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")
# Initialize Pinecone
pc = Pinecone(api_key=PINECONE_API_KEY)
embedding = OpenAIEmbeddings(model="text-embedding-ada-002")
index_name = "pdf-index"
vectorstore = PineconeVectorStore.from_existing_index(
index_name=index_name,
embedding=embedding
)
# Initialize ChatOpenAI
llm = ChatOpenAI(
openai_api_key=OPENAI_API_KEY,
model_name='gpt-4o',
temperature=0.0
)
# Creating Prompt
prompt_template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
{context}
Question: {question}
Provide a concise answer in 1-4 sentences:"""
PROMPT = PromptTemplate(
template=prompt_template, input_variables=["context", "question"]
)
# Creating a Langchain QA chain
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(),
chain_type_kwargs={"prompt": PROMPT}
)
class Query(BaseModel):
question: str
# Create API Endpoint
@app.post("/ask")
async def ask_question(query: Query):
response = qa.invoke(query.question)
return {"answer": response['result']}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8080)
First, we will do the necessary initialization like setting up API Keys, Pinecone connection, and setting up a LLM connection using OpenAI’s gpt-4o
model.
After that, we create a prompt-template
which would be sent to the LLM. Here you can see two replaceable parts, {context}
and {question}
, context is the data we receive from Pinecone regarding the users question. This template would be populated by the original question and the relevant context received from Pinecone and then it would be sent to the LLM.
Then we need to create a QA (Question Answering) chain using Langchain and provide the created variables. This QA chain will answer questions based on the provided context, using the LLM.
Now, we can invoke this qa object inside a HTTP resource, /ask
to finalize our application.
Deploying the RAG Application
Now that we have finished developing our application. We need to deploy it. For that, we will use Choreo, a Kubernetes-based internal developer platform as a service by WSO2. We can deploy this application as a Python service without having to create any Dockerfiles, images, or Kubernetes configurations, all underlying tinkering would be managed by Choreo.
First you need to create a Procfile
which will instruct how to start the application to a Buildpack.
web: uvicorn api:app --host=0.0.0.0 --port=${PORT:-8080}
Then create a component-config.yaml file in a .choreo
directory. This is a Choreo-specific config that will configure the endpoint which is exposed to the internet.
apiVersion: core.choreo.dev/v1beta1
kind: ComponentConfig
spec:
inbound:
- name: RAG API
port: 8080
type: REST
networkVisibility: Public
This would be the final directory structure.
.
├── .choreo
│ └── component-config.yaml
├── .env
├── .gitignore
├── Procfile
├── api.py
├── dark-energy.pdf
├── embed.py
└── requirements.txt
Now push the code to a GitHub repository. You can find the full application at https://github.com/rashm1n/RAG101
Now you are all set to deploy your application. Create a Python service in Choreo following the below steps.
- Create a choreo.dev account if you don't have one already.
- Create a project → Create a new Service and enter the following details.
- Component Display Name — RAGApp
- Public Repository URL — <Your Git repository URL>
- Buildpack — Python
- Python Project Directory — /
- Language Version — 3.11.x
And build the service component.
Now navigate to the deploy menu and click Configure & Deploy. Create two Environment Variables (Mark as Secret) and add your API Keys. Choreo will securely store these as Kubernetes Secrets.
Then create next and Deploy the component. The deployed application would look like the following.
Let's navigate to the Choreo Test Console and acquire a Test API Key to invoke our application. Since Choreo applications are protected by OAuth by default, you will need this API Key to test out the API. Copy the invoke-URL
as well.
Then use the API-Key and Invoke URL as below in a curl and invoke your component.
curl --location 'https://5d1fa89f-bb68-4cee-a1c9-5ca459e02768-dev.e1-us-cdp-2.choreoapis.dev/springproject/ragapp/v1.0/ask' \
--header 'accept: */*' \
--header 'API-Key: <API-Key>'
--header 'Content-Type: application/json' \
--data '{"question": "What is dark energy?"}'
And you will get a response according to your provided data! Within a few minutes, you are able to go from code to cloud and invoke your application over the internet.
In this article, we discussed about creating and deploying a RAG application which would give the function of asking questions about a PDF document of your choice. This is a very simle example of a RAG application and LLMs but you can make much more complicated applications utilizing these, and use Choreo to deploy them securely.
Please comment below if you have any questions. See you in another article.
Resources
[2] — https://choreo.dev