
At a basic level, how does a document chatbot work? At its core, it’s just the same as ChatGPT. On ChatGPT, you can copy a bunch of text into the prompt, and then ask ChatGPT to summaries the text for you or generate some answers based on the text.
Interacting with a single document, such as a PDF, Microsoft Word, or text file, works similarly. We extract all of the text from the document, pass it into an LLM prompt(here Chatgpt), such as ChatGPT, and then ask questions about the text. This is the same way the ChatGPT example above works.
Before delving into the following proof of concept, which employs the RAG concept, those interested in learning about training methods for GPT can refer to the blog mentioned below https://blogs.sap.com/2023/12/01/transforming-enterprise-ai-mastering-chatgpt-or-any-llm-training-fo...
RAG -
Where it gets a little more interesting is when the document is very large, or there are many documents we want to interact with. Passing in all of the information from these documents into a request to an LLM (Large Language Model) is impossible since these requests usually have size (token) limits, and so would only succeed if we tried to pass in too much information.
We can only send the relevant information to the LLM(Chatgpt) prompt to overcome. But how do we get only the relevant information from our documents? This is where embeddings and vector stores come in.
We want a way to send only relevant bits of information from our documents to the LLM prompt. Embeddings and vector stores can help us with this.
Embeddings are probably a little confusing if you have not heard of them before, so don’t worry if they seem a little foreign at first. A bit of explanation, and using them as part of our setup, should help make their use a little more clear.
Below is the article on Vector DB -
An embedding allows us to organize and categories a text based on its semantic meaning. So we split our documents into lots of little text chunks and use embeddings to characterise each bit of text by its semantic meaning. An embedding transformer is used to convert a bit of text into an embedding.
An embedding categorises a piece of text by giving it a vector (coordinate) representation. That means that vectors (coordinates) that are close to each other represent pieces of information that have a similar meaning to each other. The embedding vectors are stored inside a vector store, along with the chunks of text corresponding to each embedding.
Once we have a prompt, we can use the embeddings transformer to match it with the bits of text that are most semantically relevance to it, so we know how a way to match our prompt with other related bits of text from the vector store. In our case, we use the OpenAI embeddings transformer, which employs the cosine similarity method to calculate the similarity between documents and a question.
Now that we have a smaller subset of the information which is relevant to our prompt, we can query the LLM with our initial prompt, while passing in only the relevant information as the context to our prompt.
This is what allows us to overcome the size limitation on LLM prompts. We use embeddings and a vector store to pass in only the relevant information related to our query and let it get back to us based on that.
So, how do we do this in LangChain? Fortunately, LangChain provides this functionality out of the box, and with a few short method calls, we are good to go. Let’s get started!
Let’s start with processing a single pdf, and we will move on to processing multiple documents later on.
The first step is to create a Document
from the pdf. A Document
is the base class in LangChain, which chains use to interact with information. If we look at the class definition of a Document
, it is a very simple class, just with a page_content
method, that allows us to access the text content of the Document
.
We use the DocumentLoaders that LangChain provides to convert a content source into a list of
Documents
, with oneDocument
per page.
For example, there are DocumentLoaders that can be used to convert pdfs, word docs, text files, CSVs, Reddit, Twitter, Discord sources, and much more, into a list of Document's
which the LangChain chains are then able to work. Those are some cool sources, so lots to play around with once you have these basics set up.
First, let’s create a directory for our project. You can create all this as we go along or clone the GitHub repository with this link https://github.com/AbhijeetKankani/CHATGPT/
Even instead of Laingchain one can use LLMA Index as well, difference is mentioned in below blog -
https://blogs.sap.com/2023/12/01/langchain-vs-llamaindex-enhancing-llm-applications-on-sap-btp/
For the setup, we'll need to install various packages using the command provided below. While both Business Application Studio (BAS) and Visual Studio Code (VS Code) can be used, I recommend opting for VS Code. My experience with this proof of concept (POC) in BAS indicated that several dependencies were missing, which VS Code can better handle.
pip install flask==2.0.1 openai==0.27.0 langchain==0.0.220 unstructured==0.7.11 chromadb==0.3.26 tiktoken==0.4.0
All these libraries are there in requirements.txt file.
Now that our project folders are set up, let’s convert our PDF into a document. We will use the PyPDFLoader
class. Also, let's set up our OpenAI API key now. We will need it later.
As explained earlier, we can use embeddings and vector stores to send only relevant information to our prompt. The steps we will need to follow are:
How to use API key cab be referred by below blog series -
https://blogs.sap.com/2023/08/31/hello-world-openai-crafting-accurate-chatgpt-like-custom-search-for...
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
7 | |
7 | |
5 | |
4 | |
4 | |
3 | |
3 | |
3 | |
3 | |
3 |