Artificial Intelligence and Machine Learning Blogs
Explore AI and ML blogs. Discover use cases, advancements, and the transformative potential of AI for businesses. Stay informed of trends and applications.
cancel
Showing results for 
Search instead for 
Did you mean: 
Rohit_KumarB
Advisor
Advisor
1,582

In this blog post, we are embarking on an exciting journey to explore how the Streamlit Python library can be leveraged for creating an interactive chat interface capable of handling multiple PDF files. The walkthrough utilizes a Retriever-Augmented Generation (RAG) model and the Hana Vector Engine as key components. The end result? A smooth conversational experience with a chatbot trained to read and understand the context from uploaded PDF files.

There are some prerequisites that you must be familiar with. These include:

  • HANA Cloud Instance: You should have an active HANA Cloud Instance available for storing and retrieving vector embeddings.
  • SAP Gen AI HUB: The application fetches pre-trained models for text embeddings from the SAP Gen AI HUB. Ensure that you've access to this.
  • Gen AI SDKs for LLMs: The language models used in the application are accessed using the Gen AI SDKs which facilitate the interaction with the AI models.
  • Basic Knowledge on Python: As the whole application is scripted in Python, you should be comfortable working with Python and its libraries.
  • Langchain: This Python package plays a vital role in handling text chunking and conversational retrieval chains in this application - some knowledge of how it works would be beneficial but not mandatory.

In order to build this application, you can utilize either Business Application Studio (BAS) or Visual Studio. Here's a step-by-step guide of how you can proceed:

Creation of a New Folder: Initiating the process requires the creation of a new folder. This will serve as the central location for all the required files, enabling easy access and organization.

Virtual Environment Setup: Establishing a virtual environment and activating it is the next crucial step. The reason behind this is to create an isolated sandbox for your Python project. This virtual environment allows you to control the versions of the various packages used in the project and keep them separate from those in the global Python interpreter. Moreover, it prevents potential conflicts between different version requirements of various projects.

Creation of app.py: Subsequently, create a new file named app.py within the same folder. This file is where we will be executing our Python code.

Refer to the Attached File: Lastly, for your ease, an HTML file has been attached below. This has been done to simplify the process for you. You can download and copy the content of that file into the same folder.

Ps : I have added html_template.txt file change the extension to .py before you add that into the folder.

Python Libraries Needed:

We will be requiring a specific set of Python libraries. Import statements are as below:Make sure to install

 

import streamlit as st
from streamlit import session_state
from PyPDF2 import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from gen_ai_hub.proxy.langchain.openai import OpenAIEmbeddings
from gen_ai_hub.proxy.core.proxy_clients import get_proxy_client
from gen_ai_hub.proxy.langchain.openai import ChatOpenAI
from langchain_community.vectorstores.hanavector import HanaDB
from gen_ai_hub.proxy.langchain.init_models import init_llm
from dotenv import load_dotenv
from hdbcli import dbapi
from langchain.memory import ConversationBufferMemory
from langchain.chains import conversational_retrieval
from langchain.chains import create_retrieval_chain
from langchain.chains.conversational_retrieval.base import ConversationalRetrievalChain
from htmlTemplate import css,bot_template,user_template
import os
import uuid

 

these libraries in your Python environment before proceeding.

HANA Connection:

We leverage environment variables to store our HANA instance details. These variables are then used to establish a connection with HANA using the dbapi.connect() function. Note that the address, port, user and password are available in BTP>Services>Instances>service key

 

load_dotenv()
# Use connection settings from the environment
connection = dbapi.connect(
    address=os.environ.get("HANA_HOST_VECTOR"),
    port=os.environ.get("HANA_PORT_VECTOR"),
    user=os.environ.get("HANA_VECTOR_USER"),
    password=os.environ.get("HANA_VECTOR_PASS"),
    autocommit=True,
    sslValidateCertificate=False,
)

 

The get_pdf_text(pdf_docs) function reads multiple PDFs, extracting and concatenating all of their textual content into a single string, which is then returned.

 

def get_pdf_text(pdf_docs):
    text = ""
    for pdf in pdf_docs:
        pdf_reader = PdfReader(pdf)
        for page in pdf_reader.pages:
            text += page.extract_text()
    return text

 

The get_text_chunks(text) function takes a large string of text and divides it into smaller chunks. It utilizes the RecursiveCharacterTextSplitter with a defined chunk size and overlap amount to do so. The chunks are of size 1000 characters, with a 200 character overlap between consecutive chunks, which are then returned by the function.

 

def get_text_chunks(text):
    text_splitter = RecursiveCharacterTextSplitter (
        chunk_size = 1000,
        chunk_overlap = 200,
        length_function = len )
    chunks = text_splitter.split_text(text)
    return chunks

 

Note : The RecursiveCharacterTextSplitter from the langchain library is an incredibly useful function for natural language processing. It is used for breaking down larger text inputs into smaller, manageable chunks of text, which is essential for effective and efficient text analysis, especially in contexts like document retrieval or chatbot conversations where large volumes of text are involved.

For more information on different types of text splitters please refer this link : Text Splitters | 🦜️🔗 Langchain

The Document class serves as a data structure to bundle together a unit of text and its associated metadata. The purpose is to neatly encapsulate each text chunk and its additional attributes from the parsed PDFs into individual 'Document' objects for easier management and processing in the application.

 

 

class Document:
    def __init__(self, text, metadata=None):
        self.page_content = text
        self.metadata = metadata if metadata is not None else {}

 

The get_vectorstore(chunks, table_name) function processes chunks of text by generating their embeddings using the OpenAI model text-embedding-ada-002 sourced from SAP's GEN AI Hub. Each chunk is then stored as a Document object. The function subsequently creates a vectordb object by storing these embeddings in a HanaDB table, which is returned for further use.

 

def get_vectorstore(chunks,table_name):
    embeddings = OpenAIEmbeddings(proxy_model_name='text-embedding-ada-002', chunk_size=200)

    documents = [Document(chunk) for chunk in chunks] if chunks else None

    vectordb = HanaDB.from_documents(connection=connection, documents=documents, embedding=embeddings, table_name=table_name)
    return vectordb

 

Models currently available in Gen AI Hub: Generative AI Hub | SAP Help Portal

The get_conversation_chain(vectordb) function initiates a process to enable a conversation chain in the context of a document retrieval task. It initializes a language model ('gpt-4'), defines a ConversationBufferMemory to store 'chat_history', and sets the retriever to operate on the previously created vector database, vectordb. The function then creates a conversational retrieval chain and returns this chain for maintaining a fluent conversation flow based on document content.

 

def get_conversation_chain(vectordb):
    llm = init_llm('gpt-4', max_tokens=100)
    memory = ConversationBufferMemory(memory_key='chat_history',return_messages=True)
    conversation_chain = ConversationalRetrievalChain.from_llm(
        llm =llm,
        retriever = vectordb.as_retriever(),
        memory=memory

   )
    return conversation_chain

 

Note : The dependencies may vary since few features of Langchain are deprecated, hence please refer to the latest one. This worked for me -

from langchain.chains.conversational_retrieval.base import ConversationalRetrievalChain

The `handle_user_input(user_question)` function manages interaction with the Chatbot. Upon receiving a user question, it updates the conversation session state and the chat history. It then iterates over the chat history: displaying the user's messages and the bot's corresponding responses alternately using pre-defined templates, effectively handling the user's input and the bot's output within the conversational interface.

 

def handle_user_input(user_question):
    response = st.session_state.conversation({'question':user_question})
    st.session_state.chat_history = response['chat_history']
    for i, message in enumerate (st.session_state.chat_history):
        if i % 2 == 0:
            st.write(user_template.replace("{{MSG}}",message.content), unsafe_allow_html=True)

        else:
            st.write(bot_template.replace("{{MSG}}",message.content), unsafe_allow_html=True)

 

The main() function sets page configurations, initiates session states, handles user inputs, and manages PDF document uploads. Once documents are uploaded and processed, it utilizes the chat model to generate responses based on user queries.

After clicking on the 'Process' button, the raw text from PDFs are extracted and broken down into chunks. Subsequently, vector embeddings are computed and a conversational chain is created. The 'Delete Embeddings' button allows users to remove the embeddings when they are no longer needed.

Therefore, by running this application, you will be able to chat with an AI model that generates responses directly from the content in the set of your uploaded PDF files. This is especially useful for exploring and interacting with large sets of documents in a conversational manner.

 

def main():
    st.set_page_config(page_title="chat with multiple PDF", page_icon="‌‌")
    
    st.write(css,unsafe_allow_html=True)

    if "conversation" not in st.session_state:
        st.session_state.conversation = None

    if "chat_history" not in st.session_state:
        st.session_state.chat_history = None

    st.header("Chat with multiple PDF's ‌‌")
    user_question = st.text_input("Ask a question about your documents:")
    if user_question:
        handle_user_input(user_question)
    


    with st.sidebar:
        st.subheader("Your documents")
        pdf_docs = st.file_uploader("Upload your PDF's here and click on 'Process'",accept_multiple_files=True)

        unique_id = str(uuid.uuid4())
        table_name = "Embeddings" + unique_id if pdf_docs else None

        vectordb = None

        if st.button("Process"):
            with st.spinner("Processing"):

                raw_text = get_pdf_text(pdf_docs)
                text_chunks = get_text_chunks(raw_text)
                #st.write(text_chunks)

                vectordb = get_vectorstore(text_chunks, table_name)
                
                st.session_state.conversation = get_conversation_chain(vectordb)
        
        st.session_state.conversation 



        if vectordb and table_name and st.button("Delete Embeddings"):
            with st.spinner("Deleting"):
                vectordb.delete_embeddings(table_name=table_name) 



    

if __name__ == '__main__':
    main()

 

.for now, the current version provides a robust solution for extracting and discussing information from multiple PDFs, significantly reducing the time and effort need to manually go through each document. Leveraging Streamlit for the UI and using HANA cloud for data storage makes the application highly interactive and efficient.


A key reflection from this process is the remarkable potential of chatbots not just as a customer interaction tool, but also as a way to interact with and extract insights from text documents. This use case can be extended to a wide array of applications across any sector dealing with large volumes of textual data.

The possible future enhancements could involve the integration of more sophisticated NLP techniques and machine learning algorithms to improve accuracy. SAP HANA Cloud

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5 Comments
VijayakumarU
Associate
Associate

Very Informative blog and well put together. Thanks!

Simmaco
Associate
Associate
0 Kudos

Hi Rohit, your blog is very interesting. Could you also share a Github repository (if you have one) with the entire code and Python dependencies?

Thanks,

Simmaco

SureshChinnasamy
Employee
Employee
0 Kudos

Great Summary and well documented. 

Rishika
Associate
Associate
0 Kudos

Great to see a detailed walkthrough in this blog!

VishnuHemant
Associate
Associate
0 Kudos

Appreciate the well documented walkthrough Rohit!