GPT-4 and LangChain: Building Python Chatbot with PDF Integration

Chatbot gpt-4 langchain python

Language model-powered applications, such as chatbots, have witnessed a transformative impact in the field of AI. With the advent of advanced models like OpenAI’s GPT-4, the capabilities of these applications have reached new heights. In this article, we explore LangChain, an exceptional framework tailored for developing language model-powered applications, including chatbots. We delve into the core concepts of LangChain, emphasizing the significance of data-awareness and agentic capabilities. As we navigate this journey, we’ll showcase how LangChain empowers developers to create intelligent chatbot systems that deliver contextually informed responses. Join us on this concise exploration to unlock the potential of LangChain and harness the power of GPT-4, Python, and PDF integration in building remarkable chatbot applications.

First, let’s define a few things.

What is LangChain?

Chatbot gpt-4 langchain python

LangChain is a powerful tool that can be used to work with Large Language Models (LLMs). LLMs are very general in nature, which means that while they can perform many tasks effectively, they may not be able to provide specific answers to questions or tasks that require deep domain knowledge or expertise. For example, imagine you want to use an LLM to answer questions about a specific field, like medicine or law. While the LLM may be able to answer general questions about the field, it may not be able to provide more detailed or nuanced answers that require specialized knowledge or expertise.

Check out this guide if you’re looking for more details about LangChain and its use case with databases.

Why LangChain?

LangChain fills a crucial need in the realm of Large Language Models (LLMs). While LLMs excel at a wide range of tasks, they may fall short when it comes to providing specific answers or deep domain expertise. To address this limitation, LangChain adds data-awareness and agentic capabilities to LLMs, allowing applications to connect to external data sources and engage in interactive, context-aware interactions. This integration empowers developers to build more powerful and specialized language model applications that can provide targeted and nuanced responses, bridging the gap between LLMs and domain-specific knowledge requirements.

LangChain + GPT-4 + PDFs = ChatBot

GPT-4 and LangChain bring together the power of PDF processing, Python programming, and chatbot development to create an advanced language model-powered chatbot. With the integration of GPT-4, LangChain provides a comprehensive framework for building intelligent chatbot applications that can seamlessly interact with PDF documents

The article will cover the following steps:

Setting up the Environment:
- Installation of necessary dependencies for Python, including libraries and packages required for LangChain, data loaders, and OpenAI GPT-4.
Understanding LangChain Data Loaders:
- Explaining the concept of data loaders in LangChain, which facilitate seamless integration of different data sources.
- Demonstrating how to use data loaders to load and preprocess data for language model applications.
Leveraging Embeddings:
- Introducing the concept of embeddings and their significance in understanding and processing text data.
- Exploring how to leverage LangChain’s embedding capabilities to enhance language model applications.
Chatting with PDF Documents:
- Demonstrating how to integrate LangChain, OpenAI GPT-4, and Python to create a chatbot capable of interacting with PDF documents.
- Providing code examples and step-by-step instructions on loading, analyzing, and extracting information from PDFs using LangChain and GPT-4.
Practical Applications and Use Cases:
- Showcasing real-world scenarios where LangChain, data loaders, embeddings, and GPT-4 integration can be applied, such as customer support, research, or data analysis.

Setting up the Environment

Install Dependencies

First, ensure you have Python version 3.8 or newer. Check your python version by running this command:

python -V

Create your project directory. Let’s call it pdf-chatbot

mkdir pdf-chatbot
cd pdf-chatbot

Create a Python environment

python -m venv env

Activate the environment

Windows:

env\scripts\activate

Linux:

source env/bin/activate

Install the necessary dependencies. Execute the following command to install the required packages:

pip install langchain openai chromadb pymupdf tiktoken

Understanding LangChain Data Loaders

LangChain offers powerful data loader functionality that allows you to combine your own text data with language models, enabling you to create customized and differentiated applications. The first step in this process is loading the data into “Documents,” which are essentially pieces of text. The document loader module in LangChain simplifies this task, making it effortless to load and preprocess your data.

LangChain provides a variety of document loaders, including transform loaders that convert data from specific formats into the Document format. For example, there are transformers available for CSV and SQL data. These loaders can take input from files or even URLs, offering flexibility in data source selection.

Today, we will focus on a specific data loader provided by LangChain – PyPDF. This loader is designed to handle PDF documents, enabling you to seamlessly integrate PDF data into your language model applications.

LangChain PDF Data Loaders

With PyMuPDFLoader, you can load a PDF document. According to LangChain document, this is the fastest of the PDF parsing options, and contains detailed metadata about the PDF and its pages, as well as returns one document per page.

import os
from langchain.document_loaders import PyMuPDFLoader

os.environ['OPENAI_API_KEY'] = 'ENTER YOUR API KEY'

loader = PyMuPDFLoader("./docs/example.pdf")
documents = loader.load()

Now that we have our PDF document loaded into a loader object, let’s talk about TextSplitters.

Text Splitters

The concept of TextSplitters revolves around the need to break down long pieces of text into smaller, meaningful chunks. While this task may seem straightforward, there are various complexities involved. The goal is to split the text in a way that keeps semantically related pieces together, with the definition of “semantically related” depending on the specific type of text being processed. This notebook showcases different approaches to achieve this.

At a high level, text splitters operate as follows:

Splitting the text into small, semantically meaningful chunks, often based on sentence boundaries.
Combining these small chunks into larger chunks until a specific size is reached, determined by a predefined function that measures the chunk size.
Once the chunk reaches the desired size, it becomes its own separate piece of text. A new chunk is then created with some overlap to maintain context between the chunks.

Text splitters allow customization along two axes:

How the text is split: This involves selecting a strategy for splitting the text, such as using sentence boundaries, paragraph breaks, or other semantic cues.
How the chunk size is measured: This refers to the function used to determine the size of the chunks, which could be based on the number of characters, words, sentences, or any other suitable metric.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=10)
texts = text_splitter.split_documents(documents)

In the given implementation, the TextSplitter utilizes two parameters: chunk_size and chunk_overlap.

The chunk_size parameter determines the number of input texts that will be grouped together as a single request or chunk. If the chunk_size is set to None, the TextSplitter will use the default chunk size specified by the class itself. This parameter allows you to control the granularity of the chunks and how much text is processed together at once.

chunk_overlap parameter refers to the maximum overlap between consecutive chunks. By including some overlap, such as using a sliding window approach, the TextSplitter ensures that there is continuity and context maintained between the chunks. This can be particularly useful when dealing with long pieces of text, as it helps to preserve the flow of information and avoids abrupt transitions between chunks.

By adjusting the values of chunk_size and chunk_overlap, you can fine-tune the behavior of the TextSplitter according to your specific requirements and the nature of the text data being processed.

Leveraging Embeddings

Text embeddings play a crucial role in representing textual information in a numerical vector format. The Embedding class in LangChain serves as a standardized interface for various embedding providers, including OpenAI, Cohere, Hugging Face, and more.

By generating embeddings, text is transformed into a vector representation in a high-dimensional vector space. This vector representation allows for semantic analysis, enabling tasks such as semantic search, where similar pieces of text can be identified based on their proximity in the vector space.

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings

persist_directory = "./storage"
embeddings = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=texts, 
                                 embedding=embeddings,
                                 persist_directory=persist_directory)
vectordb.persist()

In the provided code snippet, we can see the following actions:

A persist_directory variable is defined, specifying the directory path where the collection will be persisted.
An instance of OpenAIEmbeddings is created, representing the embedding provider. This instance will handle the embedding generation for the collection of texts.
The Chroma class is utilized to create a vector database (vectordb) from a collection of texts (documents). The embedding parameter is set to the embeddings instance created earlier.
The persist_directory parameter is passed to the Chroma.from_documents method, indicating that if a specific directory is specified, the collection and its embeddings will be persisted to that directory. On the other hand, if no directory is specified, the data will be stored in memory and will not persist across sessions.

Chatting with PDF Documents

First we need to define a couple more items.

from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

retriever = vectordb.as_retriever()
llm = ChatOpenAI(model_name='gpt-4')
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)

Here we define that llm object (ChatOpenAI) and its parameters. This instance is using the gpt-4 model.

As of the writing of this article there is a waitlist for gpt-4 model. If you do not yet have access, change the model name to gpt-3.5-turb

while True:
        user_input = input("Enter a query: ")
        if user_input == "exit":
            break

        query = f"###Prompt {user_input}"
        try:
            llm_response = qa(query)
            print(llm_response["result"])
        except Exception as err:
            print('Exception occurred. Please try again', str(err))

Here we create a loop that prompts the user for queries. The loop continues until the user enters “exit”. Each query is processed by a language model using the qa(query) function. The result of the query is then printed. If an exception occurs during the process, an error message is displayed.

Let’s put it all together

This example is using the IRS 1040 Instructions PDF (fun I know!)

import os 
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

os.environ["OPENAI_API_KEY"] = 'ENTER YOUR API KEY'

persist_directory = "./storage"
pdf_path = "./docs/i1040gi.pdf"

loader = PyMuPDFLoader(pdf_path)
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=10)
texts = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=texts, 
                                 embedding=embeddings,
                                 persist_directory=persist_directory)
vectordb.persist()

retriever = vectordb.as_retriever(search_kwargs={"k": 3})
llm = ChatOpenAI(model_name='gpt-4')

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)

while True:
        user_input = input("Enter a query: ")
        if user_input == "exit":
            break

        query = f"###Prompt {user_input}"
        try:
            llm_response = qa(query)
            print(llm_response["result"])
        except Exception as err:
            print('Exception occurred. Please try again', str(err))

Practical Applications and Use Cases

As mentioned above, we are using the IRS 1040 instruction PDF to chat with. Let’s ask some questions.

Enter a query: what is this document about?
This document is about the legal requirements for the Internal Revenue Service (IRS) to ask for and use information from taxpayers related to their eligibility for benefits or the repayment of loans. It also explains how the IRS may disclose this information to other countries, federal and state agencies, and federal law enforcement and intelligence agencies.
Enter a query:

Let’s get more specific. The document talks about “The Marketplace”. Let’s see if our chat bot can questions about this particular page.

Enter a query: By what date is The Marketplace is required to send Form 1095-A?
The Marketplace is required to send Form 1095-A by January 31, 2023.
Enter a query:

As you can see, the chat bot finds the proper context that the question (prompt) is referring to and return the correct answer.

Conclusion

To sum it up, LangChain offers a powerful framework for building applications that can handle and interact with text data intelligently. By leveraging tools like OpenAI GPT-4, LangChain helps developers create systems that can understand and process information from PDF documents effectively.

Hire LatAm Nearshore developers with Next Idea Tech

But having the right team to implement these technologies is just as important. Our pre-vetted LatAm developers are experts in the latest tech and can help you make the most of LangChain, Python and other advanced tools. With our nearshore staffing services, you get skilled professionals who fit your project needs and can deliver top-notch results.

Zak Elmeghni