Language model-powered applications, such as chatbots, have witnessed a transformative impact in the field of AI. With the advent of advanced models like OpenAI’s GPT-4, the capabilities of these applications have reached new heights. In this article, we explore LangChain, an exceptional framework tailored for developing language model-powered applications, including chatbots. We delve into the core concepts of LangChain, emphasizing the significance of data-awareness and agentic capabilities. As we navigate this journey, we’ll showcase how LangChain empowers developers to create intelligent chatbot systems that deliver contextually informed responses. Join us on this concise exploration to unlock the potential of LangChain and harness the power of GPT-4, Python, and PDF integration in building remarkable chatbot applications.
Hire the best developers in Latin America. Get a free quote today!
Contact Us Today!First, let’s define a few things.
LangChain is a powerful tool that can be used to work with Large Language Models (LLMs). LLMs are very general in nature, which means that while they can perform many tasks effectively, they may not be able to provide specific answers to questions or tasks that require deep domain knowledge or expertise. For example, imagine you want to use an LLM to answer questions about a specific field, like medicine or law. While the LLM may be able to answer general questions about the field, it may not be able to provide more detailed or nuanced answers that require specialized knowledge or expertise.
Check out this guide if you’re looking for more details about LangChain and its use case with databases.
LangChain fills a crucial need in the realm of Large Language Models (LLMs). While LLMs excel at a wide range of tasks, they may fall short when it comes to providing specific answers or deep domain expertise. To address this limitation, LangChain adds data-awareness and agentic capabilities to LLMs, allowing applications to connect to external data sources and engage in interactive, context-aware interactions. This integration empowers developers to build more powerful and specialized language model applications that can provide targeted and nuanced responses, bridging the gap between LLMs and domain-specific knowledge requirements.
GPT-4 and LangChain bring together the power of PDF processing, Python programming, and chatbot development to create an advanced language model-powered chatbot. With the integration of GPT-4, LangChain provides a comprehensive framework for building intelligent chatbot applications that can seamlessly interact with PDF documents
The article will cover the following steps:
Install Dependencies
First, ensure you have Python version 3.8 or newer. Check your python version by running this command:
python -V
Create your project directory. Let’s call it pdf-chatbot
mkdir pdf-chatbot
cd pdf-chatbot
Create a Python environment
python -m venv env
Activate the environment
Windows:
env\scripts\activate
Linux:
source env/bin/activate
Install the necessary dependencies. Execute the following command to install the required packages:
pip install langchain openai chromadb pymupdf tiktoken
LangChain offers powerful data loader functionality that allows you to combine your own text data with language models, enabling you to create customized and differentiated applications. The first step in this process is loading the data into “Documents,” which are essentially pieces of text. The document loader module in LangChain simplifies this task, making it effortless to load and preprocess your data.
LangChain provides a variety of document loaders, including transform loaders that convert data from specific formats into the Document format. For example, there are transformers available for CSV and SQL data. These loaders can take input from files or even URLs, offering flexibility in data source selection.
Today, we will focus on a specific data loader provided by LangChain – PyPDF. This loader is designed to handle PDF documents, enabling you to seamlessly integrate PDF data into your language model applications.
With PyMuPDFLoader, you can load a PDF document. According to LangChain document, this is the fastest of the PDF parsing options, and contains detailed metadata about the PDF and its pages, as well as returns one document per page.
import os
from langchain.document_loaders import PyMuPDFLoader
os.environ['OPENAI_API_KEY'] = 'ENTER YOUR API KEY'
loader = PyMuPDFLoader("./docs/example.pdf")
documents = loader.load()
Now that we have our PDF document loaded into a loader object, let’s talk about TextSplitters.
Text Splitters
The concept of TextSplitters revolves around the need to break down long pieces of text into smaller, meaningful chunks. While this task may seem straightforward, there are various complexities involved. The goal is to split the text in a way that keeps semantically related pieces together, with the definition of “semantically related” depending on the specific type of text being processed. This notebook showcases different approaches to achieve this.
At a high level, text splitters operate as follows:
Text splitters allow customization along two axes:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=10)
texts = text_splitter.split_documents(documents)
In the given implementation, the TextSplitter utilizes two parameters: chunk_size and chunk_overlap.
The chunk_size parameter determines the number of input texts that will be grouped together as a single request or chunk. If the chunk_size is set to None, the TextSplitter will use the default chunk size specified by the class itself. This parameter allows you to control the granularity of the chunks and how much text is processed together at once.
chunk_overlap parameter refers to the maximum overlap between consecutive chunks. By including some overlap, such as using a sliding window approach, the TextSplitter ensures that there is continuity and context maintained between the chunks. This can be particularly useful when dealing with long pieces of text, as it helps to preserve the flow of information and avoids abrupt transitions between chunks.
By adjusting the values of chunk_size and chunk_overlap, you can fine-tune the behavior of the TextSplitter according to your specific requirements and the nature of the text data being processed.
Text embeddings play a crucial role in representing textual information in a numerical vector format. The Embedding class in LangChain serves as a standardized interface for various embedding providers, including OpenAI, Cohere, Hugging Face, and more.
By generating embeddings, text is transformed into a vector representation in a high-dimensional vector space. This vector representation allows for semantic analysis, enabling tasks such as semantic search, where similar pieces of text can be identified based on their proximity in the vector space.
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
persist_directory = "./storage"
embeddings = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=texts,
embedding=embeddings,
persist_directory=persist_directory)
vectordb.persist()
In the provided code snippet, we can see the following actions:
persist_directory
variable is defined, specifying the directory path where the collection will be persisted.OpenAIEmbeddings
is created, representing the embedding provider. This instance will handle the embedding generation for the collection of texts.Chroma
class is utilized to create a vector database (vectordb
) from a collection of texts (documents
). The embedding
parameter is set to the embeddings
instance created earlier.persist_directory
parameter is passed to the Chroma.from_documents
method, indicating that if a specific directory is specified, the collection and its embeddings will be persisted to that directory. On the other hand, if no directory is specified, the data will be stored in memory and will not persist across sessions.First we need to define a couple more items.
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
retriever = vectordb.as_retriever()
llm = ChatOpenAI(model_name='gpt-4')
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)
Here we define that llm object (ChatOpenAI) and its parameters. This instance is using the gpt-4 model.
As of the writing of this article there is a waitlist for gpt-4 model. If you do not yet have access, change the model name to gpt-3.5-turb
while True:
user_input = input("Enter a query: ")
if user_input == "exit":
break
query = f"###Prompt {user_input}"
try:
llm_response = qa(query)
print(llm_response["result"])
except Exception as err:
print('Exception occurred. Please try again', str(err))
Here we create a loop that prompts the user for queries. The loop continues until the user enters “exit”. Each query is processed by a language model using the qa(query)
function. The result of the query is then printed. If an exception occurs during the process, an error message is displayed.
Let’s put it all together
This example is using the IRS 1040 Instructions PDF (fun I know!)
import os
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
os.environ["OPENAI_API_KEY"] = 'ENTER YOUR API KEY'
persist_directory = "./storage"
pdf_path = "./docs/i1040gi.pdf"
loader = PyMuPDFLoader(pdf_path)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=10)
texts = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=texts,
embedding=embeddings,
persist_directory=persist_directory)
vectordb.persist()
retriever = vectordb.as_retriever(search_kwargs={"k": 3})
llm = ChatOpenAI(model_name='gpt-4')
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)
while True:
user_input = input("Enter a query: ")
if user_input == "exit":
break
query = f"###Prompt {user_input}"
try:
llm_response = qa(query)
print(llm_response["result"])
except Exception as err:
print('Exception occurred. Please try again', str(err))
As mentioned above, we are using the IRS 1040 instruction PDF to chat with. Let’s ask some questions.
Enter a query: what is this document about?
This document is about the legal requirements for the Internal Revenue Service (IRS) to ask for and use information from taxpayers related to their eligibility for benefits or the repayment of loans. It also explains how the IRS may disclose this information to other countries, federal and state agencies, and federal law enforcement and intelligence agencies.
Enter a query:
Let’s get more specific. The document talks about “The Marketplace”. Let’s see if our chat bot can questions about this particular page.
Enter a query: By what date is The Marketplace is required to send Form 1095-A?
The Marketplace is required to send Form 1095-A by January 31, 2023.
Enter a query:
As you can see, the chat bot finds the proper context that the question (prompt) is referring to and return the correct answer.
To sum it up, LangChain offers a powerful framework for building applications that can handle and interact with text data intelligently. By leveraging tools like OpenAI GPT-4, LangChain helps developers create systems that can understand and process information from PDF documents effectively.
But having the right team to implement these technologies is just as important. Our pre-vetted LatAm developers are experts in the latest tech and can help you make the most of LangChain, Python and other advanced tools. With our nearshore staffing services, you get skilled professionals who fit your project needs and can deliver top-notch results.
Contact us today to build your team!
In several short years, artificial intelligence (AI) moved from the research labs to mainstream usage.…
Latin America is one of the fastest growing markets of highly-skilled software developers, fluent in…
Over 33% of the world's population shops online, making up 2.64B buyers in 2024. This…
In the rapidly evolving world of business and technology, companies across the globe are continually…
Low inpatient volume—combined with soaring labor costs and high interest rates—are putting increased pressure on…
As innovative solutions and disruptive technologies continue to reshape the way Latin Americans access and…
This website uses cookies.