Communicating with Embedded Data in Documents

Illustration of the problem

A company named Tech Solutions has a large database of technical documents related to their products.

These documents contain information about product specifications, troubleshooting guides, and user manuals.

The company wants to provide better customer support by allowing users to ask questions about their products and get instant answers.

Could they achieve their goal?

The answer is yes, and they did it by using AI chatbot capable of comprehending and retrieving information from these documents.

So, the aim is to enable conversations like the following:

User: Hi, I’m having trouble setting up my new Tech Solutions router. Can you help?

AI Chatbot: Of course, I’m here to help. Could you please tell me the model of your router?

User: It’s the Tech Solutions Router Model X100.

AI Chatbot: Thank you for the information. Let me fetch the setup guide for you.

The AI chatbot scans the database of documents, finds the setup guide for the Tech Solutions Router Model X100, and extracts the relevant information.

AI Chatbot:

Here are the basic steps to set up your Tech Solutions Router Model X100:

Connect the router to your modem using the provided Ethernet cable.

Plug the router into a power source and turn it on.

Connect your device to the router’s network. The default network name and password are on the bottom of the router.

Open a web browser and type in the router’s IP address (found on the bottom of the router) to access the setup page.

Follow the on-screen instructions to complete the setup.

Please let me know if you need help with any of these steps.

This scenario illustrates how an AI chatbot can engage with data within a document to offer real-time assistance to users. The AI chatbot leverages natural language processing to comprehend user queries, and machine learning algorithms to locate and extract pertinent information from the documents.

Now, let us try to create this using LangChain.

Concept Communicating with Embedded Data in Documents in LangChain

LangChain provides a powerful way to interact with your data by enabling a large language model (LLM) to answer questions based on the content of your documents.

Here’s a overview of the process:

Load The Document
Store Document to Memory
Querying Data

Load The Document

The first step is to load your document into LangChain. This can be done using LangChain’s document loaders, which can handle data from a variety of sources. The document could be a text file, a CSV file, a webpage, or any other type of document that contains the data you want the LLM to interact with.

For example we will load a text file containing the text of the book Alice in Wonderland by Lewis Carroll. The text file is located in the data directory of the LangChain repository.

https://gist.githubusercontent.com/phillipj/4944029/raw/75ba2243dd5ec2875f629bf5d79f6c1e4b5a8b46/alice_in_wonderland.txt

from langchain.document_loaders import TextLoader

loader = TextLoader("./documents/alice_in_wonderland.txt")
pages = loader.load()

Each page is a Document.

A Document contains text (page_content) and metadata.

len(pages)

page = pages[0]

print(page.page_content[200:500])
page.metadata

Document Splitting

Document splitting is a crucial step in the process of preparing data for AI models.

It involves breaking down large documents into smaller, manageable chunks.

This process may seem straightforward, but it’s filled with subtleties that can significantly impact the performance of AI models down the line.

Why is Document Splitting Important?

Consider a document containing information about the specifications of a Toyota Camry. If we split the document incorrectly, we might end up with one chunk containing half a sentence about the car’s specifications and another chunk containing the rest.

This split could prevent an AI model from correctly answering a question about the car’s specifications because the relevant information is spread across two chunks.

To avoid such issues, we need to split documents in a way that keeps semantically relevant information together. This process involves defining a chunk size and a chunk overlap.

Chunk Size and Chunk Overlap

The chunk size refers to the size of each chunk, which can be measured in various ways, such as the number of characters or tokens.

The chunk overlap is a small overlap between two chunks, like a sliding window, which ensures some consistency and continuity between chunks.

Example chunking

Consider we have a sentence:

Artificial intelligence is a branch of computer science that aims to create intelligent machines that work and react like humans.

If we decide to chunk this sentence into chunks of 10 words each, but without any overlap, we would end up with the following chunks:

Chunk 1: Artificial intelligence is a branch of computer science that aims to create Chunk 2: intelligent machines that work and react like humans.

The first chunk contains the first 10 words of the sentence, and the second chunk contains the remaining words. This split is not ideal because it will prevent an AI model from understanding the sentence’s meaning.

Now, let’s try to split the sentence into chunks of 10 words each, but with an overlap of 3 words. This time, we would end up with the following chunks:

Chunk 1: Artificial intelligence is a branch of computer science that aims to create Chunk 2: that aims to create intelligent machines that work and react like humans.

Here, that aims to create is the overlap between the two chunks. This overlap can help in maintaining the context when these chunks are processed independently.

Text Splitters

Text splitters in Lang Chain split documents into chunks based on the defined chunk size and overlap.

They can vary in how they split the chunks and measure the length of the chunks.

Some splitters even use smaller models to determine the end of a sentence and use that as a splitting point.

from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=4,
    chunk_overlap=2
)

text1 = 'abcdefghijklmnopqrstuvwxyz'
r_splitter.split_text(text1)

text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'
r_splitter.split_text(text2)

text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"
r_splitter.split_text(text3)

c_splitter = CharacterTextSplitter(
    separator = ' ',
    chunk_size=10,
    chunk_overlap=2
)

c_splitter.split_text("hello world \n I am a text splitter \n please split me \n thank you")

Metadata and Document Type

Maintaining metadata across all chunks and adding new pieces of metadata when relevant is another crucial aspect of document splitting. The type of document we’re working with can also influence how we split it. For instance, code documents might require different splitting strategies compared to text documents.

Split The Document

We will take the document above and split it into chunks.

We can use the recursive character text splitter and the character text splitter, two common types of text splitters in Lang Chain.

We can experiment with different chunk sizes and overlaps to see how they affect the splitting.

After splitting, we can compare the length of the original document with the lengths of the chunks to see how many chunks we’ve created. We can also check the metadata of the chunks to ensure it matches the metadata of the original document.

from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

chunk_size =10
chunk_overlap = 2

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

textSplit = r_splitter.split_text(page.page_content)

print(textSplit)

c_splitter = CharacterTextSplitter(
    separator = '\n',
    chunk_size=100,
    chunk_overlap=10
)

charSplit = c_splitter.split_text(page.page_content)

print(charSplit)

Store Document to Memory

Storing and searching over unstructured data often involves embedding the data and storing the resulting embedding vectors. At the time of a query, the unstructured query is embedded and the embedding vectors that are ‘most similar’ to the embedded query are retrieved. This process is managed by a vector store.

A vector store is responsible for storing embedded data and performing vector search. It stores vector representations, or embeddings, of the data. These embeddings encapsulate the semantic meaning of the text, enabling the Language Learning Model (LLM) to comprehend and interact with the content of the document.

Why is Vector Search Important?

Vector search is a crucial component of many AI applications. It enables the AI model to retrieve relevant information from a large corpus of documents.

For example, a chatbot can use vector search to retrieve information from a database of documents and answer user queries.

Vector Store in LangChain

LangChain provides a robust and flexible platform that supports a multitude of integration methods with Vector Stores. This versatility allows users to choose the most suitable vector store for their specific needs. In this particular instance, we will delve into the process of integrating with FAISS, a library developed by Facebook AI that is renowned for efficient similarity search and clustering of dense vectors.

FAISS is particularly beneficial for users who need to manage large databases and perform quick nearest-neighbor searches on high dimensional vectors. It’s a powerful tool that can significantly enhance the efficiency of handling unstructured data.

For those interested in exploring other options, LangChain supports a wide range of Vector Stores. A comprehensive list of these supported Vector Stores can be found at the following link: LangChain Vector Stores.

import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap = 25
)

splits = text_splitter.split_text(page.page_content)

len(splits)

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS

from langchain.document_loaders import TextLoader

loader = TextLoader("./documents/alice_in_wonderland.txt")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

db = FAISS.from_documents(docs, embeddings)

Querying Data

With the document stored in memory, you can now query the data. This involves creating a query and passing it to the vector store. The vector store will return the most relevant documents based on the query. These documents can then be passed to the LLM to generate a response.

In this way, LangChain allows you to chat with your data, enabling the LLM to answer questions based on the content of your documents. This can be particularly useful when dealing with large amounts of data or proprietary documents that the LLM was not originally trained on.

The First step is to imports the RetrievalQA module from the langchain.chains package.

RetrievalQA stands for Retrieval Question Answering. It’s a type of model that retrieves answers to questions based on a given context.

In the context of LangChain, this module is used to retrieve the most relevant information from the stored data based on the user’s query.

And then imports the ChatOpenAI module from the langchain.chat_models package.

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

After that, we setup the language model and retrival class.

OpenAIModel = "gpt-4"
llm = ChatOpenAI(model=OpenAIModel, temperature=0.1)

qa = RetrievalQA.from_chain_type(llm=llm, retriever=db.as_retriever())

After creating the retrieval class, we can pass the query to the get_answers method of the retrieval class. Thi method will return the most relevant answers to the query based on the stored data in the vector store.

query = "please mention all of the characters?"

qa.run(query)

query = "who is the main character?"

qa.run(query)

query = "please create synopsis from the story?"

qa.run(query)

query = "what is the ending of the story?"

qa.run(query)

If the query is not relevant to the stored data, the method will return answers like I don't know or I don't understand.

query = "who is president of indonesia?"

qa.run(query)