PDF data search with Python, OpenAI, Langchain, Faiss, and Streamlit

June 10, 2023 · 8 min read

In the previous tutorial, we learned how to make a docuemnt search tool using Node.js and Pinecone, that allows us to search private data like you would with ChatGPT. In this tutorial, we will learn how to do this with Python and Faiss. In addition, we will also use Streamlit to quickly spin up a simple user interface and dynamically ask questions and get answers.

The tools we will need are as follows:

Python. We need Python 3.
Langchain. Langchain gives us libraries in Javascript and Python to interact with the LLMs more easily. It enables us to use the power of LLMs on our own private data.
OpenAI API Key. To use OpenAI's LLMs we need an API key.

To get an OpenAI API key, follow these steps.

Go to the OpenAI platform. First set up billing. Don't worry, a decently accurate model like GPT-3.5 Turbo only costs $0.0015 per 1000 tokens. Each token correspond to a group of characters. A general rule of thumb is 1 token = 4 characters.

API pricing

Click on your account and go to the API keys section. Here, you can create a new secret key and make sure to save the key value. We will plug this into our code later.

Create key

Assuming you already have Python installed on your machine, it's time to get started.

We will have a requirements.txt file for all the dependencies.

requiremetns.txt
langchain==0.0.216
PyPDF2==3.0.1
python-dotenv==1.0.0
streamlit==1.18.1
faiss-cpu==1.7.4
altair==4.1.0
streamlit-extras

As you can see, we will also rely on PyPDF2`` to parse data from documents. faiss-cpu` to give us access to the FAISS library. We will use FAISS to store our vectors.

Let's create an file called upload_any_doc.py and do some imports.

imports
import streamlit as st
from dotenv import load_dotenv
from PyPDF2 import PdfReader
from langchain import LLMChain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import OpenAI
from langchain.prompts import (SystemMessagePromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate)

import pickle
import os

Here, we are importing LLMChain from Langchain. It will allow us to instantiate a chain object and run our query later.

We import the RecursiveCharacterTextSplitter, OpenAIEmbeddings, and OpenAI from Langchain as well.

Since the LLM has an input token limit, recursive text spitter splits longer text into chunks so they can be digested by the LLM. We will specify the parameters are how to split text. OpenAIEmbeddings allow us to embed the split words into vectors, which will then be stored via FAISS. OpenAI gives us many LLMs to play with. We will see this all in action momentarily.

We will be using the technique of chaining. This gives the LLM more context. There are more advanced things we can do with chaining, such as involving agents, but that's for another tutorial.

Now we can write our main function:

main function
def main():
    load_dotenv()

    st.header("Simple search of documents")

    pdf = st.file_uploader("Upload a PDF file", type=["pdf"])

In this function, we load our environment variables, and use Streamlit to set up a simple UI. We have a header called Simple search of documents. We also have a file uploader that accepts any PDF.

Before we proceed, let's write a helper function that processes this PDF. In the helper function, we will check if the PDF is valid. If it its, then proceed to reading its data. We will create a vector embedding array from its text.

Before we create the vector embeddings, we check to see if we have already processed this PDF before. If we have never seen this PDF before, then we will proceed. If we have processed this PDF already, then we simply retrieve the embedding from our file system. This simple caching technique will save you some CPU. Here we only check the PDF name, you can enhance this by checking the hash of the content to ensure updated PDFs will also get processed properly.

Assuming we have never processed the PDF file before, we do the following steps:

Extract text
Split text into chunks. We are setting chunk size to 1000, and chunk overlap to 100. Chunk size of 1000 means each chunk will have a maximum of 1000 characters (think letters of a word, or a comma). Chunk overlap means we will have some overlapping characters that exists in more than 1 chunk. This is important because the goal is to keep track of the semantic (or meaning) behind the input text. Since we are splitting chunks by periods, commas, and new line characters, having overlap means we are preserving meaning knowing that some sentences will get cut off.
Once we have the chunks, we call OpenAIEmbeddings library to generate vectors out of the text, using the GPT 3.5 Turbo model. You can specify any model here.
Return the embedding as well as the file name.

create_vector_store_from_pdf function
def create_vector_store_from_pdf(pdf):
    if pdf is not None:
        pdf_reader = PdfReader(pdf)
        # Get rid of .pdf from name
        store_file_name = pdf.name[:-4]

        if os.path.exists(f"{store_file_name}.pk1"):
            # read the file
            with open(f"{store_file_name}.pk1", "rb") as f:
                vectorStore = pickle.load(f)

            st.write('File already exists, loaded embeddings from disk')
        else:
            text = ""
            for page in pdf_reader.pages:
                text += page.extract_text()
            
            # Use langchain to split into chunks
            splitter = RecursiveCharacterTextSplitter(["\n\n", "\n", ".", ","], chunk_size=1000, chunk_overlap=100)
            chunks = splitter.split_text(text=text)
            st.write("File broken into chunks: ", len(chunks))
            st.write("Chunk:")
            st.write(chunks)

            # Embed chunks
            embeddings = OpenAIEmbeddings(model="chatgpt-3.5-turbo")

            # Create vector store using Meta's FAISS store
            vectorStore = FAISS.from_texts(chunks, embedding=embeddings)

        return vectorStore, store_file_name
    else: 
        return None, None

We will write another helper function to query for answers.

In this function, we pass in k for how many responses we want to get back. k=3 means we will get the top 3 most similar results based on vector similarity.

Using the same model as before, GPT 3.5 Turbo, and passing a temperature of 0.2 for creativity (a range between 0-1 with 1 being most creative), and max_tokens of 3000 which means the answer will contain up to 3000 tokens. Each token roughly maps to 4 characters as a general rule of thumb.

We also chain our prompt templates, and prevents halluciation by explicitly telling the LLM to say "I don't know" if it doesn't know.

The LLMChain will perform a similary search by embedding our query into vectors and comparing it to the stored vectors from our document text. Once it has an answer, it will return the response. This response could be "I don't know."

get_response_from_query function
def get_response_from_query(db, query, k=3):
    docs = db.similarity_search(query=query, k=k)
    docs_page_content = " ".join([doc.page_content for doc in docs])

    # Ask LLM to give final result
    llm = OpenAI(model_name="gpt-3.5-turbo", temperature=0.2, max_tokens=3000)

    system_template = """
        Only use factual information from the document {docs}.

        If you don't have enough information, just say "I don't know".
    """
    system_message_prompt = SystemMessagePromptTemplate.from_template(system_template)

    human_template = """
        Anser the question: {question}
    """
    human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)

    chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, human_message_prompt])

    chain = LLMChain(llm=llm, prompt=chat_prompt)
    response = chain.run(docs=docs_page_content, question=query)

    return response, docs

Finally, back to the main function, we will call our helper functions and put it all together, like below. In addition to displaying the answer, we also show the top 3 most similar results.

def main():
    load_dotenv()

    st.header("Simple search of documents")

    pdf = st.file_uploader("Upload a PDF file", type=["pdf"])
    
    vector_store, store_file_name = create_vector_store_from_pdf(pdf)
    if vector_store and store_file_name is not None:
        num_vectors = vector_store.index.ntotal
        st.write(f"New file, created vector store with {num_vectors} vectors")

        with open(f"{store_file_name}.pk1", "wb") as f:
            pickle.dump(vector_store, f)
            
        # Take in search query
        query = st.text_input("Search for:")
        if query:
            # Get top results
            k = 3
            response, docs = get_response_from_query(vector_store, query, k=k)
            st.write("Answer:")
            st.write(response)

            st.write(f"Top {k} results:")
            cleaned_docs = [doc.page_content.replace('\n', ' ') for doc in docs]
            st.write(cleaned_docs)

Don't forget to call the main function.

if __name__ == "__main__":
    main()

If you run stremlit upload_any_doc.py, you should be able to see it in action. Mine is running on port 8501.

Dashboard

Now let's upload a document, for this test, we are using an 10K report from Blackstone from 2022. This is a complex financial document with lots of data and jargon, let's see how it performs!

Upload file

Once the file is uploaded, the program starts chunking the data. It is a large PDF with 350 pages of text. We created almost 1400 vectors

Chunks

We can also expand into each chunk to see the chunk data.

Chunk data

Now if we ask a question, we will get an answer.

Answer