Document search tool with Node, OpenAI, Langchain, and Pinecone

July 8, 2023 · 9 min read

Ever wanted to look up any information from a document and ask it questions like you would with ChatGPT? With the power of Langchain, it's not only doable but fairly straight forward. I will show you how to get this up and running in 20 minutes.

The tools we will need are as follows:

Node.js. For Node.js we need v18 or above.
Langchain. Langchain gives us libraries in Javascript and Python to interact with the LLMs more easily. It enables us to use the power of LLMs on our own private data.
Pinecone API Key. Pinecone is a vector database that allows us to store and search for vectors. We can use the starter version for free.
OpenAI API Key. To use OpenAI's LLMs we need an API key.

To get a Pinecone API key, go to Pinecone and sign up for an account. Once you are logged in, go to the API keys section and create a new key.

Pinecone API key

Make sure to save the environemnt as well as the key value.

To get an OpenAI API key, follow these steps.

Go to the OpenAI platform. First set up billing. Don't worry, a decently accurate model like GPT-3.5 Turbo only costs $0.0015 per 1000 tokens. Each token correspond to a group of characters. A general rule of thumb is 1 token = 4 characters.

API pricing

Click on your account and go to the API keys section. Here, you can create a new secret key and make sure to save the key value. We will plug this into our code later.

Create key

Assuming you already have Node and NPM installed on your machine, it's time to get started.

First, the packages we need as listed in this package.json file.

package.json
{
  "name": "simple-search",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "author": "Alex Casella",
  "license": "ISC",
  "type": "module",
  "dependencies": {
    "@pinecone-database/pinecone": "^0.1.6",
    "body-parser": "^1.20.2",
    "cors": "^2.8.5",
    "dotenv": "^16.3.1",
    "express": "^4.18.2",
    "langchain": "^0.0.96",
    "openai": "^3.3.0",
    "pdf-parse": "^1.1.1",
    "uuid": "^9.0.0"
  }
}

Make sure to put your OpenAI API Key and Pinecone API key in a .env file to be more secure.

OPENAI_API_KEY=
PINECONE_API_KEY=
PINECONE_ENVIRONMENT=

Once you have installed all the packages. We need to create a few files. Create an index.js, which is our entry point to the application. For simplicity, we can divide the rest of the files by their responsibilities. Create createPineoneIndex.js, updatePinecone.js, and queryPineconeAndLLM.js.

Let's first write the vector database index creation logic in our createPineconeIndex.js file.

createPineconeIndex function
export const createPineconeIndex = async (indexName, dimension, client) => {
  const existingIndices = await client.listIndexes();
  if (!existingIndices.includes(indexName)) {
    const createClient = await client.createIndex({
      createRequest: {
        name: indexName,
        dimension: dimension,
        metric: "cosine",
      }
    });

    await new Promise((resolve) => setTimeout(resolve, 50000));
  } else {
    console.log(`Index ${indexName} already exists`);
  }
};

As you can see, the function takes in an index name, dimension, and client as input parameters. Index name is the name of our index in the vector database. Dimension is the number of dimensions of our vector. Here is a great article on what dimensions and vectors mean and why they are needed. Client is the Pinecone client obejct that we will initiate in index.js. We will get to that momentarily.

In this function, we check to see if the index already exists, if it does, we will proceed to the next step and just console log. If not, we will call our client to create the index for us.

Next, we will write our update pinecone index logic. This is where the bulk of the code lives. First let's import a couple of libraries.

imports
import { OpenAIEmbeddings } from "langchain/embeddings/openai";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { v4 as uuidv4 } from 'uuid';

We imported OpenAIEmbeddings from Langchain. This will help us embed the document text into vectors, which will then be stored in our index in our vector database. We also imported RecursiveCharacterTextSplitter from Langchain. This is used for parsing through the document and splitting the document into chunks. Because there is a token limit for OpenAI LLMs, we need to split larger amounts of text into smaller chunks.

Now, we can proceed to write our function to update the Pinecone index we created.

updatePinecone function
export const updatePinecone = async (indexName, docs, client) => {
  const index = client.Index(indexName);

  for (const doc of docs) {
    const filePath = doc.metadata.source;
    const text = doc.pageContent;

    const textSplitter = new RecursiveCharacterTextSplitter({
      characters: ['.', '\n', '\n\n'],
      chunkSize: 1000,
      chunkOverlap: 200,
    });
    const chunks = await textSplitter.createDocuments([text]);
    const embeddingsArrays = await new OpenAIEmbeddings().embedDocuments(
      chunks.map((chunk) => 
        chunk.pageContent.replace(/\n/g, " ")
      )
    )
    
    await upsertVectors(filePath, chunks, embeddingsArrays, index);
  }
}

In this function, we take in indexName which is the name of the index we created earlier, docs which are the documents we need to parse, and the same Pinecone client object used in createPineconeIndex.js.

We go through all the documents given, we keep track of the file path, and extract the text by calling doc.pageContent. Once we have the text, we call RecursiveCharacterTextSplitter to split the text into chunks. Each chunk will have a maximize number of 1000 characters. Note that this is not 1000 tokens. You can read more about text splitters here. Chunk overlap means we will have some overlapping characters that exists in more than 1 chunk. This is important because the goal is to keep track of the semantic (or meaning) behind the input text. Since we are splitting chunks by period and new line characters, having overlap means we are preserving meaning knowing that some sentences will get cut off.

Once we have the chunks, we call OpenAIEmbeddings's embedDocuments function to create an embedding array. We will write a helper function to add the embedded vectors to our vector database.

upsertVectors function
export const upsertVectors = async (filePath, chunks, embeddingsArrays, index) => {
  const batchSize = 100;
  let batch = [];

  for (let i = 0; i < chunks.length; i++) {
    const chunk = chunks[i];
    const vector = {
      id: `${filePath}-${i}-${uuidv4()}`,
      values: embeddingsArrays[i],
      metadata: {
        ...chunk.metadata,
        loc: JSON.stringify(chunk.metadata.loc), // where the vector is in the document
        pageContent: chunk.pageContent,
        filePath: filePath,
      }
    };

    batch.push(vector);

    // upsert vectors in batches of 100
    if (batch.length === batchSize || i === chunks.length - 1) {
      try {
        const result = await index.upsert({
          upsertRequest: {
            vectors: batch,
          }
        });
      } catch (err) {
        console.log("error upserting vectors", err);
      }
      
      // reset batch
      batch = [];
    }
  }
}

To avoid rate limiting, we will upload our batches in size of 100. We will have a simple for loop that goes through the embedding array, grab each chunk and add the info to a vector, then push that vector to our batch. The information we are storing in each vector includes a vector ID, which must be unique. We also capture the location of the vector in our document, so we can reference it later. We also capture the actual text content of the chunk. We save the text along with the vector embedding so we can query against the vectors, but display the text.

Once our batch reaches the maximum size of 100, we add it to our Pinecone index, and reset the batch.

tip

The vector ID has to be unique or you will overwrite the vectors in your index over and over again. This is why we also use the uuid library here.

  id: `${filePath}-${i}-${uuidv4()}`,

The next step is to write our query logic.

queryPineconeAndQueryLLM function
import { OpenAIEmbeddings } from "langchain/embeddings/openai";
import { OpenAI } from "langchain/llms/openai";
import { loadQAStuffChain } from "langchain/chains";
import { Document } from "langchain/document";

export const queryPineconeAndQueryLLM = async (question, indexName, client) => {
  const index = client.Index(indexName);

  const queryEmbedding = await new OpenAIEmbeddings().embedQuery(question);

  let queryResponse = await index.query({
    queryRequest: {
      vector: queryEmbedding,
      topK: 10,
      includeMetadata: true,
      includeValues: true,
    }
  });

  if (queryResponse.matches.length) {
    let result = await queryLLM(question, queryResponse);
  } else {
    console.log("No matches found");
  }
}

This is fairly straight forward, as we takes in the question we are asking, the indexName of our index, and the Pinecone client once again. We embed our question into vectors, and then compare the vectors against the document vectors we stored in the previous step.

We asked for the top 10 results so the 10 closest responses will be returned, along with their document text.

let queryResponse = await index.query({
    queryRequest: {
      vector: queryEmbedding,
      topK: 10,
      includeMetadata: true,
      includeValues: true,
    }
  });

We can then take our the top 10 responses to the OpenAI LLM, and it will tell us the ultimate answer to our question.

queryLLM helper function
const queryLLM = async (question, queryResponse) => {
  const llm = new OpenAI();
  const chain = loadQAStuffChain(llm);

  // concatinate query results
  const concatenatedPageContent = queryResponse.matches.map((match) => match.metadata.pageContent).join(" ");

  const result = await chain.call({
    input_documents: [new Document({ pageContent: concatenatedPageContent })],
    question: question,
  });

  return result;
}

Finally, we will put it all together in our index.js file.

index.js
import { PineconeClient } from "@pinecone-database/pinecone";
import { DirectoryLoader } from "langchain/document_loaders/fs/directory";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { PDFLoader } from "langchain/document_loaders/fs/pdf";
import { createPineconeIndex } from "./createPineconeIndex.js";
import { updatePinecone } from "./updatePinecone.js";
import { queryPineconeAndQueryLLM } from "./queryPineconeAndLLM.js";
import * as dotenv from "dotenv";

dotenv.config();

const loader = new DirectoryLoader("./documents", {
  ".txt": (path) => new TextLoader(path),
  ".pdf": (path) => new PDFLoader(path),
});

const docs = await loader.load();

const question = "what is blackstone's total revenue for the year";
const indexName = "test-index";
const vectorDimension = 1536;

const client = new PineconeClient();
await client.init({
  apiKey: process.env.PINECONE_API_KEY,
  environment: process.env.PINECONE_ENVIRONMENT,
});

(async () => {
  await createPineconeIndex(indexName, vectorDimension, client);
  await updatePinecone(indexName, docs, client);
  await queryPineconeAndQueryLLM(question, indexName, client);
})();

The document I tested with is a 10K report from Blackstone. Once our code runs, you should get the answer for "what is blackstone's total revenue for the year".

You can easily expand on this and build a nice UI that prompts the user for questions dynamically. Also feel free to play with the chunk size, overlap size, and vector dimensions to see if you get a different result. You can also try out different GPT models. The default one is GPT3, but you can use any of them.

Happy coding!