Comparing OCR data extraction tools - EasyOCR, Tesseract-OCR, and AWS Textract

July 20, 2023 · 6 min read

Parsing printed text data is easy, but parsing data from PDFs, scans, and images is hard because they can come in various formats and quality. There may also be handwriting involved. To help solve this challenge, we need to rely on computer vision and optical character recognition (OCR). Let's explore the most popular OCR text extraction tools and compare them.

It is worth noting that OCR is not a silver bullet. It can handle any type of incoming document, any language, font, format, or layout. However, its accuracy is heavily dependent on the complexity of these factors, as well as image quality and text quality. To reduce mis-interpretation, we can fine tune our OCR engine and train it on a specific format over time. We will not do that in this tutorial, but in a future one.

Compared with CSVs which are more structured, and video/audio files which are unstructured, PDFs generally fall in the category of semi-structured data. The PDF we will test with today is this example Bill of Lading, which is a document used in supply chain as shipment receipt for carriers.

Bill of Lading

We will put this file into a directory called documents.

The tools we will explore as follows:

EasyOCR. OCR library that supports 80+ languages, developed by JaidedAI.
Python-tesseract. A Python wrapper for Google's OCR - Tesseract-OCR engine.
AWS Textract. AWS service that allows for custom configuration.

It's time to get started. First let's check out EasyOCR.

We will grab our file from the documents directory. Parse each page of the document, and enhance it to get a better resolution and convert it to an image. Our OCR library will then process the image and extract text result. Once the text is extracted, we will put it into a Pandas dataframe to better display and visualize the result. The code below works if you have multiple documents in the directory as well.

EasyOCR
import easyocr
import glob, fitz
import pandas as pd

# To get better resolution
zoom_x = 2.0  # horizontal zoom
zoom_y = 2.0  # vertical zoom
mat = fitz.Matrix(zoom_x, zoom_y)  

path = 'documents/'
all_files = glob.glob(path + "*.pdf")

reader = easyocr.Reader(['en'], gpu=False)

for filename in all_files:
    doc = fitz.open(filename)  
    for page in doc:  
        pix = page.get_pixmap(matrix=mat)  # render page to an image
        outputname = "data/out/page-%i.png" % page.number
        pix.save(outputname)  # store image 
        result = reader.readtext(outputname)

        dataframe = pd.DataFrame(result, columns = ['bound_box', 'text', 'confidence'])
        pd.set_option('display.max_rows', None)
        print(dataframe)

This give us the following output on our file. Additional data is cropped out of our screenshot below.

Easy OCR output

As you can see, we get 3 columns - bound box, text, and confidence. Bound box is the region of the document for which the data is extracted from. Text is obviously the text. Confidence is a confidence level of the accuracy of the extraction. EasyOCR correctly retrieved fields like "Name" and "Address" for "Ship From", and the Bill of Lading Number, but it didn't do well for other fields such as date. Instead of 2/25/2016, the engine extracted 212512016, confusing the / for 1. It also failed to follow the format of the customer order information section, among other things.

The second tool we will look at is Pytesseract, a wrapper of Google's Tesseract OCR engine.

Just like earlier, we convert the PDF to an image, and then pass the image to our pytesseract library. We extract text using the image_to_string method.

Pytesseract
import pytesseract
import glob, fitz

# To get better resolution
zoom_x = 2.0  # horizontal zoom
zoom_y = 2.0  # vertical zoom
mat = fitz.Matrix(zoom_x, zoom_y)  

path = 'documents/'
all_files = glob.glob(path + "*.pdf")

for filename in all_files:
    doc = fitz.open(filename)  
    for page in doc:  
        pix = page.get_pixmap(matrix=mat)  # render page to an image
        outputname = "data/out/page-%i.png" % page.number
        pix.save(outputname)  # store image 

        result = pytesseract.image_to_string(outputname, lang='eng')
        print(result)

Running this code gives us the following output.

PyTesseract Output

There is more to the extraction than the screenshot above, but you get the idea. We are able to get the correct date this time! However, it didn't associate the "bill of lading number" itself to the bill of lading number field. Clearly, improvements can still be made.

The third tool we will examine is AWS Textract. Before we get started, head over to AWS and follow the steps here to get an access key ID and a secret access key.

We will use boto3, the AWS SDK for Python to communicate with our AWS service. We will also use Textractor to make extraction and data visualization easier.

AWS Textract
import boto3
import glob, fitz
from textractor import Textractor
from textractor.data.constants import TextractFeatures
import pandas as pd

# To get better resolution
zoom_x = 2.0  # horizontal zoom
zoom_y = 2.0  # vertical zoom
mat = fitz.Matrix(zoom_x, zoom_y)  # zoom factor 2 in each dimension

path = '../documents/'
all_files = glob.glob(path + "*.pdf")

# AWS client
client = boto3.client('textract', region_name='us-east-1', aws_access_key_id='YOUR KEY ID', aws_secret_access_key='YOUR SECRET ACCESS KEY')
extractor = Textractor(profile_name="default")

for filename in all_files:
    doc = fitz.open(filename)  
    for page in doc:  
        pix = page.get_pixmap(matrix=mat)  
        outputname = "../data/out/page-%i.png" % page.number
        pix.save(outputname)  

        with open(outputname, 'rb') as image:
            img_bytes = bytearray(image.read())

        document = extractor.analyze_document(file_source=outputname, features=[TextractFeatures.FORMS])

        # Getting list of keys and list of values
        keys = [kv.key.text for kv in document.key_values]
        values = [str(kv.value) for kv in document.key_values]

        # Creating the data frame with the values and assigning the keys as columns
        df = pd.DataFrame([values], columns=keys)

        filename = "output.csv"

        # Write the dataframe object into CSV file
        df.to_csv(filename, index=None, header=True)

        # To verify if the CSV is created, read CSV file and convert it into dataframe object
        df = pd.DataFrame(pd.read_csv("output.csv"))

        print(df)

Textract gives us extracted data in key value pairs. We will use Pandas DataFrame's to_csv method to help us generate a CSV file, which looks like this:

Textract output

It correctly got name fields but doesn't tell us which name is the "ship from" name, and which is the "ship to" name.

You may be disappointed by the accuracy but remember these engines need to be trained on specific formats to perform better. Once we have extracted the data, however, we can then pass the data to an LLM for embeddings and other cool tricks! There are also other OCR-free data extraction technique proposed in recent years, such as the Donut model. As a challenge, try running these OCR engines on your own document, and try it on a document with handwriting.