Using GPT-4o for Optical Character Recognition: An Experience

June 10, 2024

Optical Character Recognition (OCR) has been a cornerstone technology for digitizing text from physical documents, and industries have been striving for greater efficiency, accuracy, and intelligence from OCR solutions. Enter GPT-4o Vision, the latest advancement from OpenAI, which combines the power of GPT-4's natural language understanding with cutting-edge visual recognition capabilities.

We at Techjays recently worked in the OCR domain for a custom AI solution project, a Data Extraction & Visualization one. A PDF and an Excel sheet document were the data sources from which we were required to extract data by performing OCR.

‍

But before getting into real experiences, let's quickly glance at the promises that the GPT-4o makes regarding OCR.

‍

What Does GPT-4o Promise for OCR?

‍

The latest GPT-4o model overall boasts enhanced performance and understanding than its predecessors and competitors.

It promises improved Accuracy with better recognition capabilities, especially in noisy or distorted text cases. Similarly, an improved contextual understanding of the model can help reduce errors by correcting based on the surrounding text.

‍

The model also claims to consume only optimal resources which they say can help reduce computational costs and improve processing speeds. It is also said to handle large volumes of OCR tasks without a visible drop in performance.

‍

GPT-4o also exhibits its multi-language and dialect support for a wider range of languages and dialects in global contexts supporting in creating custom AI solutions. It is also said to be better at recognizing and processing complex structures, such as tables, forms, and mixed media documents. It is also said to have improved entity recognition for extracting meaningful information such as dates, names, and locations.

‍

Real-Time Experience at Techjays:
‍

We being an AI services company, recently needed OCR for converting a PDF of 220 pages in length. The pattern of the pdf was mid-complex to grab the content.
‍

Initially, we started using pdfplumber for reading the PDF and Pytesseract for Optical Character Recognition but got incomplete results and the combination could not recognize 80% of the characters in our use case.
‍

This is when we planned to move to the OpenAI vision. For our work, we took most of the output in JSON format, so that later manipulations can be handled easily. OpenAI offers two models for the vision service, GPT-4o and GPT-4 and we chose GPT-4o for this.
‍

The initial observation about GPT-4o was that it gave us close to 80% accurate conversion when in the previous case we only got close to 30%.
‍

The model could successfully recognize different types of data from the image such as pin codes, email addresses, Names, telephone numbers, etc. On top of that, there were other advantages OpenAI vision offers, especially the capacity to extract the text and return output in the format we desired. Only very few times was a manual rectification needed
‍

The model was definitely faster than any we have used till now and was reli
able. Also, the ability to recognize distorted images and extract data was commendable.
‍

As far as cost is concerned, while Tesseract is completely free and open-source, using GPT-4o can be expensive, especially at scale, due to API usage fees and the infrastructure needed. But do remember that costs are primarily associated with the computational resources required for various projects.
‍

For us, it costs $0.03 to $0.05 per page depending on the resolution of that page and an average of 1 minute execution time per page. Also, significant time and technical expertise may be required for the initial integration and customization.
‍

On the other hand, we did notice some limitations to the model, when the contents started increasing and becoming much more complex. This was sort of expected as GPT-4o is still not a dedicated OCR solution. Generally, Tesseract is faster for basic OCR tasks, especially when using pre-configured settings. The slowing of GPT-4o’s Processing speed can be due to the computational demands of its advanced AI capabilities.

While GPT-4o is highly versatile, it is not specifically designed for OCR, meaning it might not be as optimized for this task as specialized OCR engines. When compared to such specialized OCR engines, the processing also might be slower due to the complexity of the model.

‍

Similarly, when it comes to customization possibilities, GPT-4o has limited customization space even though the model is in itself designed to handle a wide range of OCR tasks without the need for extensive configuration.
‍

Conclusion
‍

GPT-4o gave us some amazing results where certain other models failed, giving us more than 80% efficiency and accuracy in converting data from images to text. Equally impressive was its capability to recognize different types of data and give output in the format and pattern that we desired.
‍

Even if it is a paid model, the smartness of the model seems to be worthy enough, just that balance needs to be struck when it comes to larger projects.
‍

At the same time, another observation is the fact that while GPT-4o is highly versatile, it is not specifically designed for OCR and may not provide optimized solutions as specialized OCR engines can. Also, difficulties can arise in cases of highly structured text, especially with rigid formatting, though it might not be ideal for cases requiring data extraction pipelines to process large volumes of complex raw data.
‍

While models ^like Tesseract are highly customizable, if you want high-accuracy results with a minimal setup, GPT-4o might be your best choice.

‍