Document OCR (Optical Character Recognition)

What is the Document OCR module?

The Document OCR module is an integration interface for the OCR services like Google Document AI to get contents of image and PDF files as text and if supported by the OCR service export text as properties. The module will map properties and import content into Drupal entities.

The module is extendable and supports many ways to transform data from files before saving into Drupal entities. By default the module comes with the Basic, Pipeline and the OpenAI transformer plugins.

Key Features

Integration with Google Document AI (with support for large PDF files via Google Cloud Storage).
Integration with Google Translation to translate extracted text.
Integration with Google Text to Speech to generate mp3 file based on the extracted text and attach mp3 to a file field.
Integration with PDF Parser PHP library.
Integration with Tesseract OCR engine.
Integration with PDFtoText (Poppler) engine.
Integration with docconv to extract text from PDF, DOC, DOCX, XML, HTML, RTF, ODT, PAGES files.
Integration with OpenAI to transcript audio files.
OpenAI integration support (get summaries for PDF files, translate content and many more options).
Integration with Microsoft Azure OpenAI using the same OpenAI plugins (see the README for the details).
Integration with Microsoft Azure Translation to translate extracted text.
Drupal entity and document properties mapping tool.
Transformer plugins to preprocess data before saving it into Drupal entities.
Pipeline transformer plugin allows to stack up multiple transformers and change their execution order.
One-time import tool.
Real-time processing and Queue processing.
Option to store API response as JSON.

Use Cases

The following are some use cases and can be extended via module's plugin system.

Process PDF files to extract its contents as text.
Process PDF file forms and extract each element as property using services like Google Document AI. It also allows to train extraction from your own custom forms.
Get summary of extracted PDF contents via OpenAI.
Create entities and have PDF file contents stored in Drupal.
Process receipts, tax forms and other documents.
Transcribe audio files and translate content to different languages via Google Translate transformer plugin or OpenAI.
Generate audio file (mp3) from the PDF extracted text via Google Text to Speech API.