What is the Document OCR module?
The Document OCR module is an integration interface for the OCR services like Google Document AI to get contents of image and PDF files as text and if supported by the OCR service export text as properties. The module will map properties and import content into Drupal entities.
The module is extendable and supports many ways to transform data from files before saving into Drupal entities. By default the module comes with the Basic, Pipeline and the OpenAI transformer plugins.
Key Features
- Integration with Google Document AI (with support for large PDF files via Google Cloud Storage).
- Integration with Google Translation to translate extracted text.
- Integration with Google Text to Speech to generate mp3 file based on the extracted text and attach mp3 to a file field.
- Integration with PDF Parser PHP library.
- Integration with Tesseract OCR engine.
- Integration with PDFtoText (Poppler) engine.
- Integration with docconv to extract text from PDF, DOC, DOCX, XML, HTML, RTF, ODT, PAGES files.
- Integration with OpenAI to transcript audio files.
- OpenAI integration support (get summaries for PDF files, translate content and many more options).
- Integration with Microsoft Azure OpenAI using the same OpenAI plugins (see the README for the details).
- Integration with Microsoft Azure Translation to translate extracted text.
- Drupal entity and document properties mapping tool.
- Transformer plugins to preprocess data before saving it into Drupal entities.
- Pipeline transformer plugin allows to stack up multiple transformers and change their execution order.
- One-time import tool.
- Real-time processing and Queue processing.
- Option to store API response as JSON.
Use Cases
The following are some use cases and can be extended via module's plugin system.
- Process PDF files to extract its contents as text.
- Process PDF file forms and extract each element as property using services like Google Document AI. It also allows to train extraction from your own custom forms.
- Get summary of extracted PDF contents via OpenAI.
- Create entities and have PDF file contents stored in Drupal.
- Process receipts, tax forms and other documents.
- Transcribe audio files and translate content to different languages via Google Translate transformer plugin or OpenAI.
- Generate audio file (mp3) from the PDF extracted text via Google Text to Speech API.
Integration Modules
Configuration
Please read the README.md file for more details. It includes configuration details for the Google Document AI, OpenAI and other integrations.
Examples
Mapping listing page:

Mapping settings:

Drupal entity and document property mapping tool:

One-time import form (with the same Drupal entity and document property mapping tool)

Document processor listing page:

Add processor form:

Transformers listing page:

Basic property transformer plugin:

OpenAI property transformer plugin:

Document processing tasks listing page:

Installation
Use composer require 'drupal/document_ocr:^1.0' to install all required dependencies.
Project information
- Project categories: Content editing experience, Developer tools, Import and export
29 sites report using this module
- Created by minnur on , updated
Stable releases for this project are covered by the security advisory policy.
Look for the shield icon below.
