AI File to Text

This project is not covered by Drupal’s security advisory policy.

AI File to Text automatically extracts content from uploaded document files and converts them to plain text, HTML, Markdown, or structured JSON. Built on an extensible extractor architecture, it integrates with the Drupal AI module and Document Loader module — providing AI Automator plugins, AI Agent function calls, and a Document Loader plugin as three consumers that all flow through the same extraction pipeline. No external services, APIs, or server-side applications required. Everything runs in pure PHP on your server.

Features

10 file extensions supported out of the box across 7 extractors: Word (.docx, .doc), OpenDocument (.odt), PDF (.pdf), Spreadsheets (.xlsx, .xls, .ods, .csv), Plain Text (.txt), and Markdown (.md).
4 output formats: plain text, styled HTML, Markdown, or structured JSON.
HTML output preserves headings, bold, italic, underline, font sizes, colors, links, lists, and tables.
JSON output produces a structured DOM tree ({"tag": "p", "attributes": {...}, "children": [...]}) for non-tabular files, or an array of objects keyed by column headers for spreadsheets/CSV.
Document Loader plugin (document_loader:file) — enables any third-party code to load documents programmatically via the Document Loader API.
AI Automator plugins for text_long and string_long fields — upload a file and the text is extracted automatically on entity save.
AI Agent function call (file_to_text) — AI agents can read and process documents autonomously.
Extensible architecture — other modules can register new file-type extractors as tagged services without modifying this module.
Dynamic type registration — extractor types and output capabilities are automatically discovered and registered with the Document Loader plugin system.
Per-type output accuracy — when extractors have different output capabilities, plugin definitions are automatically split into capability groups so that getLoaderByType() returns accurate results (no false positives).
No external services, APIs, or server-side applications required.

Architecture

All consumers go through a single unified path:

Consumers (Automator / FunctionCall / Document Loader API)
  └─ FileDocumentLoader  (document_loader plugin)
       └─ FileExtractorManager  (routes by file extension)
            └─ Extractors  (auto-discovered tagged services)

Installation

For DDEV environments with Poppler support, add to .ddev/config.yaml:

webimage_extra_packages:
  - poppler-utils

Extending — Adding Custom Extractors

Other modules can add support for new file types by:
See the module's README.md for a full extractor implementation example.

Requirements

Drupal 10 or 11
AI module >= 1.1.0
Document Loader >= 1.0

PHP libraries (installed automatically via Composer):

phpoffice/phpword >= 1.4 — Word (.docx, .doc) and ODT extraction
phpoffice/phpspreadsheet >= 2.0 — Spreadsheet (.xlsx, .xls, .ods, .csv) extraction
smalot/pdfparser >= 2.12 — PDF extraction (pure PHP)
league/html-to-markdown >= 5.1 — Markdown output conversion

Optional system package for improved PDF extraction:

poppler-utils — When installed, the module uses pdftohtml for higher-quality PDF output with better table, link, and style detection. Falls back to smalot/pdfparser if not available.

Recommended modules

AI Agents — Enables AI agent workflows where agents can call file_to_text to read and process uploaded documents autonomously.
AI Context — Provides additional context to AI operations, useful when combining file extraction with other AI tasks.

Similar projects

AI Simple PDF to Text — Handles PDF files, converts to plain text only.
Unstructured — Supports a similar range of file types but requires an external Unstructured API server or a SaaS account.

Supporting this Module

Contributions, bug reports, and feature requests are welcome in the issue queue.

Community Documentation

Documentation, architecture details, and usage examples are included in the module's README.md file.

Supporting organizations:

Vardot

Development

FreelyGive

Project information

Project categories: Artificial Intelligence (AI), Import and export, Media
Ecosystem: AI (Artificial Intelligence)
Created by ahmad khader on 11 February 2026, updated 19 March 2026
This project is not covered by the security advisory policy.
Use at your own risk! It may have publicly disclosed vulnerabilities.

Releases

View all releases

AI File to Text

Primary tabs

Features

Architecture

Installation

Extending — Adding Custom Extractors

Requirements

Recommended modules

Similar projects

Supporting this Module

Community Documentation

Project information

Releases

Maintainers

Issues for AI File to Text

All issues

Bug report

Statistics

Resources

Development

News items

Our community

Documentation

Drupal code base

Governance of community