This project is not covered by Drupal’s security advisory policy.
AI File to Text automatically extracts content from uploaded document files and converts them to plain text, HTML, Markdown, or structured JSON. Built on an extensible extractor architecture, it integrates with the Drupal AI module and Document Loader module — providing AI Automator plugins, AI Agent function calls, and a Document Loader plugin as three consumers that all flow through the same extraction pipeline. No external services, APIs, or server-side applications required. Everything runs in pure PHP on your server.
Features
- 10 file extensions supported out of the box across 7 extractors: Word (
.docx,.doc), OpenDocument (.odt), PDF (.pdf), Spreadsheets (.xlsx,.xls,.ods,.csv), Plain Text (.txt), and Markdown (.md). - 4 output formats: plain text, styled HTML, Markdown, or structured JSON.
- HTML output preserves headings, bold, italic, underline, font sizes, colors, links, lists, and tables.
- JSON output produces a structured DOM tree (
{"tag": "p", "attributes": {...}, "children": [...]}) for non-tabular files, or an array of objects keyed by column headers for spreadsheets/CSV. - Document Loader plugin (
document_loader:file) — enables any third-party code to load documents programmatically via the Document Loader API. - AI Automator plugins for
text_longandstring_longfields — upload a file and the text is extracted automatically on entity save. - AI Agent function call (
file_to_text) — AI agents can read and process documents autonomously. - Extensible architecture — other modules can register new file-type extractors as tagged services without modifying this module.
- Dynamic type registration — extractor types and output capabilities are automatically discovered and registered with the Document Loader plugin system.
- Per-type output accuracy — when extractors have different output capabilities, plugin definitions are automatically split into capability groups so that
getLoaderByType()returns accurate results (no false positives). - No external services, APIs, or server-side applications required.
Architecture
All consumers go through a single unified path:
Consumers (Automator / FunctionCall / Document Loader API)
└─ FileDocumentLoader (document_loader plugin)
└─ FileExtractorManager (routes by file extension)
└─ Extractors (auto-discovered tagged services)
Installation
For DDEV environments with Poppler support, add to .ddev/config.yaml:
webimage_extra_packages:
- poppler-utils
Extending — Adding Custom Extractors
Other modules can add support for new file types by:
See the module's README.md for a full extractor implementation example.
Requirements
- Drupal 10 or 11
- AI module >= 1.1.0
- Document Loader >= 1.0
PHP libraries (installed automatically via Composer):
- phpoffice/phpword >= 1.4 — Word (.docx, .doc) and ODT extraction
- phpoffice/phpspreadsheet >= 2.0 — Spreadsheet (.xlsx, .xls, .ods, .csv) extraction
- smalot/pdfparser >= 2.12 — PDF extraction (pure PHP)
- league/html-to-markdown >= 5.1 — Markdown output conversion
Optional system package for improved PDF extraction:
- poppler-utils — When installed, the module uses
pdftohtmlfor higher-quality PDF output with better table, link, and style detection. Falls back to smalot/pdfparser if not available.
Recommended modules
- AI Agents — Enables AI agent workflows where agents can call
file_to_textto read and process uploaded documents autonomously. - AI Context — Provides additional context to AI operations, useful when combining file extraction with other AI tasks.
Similar projects
- AI Simple PDF to Text — Handles PDF files, converts to plain text only.
- Unstructured — Supports a similar range of file types but requires an external Unstructured API server or a SaaS account.
Supporting this Module
Contributions, bug reports, and feature requests are welcome in the issue queue.
Community Documentation
Documentation, architecture details, and usage examples are included in the module's README.md file.
Project information
- Project categories: Artificial Intelligence (AI), Import and export, Media
- Ecosystem: AI (Artificial Intelligence)
- Created by ahmad khader on , updated
This project is not covered by the security advisory policy.
Use at your own risk! It may have publicly disclosed vulnerabilities.



