This project is not covered by Drupal’s security advisory policy.

AI File to Text automatically extracts content from uploaded document files and converts them to plain text, HTML, Markdown, or structured JSON. Built on an extensible extractor architecture, it integrates with the Drupal AI module and Document Loader module — providing AI Automator plugins, AI Agent function calls, and a Document Loader plugin as three consumers that all flow through the same extraction pipeline. No external services, APIs, or server-side applications required. Everything runs in pure PHP on your server.

Features

  • 10 file extensions supported out of the box across 7 extractors: Word (.docx, .doc), OpenDocument (.odt), PDF (.pdf), Spreadsheets (.xlsx, .xls, .ods, .csv), Plain Text (.txt), and Markdown (.md).
  • 4 output formats: plain text, styled HTML, Markdown, or structured JSON.
  • HTML output preserves headings, bold, italic, underline, font sizes, colors, links, lists, and tables.
  • JSON output produces a structured DOM tree ({"tag": "p", "attributes": {...}, "children": [...]}) for non-tabular files, or an array of objects keyed by column headers for spreadsheets/CSV.
  • Document Loader plugin (document_loader:file) — enables any third-party code to load documents programmatically via the Document Loader API.
  • AI Automator plugins for text_long and string_long fields — upload a file and the text is extracted automatically on entity save.
  • AI Agent function call (file_to_text) — AI agents can read and process documents autonomously.
  • Extensible architecture — other modules can register new file-type extractors as tagged services without modifying this module.
  • Dynamic type registration — extractor types and output capabilities are automatically discovered and registered with the Document Loader plugin system.
  • Per-type output accuracy — when extractors have different output capabilities, plugin definitions are automatically split into capability groups so that getLoaderByType() returns accurate results (no false positives).
  • No external services, APIs, or server-side applications required.

Architecture

All consumers go through a single unified path:

Consumers (Automator / FunctionCall / Document Loader API)
  └─ FileDocumentLoader  (document_loader plugin)
       └─ FileExtractorManager  (routes by file extension)
            └─ Extractors  (auto-discovered tagged services)

Installation

For DDEV environments with Poppler support, add to .ddev/config.yaml:

webimage_extra_packages:
  - poppler-utils

Extending — Adding Custom Extractors

Other modules can add support for new file types by:
See the module's README.md for a full extractor implementation example.

Requirements

PHP libraries (installed automatically via Composer):

Optional system package for improved PDF extraction:

  • poppler-utils — When installed, the module uses pdftohtml for higher-quality PDF output with better table, link, and style detection. Falls back to smalot/pdfparser if not available.

Recommended modules

  • AI Agents — Enables AI agent workflows where agents can call file_to_text to read and process uploaded documents autonomously.
  • AI Context — Provides additional context to AI operations, useful when combining file extraction with other AI tasks.

Similar projects

  • AI Simple PDF to Text — Handles PDF files, converts to plain text only.
  • Unstructured — Supports a similar range of file types but requires an external Unstructured API server or a SaaS account.

Supporting this Module

Contributions, bug reports, and feature requests are welcome in the issue queue.

Community Documentation

Documentation, architecture details, and usage examples are included in the module's README.md file.

Supporting organizations: 
Development

Project information

Releases