[Tracker]
Update Summary: [One-line status update for stakeholders]
Check-in Date: MM/DD/YYYY
Additional Collaborators:
Metadata is used by the AI Tracker. Docs and additional fields here.
[/Tracker]

Problem/Motivation

We currently have an Simple PDF module for Automators and Tools in https://www.drupal.org/project/ai_simple_pdf_to_text.

We should create a similar one for Word in https://github.com/PHPOffice/PHPWord

Proposed resolution

Release the module.

Target date or deadline

Remaining tasks

AI usage (if applicable)

[ ] AI Assisted Issue
This issue was generated with AI assistance, but was reviewed and refined by the creator.

[ ] AI Assisted Code
This code was mainly generated by a human, with AI autocompleting or parts AI generated, but under full human supervision.

[ ] AI Generated Code
This code was mainly generated by an AI with human guidance, and reviewed, tested, and refined by a human.

[ ] Vibe Coded
This code was generated by an AI and has only been functionally tested.

Comments

marcus_johansson created an issue. See original summary.

marcus_johansson’s picture

This should also handle csv and txt.

We should also take a decision if we merge this with https://www.drupal.org/project/ai_simple_pdf_to_text, though I think its good that they are separated due to the amount of dependencies they take in.

arianraeesi’s picture

ahmad khader’s picture

Assigned: Unassigned » ahmad khader
ahmad khader’s picture

Assigned: ahmad khader » Unassigned
Status: Active » Needs review

@marcus_johansson I've been working on this and went a bit broader — instead of a Word-only module, I created AI File to Text (ai_file_to_text) which handles Word (.docx, .doc), ODT, ODS, CSV, TXT, Markdown, and PDF all in one module.

Regarding the naming discussion and whether to merge with ai_simple_pdf_to_text — While ai_file_to_text covers all the file types mentioned here (Word, CSV, TXT) plus PDF, ODT, ODS, and Markdown. The naming ai_file_to_text reflects that it's not limited to one format.

PDF is included because it made sense to have one module that handles all common document types rather than requiring users to install two separate modules and have two separate fields. The file extractor also adds styled HTML output with table detection, heading recognition, and font styling on top of what ai_simple_pdf_to_text provides.

It supports three output formats (plain text, HTML, and Markdown), and extraction is pure PHP — no external services needed.

please give it a look: https://www.drupal.org/project/ai_file_to_text

ahmad khader’s picture

It's also easy to provide support for new file types, as each file type extractor is a service type controlled by an extractor manager; therefore, the main functionality is independent of the file type.

kristen pol’s picture

What about Word to MD? This would be helpful for CCC

Or perhaps its Word => Text => MD?

ahmad khader’s picture

Actually yes,
It also supports MD (Markdown);-
file -> html -> md format
file.md -> HTML format
file.md -> text format
file.md -> md format

ahmad khader’s picture

  /**
   * Extract content from a file in the requested format.
   *
   * @param string $filePath
   *   The real filesystem path to the file.
   * @param string $extension
   *   The file extension (lowercase, no dot).
   * @param string $format
   *   The output format: 'text', 'html', or 'markdown'.
   *
   * @return string|null
   *   The extracted content, or NULL if the extension is unsupported.
   */
  public function extract(string $filePath, string $extension, string $format = 'text'): ?string {
    if (!isset($this->extensionMap[$extension])) {
      return NULL;
    }

    $extractor = $this->extensionMap[$extension];

    if ($format === 'text') {
      return $extractor->extractText($filePath, $extension);
    }

    // Both 'html' and 'markdown' need HTML first.
    $html = $extractor->extractHtml($filePath, $extension);

    if ($format === 'markdown') {
      return $this->htmlToMarkdown($html);
    }

    return $html;
  }
ahmad khader’s picture

StatusFileSize
new29.8 MB

@kristen pol
I uploaded a video showing the results.

kristen pol’s picture

Very cool. We definitely want this 🙏

marcus_johansson’s picture

@ahmad - the reasoning the issue was actually written as word/text-to-text was dependency hell, so if you only really need word loaders, it would only load phpoffice/phpword, and if you only need PDF it would only load smalot/pdfparser.

The problem is right now, if any of these have conflicts or get outdated, both functionality stops working. But I guess the easyiness of having one way of loading those files is nice.

I think adding plugins for https://www.drupal.org/project/document_loader would be next step for this and Unstructured, that way AI module or CCC can easily just assume that there is some document loader, but its abstracted.

But we can do that in follow up issues, the module is there and its working.

Also if you are ok with it, could you add @unqunq as a maintainer to it, since he carries the Document Loader module. Also to the opposite, if you want to help out with this, it would be great.

marcus_johansson’s picture

Status: Needs review » Reviewed & tested by the community

So, RTBC for me - I'll add some follow up issues.

marcus_johansson’s picture

Status: Reviewed & tested by the community » Fixed

Its not code in this repo - so fixed it is :)

Now that this issue is closed, review the contribution record.

As a contributor, attribute any organization that helped you, or if you volunteered your own time.

Maintainers, credit people who helped resolve this issue.

arianraeesi’s picture

Issue tags: -AI Product Development

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.