[Tracker]
Update Summary: [One-line status update for stakeholders]
Check-in Date: MM/DD/YYYY
Additional Collaborators:
Metadata is used by the AI Tracker. Docs and additional fields here.
[/Tracker]
Problem/Motivation
We currently have an Simple PDF module for Automators and Tools in https://www.drupal.org/project/ai_simple_pdf_to_text.
We should create a similar one for Word in https://github.com/PHPOffice/PHPWord
Proposed resolution
Release the module.
Target date or deadline
Remaining tasks
AI usage (if applicable)
[ ] AI Assisted Issue
This issue was generated with AI assistance, but was reviewed and refined by the creator.
[ ] AI Assisted Code
This code was mainly generated by a human, with AI autocompleting or parts AI generated, but under full human supervision.
[ ] AI Generated Code
This code was mainly generated by an AI with human guidance, and reviewed, tested, and refined by a human.
[ ] Vibe Coded
This code was generated by an AI and has only been functionally tested.
| Comment | File | Size | Author |
|---|---|---|---|
| #10 | Screencast from 11-02-26 08_15_27.mp4 | 29.8 MB | ahmad khader |
Comments
Comment #2
marcus_johansson commentedThis should also handle csv and txt.
We should also take a decision if we merge this with https://www.drupal.org/project/ai_simple_pdf_to_text, though I think its good that they are separated due to the amount of dependencies they take in.
Comment #3
arianraeesi commentedComment #4
ahmad khader commentedComment #5
ahmad khader commented@marcus_johansson I've been working on this and went a bit broader — instead of a Word-only module, I created AI File to Text (
ai_file_to_text) which handles Word (.docx, .doc), ODT, ODS, CSV, TXT, Markdown, and PDF all in one module.Regarding the naming discussion and whether to merge with ai_simple_pdf_to_text — While
ai_file_to_textcovers all the file types mentioned here (Word, CSV, TXT) plus PDF, ODT, ODS, and Markdown. The namingai_file_to_textreflects that it's not limited to one format.PDF is included because it made sense to have one module that handles all common document types rather than requiring users to install two separate modules and have two separate fields. The file extractor also adds styled HTML output with table detection, heading recognition, and font styling on top of what
ai_simple_pdf_to_textprovides.It supports three output formats (plain text, HTML, and Markdown), and extraction is pure PHP — no external services needed.
please give it a look: https://www.drupal.org/project/ai_file_to_text
Comment #6
ahmad khader commentedIt's also easy to provide support for new file types, as each file type extractor is a service type controlled by an extractor manager; therefore, the main functionality is independent of the file type.
Comment #7
kristen polWhat about Word to MD? This would be helpful for CCC
Or perhaps its Word => Text => MD?
Comment #8
ahmad khader commentedActually yes,
It also supports MD (Markdown);-
file -> html -> md format
file.md -> HTML format
file.md -> text format
file.md -> md format
Comment #9
ahmad khader commentedComment #10
ahmad khader commented@kristen pol
I uploaded a video showing the results.
Comment #11
kristen polVery cool. We definitely want this 🙏
Comment #12
marcus_johansson commented@ahmad - the reasoning the issue was actually written as word/text-to-text was dependency hell, so if you only really need word loaders, it would only load phpoffice/phpword, and if you only need PDF it would only load smalot/pdfparser.
The problem is right now, if any of these have conflicts or get outdated, both functionality stops working. But I guess the easyiness of having one way of loading those files is nice.
I think adding plugins for https://www.drupal.org/project/document_loader would be next step for this and Unstructured, that way AI module or CCC can easily just assume that there is some document loader, but its abstracted.
But we can do that in follow up issues, the module is there and its working.
Also if you are ok with it, could you add @unqunq as a maintainer to it, since he carries the Document Loader module. Also to the opposite, if you want to help out with this, it would be great.
Comment #13
marcus_johansson commentedSo, RTBC for me - I'll add some follow up issues.
Comment #14
marcus_johansson commentedIts not code in this repo - so fixed it is :)
Comment #16
arianraeesi commented