--- AI TRACKER METADATA ---
Update Summary: Research PDF upload support for CCC
Blocked by: [#XXXXXX] (New issues on new lines)
Additional Collaborators:
AI Tracker found here: https://www.drupalstarforge.ai/
--- END METADATA ---

Problem/Motivation

We want to allow pulling context content from a PDF file and use PDF to MD conversion and need to know our options.

Proposed resolution

Research options and report findings

Remaining tasks

  • Research and write up notes
  • Review with others
  • Create follow up issues
CommentFileSizeAuthor
#31 Screenshot from 2026-02-16 15-59-44.png145.96 KBrobloach

Issue fork ai_context-3547035

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

kristen pol created an issue. See original summary.

kristen pol’s picture

Component: Tracks » Code

switching to code component

kristen pol’s picture

Project: Drupal AI Initiative » Context Control Center (CCC)
Version: » 1.0.x-dev

Moving to CCC module

kristen pol’s picture

Assigned: kristen pol » Unassigned
Category: Task » Feature request
marcus_johansson’s picture

Please be aware of #3528673: Create Document Loader Normalization Layer. When that is solved, this issue should be a lot simpler.

kristen pol’s picture

Status: Active » Postponed
Related issues: +#3559379: [META] CCC rearchitecture and roadmap

Postponing until we've prioritized in the roadmap and defined the new architecture:

#3559379: [META] CCC rearchitecture and roadmap

kristen pol’s picture

kristen pol’s picture

kristen pol’s picture

Issue tags: +AI Innovation
kristen pol’s picture

Issue tags: +sprint candidate
kristen pol’s picture

kristen pol’s picture

Status: Postponed » Active
Issue tags: -sprint candidate +AI Initiative Sprint
kristen pol’s picture

Assigned: Unassigned » kristen pol

If someone wants to focus on this issue, feel free to DM me to discuss the approach.

afoster’s picture

Quality of PDF to MD varies greatly depending on the tools used. If you use a frontier multimodal model (supporting PDF) they can do it, but simpler PDF to MD tools often do stuff like write the header and footer "copyright someone someone" to the MD output for every page.

(which would pollute the context)

I've tested tools like https://github.com/datalab-to/marker which need a LLM to run but I think we'll need {#3528673} to really be done before this can be used.

kristen pol’s picture

Thanks Aidan. Perhaps this becomes a spike issue then

Side note: you can use square brackets instead of curly brackets to have it spell but I think you know that so perhaps that was intentional

kristen pol’s picture

Title: Add PDF upload support to AI CCC » [Spike] Research PDF upload support for CCC
kristen pol’s picture

Assigned: kristen pol » Unassigned

Opening up for a contributor.

kristen pol’s picture

Issue summary: View changes
robloach’s picture

Assigned: Unassigned » robloach

Can do some research around this, while paired with #3528673: Create Document Loader Normalization Layer

ETD 1w

kristen pol’s picture

Hi Rob :) If you don't think you will get to this one early next week, please unassign, in case someone else can pick it up. Thanks!

kristen pol’s picture

kristen pol’s picture

Check out progress here:

#3569027: Simple Word to Text

kristen pol’s picture

ahmad khader’s picture

Hi @hristem,

AI File to Text module currently supports PDF to MD conversion. While the Smalot\PdfParser (a pure PHP parser) may not provide perfect quality, it performs adequately. Table conversion presents challenges, though, because tables are just positional segments of (Y/X), making consistent results impossible.

Poppler support has been added as an option. However, a lot of hosting environments may not support the requiring system dependencies. So this is as far as the maximum capability achievable without relying on APIs - LLMs paid, or heavy system dependencies.

The issue mentioned in #15 may occur, as it's a non-AI API.

We are planning to add a document loader (#3573054) to the AI file-to-text module, which I think is worth checking out.

robloach’s picture

Most excellent, thanks. In the mean time, I've pushed the minimal one that I had put together over at:
https://www.drupal.org/project/document_loader_pdfparser

It aims to only introduce the PDF Plugin, with minimal dependencies. Can add all as maintainers, just let me know.

Other Thoughts:

  • Many instances of Drupal I've found used Media to upload and select PDFs
  • Within AI Context, any context should be able to have multiple files
  • CKEditor does have Markdown Output support, with the CKEditor Markdown Editor module
  • CKEditor also has a few plugins to reference uploads/Media too
  • I understand we'd need MDXEditor for a more React-like interface elsewhere in AI, but AI Context is just a form, not an interactive chat window. Something to consider.
kristen pol’s picture

Thanks for the info! Note, we are going to integrate MDXEditor:

#3547033: AI CCC markdown editor integration

But we will need a way to attach files

kristen pol’s picture

It would be good to have a high-level overview of how we'd go about using whichever approach makes sense. Are we needing to wait for:

#3573054: Create Document Loader Plugin

or using:

https://www.drupal.org/project/document_loader_pdfparser

Fill out the steps we need to do in the module, e.g.

  • Depend on module(s) xyz
  • Add a file upload field
  • Configure xyz
  • User will be able to go through workflow xyz
  • End result: User uploads a PDF, it's converted into MD, and becomes available in the add/edit form as context

UPDATED: It would also be nice to have some demos/screenshots of how this currently works

robloach’s picture

Spoke through this today at the stand up. A few possible approaches, each using similar modules...

1. AI File To Text

The AI File To Text module leverages AI tooling to port a file to text.

  1. Add AI File To Text module
  2. Add a File Field to the Context Item Entity
  3. Introduce a Field Widget Action and Automator that populates the Context Entity's Content upon Save, using the AI File To Text's Automators

2. AI File To Text with Document Loader

The Document Loader module introduces a plugin architecture to instruct transformations between any kind of document type, even web scraping. This is a similar approach taken with other AI solutions, like LangChain's Document Loaders.

  1. Handle AI File To Text issue: #3573054: Create Document Loader Plugin
  2. Add AI File To Text module, along with Document Loader
  3. Add a File Field to the Context Item Entity
  4. Add a Field Widget Action that leverages File To Text's Document Loader architecture to populate the Context Entity's Content upon Save

3. Document Loader, with AI Automators

While we have the Document Loader module, there currently isn't AI Tooling directly introduced for it. This would enable transforming pretty much any document format directly through AI Automators.

  1. Handle Document Loader issue: #3573331: Add standalone Automators for file to formatted text and JSON outputs
  2. Add Document Loader module
  3. Add a File Field to the Context Item Entity
  4. Add a Field Widget Action that uses Document Loader Automators to populate the Context Item's Content field

Summary

All three approaches are similar, leveraging automators and field widget actions to fill in the content of the context entities. The benefits of leveraging Document Loader for this is that it's extendable through Plugins, even allowing scraping webpages. Given that though, we are currently not blocked on moving forwards with having a File Field attached to the Context Entity, even with AI File To Text, since the Document Loader architecture will eventually be introduced to it.

robloach’s picture

StatusFileSize
new145.96 KB

Just pushed up a hardcoded demonstration of using File To Text in Merge Request !66:

Again, this is just a quick demonstration, optimally we would take on one of the above approaches in #29...

  • Use Automators/Field Widget Actions
  • Adopt Document Loaders to allow a pluggable architecture around loading the different documents
  • Have it append the content to the end so that it doesn't eat all the Context's content
marcus_johansson’s picture

My preference is that I think you should do Document Loader, so its future proof to use enterprise document loaders like Unstructured.io or Docling.

I think you should keep dependencies at a minimum and solve it without Automators/FWA - one thing to note and to think about as well is loading animations. Unstructured's best models can take up to a minute to get everything correctly handled.

I'm wondering on a later stage if we want to actually upcast the MDXEditor so it can loads file in general when the document loader is installed. It would be cool if you could drag a file to the textarea and its filled in :)

kristen pol’s picture

Status: Active » Needs review

Rob, Artem, and I just discussed at the standup. It seems like there are four approaches to move forward on the UX side.

All of these use Document Loader, but the module does not have security coverage.

1. Add a file field to the context form; the user uploads the PDF and saves the form, the PDF is parsed, and the context is added into the content field; they have to go back and edit the form to see the changes

I do not like this workflow at all from a UX perspective; it's unclear and feels like a bug

2. Add a field, automator, and FWA; the user uploads the PDF, the PDF is parsed, and context is added without the user having to save the form

Better UX, but note the timing that Marcus points out of potentially 1 minute processing

3. Add a button in MDXEditor; the user clicks the button, and otherwise it works like option 2

Nice UX but we want to track the source document as well so need to keep that in mind

4. Add an interstitial page that shows a list of context sources that can be added:

- Document upload (PDF, Docx, MD, etc)
- URL
- Manual (same as now)
- Dynamic integration 1 (e.g. Google Analytics)
- Dynamic integration 2 (e.g. SEMRush)
- etc

This option doesn't preclude using options 2 or 3 as well.

And, it supports having other sources that may be handled differently.

My vote is probably #4 for now

kristen pol’s picture

Assigned: robloach » Unassigned
Issue tags: +Needs UX review
robloach’s picture

I had one project a while ago where the site had 100MB PDFs. We needed to use a background task on cron to process them. While I'm sure that kind of thing is out-of-scope for ai_context, I'm sure there are users who will run into it.

kristen pol’s picture

Status: Needs review » Fixed

Thanks, everyone!

We have options for a good way forward.

I will make a follow-up issue to do the work, but will leave out the details, because it requires a UX direction first.

My feeling is this won't be prioritized before Chicago, given all the other work, but maybe we can squeeze it in right beforehand.

Now that this issue is closed, review the contribution record.

As a contributor, attribute any organization that helped you, or if you volunteered your own time.

Maintainers, credit people who helped resolve this issue.

kristen pol’s picture

Status: Fixed » Closed (fixed)
kristen pol’s picture