[Spike] Research PDF upload support for CCC [#3547035]

Quality of PDF to MD varies greatly depending on the tools used. If you use a frontier multimodal model (supporting PDF) they can do it, but simpler PDF to MD tools often do stuff like write the header and footer "copyright someone someone" to the MD output for every page.

(which would pollute the context)

I've tested tools like https://github.com/datalab-to/marker which need a LLM to run but I think we'll need {#3528673} to really be done before this can be used.

Log in or register to post comments

Comment #16

kristen pol

she/her

English

Santa Cruz, CA, USA

commented 26 January 2026 at 19:22

Thanks Aidan. Perhaps this becomes a spike issue then

Side note: you can use square brackets instead of curly brackets to have it spell but I think you know that so perhaps that was intentional

Log in or register to post comments

Comment #17

kristen pol

she/her

English

Santa Cruz, CA, USA

commented 26 January 2026 at 19:23

Title:

Add PDF upload support to AI CCC

» [Spike] Research PDF upload support for CCC

Log in or register to post comments

Comment #18

kristen pol

she/her

English

Santa Cruz, CA, USA

commented 27 January 2026 at 17:30

Assigned:

kristen pol

» Unassigned

Opening up for a contributor.

Log in or register to post comments

Comment #19

kristen pol

she/her

English

Santa Cruz, CA, USA

commented 27 January 2026 at 17:34

Issue summary:

View changes

Log in or register to post comments

Comment #20

robloach

he/him

commented 29 January 2026 at 19:00

Assigned:

Unassigned

» robloach

Can do some research around this, while paired with #3528673: Create Document Loader Normalization Layer

ETD 1w

Log in or register to post comments

Comment #21

kristen pol

she/her

English

Santa Cruz, CA, USA

commented 6 February 2026 at 21:34

Hi Rob :) If you don't think you will get to this one early next week, please unassign, in case someone else can pick it up. Thanks!

Log in or register to post comments

Comment #22

kristen pol

she/her

English

Santa Cruz, CA, USA

commented 8 February 2026 at 00:26

Parent issue:

#3559379: [META] CCC rearchitecture and roadmap

» #3567798: [META] CCC MVP 1.0 roadmap

Log in or register to post comments

Comment #23

kristen pol

she/her

English

Santa Cruz, CA, USA

commented 11 February 2026 at 06:59

Check out progress here:

#3569027: Simple Word to Text

Log in or register to post comments

Comment #24

kristen pol

she/her

English

Santa Cruz, CA, USA

commented 11 February 2026 at 07:00

Related issues:

+#3569027: Simple Word to Text

Log in or register to post comments

Comment #25

ahmad khader commented 12 February 2026 at 16:00

Hi @hristem,

AI File to Text module currently supports PDF to MD conversion. While the Smalot\PdfParser (a pure PHP parser) may not provide perfect quality, it performs adequately. Table conversion presents challenges, though, because tables are just positional segments of (Y/X), making consistent results impossible.

Poppler support has been added as an option. However, a lot of hosting environments may not support the requiring system dependencies. So this is as far as the maximum capability achievable without relying on APIs - LLMs paid, or heavy system dependencies.

The issue mentioned in #15 may occur, as it's a non-AI API.

We are planning to add a document loader (#3573054) to the AI file-to-text module, which I think is worth checking out.

Log in or register to post comments

Comment #26

robloach

he/him

commented 12 February 2026 at 17:20

Most excellent, thanks. In the mean time, I've pushed the minimal one that I had put together over at:
https://www.drupal.org/project/document_loader_pdfparser

It aims to only introduce the PDF Plugin, with minimal dependencies. Can add all as maintainers, just let me know.

Other Thoughts:

Many instances of Drupal I've found used Media to upload and select PDFs
Within AI Context, any context should be able to have multiple files
CKEditor does have Markdown Output support, with the CKEditor Markdown Editor module
CKEditor also has a few plugins to reference uploads/Media too
I understand we'd need MDXEditor for a more React-like interface elsewhere in AI, but AI Context is just a form, not an interactive chat window. Something to consider.

Log in or register to post comments

Comment #27

kristen pol

she/her

English

Santa Cruz, CA, USA

commented 15 February 2026 at 15:43

Thanks for the info! Note, we are going to integrate MDXEditor:

#3547033: AI CCC markdown editor integration

But we will need a way to attach files

Log in or register to post comments

Comment #28

kristen pol

she/her

English

Santa Cruz, CA, USA

commented 15 February 2026 at 15:48

It would be good to have a high-level overview of how we'd go about using whichever approach makes sense. Are we needing to wait for:

#3573054: Create Document Loader Plugin

or using:

https://www.drupal.org/project/document_loader_pdfparser

Fill out the steps we need to do in the module, e.g.

Depend on module(s) xyz
Add a file upload field
Configure xyz
User will be able to go through workflow xyz
End result: User uploads a PDF, it's converted into MD, and becomes available in the add/edit form as context

UPDATED: It would also be nice to have some demos/screenshots of how this currently works

Log in or register to post comments

Comment #29

robloach

he/him

commented 16 February 2026 at 18:40

Spoke through this today at the stand up. A few possible approaches, each using similar modules...

1. AI File To Text

The AI File To Text module leverages AI tooling to port a file to text.

Add AI File To Text module
Add a File Field to the Context Item Entity
Introduce a Field Widget Action and Automator that populates the Context Entity's Content upon Save, using the AI File To Text's Automators

2. AI File To Text with Document Loader

The Document Loader module introduces a plugin architecture to instruct transformations between any kind of document type, even web scraping. This is a similar approach taken with other AI solutions, like LangChain's Document Loaders.

Handle AI File To Text issue: #3573054: Create Document Loader Plugin
Add AI File To Text module, along with Document Loader
Add a File Field to the Context Item Entity
Add a Field Widget Action that leverages File To Text's Document Loader architecture to populate the Context Entity's Content upon Save

3. Document Loader, with AI Automators

While we have the Document Loader module, there currently isn't AI Tooling directly introduced for it. This would enable transforming pretty much any document format directly through AI Automators.

Handle Document Loader issue: #3573331: Add standalone Automators for file to formatted text and JSON outputs
Add Document Loader module
Add a File Field to the Context Item Entity
Add a Field Widget Action that uses Document Loader Automators to populate the Context Item's Content field

Summary

All three approaches are similar, leveraging automators and field widget actions to fill in the content of the context entities. The benefits of leveraging Document Loader for this is that it's extendable through Plugins, even allowing scraping webpages. Given that though, we are currently not blocked on moving forwards with having a File Field attached to the Context Entity, even with AI File To Text, since the Document Loader architecture will eventually be introduced to it.

Log in or register to post comments

Comment #30

16 February 2026 at 20:42

robloach opened merge request !66

Log in or register to post comments

Comment #31

robloach

he/him

commented 16 February 2026 at 21:02

Status	File	Size
new	Screenshot from 2026-02-16 15-59-44.png	145.96 KB

Just pushed up a hardcoded demonstration of using File To Text in Merge Request !66:

Again, this is just a quick demonstration, optimally we would take on one of the above approaches in #29...

Use Automators/Field Widget Actions
Adopt Document Loaders to allow a pluggable architecture around loading the different documents
Have it append the content to the end so that it doesn't eat all the Context's content

Log in or register to post comments

Comment #32

marcus_johansson commented 17 February 2026 at 07:46

My preference is that I think you should do Document Loader, so its future proof to use enterprise document loaders like Unstructured.io or Docling.

I think you should keep dependencies at a minimum and solve it without Automators/FWA - one thing to note and to think about as well is loading animations. Unstructured's best models can take up to a minute to get everything correctly handled.

I'm wondering on a later stage if we want to actually upcast the MDXEditor so it can loads file in general when the document loader is installed. It would be cool if you could drag a file to the textarea and its filled in :)

Log in or register to post comments

Comment #33

kristen pol

she/her

English

Santa Cruz, CA, USA

commented 17 February 2026 at 17:45

Status:

Active

» Needs review

Rob, Artem, and I just discussed at the standup. It seems like there are four approaches to move forward on the UX side.

All of these use Document Loader, but the module does not have security coverage.

1. Add a file field to the context form; the user uploads the PDF and saves the form, the PDF is parsed, and the context is added into the content field; they have to go back and edit the form to see the changes

I do not like this workflow at all from a UX perspective; it's unclear and feels like a bug

2. Add a field, automator, and FWA; the user uploads the PDF, the PDF is parsed, and context is added without the user having to save the form

Better UX, but note the timing that Marcus points out of potentially 1 minute processing

3. Add a button in MDXEditor; the user clicks the button, and otherwise it works like option 2

Nice UX but we want to track the source document as well so need to keep that in mind

4. Add an interstitial page that shows a list of context sources that can be added:

- Document upload (PDF, Docx, MD, etc)
- URL
- Manual (same as now)
- Dynamic integration 1 (e.g. Google Analytics)
- Dynamic integration 2 (e.g. SEMRush)
- etc

This option doesn't preclude using options 2 or 3 as well.

And, it supports having other sources that may be handled differently.

My vote is probably #4 for now

Log in or register to post comments

Comment #34

kristen pol

she/her

English

Santa Cruz, CA, USA

commented 17 February 2026 at 17:46

Assigned:	robloach	» Unassigned
Issue tags:		+Needs UX review

Log in or register to post comments

Comment #35

robloach

he/him

commented 18 February 2026 at 22:28

I had one project a while ago where the site had 100MB PDFs. We needed to use a background task on cron to process them. While I'm sure that kind of thing is out-of-scope for ai_context, I'm sure there are users who will run into it.

Log in or register to post comments

Comment #36

kristen pol

she/her

English

Santa Cruz, CA, USA

commented 18 February 2026 at 23:56

Status:

Needs review

» Fixed

Thanks, everyone!

We have options for a good way forward.

I will make a follow-up issue to do the work, but will leave out the details, because it requires a UX direction first.

My feeling is this won't be prioritized before Chicago, given all the other work, but maybe we can squeeze it in right beforehand.

Log in or register to post comments

Comment #37

18 February 2026 at 23:56

Now that this issue is closed, review the contribution record.

As a contributor, attribute any organization that helped you, or if you volunteered your own time.

Maintainers, credit people who helped resolve this issue.

Log in or register to post comments

Comment #38

kristen pol

she/her

English

Santa Cruz, CA, USA

commented 18 February 2026 at 23:57

Status:

Fixed

» Closed (fixed)

Log in or register to post comments

Comment #39

kristen pol

she/her

English

Santa Cruz, CA, USA

commented 19 February 2026 at 00:00

Issue tags:

-Needs UX review

Follow-up issue:

#3574413: Add PDF context source plugin to CCC

Log in or register to post comments

Issue tags:		+mvp
Parent issue:	#3545824: Create demo Context Control Center for Vienna 2025	» #3559379: [META] CCC rearchitecture and roadmap
Related issues:	-#3559379: [META] CCC rearchitecture and roadmap

Status:	Postponed	» Active
Issue tags:	-sprint candidate	+AI Initiative Sprint

[Spike] Research PDF upload support for CCC

Problem/Motivation

Proposed resolution

Remaining tasks

Issue fork ai_context-3547035

Comments

1. AI File To Text

2. AI File To Text with Document Loader

3. Document Loader, with AI Automators

Summary

Parent issue

Related issues

Referenced by