Closed (fixed)
Project:
Context Control Center (CCC)
Version:
1.0.x-dev
Component:
Code
Priority:
Normal
Category:
Feature request
Assigned:
Unassigned
Issue tags:
Reporter:
Created:
16 Sep 2025 at 18:16 UTC
Updated:
19 Feb 2026 at 00:00 UTC
Jump to comment: Most recent, Most recent file
Comments
Comment #2
kristen polswitching to code component
Comment #3
kristen polMoving to CCC module
Comment #4
kristen polComment #5
marcus_johansson commentedPlease be aware of #3528673: Create Document Loader Normalization Layer. When that is solved, this issue should be a lot simpler.
Comment #6
kristen polPostponing until we've prioritized in the roadmap and defined the new architecture:
#3559379: [META] CCC rearchitecture and roadmap
Comment #7
kristen polComment #8
kristen polComment #9
kristen polComment #10
kristen polComment #11
kristen polSee
https://www.drupal.org/project/ai_simple_pdf_to_text
And
https://www.drupal.org/project/ai_initiative/issues/3569027
Comment #12
kristen polComment #13
kristen polRelated:
#3547034: [Spike] Research URL support for CCC
#3569310: [META] Context source plugin feature (context from PDF/MD/Word/URL/etc)
Comment #14
kristen polIf someone wants to focus on this issue, feel free to DM me to discuss the approach.
Comment #15
afoster commentedQuality of PDF to MD varies greatly depending on the tools used. If you use a frontier multimodal model (supporting PDF) they can do it, but simpler PDF to MD tools often do stuff like write the header and footer "copyright someone someone" to the MD output for every page.
(which would pollute the context)
I've tested tools like https://github.com/datalab-to/marker which need a LLM to run but I think we'll need {#3528673} to really be done before this can be used.
Comment #16
kristen polThanks Aidan. Perhaps this becomes a spike issue then
Side note: you can use square brackets instead of curly brackets to have it spell but I think you know that so perhaps that was intentional
Comment #17
kristen polComment #18
kristen polOpening up for a contributor.
Comment #19
kristen polComment #20
robloachCan do some research around this, while paired with #3528673: Create Document Loader Normalization Layer
ETD 1w
Comment #21
kristen polHi Rob :) If you don't think you will get to this one early next week, please unassign, in case someone else can pick it up. Thanks!
Comment #22
kristen polComment #23
kristen polCheck out progress here:
#3569027: Simple Word to Text
Comment #24
kristen polComment #25
ahmad khader commentedHi @hristem,
AI File to Text module currently supports PDF to MD conversion. While the Smalot\PdfParser (a pure PHP parser) may not provide perfect quality, it performs adequately. Table conversion presents challenges, though, because tables are just positional segments of (Y/X), making consistent results impossible.
Poppler support has been added as an option. However, a lot of hosting environments may not support the requiring system dependencies. So this is as far as the maximum capability achievable without relying on APIs - LLMs paid, or heavy system dependencies.
The issue mentioned in #15 may occur, as it's a non-AI API.
We are planning to add a document loader (#3573054) to the AI file-to-text module, which I think is worth checking out.
Comment #26
robloachMost excellent, thanks. In the mean time, I've pushed the minimal one that I had put together over at:
https://www.drupal.org/project/document_loader_pdfparser
It aims to only introduce the PDF Plugin, with minimal dependencies. Can add all as maintainers, just let me know.
Other Thoughts:
Comment #27
kristen polThanks for the info! Note, we are going to integrate MDXEditor:
#3547033: AI CCC markdown editor integration
But we will need a way to attach files
Comment #28
kristen polIt would be good to have a high-level overview of how we'd go about using whichever approach makes sense. Are we needing to wait for:
#3573054: Create Document Loader Plugin
or using:
https://www.drupal.org/project/document_loader_pdfparser
Fill out the steps we need to do in the module, e.g.
UPDATED: It would also be nice to have some demos/screenshots of how this currently works
Comment #29
robloachSpoke through this today at the stand up. A few possible approaches, each using similar modules...
1. AI File To Text
The AI File To Text module leverages AI tooling to port a file to text.
2. AI File To Text with Document Loader
The Document Loader module introduces a plugin architecture to instruct transformations between any kind of document type, even web scraping. This is a similar approach taken with other AI solutions, like LangChain's Document Loaders.
3. Document Loader, with AI Automators
While we have the Document Loader module, there currently isn't AI Tooling directly introduced for it. This would enable transforming pretty much any document format directly through AI Automators.
Summary
All three approaches are similar, leveraging automators and field widget actions to fill in the content of the context entities. The benefits of leveraging Document Loader for this is that it's extendable through Plugins, even allowing scraping webpages. Given that though, we are currently not blocked on moving forwards with having a File Field attached to the Context Entity, even with AI File To Text, since the Document Loader architecture will eventually be introduced to it.
Comment #31
robloachJust pushed up a hardcoded demonstration of using File To Text in Merge Request !66:

Again, this is just a quick demonstration, optimally we would take on one of the above approaches in #29...
Comment #32
marcus_johansson commentedMy preference is that I think you should do Document Loader, so its future proof to use enterprise document loaders like Unstructured.io or Docling.
I think you should keep dependencies at a minimum and solve it without Automators/FWA - one thing to note and to think about as well is loading animations. Unstructured's best models can take up to a minute to get everything correctly handled.
I'm wondering on a later stage if we want to actually upcast the MDXEditor so it can loads file in general when the document loader is installed. It would be cool if you could drag a file to the textarea and its filled in :)
Comment #33
kristen polRob, Artem, and I just discussed at the standup. It seems like there are four approaches to move forward on the UX side.
All of these use Document Loader, but the module does not have security coverage.
1. Add a file field to the context form; the user uploads the PDF and saves the form, the PDF is parsed, and the context is added into the content field; they have to go back and edit the form to see the changes
I do not like this workflow at all from a UX perspective; it's unclear and feels like a bug
2. Add a field, automator, and FWA; the user uploads the PDF, the PDF is parsed, and context is added without the user having to save the form
Better UX, but note the timing that Marcus points out of potentially 1 minute processing
3. Add a button in MDXEditor; the user clicks the button, and otherwise it works like option 2
Nice UX but we want to track the source document as well so need to keep that in mind
4. Add an interstitial page that shows a list of context sources that can be added:
- Document upload (PDF, Docx, MD, etc)
- URL
- Manual (same as now)
- Dynamic integration 1 (e.g. Google Analytics)
- Dynamic integration 2 (e.g. SEMRush)
- etc
This option doesn't preclude using options 2 or 3 as well.
And, it supports having other sources that may be handled differently.
My vote is probably #4 for now
Comment #34
kristen polComment #35
robloachI had one project a while ago where the site had 100MB PDFs. We needed to use a background task on cron to process them. While I'm sure that kind of thing is out-of-scope for ai_context, I'm sure there are users who will run into it.
Comment #36
kristen polThanks, everyone!
We have options for a good way forward.
I will make a follow-up issue to do the work, but will leave out the details, because it requires a UX direction first.
My feeling is this won't be prioritized before Chicago, given all the other work, but maybe we can squeeze it in right beforehand.
Comment #38
kristen polComment #39
kristen polFollow-up issue:
#3574413: Add PDF context source plugin to CCC