Closed (fixed)
Project:
Context Control Center (CCC)
Version:
1.0.x-dev
Component:
Code
Priority:
Normal
Category:
Feature request
Assigned:
Unassigned
Issue tags:
Reporter:
Created:
16 Sep 2025 at 18:15 UTC
Updated:
19 Feb 2026 at 00:10 UTC
Jump to comment: Most recent
Comments
Comment #2
kristen polswitching to code component
Comment #3
kristen polMoving to CCC module
Comment #4
kristen polComment #5
kristen polComment #6
kristen polPostponing until we've prioritized in the roadmap
Comment #7
kristen polComment #8
kristen polRelated:
#3547035: [Spike] Research PDF upload support for CCC
#3569310: [META] Context source plugin feature (context from PDF/MD/Word/URL/etc)
Comment #9
kristen polIf someone wants to focus on this issue, feel free to DM me to discuss the approach.
Comment #10
kristen polSwitching to spike for research and prototyping like the PDF issue
Comment #11
kristen polOpening up for a contributor.
Comment #12
kristen polComment #13
webbywe commentedQuick glance and have some initial questions regarding "pulling context content from a URL"...
Comment #14
kristen polHeader and footer content wouldn’t likely make sense
Let’s assume some use cases. The user has separate external web pages with:
- Brand guidelines
- Writing guide
- Product catalog
The user can provide a URL and it will take the main content of the page and convert it to markdown.
Comment #15
afoster commentedMaybe we could use the Guzzle approach in here?
https://brightdata.com/blog/web-data/php-web-scraping-libraries
It would be great to do something like we sync a copy of what's here as clean MD.
https://www.gov.uk/government/publications/government-functional-standar...
Comment #16
ajv009 commentedJust sharing a quick thought, what if the page is an index sort of page for a bunch of guidelines...
Since it's an issue still under discussion. I suggest instead of going for a simple grab and clean approach, which will later make us work on a crawl multiple layers and such appraoch, let's maybe use an agentic approach here? Maybe I'm overcomplicating it.
BUT imagine the time when you have given soemthing like Claude Code or such tools a page to extract something from and it realises that a specific sub-page is also important and grabs that as well...
Comment #17
kristen polThanks! Yes, we definitely we need to also consider a whole site of docs, but that may have a different workflow and architecture. Like you mentioned, indexing and storing metadata about those external web pages might work well.
For this issue, we can have the simple use case of one page. We can make a follow up issue to explore handling multiple pages together or a whole site.
Comment #18
kristen polRegarding a tree of documentation, the issue comments here are relevant:
#3567568: [Discuss] Look at Progressive Disclosure for Context items using a progressive disclosure inspired by Claude Skills
Comment #19
rakhimandhania commentedComment #20
webbywe commentedThese are options to achieve the single page criteria.
To also note, https://www.drupal.org/project/markdownify is a module that uses thephpleague/html-to-markdown and can be reviewed for implementation ideas. I did not test out to see if an option to use.
Potential Gotchas
I did do a simple "drush scr" script but guzzle might run into issues such as when I was initially trying it with "https://www.drupal.org/drupalorg/style-guide/brand". I kept getting a 403 as it uses advanced security to prevent scraping or bots. Perhaps less of an issue if companies themselves are using CCC to get brand content and are able to whitelist internal IP ranges.
For content, realized may provide a lot of cruft in the markdown since it may have regions and actual content would be a specific div depending on how page is constructed. There may be a need to pass a selector to a function for grabbing specific content (eg. "div.content").
Comment #21
kristen polThanks, @webbywe! Some great information in here.
Opening this up to review by others.
Comment #22
unqunqI've been working on the Document Loader (https://www.drupal.org/project/document_loader) which is mentioned in https://www.drupal.org/project/ai_context/issues/3547035
The Document loader had a bunch of submodules initially (see older commit), one of which was for loading content from web pages. It suits the need mentioned in this issue.
I took all the submodules out for now with the idea to make each of them standalone modules like document_loader_plugin_api
I have been working on the https://www.drupal.org/project/ai_simple_pdf_to_text module in the last few days in order to upgrade it to use the new Tools api and Document Loader - it's working, I only need to do some cleanup and push it for review.
Comment #23
kristen polOh, this is very good to know! Thanks, Nick!
Comment #24
robloachWhat @unqunq said. The idea is that we would expose the Document Loader API to AI Context, so that you could feed it any document loader plugin, and it would try to transform what was given into text/markdown for AI Context to understand.
We will need to consider the UI around AI Context for all of these plugins though. Might want a cleaner Forms API component in Document Loader, or something to make it easier to use this.
Comment #25
kristen polThanks. Are there any demos or screenshots somewhere of how this works?
Comment #26
kristen polThanks, everyone!
See #3547035: [Spike] Research PDF upload support for CCC for additional discussions, including a demo Rob put together.
I think we should do the PDF one first and then tackle this one, but we have a lot of good input here, which I think is sufficient for now.
See also:
#3569310: [META] Context source plugin feature (context from PDF/MD/Word/URL/etc)
Comment #28
kristen polComment #29
kristen polFollow-up issue:
#3574414: Add webpage (URL) context source plugin to CCC