[Spike] Research URL support for CCC [#3547034]

Is the intent on converting the entire html to md, ability to do a section of the html (eg. a div with an ID), or other specifics of an html looking to convert to md?
Nothing in the head would be relevant and context is within body tag

Log in or register to post comments

Comment #14

kristen pol

she/her

English

Santa Cruz, CA, USA

commented 28 January 2026 at 00:54

Header and footer content wouldn’t likely make sense

Let’s assume some use cases. The user has separate external web pages with:

- Brand guidelines
- Writing guide
- Product catalog

The user can provide a URL and it will take the main content of the page and convert it to markdown.

Log in or register to post comments

Comment #15

afoster commented 28 January 2026 at 19:20

Maybe we could use the Guzzle approach in here?
https://brightdata.com/blog/web-data/php-web-scraping-libraries

It would be great to do something like we sync a copy of what's here as clean MD.
https://www.gov.uk/government/publications/government-functional-standar...

Log in or register to post comments

Comment #16

ajv009 commented 29 January 2026 at 01:28

Just sharing a quick thought, what if the page is an index sort of page for a bunch of guidelines...
Since it's an issue still under discussion. I suggest instead of going for a simple grab and clean approach, which will later make us work on a crawl multiple layers and such appraoch, let's maybe use an agentic approach here? Maybe I'm overcomplicating it.
BUT imagine the time when you have given soemthing like Claude Code or such tools a page to extract something from and it realises that a specific sub-page is also important and grabs that as well...

Log in or register to post comments

Comment #17

kristen pol

she/her

English

Santa Cruz, CA, USA

commented 29 January 2026 at 03:25

Thanks! Yes, we definitely we need to also consider a whole site of docs, but that may have a different workflow and architecture. Like you mentioned, indexing and storing metadata about those external web pages might work well.

For this issue, we can have the simple use case of one page. We can make a follow up issue to explore handling multiple pages together or a whole site.

Log in or register to post comments

Comment #18

kristen pol

she/her

English

Santa Cruz, CA, USA

commented 29 January 2026 at 19:09

Regarding a tree of documentation, the issue comments here are relevant:

#3567568: [Discuss] Look at Progressive Disclosure for Context items using a progressive disclosure inspired by Claude Skills

Log in or register to post comments

Comment #19

rakhimandhania commented 31 January 2026 at 20:56

Issue tags:

-AI Initiative Sprint

+sprint candidate

Log in or register to post comments

Comment #20

webbywe commented 2 February 2026 at 08:36

These are options to achieve the single page criteria.

https://github.com/thephpleague/html-to-markdown: This is already a required composer dependency for the the "AI" module. The thephpleague packages are highly regarded and dependable.
Guzzle: This is a Drupal required dependency, obviously, so it can be used for fetching content from remote urls.
Content: Various methods such as regex, DOMDocument, Drupal's HTML load to parse for the content (or other region).

To also note, https://www.drupal.org/project/markdownify is a module that uses thephpleague/html-to-markdown and can be reviewed for implementation ideas. I did not test out to see if an option to use.

Potential Gotchas

I did do a simple "drush scr" script but guzzle might run into issues such as when I was initially trying it with "https://www.drupal.org/drupalorg/style-guide/brand". I kept getting a 403 as it uses advanced security to prevent scraping or bots. Perhaps less of an issue if companies themselves are using CCC to get brand content and are able to whitelist internal IP ranges.

For content, realized may provide a lot of cruft in the markdown since it may have regions and actual content would be a specific div depending on how page is constructed. There may be a need to pass a selector to a function for grabbing specific content (eg. "div.content").

Log in or register to post comments

Comment #21

kristen pol

she/her

English

Santa Cruz, CA, USA

commented 6 February 2026 at 21:43

Status:	Active	» Needs review
Issue tags:	-sprint candidate	+AI Initiative Sprint

Thanks, @webbywe! Some great information in here.

Opening this up to review by others.

Log in or register to post comments

Comment #22

unqunq

He/Him

English

commented 8 February 2026 at 09:54

I've been working on the Document Loader (https://www.drupal.org/project/document_loader) which is mentioned in https://www.drupal.org/project/ai_context/issues/3547035

The Document loader had a bunch of submodules initially (see older commit), one of which was for loading content from web pages. It suits the need mentioned in this issue.
I took all the submodules out for now with the idea to make each of them standalone modules like document_loader_plugin_api

I have been working on the https://www.drupal.org/project/ai_simple_pdf_to_text module in the last few days in order to upgrade it to use the new Tools api and Document Loader - it's working, I only need to do some cleanup and push it for review.

Log in or register to post comments

Comment #23

kristen pol

she/her

English

Santa Cruz, CA, USA

commented 8 February 2026 at 16:08

Oh, this is very good to know! Thanks, Nick!

Log in or register to post comments

Comment #24

robloach

he/him

commented 16 February 2026 at 00:01

What @unqunq said. The idea is that we would expose the Document Loader API to AI Context, so that you could feed it any document loader plugin, and it would try to transform what was given into text/markdown for AI Context to understand.

We will need to consider the UI around AI Context for all of these plugins though. Might want a cleaner Forms API component in Document Loader, or something to make it easier to use this.