--- AI TRACKER METADATA ---
Update Summary: Research URL support for CCC
Blocked by: [#XXXXXX] (New issues on new lines)
Additional Collaborators:
AI Tracker found here: https://www.drupalstarforge.ai/
--- END METADATA ---

Problem/Motivation

Allow pulling context content from a URL and use HTML to MD

Problem/Motivation

We want to allow pulling context content from a URL and have it convert from HTML to MD, and need to know our options.

Proposed resolution

Research options and report findings

Remaining tasks

  • Research and write up notes
  • Review with others
  • Create follow up issues

Comments

kristen pol created an issue. See original summary.

kristen pol’s picture

Component: Tracks » Code

switching to code component

kristen pol’s picture

Project: Drupal AI Initiative » Context Control Center (CCC)
Version: » 1.0.x-dev

Moving to CCC module

kristen pol’s picture

Assigned: kristen pol » Unassigned
Category: Task » Feature request
kristen pol’s picture

kristen pol’s picture

Status: Active » Postponed

Postponing until we've prioritized in the roadmap

kristen pol’s picture

Status: Postponed » Active
Issue tags: +mvp, +AI Initiative Sprint, +AI Innovation
kristen pol’s picture

Assigned: Unassigned » kristen pol

If someone wants to focus on this issue, feel free to DM me to discuss the approach.

kristen pol’s picture

Title: Add URL support to AI CCC » [Spike] Research URL support for CCC

Switching to spike for research and prototyping like the PDF issue

kristen pol’s picture

Assigned: kristen pol » Unassigned

Opening up for a contributor.

kristen pol’s picture

Issue summary: View changes
webbywe’s picture

Quick glance and have some initial questions regarding "pulling context content from a URL"...

  1. Is the intent on converting the entire html to md, ability to do a section of the html (eg. a div with an ID), or other specifics of an html looking to convert to md?
  2. Nothing in the head would be relevant and context is within body tag
kristen pol’s picture

Header and footer content wouldn’t likely make sense

Let’s assume some use cases. The user has separate external web pages with:

- Brand guidelines
- Writing guide
- Product catalog

The user can provide a URL and it will take the main content of the page and convert it to markdown.

afoster’s picture

Maybe we could use the Guzzle approach in here?
https://brightdata.com/blog/web-data/php-web-scraping-libraries

It would be great to do something like we sync a copy of what's here as clean MD.
https://www.gov.uk/government/publications/government-functional-standar...

ajv009’s picture

Just sharing a quick thought, what if the page is an index sort of page for a bunch of guidelines...
Since it's an issue still under discussion. I suggest instead of going for a simple grab and clean approach, which will later make us work on a crawl multiple layers and such appraoch, let's maybe use an agentic approach here? Maybe I'm overcomplicating it.
BUT imagine the time when you have given soemthing like Claude Code or such tools a page to extract something from and it realises that a specific sub-page is also important and grabs that as well...

kristen pol’s picture

Thanks! Yes, we definitely we need to also consider a whole site of docs, but that may have a different workflow and architecture. Like you mentioned, indexing and storing metadata about those external web pages might work well.

For this issue, we can have the simple use case of one page. We can make a follow up issue to explore handling multiple pages together or a whole site.

kristen pol’s picture

rakhimandhania’s picture

webbywe’s picture

These are options to achieve the single page criteria.

  • https://github.com/thephpleague/html-to-markdown: This is already a required composer dependency for the the "AI" module. The thephpleague packages are highly regarded and dependable.
  • Guzzle: This is a Drupal required dependency, obviously, so it can be used for fetching content from remote urls.
  • Content: Various methods such as regex, DOMDocument, Drupal's HTML load to parse for the content (or other region).

To also note, https://www.drupal.org/project/markdownify is a module that uses thephpleague/html-to-markdown and can be reviewed for implementation ideas. I did not test out to see if an option to use.

Potential Gotchas

I did do a simple "drush scr" script but guzzle might run into issues such as when I was initially trying it with "https://www.drupal.org/drupalorg/style-guide/brand". I kept getting a 403 as it uses advanced security to prevent scraping or bots. Perhaps less of an issue if companies themselves are using CCC to get brand content and are able to whitelist internal IP ranges.

For content, realized may provide a lot of cruft in the markdown since it may have regions and actual content would be a specific div depending on how page is constructed. There may be a need to pass a selector to a function for grabbing specific content (eg. "div.content").

kristen pol’s picture

Status: Active » Needs review
Issue tags: -sprint candidate +AI Initiative Sprint

Thanks, @webbywe! Some great information in here.

Opening this up to review by others.

unqunq’s picture

I've been working on the Document Loader (https://www.drupal.org/project/document_loader) which is mentioned in https://www.drupal.org/project/ai_context/issues/3547035

The Document loader had a bunch of submodules initially (see older commit), one of which was for loading content from web pages. It suits the need mentioned in this issue.
I took all the submodules out for now with the idea to make each of them standalone modules like document_loader_plugin_api

I have been working on the https://www.drupal.org/project/ai_simple_pdf_to_text module in the last few days in order to upgrade it to use the new Tools api and Document Loader - it's working, I only need to do some cleanup and push it for review.

kristen pol’s picture

Oh, this is very good to know! Thanks, Nick!

robloach’s picture

What @unqunq said. The idea is that we would expose the Document Loader API to AI Context, so that you could feed it any document loader plugin, and it would try to transform what was given into text/markdown for AI Context to understand.

We will need to consider the UI around AI Context for all of these plugins though. Might want a cleaner Forms API component in Document Loader, or something to make it easier to use this.

kristen pol’s picture

Thanks. Are there any demos or screenshots somewhere of how this works?

kristen pol’s picture

Status: Needs review » Fixed

Thanks, everyone!

See #3547035: [Spike] Research PDF upload support for CCC for additional discussions, including a demo Rob put together.

I think we should do the PDF one first and then tackle this one, but we have a lot of good input here, which I think is sufficient for now.

See also:

#3569310: [META] Context source plugin feature (context from PDF/MD/Word/URL/etc)

Now that this issue is closed, review the contribution record.

As a contributor, attribute any organization that helped you, or if you volunteered your own time.

Maintainers, credit people who helped resolve this issue.

kristen pol’s picture

Status: Fixed » Closed (fixed)
kristen pol’s picture