Problem/Motivation

Users should be protected from AI bot(s) scraping by default. If they want to allow it, they can choose to do so after the fact by editing robots.txt or using modules like RobotsTxt. This would protect users and teams either unaware of AI bot scraping, those who don't want it, and or those who are not aware they need to be taking action (in this manner).

Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to primarily aggregate personally identifiable information (PII), or have text that violates our policies. Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety.

I think this is a sensible change and also doesn't put full trust into ChatGPT or Google Bard not ingesting things or interpreting sensitive content as not-sensitive.

Proposed resolution

Add the following to the default robots.txt to block Google and OpenAI (ChatGPT):

User-agent: GPTBot
Disallow: /

User-agent: CCbot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: Google-Extended
Disallow: /

It may also be relevant to add an ai.txt by default for a broader disallow (is this a real growing standard or proposed?).

See also: https://site.spawning.ai/spawning-ai-txt

Comments

kevinquillen created an issue. See original summary.

kevinquillen’s picture

Title: Disallow GPTBot by default » Disallow GPTBot by default in robots.txt
Project: RobotsTxt » Drupal core
Version: 8.x-1.5 » 11.x-dev
Component: Code » base system
Issue summary: View changes

Moving to core queue since RobotsTxt module uses the default robots.txt file to create initial configuration.

kevinquillen’s picture

Issue summary: View changes
cilefen’s picture

This would be the first crawler blocked by default in robots.txt.

kevinquillen’s picture

Here?

https://git.drupalcode.org/project/drupal/-/blob/11.x/robots.txt?ref_typ...

Because as I understand it you'd need to declare it again to prevent it from scraping content paths, where the default is currently for assets and admin paths:

# This file is to prevent the crawling and indexing of certain parts
# of your site by web crawlers and spiders run by sites like Yahoo!
# and Google. By telling these "robots" where not to go on your site,
# you save bandwidth and server resources.

cilefen’s picture

I don't understand what #5 is communicating (to me?). If it is to me, then maybe my comment wasn't clear.

kevinquillen’s picture

I thought you were saying that GPTBot is blocked already from scraping a Drupal site by default, however I am not seeing that.

cilefen’s picture

That explains the confusion. That's not what I was saying.

This issue would introduce, for the first time, blocking a specific crawler by default. By that I mean, Drupal AFAIK has never done that and it comes with some downsides. Notably, there are many crawlers. Doing this could open the door to other requests to add hated crawlers.

So all I'm saying is that there is another decision implicit in this issue: whether to block specific crawlers.

mindaugasd’s picture

@kevinquillen why? what does this achieve (long term)?
Is there some article that goes in depth into this? (measuring both sides of the argument)
This would be significant decision and needs solid reasons.

kevinquillen’s picture

https://searchengineland.com/more-popular-websites-blocking-gptbot-432531

https://www.kevin-indig.com/most-sites-will-block-chat-gpt/

https://searchengineland.com/websites-blocking-gptbot-431183

Its not a knock against AI, I am simply saying that as a CMS by default should likely have this to protect the user from their content entering model(s) either prematurely or at all. This all largely entered the public consciousness in the last 12 months, so its brand new to most people. But I can see where some sites find their content parroted out in ChatGPT models without consent and IMO that would look bad on Drupal for assuming users would just know to add that post-install (to robots.txt). This is different than search engine crawlers, they've been around a long time. I think it should be something users go "okay I am ready to allow GPTBot now, remove it" or not at all versus waking up one day finding your content assimilated.

I can't see far enough down the road yet, but it seems like once something is in an LLM, its permanent. Depending on the type of site, you may not want this at all, but at the same time it may not even occur to people until its too late to do anything about it. If you get indexed by Google by accident, you can work a process with them to delist it. I don't think the same is true for AI, not that I have seen.

NPR, for example: https://www.npr.org/robots.txt

larowlan’s picture

Is there a wider list we should be considering, rather than just one AI based crawler?
I agree with @cilefen we shouldn't just pick one.

kevinquillen’s picture

Reuters has even more.. is there a standard somewhere anyone knows of?

User-agent: PiplBot
Disallow: /

User-agent: CCbot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: Google-Extended
Disallow: /

https://searchengineland.com/robots-txt-new-meta-tag-llm-ai-429510

kevinquillen’s picture

Title: Disallow GPTBot by default in robots.txt » Disallow AI bots by default in robots.txt
Issue summary: View changes
mindaugasd’s picture

I read the articles.

Summary:

  • websites block AI for financial interest.
  • this is a massive trend of (Top aka. profitable) websites blocking AI.

Conclusion:

  • this is slowing AI progress (maybe a good thing?)
  • this is slowing AI usefulness (that is probably very bad for users) aka. good for business, bad for users.

I seem to agree it can be blocked by default, because:

  1. Its a massive trend anyway.
  2. As Kevin have said, its not possible to remove the data after its in the model. So data can end up in model without people having enough time to consent about sharing it (if bots are not blocked by default). People should be given time to decide.

Counter-arguments?

kevinquillen’s picture

Issue summary: View changes
kevinquillen’s picture

Issue summary: View changes
mindaugasd’s picture

scott_euser’s picture

At my agency (Soapbox) we work with Think Tanks and other organisations that generally work to positively influence policy and decision making. We have consulted with a number of them on their thoughts since the Robots.txt blocking was added as an option. The general feeling we gather is that they want their information used to better inform the end user, rather than blocking the information and having AI bots produce a less accurate result. There are concerns about citations and credits and prioritisation of research driven and informed answers of course, but the general idea is that they prefer to allow AI bots.

There is a danger that this goes unnoticed if not front and centre in a site builder's installation steps resulting in an eventual large number of Drupal websites no longer contributing to the quality of the responses that AI bots give. Not sure if there are any statistics but I would expect eg a higher percentage of Drupal sites to come from organisations that influence policy and decision making compared to possibly lower quality content from WordPress sites that may be more prone to opinions of individuals, so a possible danger of reduction in AI bot response quality.

Not against having to opt-in to tracking for our clients of course, but I suppose this needs a general Drupal policy consideration (if we do not already have one).

cilefen’s picture

My organization is similarly-minded. All of its published work is for everyone. I imagine scholars here would consider bad or missing citations as an oversight on the user’s end rather than a reason to block anyone. But that is only what I think. I will ask around.

mindbet’s picture

To answer @cilefen

The proposal above seems analogous to Drupal's blocking of Google FLoC

https://www.drupal.org/project/drupal/issues/3209628

FLoC has now gone away and so have the blocking headers.

https://www.drupal.org/project/drupal/issues/3260401

That said, my preference is that blocking AI bots shouldn't be in core;
it should be in contrib for those who want to opt out.

Perhaps I am still on a sugar high from drinking the AI Kool-Aid but I think
the benefits of LLMs will greatly outweigh the risks.

AI bots reading content should be considered transformative use.

Imagine multiple super intelligences reading the medical literature,
coming up with new treatments and models -- but wait -- a content troll company
like Elsevier prevents this with exorbitant fees or a complete blockade.

For your consideration, an essay from Benedict Evans:

Generative AI and intellectual property

https://www.ben-evans.com/benedictevans/2023/8/27/generative-ai-ad-intel...

cilefen’s picture

I've confirmed in a technical staff committee meeting that blocking this crawler from accessing public information would be perceived as an anti-pattern by my organization. Unattributed citations are the LLM user's responsibility.

frazras’s picture

I see this as an opinionated and somewhat intrusive one-sided stance on artificial intelligence as a whole.

The majority of public-facing websites exist to share content and not for profit. Most of the people with issues about their content ending up in a model are the minority (and powerful) profit-seeking content creators who will suffer with the progress of technology - as they have already had with the progress of the internet killing paper.
Although not apples to apples, this is like blocking the internet archive bot from accessing information about your site because you might post content you want to be removed in the future.

Besides, as weak barriers are setup like this, it will create a market for subversion, with people using lesser-known models to bypass the blacklist and with models like open-source LLAMA and the many others with close to equivalent power of GPT4, these non-commercial options would be just as powerful as the popular ones and hacks will be used to employ them. You will end up chasing a constantly growing list of bots.

I say, still put the code but commented for those who care about their content being indexed by AI, they should have the resources to uncomment these if they really need it.

cilefen’s picture

dan_metille’s picture

Unless explicitly requested by a customer, I am generally comfortable with feeding data to AI. However, what really frustrates me is that, in particular, the Anthropic/Claude bots are not adhering to robots.txt instructions. Furthermore, these bots are sending so many redundant and repetitive requests that I had to spend nearly an entire day blocking what can almost be considered a DDoS attack.

I will refrain from delving into the ethical and trust issues in AI that Anthropic claims to uphold, in order to keep this thread on a positive note.

scott_euser’s picture

Yeah we've had a couple where we've had to temporarily block ClaudeBot as well (then remove the block) as it was leading to degraded performance for visitors - agreed it sometimes feels DOS-attack like. Beyond ClaudeBot the rest seem to be better throttled at source which seems to reflect the vibe on reddit/X on the topic.

kevinquillen’s picture

We recently had to block on a site because the AI site crawler was effectively taking the site down.

alvarodemendoza’s picture

After seeing millions of visits from these bots, specially from Claudebot, I think the following list in robots.tx will make the solution more robust.

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Omgilibot
Disallow: /

User-agent: Omgili
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: ImagesiftBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Meta-ExternalFetcher
Disallow: /

User-agent: Timpibot
Disallow: /

User-agent: Webzio-Extended
Disallow: /

User-agent: YouBot
Disallow: /
cilefen’s picture

I believe Drupal Core should be neutral on this matter. Blocking specific bots should be a contributed module feature.

kentr’s picture

Adding to what @scott_euser said, the topic of model collapse has been in the news recently. From WikiPedia: Model collapse is a phenomenon where machine learning models gradually degrade due to errors coming from uncurated training on the outputs of another model, including prior versions of itself.

I'd hypothesize that the majority of Drupal sites are providers of fresh, quality, human-generated content in many fields. Society appears to be hell-bent on using generative AI as a general tool for real work. So I echo @scott_euser's comment about the danger of adversely-affecting AI output.

karolus’s picture

Any updates on this?

I understand that this is a controversial issue, but site operators should have the option of changing what the robots.txt allows. On some sites I manage, what I'm doing now is copying a customized robots.txt in every time I'm updating core or contrib. It's just another item in the checklist, but would be great if there were an easier way to manage this.

Regarding content governance, as posted upthread, some sites operators are fine with anyone scraping their content, others certainly are not.

cilefen’s picture

Thanks for that comment. I just don't think this feature request should be taken on by Core. The existence of https://www.drupal.org/project/robotstxt with 40,000 installs and of https://www.drupal.org/project/aitxt (10 installs) means the flexibility to adjust AI content consumption is already well-supported within the Drupal ecosystem. IMO we don't want to be chasing specific bots in this core file, because to do so by default is controversial, and because of the commit noise it would incur. The contrib modules are a better avenue—even more so if those modules would feature a way to append the latest bots list.

Version: 11.x-dev » main

Drupal core is now using the main branch as the primary development branch. New developments and disruptive changes should now be targeted to the main branch.

Read more in the announcement.