Problem/Motivation
Users should be protected from AI bot(s) scraping by default. If they want to allow it, they can choose to do so after the fact by editing robots.txt or using modules like RobotsTxt. This would protect users and teams either unaware of AI bot scraping, those who don't want it, and or those who are not aware they need to be taking action (in this manner).
Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to primarily aggregate personally identifiable information (PII), or have text that violates our policies. Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety.
I think this is a sensible change and also doesn't put full trust into ChatGPT or Google Bard not ingesting things or interpreting sensitive content as not-sensitive.
Proposed resolution
Add the following to the default robots.txt to block Google and OpenAI (ChatGPT):
User-agent: GPTBot
Disallow: /
User-agent: CCbot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: Google-Extended
Disallow: /
It may also be relevant to add an ai.txt by default for a broader disallow (is this a real growing standard or proposed?).
See also: https://site.spawning.ai/spawning-ai-txt
Comments
Comment #2
kevinquillen commentedMoving to core queue since RobotsTxt module uses the default robots.txt file to create initial configuration.
Comment #3
kevinquillen commentedComment #4
cilefen commentedThis would be the first crawler blocked by default in robots.txt.
Comment #5
kevinquillen commentedHere?
https://git.drupalcode.org/project/drupal/-/blob/11.x/robots.txt?ref_typ...
Because as I understand it you'd need to declare it again to prevent it from scraping content paths, where the default is currently for assets and admin paths:
Comment #6
cilefen commentedI don't understand what #5 is communicating (to me?). If it is to me, then maybe my comment wasn't clear.
Comment #7
kevinquillen commentedI thought you were saying that GPTBot is blocked already from scraping a Drupal site by default, however I am not seeing that.
Comment #8
cilefen commentedThat explains the confusion. That's not what I was saying.
This issue would introduce, for the first time, blocking a specific crawler by default. By that I mean, Drupal AFAIK has never done that and it comes with some downsides. Notably, there are many crawlers. Doing this could open the door to other requests to add hated crawlers.
So all I'm saying is that there is another decision implicit in this issue: whether to block specific crawlers.
Comment #9
mindaugasd commented@kevinquillen why? what does this achieve (long term)?
Is there some article that goes in depth into this? (measuring both sides of the argument)
This would be significant decision and needs solid reasons.
Comment #10
kevinquillen commentedhttps://searchengineland.com/more-popular-websites-blocking-gptbot-432531
https://www.kevin-indig.com/most-sites-will-block-chat-gpt/
https://searchengineland.com/websites-blocking-gptbot-431183
Its not a knock against AI, I am simply saying that as a CMS by default should likely have this to protect the user from their content entering model(s) either prematurely or at all. This all largely entered the public consciousness in the last 12 months, so its brand new to most people. But I can see where some sites find their content parroted out in ChatGPT models without consent and IMO that would look bad on Drupal for assuming users would just know to add that post-install (to robots.txt). This is different than search engine crawlers, they've been around a long time. I think it should be something users go "okay I am ready to allow GPTBot now, remove it" or not at all versus waking up one day finding your content assimilated.
I can't see far enough down the road yet, but it seems like once something is in an LLM, its permanent. Depending on the type of site, you may not want this at all, but at the same time it may not even occur to people until its too late to do anything about it. If you get indexed by Google by accident, you can work a process with them to delist it. I don't think the same is true for AI, not that I have seen.
NPR, for example: https://www.npr.org/robots.txt
Comment #11
larowlanIs there a wider list we should be considering, rather than just one AI based crawler?
I agree with @cilefen we shouldn't just pick one.
Comment #12
kevinquillen commentedReuters has even more.. is there a standard somewhere anyone knows of?
https://searchengineland.com/robots-txt-new-meta-tag-llm-ai-429510
Comment #13
kevinquillen commentedComment #14
mindaugasd commentedI read the articles.
Summary:
Conclusion:
I seem to agree it can be blocked by default, because:
Counter-arguments?
Comment #15
kevinquillen commentedComment #16
kevinquillen commentedComment #17
mindaugasd commentedComment #18
scott_euser commentedAt my agency (Soapbox) we work with Think Tanks and other organisations that generally work to positively influence policy and decision making. We have consulted with a number of them on their thoughts since the Robots.txt blocking was added as an option. The general feeling we gather is that they want their information used to better inform the end user, rather than blocking the information and having AI bots produce a less accurate result. There are concerns about citations and credits and prioritisation of research driven and informed answers of course, but the general idea is that they prefer to allow AI bots.
There is a danger that this goes unnoticed if not front and centre in a site builder's installation steps resulting in an eventual large number of Drupal websites no longer contributing to the quality of the responses that AI bots give. Not sure if there are any statistics but I would expect eg a higher percentage of Drupal sites to come from organisations that influence policy and decision making compared to possibly lower quality content from WordPress sites that may be more prone to opinions of individuals, so a possible danger of reduction in AI bot response quality.
Not against having to opt-in to tracking for our clients of course, but I suppose this needs a general Drupal policy consideration (if we do not already have one).
Comment #19
cilefen commentedMy organization is similarly-minded. All of its published work is for everyone. I imagine scholars here would consider bad or missing citations as an oversight on the user’s end rather than a reason to block anyone. But that is only what I think. I will ask around.
Comment #20
mindbet commentedTo answer @cilefen
The proposal above seems analogous to Drupal's blocking of Google FLoC
https://www.drupal.org/project/drupal/issues/3209628
FLoC has now gone away and so have the blocking headers.
https://www.drupal.org/project/drupal/issues/3260401
That said, my preference is that blocking AI bots shouldn't be in core;
it should be in contrib for those who want to opt out.
Perhaps I am still on a sugar high from drinking the AI Kool-Aid but I think
the benefits of LLMs will greatly outweigh the risks.
AI bots reading content should be considered transformative use.
Imagine multiple super intelligences reading the medical literature,
coming up with new treatments and models -- but wait -- a content troll company
like Elsevier prevents this with exorbitant fees or a complete blockade.
For your consideration, an essay from Benedict Evans:
Generative AI and intellectual property
https://www.ben-evans.com/benedictevans/2023/8/27/generative-ai-ad-intel...
Comment #21
cilefen commentedI've confirmed in a technical staff committee meeting that blocking this crawler from accessing public information would be perceived as an anti-pattern by my organization. Unattributed citations are the LLM user's responsibility.
Comment #22
frazras commentedI see this as an opinionated and somewhat intrusive one-sided stance on artificial intelligence as a whole.
The majority of public-facing websites exist to share content and not for profit. Most of the people with issues about their content ending up in a model are the minority (and powerful) profit-seeking content creators who will suffer with the progress of technology - as they have already had with the progress of the internet killing paper.
Although not apples to apples, this is like blocking the internet archive bot from accessing information about your site because you might post content you want to be removed in the future.
Besides, as weak barriers are setup like this, it will create a market for subversion, with people using lesser-known models to bypass the blacklist and with models like open-source LLAMA and the many others with close to equivalent power of GPT4, these non-commercial options would be just as powerful as the popular ones and hacks will be used to employ them. You will end up chasing a constantly growing list of bots.
I say, still put the code but commented for those who care about their content being indexed by AI, they should have the resources to uncomment these if they really need it.
Comment #23
cilefen commentedAn article about this topic: https://www.theverge.com/24067997/robots-txt-ai-text-file-web-crawlers-s...
Comment #24
dan_metille commentedUnless explicitly requested by a customer, I am generally comfortable with feeding data to AI. However, what really frustrates me is that, in particular, the Anthropic/Claude bots are not adhering to robots.txt instructions. Furthermore, these bots are sending so many redundant and repetitive requests that I had to spend nearly an entire day blocking what can almost be considered a DDoS attack.
I will refrain from delving into the ethical and trust issues in AI that Anthropic claims to uphold, in order to keep this thread on a positive note.
Comment #25
scott_euser commentedYeah we've had a couple where we've had to temporarily block ClaudeBot as well (then remove the block) as it was leading to degraded performance for visitors - agreed it sometimes feels DOS-attack like. Beyond ClaudeBot the rest seem to be better throttled at source which seems to reflect the vibe on reddit/X on the topic.
Comment #26
kevinquillen commentedWe recently had to block on a site because the AI site crawler was effectively taking the site down.
Comment #27
mindaugasd commentedSimilar issue posted #3457500: [policy, no patch] Strengthening Drupal: Protecting Content from AI Scrapers and Bots
Comment #28
alvarodemendoza commentedAfter seeing millions of visits from these bots, specially from Claudebot, I think the following list in robots.tx will make the solution more robust.
Comment #29
cilefen commentedI believe Drupal Core should be neutral on this matter. Blocking specific bots should be a contributed module feature.
Comment #30
kentr commentedAdding to what @scott_euser said, the topic of model collapse has been in the news recently. From WikiPedia: Model collapse is a phenomenon where machine learning models gradually degrade due to errors coming from uncurated training on the outputs of another model, including prior versions of itself.
I'd hypothesize that the majority of Drupal sites are providers of fresh, quality, human-generated content in many fields. Society appears to be hell-bent on using generative AI as a general tool for real work. So I echo @scott_euser's comment about the danger of adversely-affecting AI output.
Comment #31
karolus commentedAny updates on this?
I understand that this is a controversial issue, but site operators should have the option of changing what the robots.txt allows. On some sites I manage, what I'm doing now is copying a customized robots.txt in every time I'm updating core or contrib. It's just another item in the checklist, but would be great if there were an easier way to manage this.
Regarding content governance, as posted upthread, some sites operators are fine with anyone scraping their content, others certainly are not.
Comment #32
cilefen commentedThanks for that comment. I just don't think this feature request should be taken on by Core. The existence of https://www.drupal.org/project/robotstxt with 40,000 installs and of https://www.drupal.org/project/aitxt (10 installs) means the flexibility to adjust AI content consumption is already well-supported within the Drupal ecosystem. IMO we don't want to be chasing specific bots in this core file, because to do so by default is controversial, and because of the commit noise it would incur. The contrib modules are a better avenue—even more so if those modules would feature a way to append the latest bots list.