Block bad bots and crawlers

Last updated on
9 April 2026

You never know when a horde of bots may hit your site. This page shows you how to monitor and prevent your server from being overloaded by aggressive bots and crawlers, using different methods.

For tips and strategies, see Building a Firewall in your Drupal Application by @bburg and AI Bot Abuse: Our Evolving Strategy for Taming Performance Nightmares on Drupal Faceted Search Pages by @capellic.

These great modules can block bots using different approaches (ordered by number of installs):

For Facets 3, turning links into checkboxes might help. Facets 2 still uses links, where Tarpit might work, catching bots if they follow hidden links, see #2833166: Coding Standards for modern Drupal.

Block Bots by name in .htaccess

Bots rotate IP ranges, so IP-based blocking is brittle. Drupal node pages are generally well cached, so blocking entire sites is often unnecessary. Instead, target high-server-cost, low-bot-value paths and block known AI crawlers by User-Agent in .htaccess, via a patch.

You may immediately see an overall big drop in server resource usage. The bots can still appear in your log files, but they will be getting a "HTTP 403 Forbidden" client error response status code, indicating that the server understood the request but refused to process it. Most importantly, Drupal will not process the request.

# Block expensive listing routes for known AI crawlers at origin.
RewriteCond %{REQUEST_URI} ^/(search|archive|featured-content|past-events|glossary)(/|$) [NC,OR]
RewriteCond %{REQUEST_URI} ^/taxonomy/term/[^/]+(/feed)?/?$ [NC]
RewriteCond %{HTTP_USER_AGENT} meta-externalagent [NC,OR]
RewriteCond %{HTTP_USER_AGENT} GPTBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ClaudeBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Claude-SearchBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Amazonbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} OAI-SearchBot [NC]
RewriteRule .* - [F,L]

Check if the rules work

Look for recent 403 errors, signifying that the bot was blocked:

$ tail -n 100 /var/log/apache2/access.log | grep "HTTP/1.1\" 403"
216.73.216.223 - - [06/Jun/2025:13:20:17 +0200] "GET /my-path HTTP/1.1" 403 456 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)"

Block bots by IP

Increasingly, bots are using a wide range of IP addresses, so this method can become very labour intensive and probably inefficient in the long run.

Another method is blocking IP's by patching the .htaccess file with something like this:

 <IfModule mod_rewrite.c>
   RewriteEngine on
 
+  # Block aggressive crawlers
+  # GPTBot 20 (1), bingbot 40 52 (2), ClaudeBot 216 (1)
+  # SerpstatBot 148 (1), Facebook 57 (1)
+  RewriteCond %{REMOTE_ADDR} ^20\.171\.207 [OR]
+  RewriteCond %{REMOTE_ADDR} ^40\.77\.167 [OR]
+  RewriteCond %{REMOTE_ADDR} ^52\.167\.144 [OR]
+  RewriteCond %{REMOTE_ADDR} ^216\.73\.216 [OR]
+  RewriteCond %{REMOTE_ADDR} ^148\.251\.241\.12 [OR]
+  RewriteCond %{REMOTE_ADDR} ^57\.141\.0
+  RewriteRule ^ - [F]

Ensure empty search by default

Faceted search endpoints that return results without a query attract bots. Configure /search to return no results until a keyword is submitted (form-required). Most AI crawlers won’t execute searches or submit forms. Note: This is ineffective if your site exposes prebuilt links to populated search results.

/** 
 * Implements hook_views_query_alter().
 */
function mymodule_views_query_alter(ViewExecutable $view, $query): void {
  if (!$query instanceof SearchApiQuery || $view->id() !== 'search_content' || $view->current_display !== 'page_1') {
    return;
  }

  if (mymodule_request_has_criteria()) {
    return;
  }

  // Skip execution entirely when the search page is loaded without user input.
  $query->abort();
}

/**
 * Determines whether the current request contains active search criteria.
 */
function mymodule_request_has_criteria(): bool {
  $request_query = \Drupal::request()->query;
  $keys = trim((string) $request_query->get('keys', ''));
  if ($keys !== '') {
    return TRUE;
  }

  $filters = $request_query->all()['f'] ?? [];
  if (!is_array($filters)) {
    return FALSE;
  }

  foreach ($filters as $filter) {
    if (is_scalar($filter) && trim((string) $filter) !== '') {
      return TRUE;
    }
  }

  return FALSE;
}

Allow one value per facet

In some cases, granting bot access to faceted search makes sense. To prevent them from generating exponential URL permutations:

1. Configure each facet to allow only one selection. This change may impact search UX (where the facet checkboxes actually function more like radios), but this "quick fix" constitutes a small configuration change with huge impact built right into Facets v2, so no upgrade to v3 or facet redesign is required.

2. Optionally, enforce the one selection rule, by creating a facet validator to send a fast 404 if more than one facet value is found on any URL. This requires custom code in an EventSubscriber or similar hook-in.

# Ok: Passthru when URL contains a single facet
?keys=keyword&f[]=facet1:foo

# Ok: Passthru because there's one facet1 and one facet2 value
?keys=keyword&f[]=facet1:foo&f[]=facet2:bar

# 404: Only one facet1 is allowed
?keys=keyword&f[]=facet1:foo&f[]=facet1:baz

Store data in Solr or Redis to help MySQL

A lot of server stress can be avoided by storing data in Solr, and serving data via Ajax, so MySQL won't be touched, using the method described in Create a search view that doesn’t load entities from the database. Caching in Redis can also help.

Monitor your server

There are many monitoring services or self-hosted monitoring tools, like Beszel. It is easy to set up, and can send out email alerts if RAM, CPU, or network thresholds are exceeded.

A simple method is with a short script on your server, which tracks the server load and sends out an email, if a threshold is passed.

Add this in the file ~/.loadMon.sh on your server, adjusting trigger according to number of CPUs, email address, etc.:

#!/bin/bash

# From https://www.inmotionhosting.com/support/server/server-usage/create-server-load-monitoring-bash-script/
# We set a trigger for how high the load can get before we're
# alerted via e-mail from this script.
trigger=4.00

# We set a load variable to read the average server load from
# /proc/loadavg from the second column which is the average load last five minutes.
# https://www.scoutapm.com/blog/understanding-load-averages
load=`cat /proc/loadavg | awk '{print $2}'`

# We set a response variable to the word "greater" if the current
# load is greater than our trigger that we set.
response=`echo | awk -v T=$trigger -v L=$load 'BEGIN{if ( L > T){ print "greater"}}'`

# If the response is set to "greater" we run the sar -q command
# and pipe | that data to the mail command for recipient@example.com
# this sends an e-mail with the server's recent load averages there.
if [[ $response = "greater" ]] then
  echo "Load is currently `cat /proc/loadavg`" | mail -s "High load on server – [ $load ]" myemail@example.org
fi

Adapted from https://www.inmotionhosting.com/support/server/server-usage/create-serve...

Add it in your Crontab to run every five minute or so:

# email alert if average load is too high
5 * * * * bash /home/myuser/.loadMon.sh > /dev/null 2>&1

Check which IP's made the most recent visits:

$ tail -n 500 /var/log/apache2/access.log | cut -d " " -f1 | sort | uniq -c | sort -nr | head -n 20
    357 216.73.216.223
     89 21.171.27.68
     26 112.170.0.2
      5 192.178.4.105
      4 102.178.4.105
      3 148.251.21.12

Automated bot blocking via firewall

A somewhat advanced solution is to automatically block aggressive bots, by extracting their IP addresses from the log files, and adding them to a firewall rule, for example with iptables -- a firewall that allows you to define rulesets.

See Auto-ban website spammers via the Apache access_log for details, and the code at check-apache-access-log-spammers.sh.

Help improve this page

Page status: No known problems

You can: