Problem/Motivation

We just encountered a crawler not identifying as a bot (pretending to be human-traffic) and rapidly crawling our site from hundreds of unique IPs and user agent strings. Because of the distributed nature of this crawl, this bot was able to bypass our bot and regular traffic request limits (no single "visitor" was crawling over our established limits, but combined the crawler was very, very much over the limit).

Further analysis revealed that all traffic from this crawler was coming in under a single autonomous system number (ASN), identifying the network of the cloud-computing platform the crawler was running from. To keep distributed crawlers like this from completely bypassing all limits (and slowing down our site), I would like the optional ability to rate-limit regular traffic at the ASN level. Clearly a limit at this level would need to anticipate the shared nature of any given ASN, but I feel a sane limit at this level could really help.

Steps to reproduce

N/a.

Proposed resolution

Open up the ability to rate-limit regular traffic at the ASN level. Eg.

$settings['crawler_rate_limit.settings']['regular_traffic'] = [
  'interval' => 600,
  'requests' => 300,
  'asn_interval' => 600,
  'asn_requests' => 800,
];

When enabled, use a tool like GeoIP2-php to obtain ASN for requesting regular traffic IPs and enforce ASN-level rate limits. Cache IP->ASN info so the lookup is only necessary on the first request for a given IP.

Remaining tasks

Discuss, patch, review.

User interface changes

N/a.

API changes

N/a.

Data model changes

N/a.

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

chrisolof created an issue. See original summary.

chrisolof’s picture

Status: Active » Needs review

vaish made their first commit to this issue’s fork.

chrisolof’s picture

Notes from testing this MR against real traffic for about two weeks now:

This is currently performing very well / as designed against real traffic. The addition of the ASN lookup on those requests where it is actually necessary (requester not openly identifying as bot and not already blocked at the visitor-level) is so fast, even without the optional C extension, that it is imperceptible to our end-users. On the other hand, rate-limiting all crawlers, including those that horizontally spread out across multiple IPs and/or user agent strings, is a perceptible performance boost for our end-users.

darrell_ulm’s picture

This looks good, although I'm getting this message now on a development environment for a site:
Missing dependencies: In order to rate-limit regular traffic at the ASN-level, you need to install the GeoIP2 PHP API.
I did download the a test ASN database, GeoLite2-ASN-Test.mmdb,

Full message is:

CRAWLER RATE LIMIT Enabled

  • Configured to use memcached backend.
  • Rate limiting bot/crawler requests at the bot/crawler-level. 100 requests allowed per bot/crawler over a 600-second interval.
  • Rate limiting regular traffic requests at the visitor-level. 200 requests allowed per visitor over a 1400-second interval.
  • Rate limiting regular traffic requests at the ASN-level. 600 requests allowed per ASN over a 600-second interval.
  • Issue(s) detected that prevent rate limiter from functioning. In order to prevent fatal errors rate limiting has been disabled. You must fix all the errors or disable the Crawler Rate Limit.
  • Missing dependencies: In order to rate-limit regular traffic at the ASN-level, you need to install the GeoIP2 PHP API.
vaish’s picture

Darell, there are two steps you need to complete before being able to use ASN-level rate limiting. You did download the ASN database already. What's left is to install PHP package geoip2/geoip2. That's what error message you got is about. You can install this package via composer, as usual.

composer require geoip2/geoip2

Please, let me know if you run into any other issues with this feature. I'm about to merge this MR but feel free to open a follow up issue if you find any bugs.

vaish’s picture

Status: Needs review » Fixed

Thanks @chrisolof. Everything works great. I just made few minor tweaks.

darrell_ulm’s picture

That makes sense, thank you. I'll give it another try from the dev branch.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.