Problem/Motivation
We just encountered a crawler not identifying as a bot (pretending to be human-traffic) and rapidly crawling our site from hundreds of unique IPs and user agent strings. Because of the distributed nature of this crawl, this bot was able to bypass our bot and regular traffic request limits (no single "visitor" was crawling over our established limits, but combined the crawler was very, very much over the limit).
Further analysis revealed that all traffic from this crawler was coming in under a single autonomous system number (ASN), identifying the network of the cloud-computing platform the crawler was running from. To keep distributed crawlers like this from completely bypassing all limits (and slowing down our site), I would like the optional ability to rate-limit regular traffic at the ASN level. Clearly a limit at this level would need to anticipate the shared nature of any given ASN, but I feel a sane limit at this level could really help.
Steps to reproduce
N/a.
Proposed resolution
Open up the ability to rate-limit regular traffic at the ASN level. Eg.
$settings['crawler_rate_limit.settings']['regular_traffic'] = [
'interval' => 600,
'requests' => 300,
'asn_interval' => 600,
'asn_requests' => 800,
];
When enabled, use a tool like GeoIP2-php to obtain ASN for requesting regular traffic IPs and enforce ASN-level rate limits. Cache IP->ASN info so the lookup is only necessary on the first request for a given IP.
Remaining tasks
Discuss, patch, review.
User interface changes
N/a.
API changes
N/a.
Data model changes
N/a.
Issue fork crawler_rate_limit-3447955
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #3
chrisolofComment #5
chrisolofNotes from testing this MR against real traffic for about two weeks now:
This is currently performing very well / as designed against real traffic. The addition of the ASN lookup on those requests where it is actually necessary (requester not openly identifying as bot and not already blocked at the visitor-level) is so fast, even without the optional C extension, that it is imperceptible to our end-users. On the other hand, rate-limiting all crawlers, including those that horizontally spread out across multiple IPs and/or user agent strings, is a perceptible performance boost for our end-users.
Comment #6
darrell_ulm commentedThis looks good, although I'm getting this message now on a development environment for a site:
Missing dependencies: In order to rate-limit regular traffic at the ASN-level, you need to install the GeoIP2 PHP API.
I did download the a test ASN database, GeoLite2-ASN-Test.mmdb,
Full message is:
CRAWLER RATE LIMIT Enabled
Comment #7
vaish commentedDarell, there are two steps you need to complete before being able to use ASN-level rate limiting. You did download the ASN database already. What's left is to install PHP package geoip2/geoip2. That's what error message you got is about. You can install this package via composer, as usual.
composer require geoip2/geoip2Please, let me know if you run into any other issues with this feature. I'm about to merge this MR but feel free to open a follow up issue if you find any bugs.
Comment #9
vaish commentedThanks @chrisolof. Everything works great. I just made few minor tweaks.
Comment #10
darrell_ulm commentedThat makes sense, thank you. I'll give it another try from the dev branch.