This case study was written as a collaboration between Drupal Association staff and Technology Supporting Partner Distil Networks.
Drupal.org is the home of one of the largest open source communities in the world. We've been online for more than 13 years and collectively we build the Drupal software, provide support, write documentation, share networking opportunities, and more. The open source spirit pushes the Drupal project forward, and new members are always welcome. It falls to us to maintain our community home and preserve the welcoming atmosphere that leads people to say,"Come for the code, stay for the community."
As stewards of Drupal.org, it's our responsibility to give the community a voice and welcome everyone who wants to participate in the project. At the same time, there are bad actors who would take advantage of our open community and platform for abusive purposes.
Drupal.org long-standing presence on the web has given it authority in the eyes of search engines. The site hosts millions of pages of content - all generated by our users. This combination of authority and open access for users to create content makes us a very high value target for phishers and spammers.
Spam is a nuisance to our existing community, devalues our project to the newcomers we are hoping to welcome, and left unchecked could degrade our search presence.
Spammers create bogus accounts to post their junk content
Only registered members can post content to the Drupal.org website, so there's a continuous onslaught of actors attempting to create accounts for the purpose of inserting link spam and other bad content onto the site. In the past, we've implemented a variety of strategies such as content analysis, behavioral analysis, social moderation, and rate limiting. And while these measures have been effective at reducing some of the spam we've seen, the onslaught continues.
The reason for that? Much of our attempted spam is not coming from bots. These are real people using tools to cloak their identity and manually creating accounts en masse. In many cases they may not even post junk content immediately. They will often sit on "sleeper" accounts waiting to be paid by somebody to promote malicious content.
It's too time consuming to manually remove spam content
Spam fighting is also a thankless task. All time spent fighting spam, whether by members of the engineering staff or our incredibly dedicated community volunteers, is time not spent on the project. Spam fighting has an opportunity cost that creates burn-out among staff and volunteers, and is not something we can afford to leave to manual moderation.
Especially when it comes to our community volunteers– they want to spend their time helping people with Drupal technical questions, not deleting spam.
Fake accounts and spam pollute the community engagement metrics
There are 1.9 million user accounts in the Drupal.org database, but using this data to measure community engagement is challenging because of the number of spammer accounts that have been registered over the years. When we have to work around so many illegitimate accounts, it's difficult to determine metrics for community health such as if our legitimate user growth is increasing or decreasing. We need cleaner user account data to give us more reliable community metrics, and help us make informed decisions.
Before reaching out to Distil Networks, Drupal.org relied primarily on two modules to help us fight spam. Mollom is a Drupal stand-by—a content analysis tool that looks at what users are posting and compares them against known bad actor patterns. This content analysis helps us identify and block new waves of spam patterns, but it doesn't prevent these waves from being posted in the first place.
The second module we use is Honeypot, which uses a combination of honeypot and rate limiting methods to prevent bot spam. Honeypot does a good job in preventing mass spam attacks by bots, but when real people are creating the underlying accounts honeypot can't help us.
As we researched ways to prevent spam, we discovered that all of these bad actors we wanted to keep out had one thing in common—they are hiding their identities behind proxies. This prevents us from simply blacklisting certain ip addresses or ranges. So instead, we began researching ways to unmask the users behind these proxies and block them before they can even create an account.
Our research led us to Distil Networks. We now run the Drupal.org registration pages(and only the registration pages) through the Distil Cloud CDN. Distil's service gathers device fingerprints for the users trying to create the accounts, and we're able to leverage those fingerprints to block users who would otherwise generate dozens or hundreds of accounts by rotating through proxies. This fingerprinting process is limited to a hashed, unique identifier and only affects our registration process, to preserve the privacy of our legitimate users.
What the Distil data shows
After enabling Distil's service for our registration process we were able to capture fingerprints for about 20,000 account registrations over the course of nine months. We were immediately able to identify more than 10% of those account registrations as duplicate registrations by the same user, hiding behind a proxy. As we dug into the data further, we realized that thousands of the spam accounts that spammers are attempting to register are actually created by only 200-300 real individuals.
By blocking these 200-300 individuals by their Distil fingerprint, we can block thousands of account registrations, and tens of thousands of spam posts that would have been created had these 'sleeper accounts' been activated.
Even with Distil's sophisticated profiling tool available to us, we knew that the spam fighting process would continue to have a manual component. In the first place, there are still thousands of 'sleeper' accounts registered before we implemented Distil that could be activated. And secondly, we know that we cannot simply rely on proxy detection and fingerprint collisions to identify spam accounts. Some of our users are in countries where a proxy is the only way to access a free and open internet. Other users are in environments that have identical device fingerprints and a shared IP, such as a classroom computer lab.
However, by taking advantage of the tools that Distil offers, we can now stop many of the account registrations at the source. In the same time that it once took us to moderate a single new user account that had just posted spam, we can now block a unique id that would have been used to create a dozen or even a hundred more accounts.
We've seen trends in our account registration logs that show that the new methods are working. As we block spammers in ways they can't circumvent through proxies, their ability to register multiple accounts diminishes. Without being able to mass register accounts to later activate when selling link spam, Drupal.org becomes a less viable target.
While some spam still gets through, whether from old sleeper accounts, or lucky new spammers that manage to slip by, the overall reduction in spam has been significant. This lets our volunteers and internal staff direct more of their efforts at moving the project forward, rather than fighting spam.
With fewer illegitimate account registrations, we're also able to improve the metrics we use to measure our community health and engagement, by lowering the noise-to-signal ratio in user activity.
We want to thank Distil Networks for joining the Drupal Association as a Premium Technology Supporter. The tools that Distil Networks provide enable us to better take care of the home of the community. Fighting spam is a never ending challenge: as long as there is a financial incentive to posting spam, bad actors will continue to evolve their methods, but with a partner like Distil Networks we are now equipped to stay one step ahead.
To learn more about how Drupal.org and Distil Networks partnered to tackle spam, and to learn how you could leverage a similar solution for your own site, please join us at our webinar on April 5th, at 10am Pacific.
Distil Networks will be joining us at DrupalCon Baltimore from April 24-28th. We invite the community to join us there and learn more about our partnership.