Over the last two weeks, usage stats for all my projects have fallen dramatically. Doing some random sampling, it looks like this is not only affecting my projects.

For instance Advanced help had a 14 % drop compared to the previous week in the June 7th stats, and a 49 % drop compared to the previous week in the June 14th stats.

I don't believe such huge drops are because by the module (or Drupal itself) has become significantly less popular in just one week - so the most likely explanation is that this is caused by an error.

Project usage overview

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

gisle’s picture

Project: Drupal.org customizations » Drupal.org infrastructure
Version: 7.x-3.x-dev »
Component: Miscellaneous » Other
Status: Needs work » Active

Moving to right queue.

antongp’s picture

Checked a few projects I maintain - each decreased for ~50%. Yep, looks it's some error, I don't believe Views, Token, ctools and other really lost for about half of installations.

sylus’s picture

Yeah noticed this as well for large distros such as panopoly, and commerce_kickstart.

David_Rothstein’s picture

To add another data point, Drupal core doesn't have any usage stats at all listed from that time period (the most recent data displayed at https://www.drupal.org/project/usage/drupal is from May 31) so something does seem to be wrong...

DerekAhmedzai’s picture

FileSize
32.05 KB

My module (Fitvids) had a 10% drop too. I checked some other projects (Panels, Views) and they had dropped too.

I just checked today to see if it was fixed, now it's a 50% drop! What's going on?!

Showing 50% drop in usage!

This needs to be sorted out asap, because if the numbers aren't accurate, then they are meaningless.

sylus’s picture

Priority: Normal » Major

Increasing the priority. Ideally if usage statistics are incorrect we could just disable the metrics until they are working again. I believe there was a related issue about that but can't seem to find it now.

gisle’s picture

sylus wrote:

Ideally if usage statistics are incorrect we could just disable the metrics until they are working again. I believe there was a related issue about that but can't seem to find it now.

Maybe you're thinking about: #2270127: Show info about incorrect usage stats

basic’s picture

We've found the issue on our end, the update stats are processing incomplete log files, and isn't smart enough to notice changes in filesize to the incomplete files. We are going to be rewriting and reprocessing the usage stats to account for this. In the mean time usage stats will be broken and will take some time to catch up to the current day once the processing is fixed.

sylus’s picture

I was curious about whether we have started to reindex the incomplete files. Already had a few emails about the usage drop for a few projects.

Thanks @gisle for the link we definitely should try to push that issue forward should this happen again.

Of course thank you very much for looking into this :)

basic’s picture

I've begun indexing files starting @ June 10th this morning. We are looking at ~1.5 hours per day so it will take some time to catch up. Hopefully by tomorrow morning we've caught up with most of the processing and the stats start to come back to life a bit.

mlhess’s picture

Status: Active » Fixed

Marking this fixed.

anrikun’s picture

Status: Fixed » Active

Looks broken again...

Mixologic’s picture

Thanks @Anrikun - we had another issue with our loghost, and were missing a couple of days of log data. That data has been pulled from our backup source and is processing now.

markhalliwell’s picture

I appreciate the diligence and quick responses the Infra team has shown over this issue; it is quite a headache.

I also feel, however, that it is becoming increasingly obvious that the current method, in which these logs are being processed, is quite unreliable.

Is there or will there be any thought/action into providing a more stable system for parsing these logs? (i.e. placing proper contingencies for handling random errors and missing logs, etc.) The entire way these are being parsed (from my perspective) seems rather ambiguous and manual when something goes awry.

This has been an issue for over a year now (seemingly since D7/server upgrades maybe?), spanning several issues.

Isn't it time that this issue becomes properly fixed, instead of just slapping band-aids on it?

lolandese’s picture

Isn't it time that this issue becomes properly fixed?

Please. And meanwhile #2270127: Show info about incorrect usage stats.

basic’s picture

Of course, we've made about 6 changes over the last 6 months to fix each individual problem we've had with the usage stats. These changes included:

  • Moving the mongodb processing off of openstack when it was decomissioned
  • Moving the mongodb processing to ec2
  • Moving the mongodb processing back to a dedicated machine because latency to ec2 is too high / processing was too slow
  • Moving updates traffic from EdgeCast to Fastly and rewriting parts of the processing to work with Fastly logs
  • Fixing issues with the Fastly log processing incomplete (current day) log files
  • Troubleshooting rsyslog issues on loghost preventing us from obtaining the logs, and adding an aws s3 bucket with additional logs

Since all of this has happened, we (specifically @Mixologic) have also begun work on removing the mongodb processing component which is inefficient and replacing it with awk and other gnu tools.

markhalliwell’s picture

Awesome. Yeah, not dogging what y'all have done (I know it's been a lot), just wasn't sure exactly _what_ the plan was.

FWIW, I actually just visited https://www.drupal.org/project/usage/bootstrap to see current stats and it appears to be really drastic still?

July 18, 2015	46,381
July 11, 2015	70,627

24k drop and no data since 7/18?

Mixologic’s picture

The usage statistics are a canary in the coal mine that is symptomatic of any number of failures elsewhere in the system. The stats are derived from our updates traffic which is a firehose of data - its responsible for about 75% of the bandwidth we use each month (about 12 TB or so ). In an effort to reduce costs and provide a more reliable updates service we had moved updates.drupal.org to a CDN in May of last year (to edgecast) - recently we've been transitioning to a different CDN (fastly) and therefore had to update the processing methodology. (we moved from rsyncing logs from edgecasts' servers, to having fastly communicate directly with our loghost). As with any change, there is a risk that new bugs will surface, that assumptions that were made in the former process no longer apply in the new process.

Each time the process has failed it has been for a different reason that was not previously anticipated. Each time we run into those failures we apply a proper fix so it does not happen again in exactly the same way.

This particular instance was a result of the loghost not responding to fastly's direct logging. Our monitoring and alerting was not tracking that process specifically, so we didn't know the loghost was non-responsive - that will be rectified today. In anticipation of this sort of contingency, we were already dual logging fastly logs to S3. Had we not anticipated that loghost might fail, we would not have backups of that data.

In any case, we have been looking at ways to improve the processing of these stats. Currently we end up with about 40 million records a day, which get moved to another server to process with drush -> mongodb -> drupal.org database. This process takes about 3-4 hours to run each day, and is a big reason why when something breaks (like loghost failing over 4th of july) it becomes difficult to catch up.

So I've began a rewrite of the process that removes mongodb from the equation (which was really only acting as a deduplicating key-value store). The rewrite gets back to some unix file processing roots - awk, cut, uniq -c and sort are going to be the tools we use to handle the fastly and edgecast log files. That part is done, and I was able to reformat every file from mid february to date. This reformat was able to process in about 6 hours - i.e. about a months worth of data an hour.

The next step is to do the deduplication and aggregation - Preliminary tests show that these processes take about 30 minutes per week's worth of data to run.

The next step will be to load those pre-aggregated count files into the drupal.org database - this is essentially taking the code thats already there, removing 90% of it, and changing one or two things around.

The final step is to ensure this process is running on jenkins.

Some additional value we're going to get out of this: we were discarding a *lot* of data that didnt perfectly match our release names. For example, drupal 8.0.0-dev is being reported to us, but drupal 8.0.x-dev is what is in the database - so we are not counting the 30,000 or so users who have a d8 dev install up and running. Same goes for virtually every development version of a module in contrib. So, we're going to see a good bump in the numbers that were always there.

Secondly, a feature was added a long time ago to provide stats on submodule useage. That data was being sent to drupal.org, but ignored, but now we've at least got it parsed and counted. When the priorities align, we'll be able to do something like this: https://www.drupal.org/node/1627676#comment-7683233 with it...

Anyhow, hope that helps - we have been focused on these and want to get away from cleanup efforts that take half a day when any little thing derails the update stats freight train. Thanks for being patient.

markhalliwell’s picture

Thanks @Mixologic! That is an awesome overview, definitely helps clarify quite a few questions (and answers some that haven't even been asked yet).

Like I said above, I definitely appreciate y'alls attentiveness to this issue. I wasn't suggesting otherwise.

From what I gathered in that reply, it sounds like y'all are still processing ~6mo worth of data. That will obviously take a bit, understood. One more question though: I'm assuming based on what you said above that the issue I described in #17 will automatically resolve itself once d.o's db stats have been re-imported with proper numbers, yes?

lolandese’s picture

In an effort to reduce costs and provide a more reliable updates service we had moved updates.drupal.org to a CDN in May of last year (to edgecast) - recently we've been transitioning to a different CDN (fastly) and therefore had to update the processing methodology.

That is definitely news and should be made public to a wider audience than just the twenty-something followers of this issue. A slightly edited edition of #18 could go into Drupal News. Being informed makes inconveniences easier to bear. It is proven that travellers that know the reason of a train delay are less likely to complain about it. See also #2270127: Show info about incorrect usage stats.

Secondly, a feature was added a long time ago to provide stats on submodule useage. That data was being sent to drupal.org, but ignored, but now we've at least got it parsed and counted.

That is good news.

Thanks for the efforts on this.

sylus’s picture

mqanneh’s picture

+1

derjochenmeyer’s picture

Category: Task » Bug report
Issue summary: View changes
FileSize
345.31 KB

Project usage (including Drupal core) has dropped 50%. This seems to be a bug not a task.

sylus’s picture

This has been happening off an on for a year and it is starting to get very frustrating that we cant seem to fix this.

drumm’s picture

We're working on a new method for aggregating usage statistics as we fix this week's issue.

sylus’s picture

Thanks @drumm that is exciting news, appreciate it!

markhalliwell’s picture

Priority: Major » Critical

Progress update? Both for this latest issue, whatever it may be, as well as an overall update. It's been like this since for over a week now and it does seem like it's taking longer and longer to fix these errors when they do appear. Maybe that's just the side-effect of developing/implementing the "new method", idk.

I know this is probably like the last thing on y'alls list, but I have to be honest... it's not very re-assuring to consistently just hear "we're working on it" when it breaks.

drumm’s picture

Mixologic’s picture

If you'd like to follow along, the issue and code where this is being reworked is here: #2575425: Project side of new d.o usage processing method, which has been implemented on the following dev site:

https://mixologic-updatestats-drupal.redesign.devdrupal.org/project/usage (There is a spike drop in Feb as that was the earliest we had data, and only had a partial week, that partial week will not be used in production)

All data from Feb onwards has been reprocessed on that site utilizing the new method (which also properly accounts for -dev sites, which the former method did not).

There is one other big caveat as to why this has been "taking over a week". Earlier in this thread I said:

This particular instance was a result of the loghost not responding to fastly's direct logging. Our monitoring and alerting was not tracking that process specifically, so we didn't know the loghost was non-responsive - that will be rectified today. In anticipation of this sort of contingency, we were already dual logging fastly logs to S3. Had we not anticipated that loghost might fail, we would not have backups of that data.

It turned out that pushing all of our log data from fastly to S3 once a day resulted in a silent API timeout from AWS that we unaware of. The loghost stopped responding again, and we had to rely on the S3 backups - except the S3 backups were only 80-85% complete as a result of the timeout. Additionally, we lost an entire day's worth of data the 20th of september as our S3 backups were only for 14 days, and it failed right at the start of drupalcon bcn, and everybody took time off after the con, so we missed the backup window there. - we've since extended that to 45 days of s3 + forever amazon glacier.

In other words, the update stats will not be accurate for the weeks of

Sep 19th-26th (missing data for the 20th)
Sep 27th- Oct 2nd
Oct 3rd-10th (using S3 backups from the 28th->6th with only about 80% data).

We're still working with fastly to get the backup log data sorted out (despite switching from saving every 24 hours to every 2 hours, we're still getting lost data).

Mixologic’s picture

Status: Active » Fixed

The new process is in place, and the stats have been updated.

Please let us know if there are any wild discrepancies and we can adjust the process further (in a new issue of course)

greggles’s picture

Thanks, Mixologic! I reviewed a few projects and they look reasonably good.

pingwin4eg’s picture

Component: Other » Updates System
Status: Fixed » Active
FileSize
60.83 KB

There is some trouble happened again. Or is there some work still in progress?

Please see this project's stats: https://www.drupal.org/project/usage/styleswitcher . Yesterday (after the fix was provided) the last 2 weeks' numbers was normal, they were more than 3 hundred usages in total, but now they are changed - they fell dramatically (such as before the fix).

Numbers of September 20, 2015 and earlier seems OK.

As far as I know previously stats were never recounted for earlier weeks. This latest fix recounted stats once again from Feb of this year (as far as I know again) and stats were OK (yesterday).

Mixologic’s picture

Thanks @pingwin4eg, there was still a process configured to run from the old process and it overwrote the better stats. I've reran the correct process again, and shut off the old one so that wont happen again. Have a look again.

pingwin4eg’s picture

Status: Active » Fixed

Yes, stats seem OK now. Checked also other popular contribs - they OK too. Thank you!

mqanneh’s picture

Status: Fixed » Active

It happened again, the usage stats weren't updated for October 11, 2015, all projects have 0 reported installs for the last week.

check https://www.drupal.org/project/usage/facebook_comments_block

hass’s picture

There is also a bug in "project" module stats. It has peaks of 10.000 and an install base of ~500.

markhalliwell’s picture

Status: Active » Fixed
hass’s picture

Status: Fixed » Active

How hould this fix the project peaks that make these graph unreadable?

markhalliwell’s picture

Status: Active » Fixed

I linked and closed in response to #35 (which is what re-opened the issue), a separate issue.

@hass, the issue you brought up is also a separate issue. Create a new issue (as @Mixologic has asked people to do); that appears to be bad data for a single project and should be treated as such. It does not indicate that the entire process is broken, which affects all projects and what this issue has been about.

Mixologic’s picture

It is not bad data, it is a bad definition of what "usage" really means. Peaks like that are sometimes created when we a lot of sites get created as part of some testing issue, or a class is learning drupal and they create tons of sites. We do not currently have a way to separate what is a "Real" site from a CI or testing site. Lest you think this is new, have a look at project stats on an old dev site with the old processing methodology - the spikes are there too: https://composer-drupal.redesign.devdrupal.org/project/usage/project

Someday, we will take the update stats data and turn it into a proper dataset - as it is right now, what we really have is a precomputed aggregate report that has some built in limitations - It is *only* data that is reported to us, which sometimes does not request everything (I'll see the same site ask for a slightly different group of modules every five minutes - always omitting a couple - i.e. not every request is getting through always.)

When we have it as a full dataset, we can add better rules such as "exclude sites who have only contacted us once" or "only count sites who we have been around longer than a month". Or, "how many drupal sites use *both* rules and context" - things like that where our reporting is bound to our limited definition of 'usage'.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.