It appears that this week the usage statistics of many Drupal projects have gone down by about 40%.

Examples:

https://drupal.org/project/usage/context
May 4, 2014 4 8,912 64,038 72,954
April 27, 2014 6 12,952 113,255 126,213

https://drupal.org/project/usage/bean
May 3, 2014 3,175 3,175
April 26, 2014 5,966 5,966

https://drupal.org/project/usage/nodequeue
May 3, 2014 100 6,369 16,327 22,796
April 26, 2014 136 10,007 28,592 38,735

It's not likely that there has been a large, across-the-board decline in the usage of all Drupal projects, so perhaps there is a bug.

Comments

lolandese’s picture

Yes. It seems that it is beyond the usual temporary drop that can be seen during the weekend. The numbers went back up, but only slightly, thus the huge drop.

This seems not to be a duplicate of one of the numerous existing issues about the usage statistics.

sylus’s picture

Priority: Normal » Major

I am updating this issue so the priority is reflected as major. Alot of people use these statistics to determine both project health and community strength.

markhalliwell’s picture

Project: Drupal.org site moderators » Drupal.org infrastructure
Component: Project/Git problem » Webserver

Moving to infra and assigning to "webserver" component (saw in IRC: "the rebuilt webnodes aren't allowing the varnish logs to be collected").

drumm’s picture

http://localhost:8080/view/BZR%20Vendors/job/util_sync_varnishlogs/1297/... has the underlying problem:

rsync: failed to connect to www2.drupal.bak: Connection refused (111)
rsync: failed to connect to www6.drupal.bak: Connection refused (111)
rsync: failed to connect to www7.drupal.bak: Connection refused (111)

Additionally, http://drupalcode.org/project/infrastructure.git/blob/refs/heads/master:... should probably be a bit noisier. || true allowed the job to look like it succeeded.

basic’s picture

The recently rebuilt webnodes were missing rsync config which was added a couple days ago, but also found to be missing a firewall rule to allow syncing these logs. This firewall rule has been added and statistics should fix itself in a couple of days.

markhalliwell’s picture

It's been more than a couple days. I'm hoping after today it'll get fixed (since I know that Sundays are usually when these statistics update). However it isn't encouraging to keep seeing these numbers drop right before it does...

TR’s picture

This Sunday's statistics were even worse than last Sunday's. The original poster reported a 40% drop, but this week the drop is closer to 60%.

lolandese’s picture

As the usage statistics are defunct from time to time, I propose to have a message ready for the occasion to inform about it on the project pages. It is preferable not to show information when it is incorrect.

Something like:
Reported installs: Accurate numbers are currently unavailable. View usage statistics.

Note that a link to the details should be kept in place. Opened a separate issue for this: #2270127: Show info about incorrect usage stats.

Thanks.

nnewton’s picture

Having a message is not a bad idea. Usage stats are quite brittle and are a low priority for us. Having a way to mark when they are broken would help with the confusion.

-N

webchick’s picture

Issue tags: +Needs tests
RdeBoer’s picture

I agree with the message, but not because they are low priority. On the contrary.

I feel usage stats are of huge importance. It is one of the indicators in deciding which module to choose when there are multiple alternatives. Not the only decision factor and not one to go by blindly (it's not all about the popularity), but one that says something important about the module.

The message would be helpful. Better still would be to make the process less brittle (if it is). I appreciate that reliable stats are hard to do and resource-hungry.
I for one would happily Gittip anyone working on this.

RdeBoer’s picture

I agree with the message, but not because they are low priority. On the contrary.

I feel usage stats are of huge importance. It is one of the indicators in deciding which module to choose when there are multiple alternatives. Not the only decision factor and not one to go by blindly (it's not all about the popularity), but one that says something important about the module.

The message would be helpful. Better still would be to make the process less brittle (if it is). I appreciate that reliable stats are hard to do and resource-hungry.
I for one would happily Gittip anyone working on this.

rjacobs’s picture

Usage stats are quite brittle and are a low priority for us.

Is there any public material about what the current pain points are? Also, is there any documentation about the end-to-end functional setup for collecting, processing and displaying these statistics (e.g. what happens between the logic in the update module on individual sites and the project module on drupal.org)? I've followed along on some recent related issues (e.g. #2128619, #2129811, #2176153, etc.) but have only been able to glean the details. It sounds like it's not a fully automated process and there are some problematic manual steps involved, but it's not clear if this is due to software issues, hardware issue, etc....

I remember seeing some diagrams about the overall d.o. infrastructure, and people involved, but I also recall they were quite old and not specific to this topic.

I'm just curious about all this and would like to personally understand things better, if possible.

basic’s picture

The plot has thickened, missing rsync config was only the beginning. When a web node is rebuilt, the existing logs disappear because rsync runs with delete when it syncs to the parsing server. When the log parsing jobs run, there are no previous log files to parse and stats drop.

Secondly, the name of the log files changed and the Jenkins job was not picking up any new logs after the servers had been rebuilt. This job has been updated to reflect the correct log file names.

I was able to pull the log files from our log host, copy them to each web node individually, rename them to the new format, and kick off a sync+parse job in Jenkins. It may take a few days for this to complete since there are a lot of logs to parse. I started the job about 12 hours ago, and it is about 25% complete, so I expect it takes 48 hours to parse and then we will know if the stats are fixed.

As for the current process, it is automated by two Jenkins jobs (one to sync varnish logs and one to run the parsing with drush). It relies on varnish logs which prevents us from using the CDN to serve update traffic; we have to set no-cache and pass all updates back to the origin servers to record the requests in our varnish logs. The updates traffic alone accounts for 30-45Mbps of sustained traffic. It would be great to move away from the current system to something that allows us to use the CDN, but I don't think there are plans to do this currently.

drumm’s picture

nnewton’s picture

@RdeBoer,

Hi, I understand they are useful. However, they kind of have to remain low priority for now. They don't take the site down, they don't remove a core service, they don't stop development work. We are a small team for the size and complexity of our infrastructure....something has to be low priority :). That said, as you can see above we do have someone working on this.

@rjacobs,

They are brittle mostly because of lack of time and documentation I think. The scripts break easily, because they assume things and the entire process is very prone to gaming of the system (which module authors do attempt). This makes the process occasionally break and makes the numbers....questionable...at times. That is just my view though and I'm not the most knowledgeable of the system. I mostly interact with it when I change something that breaks it :).

-N

dqd’s picture

I don't 100% agree w/ "low prio" since this is also a module maintainer motivation issue. Add the D8 changes for theme and module developer web debate and some maintainer frustration to it and you will see the whole picture. We need the motivation of module maintainers on the long run ... now. Such a drop can add some salt to any frustrating for them. I agree: Wouldn't say it's Major prio, but it should definitely NOT be THAT low ... now. (?)

webchick’s picture

This issue is categorized as "major" and has a DA staff member looking into it, so I think we can stop debating the relative priority. :) Looking forward to seeing if Jenkins is able to sort it! Thanks, basic.

basic’s picture

The job is still running as of right now, and has moved on to Processing /var/log/DROP/www6/varnishncsa.log-20140425.gz, so we're a little over halfway. If this continues at the current speed this should be done by the end of the week, hopefully within the 48 hour time frame I was hoping for.

rjacobs’s picture

Thanks for the info basic, drumm and nnewton.

I'm assuming you are using the varnish logs for this, and offloading everything into an isolated batch process, because doing any usage tallying while update requests come in has the potential to overload the update server?

In addition to reporting any observed abnormalities with the stats, are there any small things that d.o. site users and module maintainers (like me) can do to keep the stat updates rolling along more smoothly?

nnewton’s picture

@webchick, I'm not debating priority. I'm trying to be more active in the queue to explain how our decision making works and be a bit more transparent.

@Digidog I full agree they are important and as webchick noted, there are quite a few resources working on fixing them at the moment. I'm merely explaining why they are occasionally relegated behind other tasks. Security and stability of other services that would literally stop development do de-rail them sometimes.

webchick’s picture

@nnewton: Sorry, I meant #18 to be directed at #17. :( Was trying to refocus the issue on the positive: the thing is being fixed. yay! :)

dqd’s picture

#22: webchick: Sorry. My comment @#17 may have sound a little exaggerated. But I don't meant it that way. I fully agree with you. :) Respect for all the effort in this. No debate. :) I was just putting in some weight to have some diff opinions echoed (community). It was more meant like "fighting for Drupal" as a whole. All positive. Greetings from Berlin.

basic’s picture

The Jenkins jobs have finished parsing the missing logs. I wonder if there is some caching in the way of the data being updated, or if it didn't help at all. @drumm may know if there are caches that need to be cleared or if the run just didn't help.

drumm’s picture

I ran

cache_clear_all('*', 'cache_project_usage', TRUE);

Numbers are now down by ~10% for the last week now.

basic’s picture

Status: Active » Fixed
webchick’s picture

Hm. While that's definitely much better than losing 40% of our usage traffic in one week ;), 10% in one week is still pretty alarming, too. Are we sure one or more of the web heads/log files isn't missed, and/or is there some other thing that could account for that dramatic of a drop-off?

rjacobs’s picture

Great! Thanks for taking care of this.

Just curious, but what accounts for that 10% (which appears to be a common irregularity that appears briefly each week)? Perhaps only some of the varnish logs make it into each batch tally, and the rest catch-up later?

Also, if anyone has any comments about my questions from #20, I'm still curious to know more about all this :)

Cheers

basic’s picture

My understanding of the 10% issue is that it "just happens". I don't think anyone has taken the time to figure out what causes it. I'm also not an expert with the project statistics and just picked up all of the pieces that appeared to be broken. It is possible that the 10% this time comes from missing files, or from files that had not been synced to the loghost before rebuilding the web nodes. Prior to these rebuilds I and others on the infra team weren't aware of the requirement to back up the varnishncsa logs. Unfortunately the logs that are lost can't be saved.

That said, I also only have a good understanding of the glue that makes the system work, and not specifics about the system itself. I am not sure anyone has a full understanding of the entire system. I can tell you that:

- jenkins has jobs which automate these tasks
- webnodes write varnishncsa logs for all webnode traffic
- those logs are copied with rsync -av --delete to util.drupal.org, where they are queued for processing
- those logs are also copied daily to loghost.drupal.org (this is where I was able to pull the backups which let me re-run stats)
- util runs a drush command which parses the logs
- util connects to the buildvm and does processing in mongodb
- mongodb has stats that are then read by drupal and stored in the drupal database
- the statistics have a special 24 hour cache since they aren't changing frequently

Alan D.’s picture

I assume that webchick was looking for tests? New issue or reopen this one? (not that I am in a position to handle this!)

webchick’s picture

Not particularly, no. I try and add that tag to forehead-smacking Drupal.org issues that we'd really rather not have broken while doing routine maintenance, so that hopefully when the Drupal Association hires a QA engineer they have a nice list of places to start covering with tests. :)

donSchoe’s picture

Hey, is this really fixed?

After this reported 40% drop (and corrected to 10%), my module's usage statistics dropped to zero this week. See https://drupal.org/project/usage/views_dynamic_fields

Is this related?

gisle’s picture

The project usage statistics always drop to zero on Sunday, and then starts to make sense again towards the end of the week. Relax until at least Friday.

drumm’s picture

Yep, see #2176757: Hide the last date row in project stats page when it has no values..

I double checked the usual spots and everything does look okay.

drumm’s picture

Status: Fixed » Active

Actually, the log format did subtly change. This will be the same situation as #14-24 again - reparsing the log files will take some time. There is a strong possibility this will remove the 10% drop too.

basic’s picture

These appear to be fixed now, after re-running against the broken log format with the fix drumm added.

markhalliwell’s picture

Status: Active » Fixed

Confirmed

RdeBoer’s picture

Yep. Stats are up! (mostly)

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Component: Webserver » Servers
lolandese’s picture

Status: Closed (fixed) » Active

More than 50% since two weeks.

Also , please, review #2270127: Show info about incorrect usage stats.

It is preferable not to show information when it is incorrect.

gisle’s picture

Status: Active » Closed (duplicate)
Parent issue: » #2509574: Project usage stats have probably gone bad (again)

I understand since the cause of these time-to-time glitches varies, closed ones should not be reopened.

There is already an active issue open for the present glitch.

lolandese’s picture

Missed that one.

Thanks for the link.