Project usage stats have probably gone bad (again) [#2509574]

Comment	File	Size	Author
#32	Screen Shot 10-15-15 at 09.59 AM.PNG	60.83 KB	pingwin4eg
#23	project-usage-overview-drupal.org_.png	345.31 KB	derjochenmeyer
#5	2015-06-22.10-23-43.png	32.05 KB	derekahmedzai

Comment #1

gisle

he/him

Norwegian Bokmål

Norway

commented 21 June 2015 at 11:10

Project:	Drupal.org customizations	» Drupal.org infrastructure
Version:	7.x-3.x-dev	»
Component:	Miscellaneous	» Other
Status:	Needs work	» Active

Moving to right queue.

Log in or register to post comments

Comment #2

antongp commented 21 June 2015 at 12:35

Checked a few projects I maintain - each decreased for ~50%. Yep, looks it's some error, I don't believe Views, Token, ctools and other really lost for about half of installations.

Log in or register to post comments

Comment #3

sylus commented 21 June 2015 at 18:38

Yeah noticed this as well for large distros such as panopoly, and commerce_kickstart.

Log in or register to post comments

Comment #4

David_Rothstein commented 21 June 2015 at 21:45

To add another data point, Drupal core doesn't have any usage stats at all listed from that time period (the most recent data displayed at https://www.drupal.org/project/usage/drupal is from May 31) so something does seem to be wrong...

Log in or register to post comments

Comment #5

derekahmedzai commented 22 June 2015 at 09:26

Status	File	Size
new	2015-06-22.10-23-43.png	32.05 KB

My module (Fitvids) had a 10% drop too. I checked some other projects (Panels, Views) and they had dropped too.

I just checked today to see if it was fixed, now it's a 50% drop! What's going on?!

Showing 50% drop in usage!

This needs to be sorted out asap, because if the numbers aren't accurate, then they are meaningless.

Log in or register to post comments

Comment #6

sylus commented 22 June 2015 at 16:22

Priority:

Normal

» Major

Increasing the priority. Ideally if usage statistics are incorrect we could just disable the metrics until they are working again. I believe there was a related issue about that but can't seem to find it now.

Log in or register to post comments

Comment #7

gisle

he/him

Norwegian Bokmål

Norway

commented 22 June 2015 at 16:41

sylus wrote:

Ideally if usage statistics are incorrect we could just disable the metrics until they are working again. I believe there was a related issue about that but can't seem to find it now.

Maybe you're thinking about: #2270127: Show info about incorrect usage stats

Log in or register to post comments

Comment #8

basic commented 22 June 2015 at 17:07

We've found the issue on our end, the update stats are processing incomplete log files, and isn't smart enough to notice changes in filesize to the incomplete files. We are going to be rewriting and reprocessing the usage stats to account for this. In the mean time usage stats will be broken and will take some time to catch up to the current day once the processing is fixed.

Log in or register to post comments

Comment #9

sylus commented 25 June 2015 at 13:47

I was curious about whether we have started to reindex the incomplete files. Already had a few emails about the usage drop for a few projects.

Thanks @gisle for the link we definitely should try to push that issue forward should this happen again.

Of course thank you very much for looking into this :)

Log in or register to post comments

Comment #10

basic commented 25 June 2015 at 23:33

I've begun indexing files starting @ June 10th this morning. We are looking at ~1.5 hours per day so it will take some time to catch up. Hopefully by tomorrow morning we've caught up with most of the processing and the stats start to come back to life a bit.

Log in or register to post comments

Comment #11

mlhess commented 18 July 2015 at 17:55

Status:

Active

» Fixed

Marking this fixed.

Log in or register to post comments

Comment #12

anrikun commented 26 July 2015 at 08:06

Status:

Fixed

» Active

Looks broken again...

Log in or register to post comments

Comment #13

Mixologic

He/Him/His

English

Portland, OR

commented 27 July 2015 at 20:31

Thanks @Anrikun - we had another issue with our loghost, and were missing a couple of days of log data. That data has been pulled from our backup source and is processing now.

Log in or register to post comments

Comment #14

markhalliwell

English

commented 27 July 2015 at 22:12

I appreciate the diligence and quick responses the Infra team has shown over this issue; it is quite a headache.

I also feel, however, that it is becoming increasingly obvious that the current method, in which these logs are being processed, is quite unreliable.

Is there or will there be any thought/action into providing a more stable system for parsing these logs? (i.e. placing proper contingencies for handling random errors and missing logs, etc.) The entire way these are being parsed (from my perspective) seems rather ambiguous and manual when something goes awry.

This has been an issue for over a year now (seemingly since D7/server upgrades maybe?), spanning several issues.

Isn't it time that this issue becomes properly fixed, instead of just slapping band-aids on it?

Log in or register to post comments

Comment #15

lolandese commented 28 July 2015 at 07:41

Isn't it time that this issue becomes properly fixed?

Please. And meanwhile #2270127: Show info about incorrect usage stats.

Log in or register to post comments

Comment #16

basic commented 28 July 2015 at 16:59

Of course, we've made about 6 changes over the last 6 months to fix each individual problem we've had with the usage stats. These changes included:

Moving the mongodb processing off of openstack when it was decomissioned
Moving the mongodb processing to ec2
Moving the mongodb processing back to a dedicated machine because latency to ec2 is too high / processing was too slow
Moving updates traffic from EdgeCast to Fastly and rewriting parts of the processing to work with Fastly logs
Fixing issues with the Fastly log processing incomplete (current day) log files
Troubleshooting rsyslog issues on loghost preventing us from obtaining the logs, and adding an aws s3 bucket with additional logs

Since all of this has happened, we (specifically @Mixologic) have also begun work on removing the mongodb processing component which is inefficient and replacing it with awk and other gnu tools.

Log in or register to post comments

Comment #17

markhalliwell

English

commented 1 August 2015 at 15:47

Awesome. Yeah, not dogging what y'all have done (I know it's been a lot), just wasn't sure exactly _what_ the plan was.

FWIW, I actually just visited https://www.drupal.org/project/usage/bootstrap to see current stats and it appears to be really drastic still?

July 18, 2015	46,381
July 11, 2015	70,627

24k drop and no data since 7/18?

Log in or register to post comments

Comment #18

Mixologic

He/Him/His

English

Portland, OR

commented 7 August 2015 at 21:54

The usage statistics are a canary in the coal mine that is symptomatic of any number of failures elsewhere in the system. The stats are derived from our updates traffic which is a firehose of data - its responsible for about 75% of the bandwidth we use each month (about 12 TB or so ). In an effort to reduce costs and provide a more reliable updates service we had moved updates.drupal.org to a CDN in May of last year (to edgecast) - recently we've been transitioning to a different CDN (fastly) and therefore had to update the processing methodology. (we moved from rsyncing logs from edgecasts' servers, to having fastly communicate directly with our loghost). As with any change, there is a risk that new bugs will surface, that assumptions that were made in the former process no longer apply in the new process.

Each time the process has failed it has been for a different reason that was not previously anticipated. Each time we run into those failures we apply a proper fix so it does not happen again in exactly the same way.

This particular instance was a result of the loghost not responding to fastly's direct logging. Our monitoring and alerting was not tracking that process specifically, so we didn't know the loghost was non-responsive - that will be rectified today. In anticipation of this sort of contingency, we were already dual logging fastly logs to S3. Had we not anticipated that loghost might fail, we would not have backups of that data.

In any case, we have been looking at ways to improve the processing of these stats. Currently we end up with about 40 million records a day, which get moved to another server to process with drush -> mongodb -> drupal.org database. This process takes about 3-4 hours to run each day, and is a big reason why when something breaks (like loghost failing over 4th of july) it becomes difficult to catch up.

So I've began a rewrite of the process that removes mongodb from the equation (which was really only acting as a deduplicating key-value store). The rewrite gets back to some unix file processing roots - awk, cut, uniq -c and sort are going to be the tools we use to handle the fastly and edgecast log files. That part is done, and I was able to reformat every file from mid february to date. This reformat was able to process in about 6 hours - i.e. about a months worth of data an hour.

The next step is to do the deduplication and aggregation - Preliminary tests show that these processes take about 30 minutes per week's worth of data to run.

The next step will be to load those pre-aggregated count files into the drupal.org database - this is essentially taking the code thats already there, removing 90% of it, and changing one or two things around.

The final step is to ensure this process is running on jenkins.

Some additional value we're going to get out of this: we were discarding a *lot* of data that didnt perfectly match our release names. For example, drupal 8.0.0-dev is being reported to us, but drupal 8.0.x-dev is what is in the database - so we are not counting the 30,000 or so users who have a d8 dev install up and running. Same goes for virtually every development version of a module in contrib. So, we're going to see a good bump in the numbers that were always there.

Secondly, a feature was added a long time ago to provide stats on submodule useage. That data was being sent to drupal.org, but ignored, but now we've at least got it parsed and counted. When the priorities align, we'll be able to do something like this: https://www.drupal.org/node/1627676#comment-7683233 with it...

Anyhow, hope that helps - we have been focused on these and want to get away from cleanup efforts that take half a day when any little thing derails the update stats freight train. Thanks for being patient.

Log in or register to post comments

Comment #19

markhalliwell

English

commented 7 August 2015 at 22:07

Thanks @Mixologic! That is an awesome overview, definitely helps clarify quite a few questions (and answers some that haven't even been asked yet).

Like I said above, I definitely appreciate y'alls attentiveness to this issue. I wasn't suggesting otherwise.

From what I gathered in that reply, it sounds like y'all are still processing ~6mo worth of data. That will obviously take a bit, understood. One more question though: I'm assuming based on what you said above that the issue I described in #17 will automatically resolve itself once d.o's db stats have been re-imported with proper numbers, yes?

Log in or register to post comments

Comment #20

lolandese commented 8 August 2015 at 08:48

In an effort to reduce costs and provide a more reliable updates service we had moved updates.drupal.org to a CDN in May of last year (to edgecast) - recently we've been transitioning to a different CDN (fastly) and therefore had to update the processing methodology.

That is definitely news and should be made public to a wider audience than just the twenty-something followers of this issue. A slightly edited edition of #18 could go into Drupal News. Being informed makes inconveniences easier to bear. It is proven that travellers that know the reason of a train delay are less likely to complain about it. See also #2270127: Show info about incorrect usage stats.

Secondly, a feature was added a long time ago to provide stats on submodule useage. That data was being sent to drupal.org, but ignored, but now we've at least got it parsed and counted.

That is good news.

Thanks for the efforts on this.

Log in or register to post comments

Comment #21

sylus commented 4 October 2015 at 14:54

Looks like all projects across the board have lost 60% of their usage again.

https://www.drupal.org/project/usage/views
https://www.drupal.org/project/usage/ctools
https://www.drupal.org/project/usage/bootstrap

Log in or register to post comments

Comment #22

mqanneh

he/him

English

Amman

commented 6 October 2015 at 09:03

+1

Log in or register to post comments

Comment #23

derjochenmeyer commented 7 October 2015 at 08:59

Category:	Task	» Bug report
Issue summary:	View changes

Status	File	Size
new	project-usage-overview-drupal.org_.png	345.31 KB

Project usage (including Drupal core) has dropped 50%. This seems to be a bug not a task.

Log in or register to post comments

Comment #24

sylus commented 7 October 2015 at 13:18

This has been happening off an on for a year and it is starting to get very frustrating that we cant seem to fix this.

Log in or register to post comments

Comment #25

drumm

he/him

NY, US

commented 7 October 2015 at 13:49

We're working on a new method for aggregating usage statistics as we fix this week's issue.

Log in or register to post comments

Comment #26

sylus commented 7 October 2015 at 17:58

Thanks @drumm that is exciting news, appreciate it!

Log in or register to post comments

Comment #27

markhalliwell

English

commented 13 October 2015 at 15:54

Priority:

Major

» Critical

Progress update? Both for this latest issue, whatever it may be, as well as an overall update. It's been like this since for over a week now and it does seem like it's taking longer and longer to fix these errors when they do appear. Maybe that's just the side-effect of developing/implementing the "new method", idk.

I know this is probably like the last thing on y'alls list, but I have to be honest... it's not very re-assuring to consistently just hear "we're working on it" when it breaks.

Log in or register to post comments

Comment #28

drumm

he/him

NY, US

commented 13 October 2015 at 16:59

Log in or register to post comments

Comment #29

Mixologic

He/Him/His

English

Portland, OR

commented 13 October 2015 at 19:29

If you'd like to follow along, the issue and code where this is being reworked is here: #2575425: Project side of new d.o usage processing method, which has been implemented on the following dev site:

https://mixologic-updatestats-drupal.redesign.devdrupal.org/project/usage (There is a spike drop in Feb as that was the earliest we had data, and only had a partial week, that partial week will not be used in production)

All data from Feb onwards has been reprocessed on that site utilizing the new method (which also properly accounts for -dev sites, which the former method did not).

There is one other big caveat as to why this has been "taking over a week". Earlier in this thread I said:

This particular instance was a result of the loghost not responding to fastly's direct logging. Our monitoring and alerting was not tracking that process specifically, so we didn't know the loghost was non-responsive - that will be rectified today. In anticipation of this sort of contingency, we were already dual logging fastly logs to S3. Had we not anticipated that loghost might fail, we would not have backups of that data.

It turned out that pushing all of our log data from fastly to S3 once a day resulted in a silent API timeout from AWS that we unaware of. The loghost stopped responding again, and we had to rely on the S3 backups - except the S3 backups were only 80-85% complete as a result of the timeout. Additionally, we lost an entire day's worth of data the 20th of september as our S3 backups were only for 14 days, and it failed right at the start of drupalcon bcn, and everybody took time off after the con, so we missed the backup window there. - we've since extended that to 45 days of s3 + forever amazon glacier.

In other words, the update stats will not be accurate for the weeks of

Sep 19th-26th (missing data for the 20th)
Sep 27th- Oct 2nd
Oct 3rd-10th (using S3 backups from the 28th->6th with only about 80% data).

We're still working with fastly to get the backup log data sorted out (despite switching from saving every 24 hours to every 2 hours, we're still getting lost data).

Log in or register to post comments

Comment #30

Mixologic

He/Him/His

English

Portland, OR

commented 14 October 2015 at 17:01

Status:

Active

» Fixed

The new process is in place, and the stats have been updated.

Please let us know if there are any wild discrepancies and we can adjust the process further (in a new issue of course)

Log in or register to post comments

Comment #31

greggles

he/him

English

Denver, Colorado, USA

commented 14 October 2015 at 17:12

Thanks, Mixologic! I reviewed a few projects and they look reasonably good.

Log in or register to post comments

Comment #32

pingwin4eg

Zaporizhia 🇺🇦

commented 15 October 2015 at 07:02

Component:	Other	» Updates System
Status:	Fixed	» Active

Status	File	Size
new	Screen Shot 10-15-15 at 09.59 AM.PNG	60.83 KB

There is some trouble happened again. Or is there some work still in progress?

Please see this project's stats: https://www.drupal.org/project/usage/styleswitcher . Yesterday (after the fix was provided) the last 2 weeks' numbers was normal, they were more than 3 hundred usages in total, but now they are changed - they fell dramatically (such as before the fix).

Numbers of September 20, 2015 and earlier seems OK.

As far as I know previously stats were never recounted for earlier weeks. This latest fix recounted stats once again from Feb of this year (as far as I know again) and stats were OK (yesterday).

Log in or register to post comments

Comment #33

Mixologic

He/Him/His

English

Portland, OR

commented 15 October 2015 at 07:25

Thanks @pingwin4eg, there was still a process configured to run from the old process and it overwrote the better stats. I've reran the correct process again, and shut off the old one so that wont happen again. Have a look again.

Log in or register to post comments

Comment #34

pingwin4eg

Zaporizhia 🇺🇦

commented 15 October 2015 at 08:26

Status:

Active

» Fixed

Yes, stats seem OK now. Checked also other popular contribs - they OK too. Thank you!

Log in or register to post comments

Comment #35

mqanneh

he/him

English

Amman

commented 19 October 2015 at 06:31

Status:

Fixed

» Active

It happened again, the usage stats weren't updated for October 11, 2015, all projects have 0 reported installs for the last week.

check https://www.drupal.org/project/usage/facebook_comments_block

Log in or register to post comments

Comment #36

hass commented 19 October 2015 at 07:39

There is also a bug in "project" module stats. It has peaks of 10.000 and an install base of ~500.

Log in or register to post comments

Comment #37

markhalliwell

English

commented 19 October 2015 at 13:28

Status:

Active

» Fixed

Edit: found an existing issue: #2176757: Hide the last date row in project stats page when it has no values.

Log in or register to post comments

Comment #38

hass commented 19 October 2015 at 17:14

Status:

Fixed

» Active

How hould this fix the project peaks that make these graph unreadable?

Log in or register to post comments

Comment #39

markhalliwell

English

commented 19 October 2015 at 17:37

Status:

Active

» Fixed

I linked and closed in response to #35 (which is what re-opened the issue), a separate issue.

@hass, the issue you brought up is also a separate issue. Create a new issue (as @Mixologic has asked people to do); that appears to be bad data for a single project and should be treated as such. It does not indicate that the entire process is broken, which affects all projects and what this issue has been about.

Log in or register to post comments

Comment #40

Mixologic

He/Him/His

English

Portland, OR

commented 19 October 2015 at 19:21

It is not bad data, it is a bad definition of what "usage" really means. Peaks like that are sometimes created when we a lot of sites get created as part of some testing issue, or a class is learning drupal and they create tons of sites. We do not currently have a way to separate what is a "Real" site from a CI or testing site. Lest you think this is new, have a look at project stats on an old dev site with the old processing methodology - the spikes are there too: https://composer-drupal.redesign.devdrupal.org/project/usage/project

Someday, we will take the update stats data and turn it into a proper dataset - as it is right now, what we really have is a precomputed aggregate report that has some built in limitations - It is *only* data that is reported to us, which sometimes does not request everything (I'll see the same site ask for a slightly different group of modules every five minutes - always omitting a couple - i.e. not every request is getting through always.)

When we have it as a full dataset, we can add better rules such as "exclude sites who have only contacted us once" or "only count sites who we have been around longer than a month". Or, "how many drupal sites use *both* rules and context" - things like that where our reporting is bound to our limited definition of 'usage'.

Log in or register to post comments

Comment #41

2 November 2015 at 19:24

Status:

Fixed

» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Log in or register to post comments

Project usage stats have probably gone bad (again)

Comments

Child issues

Related issues

Referenced by