The varnish merge [#703720]

The new usage stats collection is working nicely. Time for me to get it upstream.

I'll be describing the procedure I used to transition from the old system as well, in case anyone else cares to implement it in their scenario. (It's not a case of flipping a switch, switching while preserving data from the "old" system is rather involved.)

Comment	File	Size	Author
#8	new_stats.patch	25.69 KB	bdragon
#2	new_stats.patch	16.39 KB	bdragon
#1	dumpdate.sh_.txt	358 bytes	bdragon
#1	fetch1.sh_.txt	677 bytes	bdragon

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

bdragon CreditAttribution: bdragon commented 3 February 2010 at 17:27

File	Size
fetch1.sh_.txt	677 bytes
dumpdate.sh_.txt	358 bytes

Dumping data from the "old" system:

Since the new system is based around parsing log files, I implemented loading the pre-logfile data as a parser as well.

This means that the data needs to be dumped to file in a specific format. I used the following shell scripts to make these "daydump" files so I could load individual days of data into the new system.

Comment #2

bdragon CreditAttribution: bdragon commented 3 February 2010 at 20:01

Status:

Active

» Needs review

File	Size
new_stats.patch	16.39 KB

Finally, the main patch.

This is a significantly cleaned up version of what I've been running stats on every week.

The cleanups have been tested in the sense that I ran the cleaned-up code on util, processed a couple of days of data, and didn't get any errors. End-of-week has not been tested, but I do not expect issues.

I have removed d.o specific information from the config file.

Comment #3

bdragon CreditAttribution: bdragon commented 3 February 2010 at 20:12

Running the new statistics:

A) Configure varnish (or squid, or apache, or your balancer, or whatever) to store access logs, broken down by date or some other changing name. MAKE SURE that files are renamed to something else after they are closed out -- if the new system processes a file that is appended to later, it will not notice the changes, as it only processes each specific filename once.
B) Configure usage/project-usage-config.inc. Make UPDATES_GLOB match the files stored in A. (If you prefer, you can copy the logs to another place and point to there. rsync in cron jobs, etc.)
C) Modify release/project-release-serve-history.php to skip writing to {project_usage_raw}. (optional but recommended, the new script no longer uses {project_usage_raw} and it will continue to grow if you are still writing to it).
D) Run "php project-usage-process-varnish.php" from the usage folder periodically. (It's okay to run more often than once a week, it will not reprocess files that have been processed already.)

Comment #4

bdragon CreditAttribution: bdragon commented 3 February 2010 at 20:22

Prerequisites:

* Mongo, and PHP MongoDB support.
* Mongo must run on a server with sufficient free RAM for your workload. For updates.drupal.org, mongo currently uses somewhere around 4.3 gigabytes of RAM. Smaller sites will not have near as high a requirement.
* An existing drupal setup to bootstrap inside.
* Log data in files that are not appended to without being renamed again afterwards. Gzipped files are okay, they will be automatically ungzipped during processing.

More notes about the log naming:

It is okay for the usage script to run on a file that's still being appended to, just make sure to rename it after you close it out to ensure that the script "sees" it again.

Comment #5

bdragon CreditAttribution: bdragon commented 3 February 2010 at 20:54

Mongo schema:

The database used by the update script is update-statistics.

The collections used by the update script are:

processed-files

Used to keep track of when files were processed.

filename: Name of file processed.
timestamp: Timestamp of the start of the processing *run* the file was processed in. (Multiple files processed in the same run have the same timestamp.)

projects

pid / uri cross reference. This is copied into mongo from the database at the start of a processing run.

pid Project ID. (example: "3060")
uri Project URI. (example: "drupal")

releases

Information about releases. This is copied into mongo from the database at the start of a processing run.

uri Project URI.
nid Release NID.
pid Project ID.
version Release version string ("6.x-1.0", etc..)
tid The version_api_tid the release is compatible with.

terms

tid / name cross reference. This is copied into mongo from the database at the start of a processing run.

tid Taxonomy TID.
name Core version name ("6.x", etc.)

In addition, there will be collections with numeric names. These correspond to project's concept of a "week".

nnnnnnnnnn

Weekly storage for site state. "Newer" checkins take precedence over older ones, so the finally tally is based on whatever the site was last reported as using before the week ended.

site_key The site_key being used to check in for updates.
modules The collection of modules in use on the site.
- nnnn The project nid.
  - timestamp Time last checked in.
  - release Release nid.
  - core Core version during last checkin for this module.
ip IP address last seen. Used to assist in cleaning up if a site happens to go crazy and checks in with thousands of different site_keys in the same week. (Yes, this has happened.)
core The core version last reported. When tallying stuff, checkins that don't match this core are not tallied. It is presumed that the site in question upgraded to a new release mid-week if this happens.

Comment #6

bdragon CreditAttribution: bdragon commented 3 February 2010 at 21:53

Stripping a broken site's counts out of the system, assuming the IP of the broken/malicious site did not change:
(Me having to do this in the past: http://drupal.org/node/351022#comment-2299588)

Run mongo.

MongoDB shell version: 1.1.3
url: test
connecting to: test
type "help" for help
> use update-statistics
switched to db update-statistics
> show collections
1260662400
1261267200
1261872000
1262476800
1263081600
1263686400
1264291200
1264896000
processed-files
projects
releases
system.indexes
terms
>

For each week (the numeric collections), check whether that week is affected, and if so, delete the site data for that ip:

> db["1260662400"].find({ip: "10.11.222.33"}).count()
5
> db["1261267200"].find({ip: "10.11.222.33"}).count()
8005
> db["1261267200"].remove({ip: "10.11.222.33"})

Alternatively, you can drop the affected collections, remove relevant rows from the processed-files collection, strip the bad data out of the logs, and reprocess them.

Actual detection of site naughtiness is not coded yet, unfortunately, so you have to do some searching around in the logs to figure out exactly what happened. (It's probabaly possible to do some probing of the data in mongo, but I don't have any premade queries to demonstrate here.)

Comment #7

bdragon CreditAttribution: bdragon commented 3 February 2010 at 22:11

Transitioning:

The first week of using the new system is a bit weird, because the two systems work very differently.
The most seamless way to transition is to have a period where you are collecting log data and running the old system simultaneously.
* Collect logfiles for more than a week before you disable the old system.
* Back up {project_usage_week_release} and {project_usage_week_project} before firing up the new script.
* Restore any weeks that did not have complete logs from the backup.

|-----weekA-----|-----weekB-----|-----weekC-----|
<<-----{project_usage_raw} logging---]
                              [-------- varnish logging------>>

* weekA will use the old statistics because there aren't logs to build that week.
* weekB will use the old statistics because the logs to build that week are incomplete.
* weekC will use the new statistics because we have complete logs.

After running the new script, weekB will have partial data and needs to be reloaded from the backup.
At this point, you should also use the mongo shell to drop the partial collection.

> use update-statistics
> db['1111111111'].drop()

Any past weeks that have a collection in mongo are assumed to be managed by the new statistics script.
Any weeks without a collection will be left alone on the SQL side of things (excepting purging of expired data.)

Loading SQL "daydumps" when transitioning to the new system:

* Back up mongo and your {project_usage_week_release} and {project_usage_week_project}.
* Get all your daydumps together and stick them in a folder.
* Set UPDATES_GLOB to glob them, and change UPDATES_LOADER to 'sqldump'.
* Run php project-usage-process-varnish.php from the usage/ folder.

Comment #8

bdragon CreditAttribution: bdragon commented 3 February 2010 at 22:16

File	Size
new_stats.patch	25.69 KB

I just noticed I didn't diff all the needed files. Try #2.

Comment #9

bdragon CreditAttribution: bdragon commented 3 February 2010 at 22:25

A final note for people currently using project-usage-process.php:

There is no need to switch to this new system in the near future. The old statistics system still works fine, and has less moving parts to break. Setting up this new system is a complex undertaking.

Comment #10

dww

we/he/they

CreditAttribution: dww commented 4 February 2010 at 20:34

Status:

Needs review

» Needs work

Cool, great work here!

A) Just to confirm, nothing here breaks the simple system, right? This is just an alternative to running project-usage-process.php. Do we have a reasonable kill-switch for project-release-serve-history.php to decide if it should attempt to log data to the {project_usage_raw} table, etc?

B) I only skimmed, but the docs in this issue look great. However, they'd be much more useful to the world either as a handbook page (or set of pages) or as README text in project/usage itself. A handbook page with a link from the README would probably be the best, since then it's easier to change the instructions as needed without stale copies being out in the wild, yet people looking through the project/usage folder will at least get an initial explanation for what all the pieces are and a link to read the gory details.

I haven't done a through review of the code (I was AFK for 3 days and am now way behind on stuff), but I hope to go over it in the near future.

Thanks!
-Derek

Comment #11

nnewton CreditAttribution: nnewton commented 8 February 2010 at 18:19

@bdragon: Is this system now fully deployed? Can we now safely start looking at varnish caching of the update calls?

-N

Comment #12

bdragon CreditAttribution: bdragon commented 8 February 2010 at 18:27

nnewton: Not yet, I'm still running it from my homedir. Once it's upstream though, it will be easier to set up a cron job -- can merge it into drupal.org and run the cron job against the /var/webroot/drupal.org copy.

Comment #13

bdragon CreditAttribution: bdragon commented 8 February 2010 at 18:44

@dww #10:

A) I was avoiding touching project-release-serve-history.php. I suppose adding in a define or something at the top would be a low-impact way of having a killswitch. variable_get() would work okay for reading from settings.php, but we don't have our "proper" variables available until later. We could read directly from {variable} I suppose. I don't want to introduce that kind of overhead unless it's absolutely necessary though.

B) Yeah, planning on copying my comments out to proper form, I was just doing a brain dump there.

Comment #14

bdragon CreditAttribution: bdragon commented 8 February 2010 at 18:53

Oh, for the actual question in A):

What I DO need to do is add a mutually exclusive killswitch to both project-usage-process.php and project-usage-process-varnish.php. I'll go stick in a variable.