Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
The new usage stats collection is working nicely. Time for me to get it upstream.
I'll be describing the procedure I used to transition from the old system as well, in case anyone else cares to implement it in their scenario. (It's not a case of flipping a switch, switching while preserving data from the "old" system is rather involved.)
Comment | File | Size | Author |
---|---|---|---|
#8 | new_stats.patch | 25.69 KB | bdragon |
#2 | new_stats.patch | 16.39 KB | bdragon |
#1 | dumpdate.sh_.txt | 358 bytes | bdragon |
#1 | fetch1.sh_.txt | 677 bytes | bdragon |
Comments
Comment #1
bdragon CreditAttribution: bdragon commentedDumping data from the "old" system:
Since the new system is based around parsing log files, I implemented loading the pre-logfile data as a parser as well.
This means that the data needs to be dumped to file in a specific format. I used the following shell scripts to make these "daydump" files so I could load individual days of data into the new system.
Comment #2
bdragon CreditAttribution: bdragon commentedFinally, the main patch.
This is a significantly cleaned up version of what I've been running stats on every week.
The cleanups have been tested in the sense that I ran the cleaned-up code on util, processed a couple of days of data, and didn't get any errors. End-of-week has not been tested, but I do not expect issues.
I have removed d.o specific information from the config file.
Comment #3
bdragon CreditAttribution: bdragon commentedRunning the new statistics:
A) Configure varnish (or squid, or apache, or your balancer, or whatever) to store access logs, broken down by date or some other changing name. MAKE SURE that files are renamed to something else after they are closed out -- if the new system processes a file that is appended to later, it will not notice the changes, as it only processes each specific filename once.
B) Configure usage/project-usage-config.inc. Make UPDATES_GLOB match the files stored in A. (If you prefer, you can copy the logs to another place and point to there. rsync in cron jobs, etc.)
C) Modify release/project-release-serve-history.php to skip writing to {project_usage_raw}. (optional but recommended, the new script no longer uses {project_usage_raw} and it will continue to grow if you are still writing to it).
D) Run "php project-usage-process-varnish.php" from the usage folder periodically. (It's okay to run more often than once a week, it will not reprocess files that have been processed already.)
Comment #4
bdragon CreditAttribution: bdragon commentedPrerequisites:
* Mongo, and PHP MongoDB support.
* Mongo must run on a server with sufficient free RAM for your workload. For updates.drupal.org, mongo currently uses somewhere around 4.3 gigabytes of RAM. Smaller sites will not have near as high a requirement.
* An existing drupal setup to bootstrap inside.
* Log data in files that are not appended to without being renamed again afterwards. Gzipped files are okay, they will be automatically ungzipped during processing.
More notes about the log naming:
It is okay for the usage script to run on a file that's still being appended to, just make sure to rename it after you close it out to ensure that the script "sees" it again.
Comment #5
bdragon CreditAttribution: bdragon commentedMongo schema:
The database used by the update script is
update-statistics
.The collections used by the update script are:
processed-files
Used to keep track of when files were processed.
filename
: Name of file processed.timestamp
: Timestamp of the start of the processing *run* the file was processed in. (Multiple files processed in the same run have the same timestamp.)projects
pid / uri cross reference. This is copied into mongo from the database at the start of a processing run.
pid
Project ID. (example: "3060")uri
Project URI. (example: "drupal")releases
Information about releases. This is copied into mongo from the database at the start of a processing run.
uri
Project URI.nid
Release NID.pid
Project ID.version
Release version string ("6.x-1.0", etc..)tid
Theversion_api_tid
the release is compatible with.terms
tid / name cross reference. This is copied into mongo from the database at the start of a processing run.
tid
Taxonomy TID.name
Core version name ("6.x", etc.)In addition, there will be collections with numeric names. These correspond to project's concept of a "week".
nnnnnnnnnn
Weekly storage for site state. "Newer" checkins take precedence over older ones, so the finally tally is based on whatever the site was last reported as using before the week ended.
site_key
The site_key being used to check in for updates.modules
The collection of modules in use on the site.nnnn
The project nid.timestamp
Time last checked in.release
Release nid.core
Core version during last checkin for this module.ip
IP address last seen. Used to assist in cleaning up if a site happens to go crazy and checks in with thousands of different site_keys in the same week. (Yes, this has happened.)core
The core version last reported. When tallying stuff, checkins that don't match this core are not tallied. It is presumed that the site in question upgraded to a new release mid-week if this happens.Comment #6
bdragon CreditAttribution: bdragon commentedStripping a broken site's counts out of the system, assuming the IP of the broken/malicious site did not change:
(Me having to do this in the past: http://drupal.org/node/351022#comment-2299588)
Run mongo.
For each week (the numeric collections), check whether that week is affected, and if so, delete the site data for that ip:
Alternatively, you can drop the affected collections, remove relevant rows from the
processed-files
collection, strip the bad data out of the logs, and reprocess them.Actual detection of site naughtiness is not coded yet, unfortunately, so you have to do some searching around in the logs to figure out exactly what happened. (It's probabaly possible to do some probing of the data in mongo, but I don't have any premade queries to demonstrate here.)
Comment #7
bdragon CreditAttribution: bdragon commentedTransitioning:
The first week of using the new system is a bit weird, because the two systems work very differently.
The most seamless way to transition is to have a period where you are collecting log data and running the old system simultaneously.
* Collect logfiles for more than a week before you disable the old system.
* Back up {project_usage_week_release} and {project_usage_week_project} before firing up the new script.
* Restore any weeks that did not have complete logs from the backup.
* weekA will use the old statistics because there aren't logs to build that week.
* weekB will use the old statistics because the logs to build that week are incomplete.
* weekC will use the new statistics because we have complete logs.
After running the new script, weekB will have partial data and needs to be reloaded from the backup.
At this point, you should also use the mongo shell to drop the partial collection.
Any past weeks that have a collection in mongo are assumed to be managed by the new statistics script.
Any weeks without a collection will be left alone on the SQL side of things (excepting purging of expired data.)
Loading SQL "daydumps" when transitioning to the new system:
* Back up mongo and your {project_usage_week_release} and {project_usage_week_project}.
* Get all your daydumps together and stick them in a folder.
* Set UPDATES_GLOB to glob them, and change UPDATES_LOADER to 'sqldump'.
* Run
php project-usage-process-varnish.php
from the usage/ folder.Comment #8
bdragon CreditAttribution: bdragon commentedI just noticed I didn't diff all the needed files. Try #2.
Comment #9
bdragon CreditAttribution: bdragon commentedA final note for people currently using project-usage-process.php:
There is no need to switch to this new system in the near future. The old statistics system still works fine, and has less moving parts to break. Setting up this new system is a complex undertaking.
Comment #10
dwwCool, great work here!
A) Just to confirm, nothing here breaks the simple system, right? This is just an alternative to running project-usage-process.php. Do we have a reasonable kill-switch for project-release-serve-history.php to decide if it should attempt to log data to the {project_usage_raw} table, etc?
B) I only skimmed, but the docs in this issue look great. However, they'd be much more useful to the world either as a handbook page (or set of pages) or as README text in project/usage itself. A handbook page with a link from the README would probably be the best, since then it's easier to change the instructions as needed without stale copies being out in the wild, yet people looking through the project/usage folder will at least get an initial explanation for what all the pieces are and a link to read the gory details.
I haven't done a through review of the code (I was AFK for 3 days and am now way behind on stuff), but I hope to go over it in the near future.
Thanks!
-Derek
Comment #11
nnewton CreditAttribution: nnewton commented@bdragon: Is this system now fully deployed? Can we now safely start looking at varnish caching of the update calls?
-N
Comment #12
bdragon CreditAttribution: bdragon commentednnewton: Not yet, I'm still running it from my homedir. Once it's upstream though, it will be easier to set up a cron job -- can merge it into drupal.org and run the cron job against the /var/webroot/drupal.org copy.
Comment #13
bdragon CreditAttribution: bdragon commented@dww #10:
A) I was avoiding touching project-release-serve-history.php. I suppose adding in a define or something at the top would be a low-impact way of having a killswitch. variable_get() would work okay for reading from settings.php, but we don't have our "proper" variables available until later. We could read directly from {variable} I suppose. I don't want to introduce that kind of overhead unless it's absolutely necessary though.
B) Yeah, planning on copying my comments out to proper form, I was just doing a brain dump there.
Comment #14
bdragon CreditAttribution: bdragon commentedOh, for the actual question in A):
What I DO need to do is add a mutually exclusive killswitch to both project-usage-process.php and project-usage-process-varnish.php. I'll go stick in a variable.
Comment #15
bdragon CreditAttribution: bdragon commentedFixed (finally.)
http://drupalcode.org/project/project.git/commit/24c42efab932e9468e027a9...
Comment #16
bdragon CreditAttribution: bdragon commentedTagging.
Comment #19
drumm