function _registry_parse_file() and the registry table do not account for at all for the fact that a class definition might move between files when a module is updated.

If you do move a class definition, etc, from one file to another, if the file with the new definition is processed before the file that previously had the definition then you get an SQL error that causes the site to break:

PDOException: SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry 'Example-class' for key 'PRIMARY': INSERT INTO {registry} (name, type, filename, module, weight) VALUES (:db_insert_placeholder_0, :db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3, :db_insert_placeholder_4); Array ( [:db_insert_placeholder_0] => Example [:db_insert_placeholder_1] => class [:db_insert_placeholder_2] => sites/all/modules/example/example.module [:db_insert_placeholder_3] => example [:db_insert_placeholder_4] => 0 ) in _registry_parse_file() (line 180 of includes/registry.inc).
CommentFileSizeAuthor
#49 stale.filetimes.kill_.kittens.patch1.61 KBAnonymous (not verified)
#48 stale.filetimes.kill_.kittens.patch1.93 KBAnonymous (not verified)
#18 registry.transaction.patch6.28 KBAnonymous (not verified)
#9 registry.transaction.patch6.28 KBAnonymous (not verified)
#3 797024-registry-rebuild-merge-3.patch1.17 KBpwolanin
#2 797024-registry-rebuild-ordering-2.patch1.71 KBpwolanin
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

pwolanin’s picture

One possible approach is to alter the {registry} primary key so that the filename is a third element.

pwolanin’s picture

Status: Active » Needs work
FileSize
1.71 KB

hmm, this patch was my first attempt - but doesn't seem to fully resolve the problem. In particular will not resolve it if the class moves between two existing files.

pwolanin’s picture

Status: Needs work » Needs review
FileSize
1.17 KB

Changing the insert query to a merge query resolves the problem for me. Obviously that's less efficient, but the name and type should be unique, so adding filename to the Primary key seems like a bad hack.

Crell’s picture

Subscribing. No time to look at this right now, but I will try to soon.

Anonymous’s picture

hmm, ouch. i think the cleanest way to do it is clear resources of all to-be-parsed files before adding any new ones. no way we could contemplate that without wrapping this whole thing in a transaction though.

pwolanin’s picture

@justinrandell - that might work too, but given the way the code is structured it seemed safer to do the merge query. Given that not all DBs will actually do transactions, can we rely on that to protect us against an empty table when we need the data?

Anonymous’s picture

pwolanin: without transactions, this whole process is open to creating WSODs on a run that doesn't complete. there's kind of a nod to that in the code comment "We update the filectime/filemtime after we've saved the files resources rather than here, so if we don't make it through this rebuild, the next run will reparse the file." its a bigger window to delete them all at once, but its the same window.

its also rare for large numbers of files to change between runs, so the window is not that much bigger.

either way, there's no good reason not to wrap this in a transaction, but that's scope creep.

pwolanin’s picture

Sure, we could wrap the above patch with a transaction too.

Anonymous’s picture

FileSize
6.28 KB

patch attached that wraps a registry rebuild in a transaction, and deletes all resources for to-be-parsed files upfront before adding any new resources.

very lightly tested, only tested a clean install works, and that the registry tests parse. lets see what the bot thinks.

i guess we need a test for the "move a class from one file to another case".

Status: Needs review » Needs work

The last submitted patch, registry.transaction.patch, failed testing.

pwolanin’s picture

That's going to be a rather tricky test to write.

Anonymous’s picture

Status: Needs work » Needs review

#9: registry.transaction.patch queued for re-testing.

Status: Needs review » Needs work

The last submitted patch, registry.transaction.patch, failed testing.

pwolanin’s picture

There is only one fail, but the patch looks like it changes a lot more than really needed. Why not just add the transaction around each file?

Anonymous’s picture

@pwolanin: because we want it all to work or none of it to work? there's very little that has changed with the patch, its mainly just whitespace.

that failing patch smells of some nasty coupled code somewhere. i'll have a look today.

Crell’s picture

While registry rebuild is a great example of where we want transactions, it's not per-file. It's per-rebuild.

The problem here is:

Before:
foo.php: class A {}
bar.php: nothing

After:
foo.php nothing
bar.php: class A {}

If the registry rebuild hits foo.php first, then there is a brief period where class A exists nowhere, which is OK. If it hits bar.php first, then it tries to insert class A in bar.php and chokes on a duplicate key error.

Possible solutions include:

1) Wrap the entire rebuild in a transaction: This wouldn't protect us against the above breaking stuff, but the breakage wouldn't be as bad because we're not left with a half-built registry table. So net win, but only part of the solution.

2) Wrap each file in a transaction. This doesn't help us here unless one file contains the name name twice, which would break PHP anyway. What it would do if we parsed bar.php first is rollback bar.php's changes entirely, then process foo.php (unless we told it to stop) and clear out class A, and then on the next rebuild it would find the new location for A. But between those rebuilds (which is two user-triggered answers) Drupal won't know where A is, and that's bad.

3) Use a merge query instead of a multi-insert. This is actually what we did originally when we were still tracking functions. It's a bit slower, but is not going to fatal out on us. The side effect is that if a class really is defined in more than one location, whichever one parses second will win. That makes "define it twice and include just one of 'em" break, but that's OK since that's a poor pattern anyway, especially for classes. That's not supported in the first place. We will still need to delete all entries in a file before parsing so that we clear out removals.

So we want to do #1 and #3, I'd say.

Now, I should go look at the patch... :-)

Anonymous’s picture

@Crell - the example you raise in #16 is why i think the cleanest solution is to delete all resources for files that have changed upfront, before doing any inserts, and after starting a transaction. a failed rebuild can still lead to WSODs, but at a known state, giving some chance to roll back to the previous code and try again.

using inserts instead of merge does enforce "you can only have that class defined once" from 3. above. bug or feature?

Anonymous’s picture

Status: Needs work » Needs review
FileSize
6.28 KB

attaching patch again, i can't reproduce the fail locally. maybe the testbot is still running myisam? if so, maybe that's the issue? do we have a case here of transactions for the win, but we're still hobbled by myisam? sad panda :-(

the patch still uses the "delete all then insert" pattern.

i'm not sure we can support the "allow a class to be declared twice, then pick one at runtime" pattern and use autoloading. this is a not a registry limitation, but a php one. to support that, we'd have to allow for a global flag to be set so that drupal_autoload_class() could check it and choose the right class. that is just evil and ugly. i'd rather we blow up early than silently load an arbitrary version of the class.

Status: Needs review » Needs work

The last submitted patch, registry.transaction.patch, failed testing.

catch’s picture

I'm pretty sure the test bot runs myisam for performance reasons.

Anonymous’s picture

sad panda. i just hacked includes/database/mysql/schema.inc to create myisam tables, and i still can't reproduce the flood fail.

pwolanin’s picture

no - as best as I know the testbot should be running on innoDB - but that may vary by test instance.

catch’s picture

hmm, I double checked and according to http://qa.drupal.org/performance-tuning-tips-for-D7 it is indeed innoDB.

Anonymous’s picture

allllrighty. can anyone else confirm that the flood test passes on their local setup? if so, then we need to investigate the bots. if not, then we need to look at the difference between my setup (stock ubuntu lucid) and whoever can reliably get the test to fail.

Berdir’s picture

Just wanted to note that I can't reproduce the test fail either on my local system.

mikeryan’s picture

Status: Needs work » Needs review

#18: registry.transaction.patch queued for re-testing.

mikeryan’s picture

The test passes in my environment.

Status: Needs review » Needs work

The last submitted patch, registry.transaction.patch, failed testing.

Anonymous’s picture

i've pinged boombatower about this. we'll need to figure out why the bot is failing when three different developers can't reproduce the fail.

Berdir’s picture

Status: Needs work » Needs review

#18: registry.transaction.patch queued for re-testing.

Status: Needs review » Needs work

The last submitted patch, registry.transaction.patch, failed testing.

Damien Tournoud’s picture

Status: Needs work » Reviewed & tested by the community
YesCT’s picture

if it gets committed like this... wont all patches start failing the bot testing?

catch’s picture

Last patch now passes the test bot.

webchick’s picture

Status: Reviewed & tested by the community » Fixed

It looks like all this really does is wrap the existing code in a transaction, which seems reasonable. This won't help those on MyISAM, of course, but at least we un-screw part of our userbase. :P It'd be nice to have test coverage for this, but I can't really envision a way for that to happen.

I'm a bit nervous about whether that test passage was a fluke or whether the previous fails were flukes. I guess if testbot starts crapping out after this, we'll have a good place to start on where to point the finger of BLAAAAAME. ;)

Committed to HEAD!

pwolanin’s picture

Status: Fixed » Needs work

ugh, wait.

I clearly did not look carefully enough at the last patch. If ALL it does is wrap the current code in a transaction, then it does not solve the bug. Every attempt to rebuild this will cause an exception and rollback, right?

Per #16, to solve the bug we MUST use a merge query like my original patch did.

pwolanin’s picture

also, this looks like a bug to me:

      $modified_file = !isset($file['filectime']) || !isset($file['filemtime'])
                  || $filectime != $file['filectime'] || $filemtime != $file['filemtime'];

Since it would seem not to account for the fact that the rebuild may be run on different webnodes, or a DB dump pulled from a different server and run locally. Looks like a good use for hash functions instead.

webchick’s picture

There is a merge query hunk at the bottom, but I can't parse the earlier patches well enough to tell if it's equivalent.

Anonymous’s picture

Status: Needs work » Fixed

discussed this with pwolanin in #drupal, now marking as fixed.

we delete all resources for files we are going to parse, so we should be good.

pwolanin’s picture

Yes, I think this is ok as long as we assume we have reasonable transactions.

new issue for #37: #819388: Switch registry back to using hashes not file mtime and ctime

David_Rothstein’s picture

Status: Fixed » Needs work

This patch led to a massive increase in the amount of time it takes to install Drupal (from ~8 seconds to ~22 seconds on my machine). Granted, installations only happen once, so their performance is not the most important thing in the world, but I recall a fair amount of work a few months ago in the D7 dev cycle to shave some seconds off the installer so people didn't have to stare at that progress bar forever, and now we just doubled or even tripled the entire installation time :(

@pwolanin's earlier patch in this issue does not have a noticeable effect on the time it takes to install Drupal.

Also, the flood test failure seems to be happening out in the wild. It happened to a patch of mine earlier today even though that patch touched nothing but JavaScript (#819724: Improve tabledrag accessibility for screen-reader users with WAI-ARIA).

rfay’s picture

This massive increase of time has two other very serious effects:

1. It uses far more resources in our testing environment. Remember that we do hundreds or thousands of installs in each D7 test suite. I note that we've doubled the time it takes to run the test suite. Not so good.

2. The important work in #686196: Meta issue: Install failures on various hosting environments aims to make Drupal install before timeouts on run-of-the-mill shared hosting. This would be a step back for us.

catch’s picture

I'm surprised that just adding a transaction would take so much time.

Didn't profile, but ran the installer via drush as 'time drush -v si --db-url:d7 -y' with HEAD and with this patch reverted:

HEAD:


real	1m19.336s
user	0m47.867s
sys	0m15.121s

Patch reverted:


real	0m44.215s
user	0m26.682s
sys	0m10.165s

About what rfay reports.

That almost certainly means a regression on #686196: Meta issue: Install failures on various hosting environments then.

Crell’s picture

My gut intuition is that it's not the transaction that slows it down as much as the changes to the DB queries that result, although I've not tested to confirm that. But normally transactions should speed things up if anything since the DB doesn't need to commit to disk as many times.

Anonymous’s picture

investigationing.

Anonymous’s picture

ouch. can anyone reporting speeds, also give info on their setup? chatting to chx in #drupal, he's using skip innodb, so that's important to know. i'm using innodb on ubuntu lucid.

with patch reverted:

4.15user 0.35system 1:12.09elapsed 6%CPU (0avgtext+0avgdata 202448maxresident)k
64inputs+0outputs (1major+22878minor)pagefaults 0swaps
3.98user 0.30system 1:16.01elapsed 5%CPU (0avgtext+0avgdata 202432maxresident)k
0inputs+0outputs (0major+22892minor)pagefaults 0swaps

with patch:

10.60user 0.50system 1:16.04elapsed 14%CPU (0avgtext+0avgdata 203232maxresident)k
0inputs+0outputs (0major+23215minor)pagefaults 0swaps
10.26user 0.49system 1:16.44elapsed 14%CPU (0avgtext+0avgdata 203232maxresident)k
0inputs+0outputs (0major+23211minor)pagefaults 0swaps

Anonymous’s picture

hmm. on innodb, its even slower without the transaction, but with same code reorganisation.

in other words, take HEAD and comment out the line that starts the transaction, and the try catch block.

14.31user 0.59system 3:33.73elapsed 6%CPU (0avgtext+0avgdata 203232maxresident)k
0inputs+0outputs (0major+23004minor)pagefaults 0swaps

Anonymous’s picture

Status: Needs work » Needs review
FileSize
1.93 KB

oh, dear. bit embarrassing how obvious this was.

the second loop in _registry_parse_files was using stale $filectime and $filemtime values. so, umm, yeah, we reparse every file bar one, every run through.

attached patch keeps a list, and we're back to normal install times.

4.41user 0.23system 1:14.69elapsed 6%CPU (0avgtext+0avgdata 202336maxresident)k

Anonymous’s picture

without the debug.

chx’s picture

Status: Needs review » Reviewed & tested by the community

Before patch real 0m12.535s after patch real 0m4.164s

catch’s picture

Title: Moving a class between files can cause SQL errors on registry rebuild » Moving a class between files can cause SQL errors on registry rebuild (TESTING BROKEN)

Extra extra critical, patch looks fine.

pwolanin’s picture

ok, well let's get this in and then deal with the other issue of using file hashes instead of times

webchick’s picture

Title: Moving a class between files can cause SQL errors on registry rebuild (TESTING BROKEN) » Moving a class between files can cause SQL errors on registry rebuild
Status: Reviewed & tested by the community » Fixed

Oh, dear lord! :\ Thanks for the sleuthing, folks.

Committed to HEAD!

rfay’s picture

Yay! PIFR test time on client #32 is down to 15 minutes again from 40 minutes.

David_Rothstein’s picture

Hehe, awesome bugfix :) Does this fix the sporadic flood API test failure also? Well, if not, I guess we'll know about it soon.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.