taxonomy_update_7005() can go on an infinite search [#930708]

As I mentioned at #438224-79: "Post comments without approval" permission name is completely misleading, I had a fairly small D6 test site DB that I was trying to upgrade to the latest D7 HEAD. update.php enters what would have been an infinite quest during taxonomy_update_7005(). After a few minutes, I looked at {batch} and noticed these fun values:

total = 364, count = 348, last = 121437000

The thing that triggers this is that my {term_node} table had 16 records pointing to tid 0. The reason for this is that I had run devel_generate on the site and I had 16 forum posts in the '0' forum. Granted, that's not exactly a common case, but it's certainly possible people could have somewhat bogus input like this, and we can't have update.php hang indefinitely in this case.

The bug appears to have been introduced in #706842: Improve comments for the taxonomy upgrade path (although I'm not sure of that, haven't done a more thorough cvs archeological dig). Basically, we do this:

    $query = db_select('taxonomy_term_node', 't');
    $sandbox['total'] = $query->countQuery()->execute()->fetchField();

to initialize the count by just counting the total records in {taxonomy_term_node}.

However, our logic query is doing something more restrictive:

    $query = 'SELECT td.vid AS vocab_id, td.tid, tn.nid, tn.vid, n.type, n2.created, n2.sticky, n2.nid AS is_current FROM {taxonomy_term_data} td INNER JOIN {taxonomy_term_node} tn ON td.tid = tn.tid INNER JOIN {node} n ON tn.nid = n.nid LEFT JOIN {node} n2 ON tn.vid = n2.vid ORDER BY tn.vid, td.weight ASC, tn.tid';

It's extremely dangerous to use a different count query than what we're actually processing. So, that's A.

B) We should probably notice if we've hit an insane state like "total = 364, count = 348, last = 121437000" and bail eventually. Granted, the last is a node revision, while the count is the total # of nodes, and we can't make too many assumptions here since revisions can be sparse, etc. But, we could consider doing *something* to abort in case of trouble.

C) We should consider purging {taxonomy_term_node} of entries that point to non-existant {taxonomy_term_data} records before starting this operation.

I know this doesn't effect everyone, since you need some slightly bogus data in the first place to trigger it, but the fix is relatively easy, and I don't think it's wise to ship a beta with such a fragile DB update that can send update.php into fatal death. Hence, critical. (catch told me to in IRC, so he can share the blame if people want to come after me for this classification). ;)

Comment	File	Size	Author
#11	930708-11.infinite.patch	3.15 KB	dww
#11
#7	930708-7.infinite.patch	1.93 KB	catch
#7
#3	930708-3.infinite.patch	1.68 KB	dww
#3
#1	infinite.patch	966 bytes	catch
#1

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

catch

he/him

English

CreditAttribution: catch commented 4 October 2010 at 03:00

Status:

Active

» Needs review

File	Size
infinite.patch	966 bytes

Here's an untested patch for the count query, if we fix that I'm not sure we should ever get to (B).

I'm about to go out, neutral about (c) - if it's a cheap-ish query I think we should probably do it though. Although might we also want to do that for records that don't have equivalents in {node_revision} too?

The reason I think this should be critical is because we have no idea how many sites might run into this, and because the symptom is the update 'hangs', which means bug reports about it are going to be very obscure and hard to track down.

Comment #2

dww

we/he/they

CreditAttribution: dww commented 4 October 2010 at 06:03

Status:

Needs review

» Needs work

Yup, that works on my faulty data set. I agree that if we have a reliable solution to A that we don't need B (which is a can of worms), and C is maybe more trouble than it's worth.

To appease the testing culture, I'm going to work on fixing one of the test data sets to trigger this bug, although I'm *not* going to post a patch with the test and not the patch to taxonomy.install, since that'd hang a test bot. ;) Stay tuned...

Comment #3

dww

we/he/they

CreditAttribution: dww commented 4 October 2010 at 06:18

Status:

Needs work

» Needs review

File	Size
930708-3.infinite.patch	1.68 KB

I didn't even have to touch any tests. I just added 1 additional record in {term_node} inside drupal-6.filled.database.php. That's enough to cause the "Taxonomy upgrade path" test to go into the same infinite search (further proof that this is a critical upgrade path bug, IMHO). Once you kill that off and clean the test environment, if you apply the patch to taxonomy.install, the test completes normally and all assertions pass. Again, for obvious reasons, I'm not going to post patches that attempt to demo this automatically. ;) I'd RTBC but it's my patch, so I'll let catch verify my claims and press the green button.

And again, this fix for taxonomy.install also worked on the D6 test site data that prompted this entire investigation...

And, I revoke my proposals for B and C. A is sufficient.

Comment #4

dww

we/he/they

CreditAttribution: dww commented 4 October 2010 at 06:30

Assuming this is going to be one of the last issues fixed before beta1, here's an utterly shameless plug for a few other RTBC'ed issues that would make me really happy to see before beta, since they're breaking aspects of the Update manager and I've spent most of the weekend trying to get that part of core knocked into shape:

#929756: Update manager UI uses stale project data
#930122: Regression: temp directory handling broken by confusion between file_directory_temp and file_temporary_path

Less urgent, but still nice since it'll somewhat prevent major UX WTF:

#612546: Remove the broken FileTransferFTPWrapper (file stream) class

At least I'm only indirectly abusing the critical queue... I didn't actually mark any of those critical. ;)

/me ducks...

Comment #5

catch

he/him

English

CreditAttribution: catch commented 4 October 2010 at 07:12

This looks fine to me, I can't run tests locally until much later this evening, and the test bot is stalled, apart from that, RTBC.

Comment #6

Damien Tournoud CreditAttribution: Damien Tournoud commented 4 October 2010 at 07:35

Thanks for looking into this. I saw that happening too (I think it was on the d.o database).

Could we:

(1) have some sort of comment explaining the join in the count query?
(2) separate the test for the standard case from the test for the faulty case?

Comment #7

catch

he/him

English

CreditAttribution: catch commented 4 October 2010 at 07:57

File	Size
930708-7.infinite.patch	1.93 KB

#1 added.
#2. I don't see why - the test changes here are three lines to the test data, all we need is one bad record in amongst correct data and the test to still complete. Adding a new test would require another dump and a new test case, which seems overkill for what is otherwise a 1.6k patch.

Comment #8

dww

we/he/they

CreditAttribution: dww commented 4 October 2010 at 08:00

@DamZ:

Re: 1) The count query now exactly matches the business logic query, as it should be. I don't know why we need a comment to explain this.

Re: 2) You want a completely duplicate taxonomy upgrade test and test DB for a single record that's different? That seems like overkill and bloat to me. We're doing functional testing of the upgrade path. Our sample DB now has some hostile data, just like (apparently d.o) and the test site I was using that led me to find this bug in the first place. That way, when this test runs on the data, we know everything is working, even for other sites with the unfortunate starting conditions.

Therefore, -1 to both requests in #6. #3 is still RTBC by my eyes.

Comment #9

dww

we/he/they

CreditAttribution: dww commented 4 October 2010 at 08:01

Status:

Needs review

» Reviewed & tested by the community

Okay, sure, that comment is fine. #7 (not my patch) really is RTBC. ;)

Comment #10

dww

we/he/they

CreditAttribution: dww commented 4 October 2010 at 08:17

Status:

Reviewed & tested by the community

» Needs work

Bah, in IRC DamZ and catch and I made a compromise. I'll add another assertion to the existing test to make sure there's no tid 0 associated with nid 1 after the upgrade... Stay tuned.

Comment #11

dww

we/he/they

CreditAttribution: dww commented 4 October 2010 at 08:34

Assigned:	Unassigned	» dww
Status:	Needs work	» Needs review

File	Size
930708-11.infinite.patch	3.15 KB

Comment #12

juan_g CreditAttribution: juan_g commented 4 October 2010 at 08:39

Assigned:	dww	» Unassigned
Status:	Needs review	» Needs work

About today's comment #4, of those three issues, two are RTBC, and one just fixed. That's speed. :)

Comment #13

juan_g CreditAttribution: juan_g commented 4 October 2010 at 08:42

Status:

Needs work

» Needs review

Sorry, cross-post. I can't restore dww's assignment.

Comment #14

babbage CreditAttribution: babbage commented 4 October 2010 at 08:42

Doesn't look like that status change was intentional.

Edit: beaten to the fix.

Comment #15

dww

we/he/they

CreditAttribution: dww commented 4 October 2010 at 08:45

Actually, screw it. I need to sleep. This extra assertion seems stupid. catch tells me the test in #11 isn't valid since "term reference won't try to link to a non existing term". I can't just inspect $node->taxonomy anymore, can I? I couldn't get anything useful trying to debug() or $this->verbose(var_export($node)) inside this test, so it's hard for me to see what I can actually assert in here. Adding an assertion that taxonomy_term_load(0) returns FALSE right here also seems stupid.

I'm still in favor of just committing #7. If someone wants to go crazy with more tests and assertions later, so be it, but that shouldn't block fixing this critical nor the release of a beta.

The critical bug here is that faulty data sends update.php into an infinite loop. This patch a) fixes that bug and b) adds faulty data to our upgrade test to ensure that the upgrades complete with the faulty data. The existing tests already assert that the upgraded data matches what we expect from the known-good input.

Comment #16

dww

we/he/they

CreditAttribution: dww commented 4 October 2010 at 08:49

Status:

Needs review

» Reviewed & tested by the community

(I was trying to unassign myself, so no harm in the x-post). However, I'm just going to be bold and say #7 is RTBC. Bedtime, folks.

p.s. @juan_g: all 3 were RTBC when I wrote that comment. But yes, hurray that at least 1 is already fixed. ;) Let's hope the others meet a similar fate while I sleep...

Comment #18

tstoeckler

he/him

German

Essen, Germany

CreditAttribution: tstoeckler commented 4 October 2010 at 12:45

+1 RTBC.
Didn't try the patch myself, but the test passes.
I verified that the query is identical to the one below in taxonomy_update_7005(). As you can see on the API page, the JOINs are documented extensively on the query below. Therefore, I think the comment in #7 is sufficient.

Comment #19

webchick

she/they

English

Vancouver 🇨🇦

CreditAttribution: webchick commented 4 October 2010 at 14:51

Status:

Reviewed & tested by the community

» Fixed

I agree with #7 being RTBC. If at some point we decide on a policy which separates "edge case" tests from "mainline" tests, then I think Damien's suggestion makes sense. In D7 though, we have a mish-mash and so keeping with that here is fine. I actually appreciate the bunk data being part of the "mainline" database in this case, because it means we'll continue to test to ensure this condition doesn't bite people going forward.

Committed #7 to HEAD.