Doing some work on a site that uses Taxonomy Defaults (http://drupal.org/project/taxonomy_defaults) made me realize something annoying.

Taxonomy in D6 and earlier was kind of flexible regarding vocabs and content types.
The fact that a vocab was not enabled on a node type just meant that the node edit form didn't display the select input for the vocab, but terms from the vocab could still be added programmatically. Taxonomy module would look no further and save them in {term_node}, load them on node_load() and display them on node_view().

In D7, a node type (a bundle) either has an instance of a taxo field, or it doesn't. If it doesn't, there's no saving or loading data from that vocab, whatever you stick in the db tables.

I guess these kind of 'hidden' taxo fields can possibly be done in D7 using some hook_field_access() tricks.

Bigger problem is that our upgrade path only imports term/node associations if the vocab is officially enabled for the node type, and ditches the rest (db_drop_table('taxonomy_term_node');). The existence of Taxonomy Defaults module ensures us there are sites out there that with nodes having terms in vocabs that are not actually enabled for their node type.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

yched’s picture

Well, the current upgrade path happens to be broken anyway.

- taxonomy_update_7005() has a query with a ':vocabulary_id' => $vocabulary->vid placeholder, while $vocabulary is undefined. I guess at some point the code in there got factored out of taxonomy_update_7004(), which iterates on vocabularies to create the fields and instances.

- taxonomy_update_7005() also fails to set $sandbox['#finished'], which actually makes it single step, batch API assumes it's done after the first iteration - which is probably why the above didn't raise any warnings so far. The function currently is a no-op (doesn't even wipe the taxonomy_term_node table :-p)

yched’s picture

Title: "taxo as field" update path wipes some node / term associations » "taxo as field" update broken + wipes some node/term associations

Massage title a bit.

+ for clarity :
Comment #1 is "taxo as field upgrade path is broken".
OP is "even if it successfully did what it intended to do, it would ditch some user data".

catch’s picture

OK so we should fix #1 first.

For #2 that's going to be tricky. We need to get all the vocabularies, then get the node type of every node attached a term in each vocabulary - can be done, but it's a big DISTINCT query joining on three tables so not nice. Then we could just use that to create the fields and instances instead of $vocabulary->node_types

joachim’s picture

Why do you need the type of nodes? Surely you only need to know which types a vocabulary has been assigned to, irrespective of whether any nodes actually have terms from it.

If my D6 site had vocab Foo set to story and page types, and because I'd not tagged any pages yet I didn't get that vocab applied to pages after upgrade, I would consider the upgrade to have lost something of my site's data.

See #366364: [meta] Data migration from D6 contrib CCK fields to D7 Field API for code that does the field creation for vocabs based on the existing data.

catch’s picture

You need the actual node types because various contrib modules allow you to assign terms to nodes without the node type being associated with the vocabulary. Also removing a vocabulary from a node type doesn't remove term_node records either. Possibly we need to check both the empty and unassigned vocabulary cases though.

bradweikel’s picture

Are there other modules besides Taxonomy Defaults that do this? If not, maybe this issue belongs in the Taxonomy Defaults queue. I adopted that module late in its life-cycle and discovered its "hidden vocabulary" approach to be a nasty hack that broke compatibility with many other taxonomy modules, and I was planning to refactor it before fields-in-core happened. (And the latest 6.x-dev version of Taxonomy Defaults might actually pave the way for a clean D7 taxonomy upgrade already -- not sure, though, because I haven't looked at it in months).

douggreen’s picture

As part of DrupalCon SF, I promised to work a D7 critical issue to completion, I asked catch for one, and this is it.... so subscribing, I'll be back later to say more ;)

jpmckinney’s picture

@#6: Even if no modules besides Taxonomy Defaults do this, you can be sure many people have been doing it in import scripts. You don't need a module to programmatically add a term to a node, even if that node's type is not associated to that term's vocabulary. That's just part of the D6 API. It's not part of the D7 API, but we shouldn't leave data behind just for that reason.

hefox’s picture

(Bit late, but Content taxonomy does not enforce vocabulary node types settings; http://drupal.org/project/usage/content_taxonomy 40k.)

sime’s picture

Content taxonomy (CT for convenience) should be ok. In any configuration it always has it's tids in cck fields. This means the upgrade path is separate. CT does also have an option to keep tids *additionally* in term_data table. We're going to end up creating a field for these, not really knowing where these orphaned tids are coming from but, to be clear, CT will get it's data from the cck fields.

catch’s picture

Issue tags: +D7 upgrade path

tagging.

yoroy’s picture

- taxonomy_update_7005() has a query with a ':vocabulary_id' => $vocabulary->vid placeholder, while $vocabulary is undefined. I guess at some point the code in there got factored out of taxonomy_update_7004(), which iterates on vocabularies to create the fields and instances.

douggreen, any news on this? :)

AaronBauman’s picture

This patch addresses the issues in comment #1, taxonomy_update_7005 is a no-op, by implementing the batch logic that was overlooked when update 7004 was re-factored.

  • Collect all vocabularies in initial batch step and dump into sandbox.
  • When terms for one vocabulary are exhausted, remove vocabulary from sandbox.
  • Update $sandbox['#finished'] on each pass, setting to TRUE when done.
AaronBauman’s picture

Status: Active » Needs review

This is "needs review" until comment #1 is addressed, when it should go back to "needs work" or "active" to address the OP

marcingy’s picture

#13: 706842-taxonomy_update_7005.patch queued for re-testing.

Anonymous’s picture

I've made a DB for D6 to work on this issue. It has 51000 terms because I'm also hearing reports of WSODs on upgrades with deep vocabularies. All the terms and nodes are generated by devel_generate. http://drop.io/bangpound/asset/drupal6-mysql-bz2 (Feel free to use or suggest things I should add to make it useful for other issues.)

To test the upgrade path, use the DB as is.

To test the upgrade path with the problem reported in this issue (terms applied to nodes whose vocabulary doesn't include the node type), delete a row or two from the vocabulary_node_types table before running upgrade.php.

Now it's time for me to dig into this issue.

Damien Tournoud’s picture

Here is what it actually takes to fix this upgrade path.

The test database is only random content generated with Devel generate for now, chx is working on a way to make this repeatable. Just skip the test database when reviewing the actual patch.

Status: Needs review » Needs work

The last submitted patch, 706842-taxonomy-update-broken.patch, failed testing.

catch’s picture

Status: Needs work » Needs review
chx’s picture

FileSize
3.91 KB

Here is a d6 generate script. Still needs to generate comments and poll. But can be used to test this issue just right. Creates 6 users, 24 vocabularies, 48 terms, 24 nodes, 36 node revisions and 1656 rows in term_node.

Edit: the node_revisions table uid is all 0. I will debug why later. Until then, before dump update node_revisions nr set uid = 3+(select max(uid) from node n where nr.nid=n.nid);

chx’s picture

This is the generate script which IMO does not belong to D7 and then the tested upgrade path. The dump contains 24 vocabularies, 48 terms (not two vocab, it's 1 or 2 or 3 depending on hierarchy, multiple parents need three etc), 24 nodes, 48 node revisions, and 1656 entries in term_node.

Other than that, it's mostly DamZ's patch, rerolled against HEAD, removing the occassional fatal :)

Edit: props to SteveL for figuring out that it's better to have the actual node author loaded in the generating script.

moshe weitzman’s picture

So now we have to store a whole drupal database as proof that a single update function works as designed? At this rate, we are going to balloon the drupal download even further. This is a 500K patch that fixes a handful of lines. I'm not saying that this is definately wrong, but I am not sure its right either.

chx’s picture

This database dump can be used for many other updates that involve nodes and user and taxonomy and path aliases.

chx’s picture

Also we might want to consider excluding this directory from the Drupal tarball and only those who checkout will get it.

webchick’s picture

Looked more into my webchick.net upgrade and saw that I lost term associations, so checking this out later.

Moshe's concern is legitimate. I talked to chx about this on IRC. A couple of things:

1. We're only going to have two database dumps: a "barebones" one (for catching things like the upgrade path fails because it assumes the term_data table exists) and a "fancy" one with lots of data (for exhaustively checking all the various core modules' paths). So every test that tests some aspect of the upgrade path will NOT have a 500KB dump associated with it, though we probably will have a single, 1MB dump at the end.

2. mikey_p suggested gzipping the file and having SimpleTest read it in as part of the test. I think this would mitigate filesize concerns. The downside is diffs then become useless. :\

3. We could also do chx's suggestion of excluding this one mega-file from the tarball. I'm not a fan of tarballs and CVS mis-matching in this way, though, but it seems most core developers would not be affected since we all work from CVS (or bzr/git whatever).

moshe weitzman’s picture

two database dumps as described is a reasonable approach.

i'm jumping the gun perhaps, but it would be nice if our db dump also served as a representative dataset for performance testing. See http://drupal.org/project/northdrop

chx’s picture

I talked to dww and it's possible to have a 'normal' and a 'developers' tarball attached to every node. We special case core already apparently anyways.

Stevel’s picture

+++ modules/simpletest/simpletest.info	2010-07-06 04:37:14 +0000
@@ -39,3 +39,4 @@ files[] = tests/unicode.test
=== added file 'modules/simpletest/tests/upgrade/drupal-6.taxonomy.database.php'

=== added file 'modules/simpletest/tests/upgrade/drupal-6.taxonomy.database.php'
--- modules/simpletest/tests/upgrade/drupal-6.taxonomy.database.php	1970-01-01 00:00:00 +0000

--- modules/simpletest/tests/upgrade/drupal-6.taxonomy.database.php	1970-01-01 00:00:00 +0000
+++ modules/simpletest/tests/upgrade/drupal-6.taxonomy.database.php	2010-07-06 04:37:53 +0000

As the generated database file will be one big database that will serve for different tests, it should be renamed to something more generic.

+++ modules/simpletest/tests/upgrade/upgrade.taxonomy.test	2010-07-06 04:37:14 +0000
@@ -0,0 +1,32 @@
+    $this->drupalGet('node/89');
+    $this->drupalGet('node/89/edit');

The node id can vary when the database is updated to contain more (other) types of nodes. Using a static path (i.e. term/foo or something) in the generation code would remedy this.

+++ modules/taxonomy/taxonomy.install	2010-07-06 04:37:14 +0000
@@ -423,14 +414,51 @@ function taxonomy_update_7005(&$sandbox)
+    } ¶

trailing whitespace

+++ modules/taxonomy/taxonomy.install	2010-07-06 04:37:14 +0000
@@ -449,20 +477,38 @@ function taxonomy_update_7005(&$sandbox)
+    ¶
+    // If there were no rows returned, we're finished with the current vocab. ¶

Trailing whitespace

Apart from the -mostly- coding style glitches, the generation-script should probably be included in the patch as well. I would prefer to have it in the same directory as the .test-file (so we can add "on-first-test-using-it" database generation using it later)

Stevel’s picture

Status: Needs review » Needs work

This thus needs a little bit of work.

int’s picture

Issue tags: +beta blocker

add beta blocker tag

int’s picture

bump... this is the last critical "D7 upgrade path" to be fixed

chx’s picture

Whee, poll is in now we can fix this one! Next patchissue can gzip down the db dumps and include compress.zlib://filename i checked it works , the manual explicitly says it's local, not restricted by allow_url_fopen.

Damien Tournoud’s picture

Rerolled now that the standard database is in.

Needs work, for the following:

  • we need to add a taxonomy reference field to node types that aren't explicitly associated to a vocabulary but have some associations in term_node
  • we need to modify the generation script so as not to associate nodes of type 'poll' to any term, so that we can test the three cases: (1 - 'page') node type is associated to a vocabulary, (2 - 'story') node type is not associated to a vocabulary but have associations in term_node, (3 - 'poll') node type is not associated to a vocabulary and have no associations in term_node
  • we need to implement the tests that I outlined as TODOs in UpgradePathTaxonomyTestCase::testTaxonomyUpgrade()

Have fun!

webchick’s picture

Last I heard, bangpound was working on this. Any update, bangpound? :)

int’s picture

Status: Needs work » Needs review

let's the bot review the last patch

chx’s picture

bangpound promised a patch today. there is nothing to review.

webchick’s picture

Status: Needs review » Needs work

sad ping. :(

Anonymous’s picture

Status: Needs work » Needs review
FileSize
12.81 KB

Here's the justification for the one field that captures all of the "extra" term node associations that are not "valid" according to the relationships between vocabularies and node types. The values in D6's vocabulary_node_type were translated into a combination of field (cardinality, required), widget (autocomplete or select) and instance (help, bundle) settings.

The only thing that enforced these constraints in D6 were the form elements themselves--through form validation. If the vocabulary "sections" was only allowed to have 1 term and was configured for the "story" node type, extra term_node values for a particular story would be wiped out if the node were edited and saved again. If "page" node type were not allowed according to core's vocabulary_node_type table, who knows if it should accept multiple values or one value? It's not supposed to accept any values, but obviously we're trying to cope with this reality in this patch. So this is why I can't just add new instances of the regular taxonomy field (which I'll call the "valid" field) to the extra node types.

I'm not keen to create double the number of taxonomy term fields either. They'd all have to be the same: multiple values, autocomplete widget, not required. If I create extra fields for each vocabulary to cope with the "overflow" of term_node associations, I'm saying "I know how these terms are related to those nodes." That's not true. I don't know anything.

So the outcome is a field that can capture all of the "excess" values and which can allow nodes to be loaded and saved without losing data. It's named "taxonomyextra_upgrade." It can't be named "taxonomy_something", because there's potential for collision with a machine name. Perhaps I can use taxonomy__something? Whatever, for now let's just call it the "excess" field.

The fun part of this patch was trying to figure out how to handle deltas. If I want to translate from D6 to D7 in groups defined by vocabulary ID and node type, I'm going to switch back and forth between the valid and the excess fields. I don't want to be passing around a larger and larger array of deltas in the batch sandbox... it's entirely possible that every term_node value for an upgrade is going to be put into the "excess" field, and it could be a database with millions of terms. So to avoid an impossible rewriting the upgrade code that Catch worked hard on (and that I remember he also struggled with handling deltas), I just rely on the maximum delta value for that particular entity type, entity ID/revision ID and language combination. This requires extra DB hits but ONLY for the excess field.

Oh and I noticed that the allowed values setting for taxonomy term reference fields are using database primary keys (vocabulary ID) instead of vocabulary machine names. Of course, it also uses a parent term ID for which there is no machine name equivalent. But for common taxonomy term reference fields whose allowed values are one or more complete vocabularies, this is an exportability problem. Bug?

Remaining work needed:

  • Tests
  • Code comments
chx’s picture

Taxonomy settings do not matter. What matters is that every node shows every term it had. We can't do otherwise. Broken or not, we can't get away with some nodes not showing terms hoping for a mythical contrib to smooth this out. Between extending your taxonomy settings and not showing some of your node-term associations I have no doubts which one to choose.

If we need to iterate over every node twice, first to figure out the configuration and then second to actually upgrade, so be it.

chx’s picture

Sorry! I misunderstood the situation. OK. I get it. That extra field is still a term reference so we always show all terms it's just in a different field.

The question is this.

Do we want this excess field - type upgrade or do we want to adjust the cardinality of the taxonomy field so that it can accomodate the excess? Will post more later as I talk with bangpound.

Anonymous’s picture

Just to clarify:

  • The patch does not lose any data.
  • All the term values are/can be displayed in the node.
  • The "excess" values are stored in a new field together with all the other excess values.
  • The values could be edited in the node form, too. But the widget for that field should be disabled. If a contrib module was exposing values without Drupal core's node form integration, bypassing its validation and storing these associations in Drupal core's taxonomy term tables, Drupal core can't know whether these values should have a UI widget.
  • "Excess" values are still added to taxonomy_index table even though they're in a different field. This means they still appear for taxonomy/term/%term paths.
  • Excess values are segregated. It's easier to analyze. Contrib modules can decide what to do with those values without stepping data that core could migrate correctly.

Vocabulary settings in D6 are migrated to D7 widget and instance settings. While it is possible to change the cardinality and the 'required' setting on an existing field, it means users who upgrade to D7 may go through an apparently fault-free upgrade only to discover later that a field that was in D6 single valued has been mysteriously upgraded to multiple values.

But something tells me this can be worked around with Field API: taxonomy term reference field cardinality should always be unlimited? Widget/instance settings should control number of values allowed?

chx’s picture

OK, that was a great summary. It's never going to be "ideal" but this is still the best we can do, I agree. Limiting stuff in widgets is against every core field API and it won't work really. Let's finish this w tests.

Anonymous’s picture

Here's a new patch. I touches all the bases EXCEPT not all tests pass.

Why don't all tests pass?

The demo data provided by Damien includes data that is invalid for the valid fields derived from the taxonomy_vocabulary_node_type table. The generated data doesn't seem to care that some of the vocabularies are defined as "multiple" and some are not. However, the migration script DOES care... sorta. It creates single value fields for these vocabularies that are not defined as multiple, but then it goes ahead and puts multiple values from taxonomy_term_node in those field data tables anyway.

The final test uses a simple algorithm to determine which terms should be in a field and which should not. If the term ID is the same as the node ID, it shouldn't be on the field. Likewise, if the term ID and the node ID add up to 49, the value shouldn't be in the field. However, every other term should be on every page and story node in the test data set. This isn't possible when many of the fields do not support multiple values.

Result: 1020 passes, 224 fails, 0 exceptions, and 40 debug messages

I tried to create all the vocabularies as multiple, but that didn't work.

Result: 1140 passes, 92 fails, 0 exceptions, and 34 debug messages

I'll still finish this up today, though! Not gonna let this last another day.

I'm beginning to think though that we're going to break many hearts with this upgrade path. Perhaps the best solution is the exceedingly careful one: migrate values that we can and delete rows of valid migrated values from the taxonomy_term_node table. If the table is empty? Hooray, drop the table. If not? Someone has a mess to clean up... but hey! Your data is "safe."

Status: Needs review » Needs work

The last submitted patch, 706842-43-taxonomy-update-path-broken.patch, failed testing.

int’s picture

the #43 patch work fine in my d6 to d7 upgrade.

All terms were migrated fine.

This patch is all what I need to my site upgrade work.

Anonymous’s picture

Status: Needs work » Needs review
FileSize
15.37 KB

This patch is 100% functional.

It needs extensive commenting and it could probably be a bit more elegant. Give me another pass through it.

It produces a wave of warnings:

Undefined property: stdClass::$vocabulary_machine_name	

These only appear during one iteration of a loop of tests.

I'm so curious to find out how many people will end up with the "excess" field after they upgrade. I suspect it will be quite a lot.

Status: Needs review » Needs work

The last submitted patch, 706842-46-taxonomy-update-path-broken.patch, failed testing.

Anonymous’s picture

Status: Needs work » Needs review
FileSize
16.31 KB

Here's a patch. I avoided the "warnings" so the patch would pass all tests. I've added comments to explain what's going on in the new taxonomy_upgrade_7005(). I hope it's not going to ruin performance.

TODO: We can just assume that an overflow field is necessary and create it with all vocabularies. This saves us one expensive query.

TODO: Name this overflow field. It can't be named 'taxonomy_' + something because that will collide with other potential names.

Status: Needs review » Needs work
Issue tags: -D7 upgrade path, -beta blocker

The last submitted patch, 706842-48-taxonomy-update-path-broken.patch, failed testing.

Anonymous’s picture

Status: Needs work » Needs review
Issue tags: +D7 upgrade path, +beta blocker
Anonymous’s picture

The patch basically rewrites taxonomy_update_7005() so it's probably hard to read without applying it. See http://drupalbin.com/node/15761 for the full function.

EDIT: Corrected URL

tsvenson’s picture

@bangpound

That link gives Page not found. I tried http://drupalbin.com/node/15761 and it opens the update 7005 page. Is the correct one?

Anonymous’s picture

Hmm.. that's odd! Yes. http://drupalbin.com/node/15761 is the right one.

chx’s picture

Status: Needs review » Reviewed & tested by the community

The solution is the best we can get out of such a mess (ie converting things not adhering to a foreign key constraint into a stricter system), the new code is nice and the tests are solid. Go!

mlncn’s picture

Status: Reviewed & tested by the community » Needs work

With both an existing real-world Drupal 6 site and a freshly minted d6 site with some devel generated taxonomy terms and nodes fail when upgrading with this patch:

Failed: PDOException: SQLSTATE[42000]: Syntax error or access violation: 1064 You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '(etid, entity_id, revision_id, bundle, language, delta, ) VALUES ('1', NULL, NUL' at line 1: INSERT INTO {} (etid, entity_id, revision_id, bundle, language, delta, ) VALUES (:db_insert_placeholder_0, :db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3, :db_insert_placeholder_4, :db_insert_placeholder_5, :db_insert_placeholder_6); Array ( [:db_insert_placeholder_0] => 1 [:db_insert_placeholder_1] => [:db_insert_placeholder_2] => [:db_insert_placeholder_3] => [:db_insert_placeholder_4] => und [:db_insert_placeholder_5] => 0 [:db_insert_placeholder_6] => 2922 ) in taxonomy_update_7005() (line 571 of /home/ben/workspace/d7b/modules/taxonomy/taxonomy.install).

This variable:

      $value_column = $field['storage']['details']['sql'][FIELD_LOAD_REVISION][$table]['tid'];

is not getting any value. Haven't figured out yet.

Trying to list terms in any vocabulary results in this, but this is just a symptom:

Error message
PDOException: SQLSTATE[42S22]: Column not found: 1054 Unknown column 'base.format' in 'field list': SELECT base.tid AS tid, base.vid AS vid, base.name AS name, base.description AS description, base.format AS format, base.weight AS weight, v.machine_name AS vocabulary_machine_name FROM {taxonomy_term_data} base INNER JOIN {taxonomy_vocabulary} v ON base.vid = v.vid WHERE (base.tid IN (:db_condition_placeholder_0, :db_condition_placeholder_1)) ; Array ( [:db_condition_placeholder_0] => 30 [:db_condition_placeholder_1] => 31 ) in DrupalDefaultEntityController->load() (line 185 of /home/ben/workspace/d7b/includes/entity.inc).

mlncn’s picture

Pretty sure the failure is that 'taxonomyextra_upgrade' is sometimes assigned here, about line 527 of the patched code:

      // Use the valid field for this vocabulary and node type or use the
      // overflow vocabulary if there is no valid field.
      $field_name = isset($sandbox['vocabularies'][$record->vocab_id][$record->type]) ? $sandbox['vocabularies'][$record->vocab_id][$record->type] : 'taxonomyextra_upgrade';
      $field = $field_info[$field_name];

when it has not been created within the if (!empty($extra_vocabularies)) { statement.

Specifically, given the failure on normal upgrades but not with the test data, this seems to happen when all terms should be fitting into normal fields.

webchick’s picture

This means we need further additions to those tests to catch this case.

mlncn’s picture

FileSize
1.13 MB

Here is the mostly devel-generated database that causes errors on upgrade.

catch’s picture

chx asked me to review this, but I'm flat out at work and haven't had time to do a proper review. Just to say I'm completely happy with the 'extras' field and don't see any other way apart from leaving the rows in term_node and letting contrib handle it, which doesn't seem so good anyway.

On naming can we call it something like taxonomyextras (no underscore) to keep it namespaced without the collisions?

bjaspan’s picture

Subscribe. The basic approach based on reading only comments seems sound. I'll try to look at the code on the plane to CPH.

Anonymous’s picture

I'll be updating the patch tomorrow (Saturday, Central US) to try to fix the problems found by Benjamin Melançon in #56-58.

Anonymous’s picture

I'm having inexplicable trouble with this patch now. When I run the upgrade path tests, I am given this error (via the "verbose response"):

Unresolved dependency Text (Disabled)
Comment requires this module.
Unresolved dependency Field (Disabled)
Comment requires this module.
Unresolved dependency Field SQL storage (Disabled)
Comment requires this module.

EDIT: This issue is blocked by http://drupal.org/node/877690

Anonymous’s picture

Status: Needs work » Needs review
FileSize
16.26 KB

New patch.

The error Benjamin Melançon confronted was that his database had terms that were not applied to any nodes. This led to the SQL results for the second query in taxonomy_update_7005() including rows that had no reason to be migrated... no node ID, no type, thus no value column name.

This cannot be tested with the current test suite's upgrade database because EVERY term has a link to almost every node.

The field 'taxonomyextra_upgrade' has been renamed to 'taxonomyextra' so that it's namespace-less.

Anonymous’s picture

I just realized that if the ONLY excess values are those which exceed the cardinality of the otherwise valid destination field, there will be no "taxonomyextra" field created (because the vocabulary node type table and the extras calculation query in taxonomy_upgrade_7004 would have matched up identically) but during the upgrade, the excess values would be found and they'd be added to a non-existant taxonomyextra field.

What's the best way to handle this?

1. Always create the "taxonomyextra" field with every vocabulary as an allowed value even if it ends up with no data.

2. Create the taxonomyextra field at the first instance when it's needed in taxonomy_upgrade_7005.

Anonymous’s picture

Assigned: Unassigned »
mlncn’s picture

Assigned: » Unassigned

Patch in #63 did not work for me. Or rather, it allowed the upgrade, but it didn't do right by the data (using the same test database attached in #58). The term_node table has quite a lot of things in that database, but after upgrade the taxonomy_index table had nothing at all. (Which ended a long time of clicking to see if i could find any piece of content with a taxonomy term, or vice versa!)

bjaspan’s picture

re #64: Why not always create the "taxonomyextra" field with every vocabulary as an allowed value even if it ends up with no data, and then when all done do a field query and delete the field if it has no data items?

mlncn’s picture

I take #66 back. The patch in #63 works for everything i can test.

mlncn’s picture

Status: Needs review » Needs work

Apologies again for the noise in #66. Upgrading a real Drupal 6 site that does crazy things with taxonomy still has the problems reported in #55. Given my theory in #56, that may be a vote for Barry's catch-all approach in #67.

Got that? In short, i'll test Bangpound's next patch, but for everyone's info the test database in #58 and Drupal's core tests don't catch all the awful possibilities out there, so i at least would like to be able to test again before this patch is committed.

Anonymous’s picture

Assigned: Unassigned »
Status: Needs work » Needs review
FileSize
16.06 KB

New patch!

Now the upgrade process creates the overflow field and migrates the data. Then it deletes instances of the overflow field when they aren't necessary. If the overflow field's revision table contains NO data, the whole field is deleted.

Anonymous’s picture

I wish I could undo that last patch. It's not going to work because I made it from git without --no-prefix.

mlncn’s picture

My complex real-world database still fails, Update #7005, but the error is more precise?

Failed: PDOException: SQLSTATE[23000]: Integrity constraint violation: 1048 Column 'delta' cannot be null: INSERT INTO {field_revision_taxonomyextra} (etid, entity_id, revision_id, bundle, language, delta, taxonomyextra_tid) VALUES (:db_insert_placeholder_0, :db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3, :db_insert_placeholder_4, :db_insert_placeholder_5, :db_insert_placeholder_6); Array ( [:db_insert_placeholder_0] => 1 [:db_insert_placeholder_1] => 3 [:db_insert_placeholder_2] => 16 [:db_insert_placeholder_3] => project [:db_insert_placeholder_4] => und [:db_insert_placeholder_5] => [:db_insert_placeholder_6] => 31 ) in taxonomy_update_7005() (line 565 of /home/ben/workspace/d7t/modules/taxonomy/taxonomy.install).

Anonymous’s picture

Fixed.

I don't really know how to create tests for all these different problems unless there's data in the test upgrade database.

The current dataset is very formulaic and it has a huge number of tests. If we want to rewrite the test database, it needs to have these cases covered:

All the current tests in testTaxonomyUpgrade().
A node which has 2 term_node entries for terms in a vocabulary which only supports single values.
Terms which have no entries in the term_node table.
A node which has term X, Y and Z in its first revision but only term X and Y in the current one.

Anonymous’s picture

And the patch...

Status: Needs review » Needs work

The last submitted patch, 706842-73-taxonomy-update-path-broken.patch, failed testing.

Anonymous’s picture

Status: Needs work » Needs review
FileSize
16.14 KB

ARGH. No prefix bit me again.

mlncn’s picture

Status: Needs review » Reviewed & tested by the community

So without my mentioning to him in advance, Bangpound has upgraded a site that did many unusual things with Taxonomy module, and then merged in a site that went back to Drupal 4.7 and... the contrib Category module. I dare say, that having upgraded all the vocabularies on each node in this monster means that there's not much out there in the world that the taxonomy upgrade could have trouble with.

My hat is off to Benjamin Doherty.

mlncn’s picture

Status: Reviewed & tested by the community » Needs work

Oh hell the database attached in #58 no longer works:

* Failed: DatabaseSchemaObjectExistsException: Table field_data_taxonomyextra already exists. in DatabaseSchema->createTable() (line 621 of /home/ben/workspace/d7t/includes/database/schema.inc).

mlncn’s picture

Status: Needs work » Reviewed & tested by the community

Never mind, i hadn't cleared the database (went to, but never completed that step) before bringing in the test database for a new try. It does work!! Please commit this before i mess up another test.

Status: Reviewed & tested by the community » Needs work
Issue tags: -D7 upgrade path, -beta blocker

The last submitted patch, 706842-76-taxonomy-update-path-broken.patch, failed testing.

Anonymous’s picture

Status: Needs work » Needs review

Status: Needs review » Needs work
Issue tags: +D7 upgrade path, +beta blocker

The last submitted patch, 706842-76-taxonomy-update-path-broken.patch, failed testing.

chx’s picture

That's a weird failure. I have instructed the testbot to retest HEAD.

chx’s picture

Status: Needs work » Needs review
bjaspan’s picture

Partial review; next I will apply the patch to review update 7005.

+++ modules/simpletest/tests/upgrade/upgrade.taxonomy.test
@@ -0,0 +1,122 @@
+          // Prefer valid taxonomy term reference fields for a given vocabulary
+          // when they exist.
+          if (empty($instances[$tree['vid']]) || $instances[$tree['vid']] == 'taxonomyextra') {
+            $instances[$tree['vid']] = $field['field_name'];

$instances is not initialized to array(). It shouldn't matter b/c page will always have a taxo reference, but if it ever doesn't this will cause an E_NOTICE exception.

+++ modules/simpletest/tests/upgrade/upgrade.taxonomy.test
@@ -0,0 +1,122 @@
+    // Check that the node type 'story' has been associated to a taxonomy
+    // reference field for each vocabulary. It was not explicitely in
+    // $vocabulary->nodes but each node of type 'story' was associated to
+    // one or more terms.

The assertions following this comment say that story nodes should only be associated with the taxonomyextra field. Okay, I see that the second sentence of this comment is saying that, but only indirectly, and the first sentence seems to say the opposite. So, the comment should probably be clarified.

+++ modules/simpletest/tests/upgrade/upgrade.taxonomy.test
@@ -0,0 +1,122 @@
+    $this->assertEqual(count($field_names), 1, t('Only one taxonomy term field is used for on story nodes'));
+    $this->assertEqual(key($field_names), 'taxonomyextra', t('Only the excess taxonomy term field is used for on story nodes'));

Grammatical error: "used for" in both lines.

+++ modules/simpletest/tests/upgrade/upgrade.taxonomy.test
@@ -0,0 +1,122 @@
+    $instances = array();
+    foreach (field_info_instances('node', 'poll') as $instance) {

This loop appears three times identically. It could stand to be factored out. You knew someone would say that, right? :-)

+++ modules/taxonomy/taxonomy.install
@@ -412,7 +412,54 @@ function taxonomy_update_7004() {
+  // Allowed values for this extra vocabs field is every vocabulary.
+  foreach (taxonomy_get_vocabularies() as $vocabulary) {
+    $allowed_values[] = array(
+      'vid' => $vocabulary->vid,
+      'parent' => 0,
+    );
+  }

$allowed_values is not initialized; later code will fail if there are no vocs.

Powered by Dreditor.

bjaspan’s picture

Status: Needs review » Needs work

Review of 7005:

    $result = db_query('SELECT v.vid, v.machine_name, n.type FROM {taxonomy_vocabulary} v INNER JOIN {taxonomy_vocabulary_node_type} n ON v.vid = n.vid');
    $vocabularies = array();
    foreach ($result as $record) {
      // If no node types are associated with a vocabulary, the LEFT JOIN will
      // return a NULL value for type.
      if (isset($record->type)) {
        $vocabularies[$record->vid][$record->type] = 'taxonomy_'. $record->machine_name;
      }

The query is an INNER JOIN, not a LEFT JOIN. Is the comment and if statement a holdover from previous "extra" field approaches?

    // Query selects all revisions at once and processes them in revision and
    // term weight order.
    $query = 'SELECT td.vid AS vocab_id, td.tid, tn.nid, tn.vid, n.type, n2.created, n2.sticky, n2.nid AS is_current FROM {taxonomy_term_data} td INNER JOIN {taxonomy_term_node} tn ON td.tid = tn.tid LEFT JOIN {node} n ON tn.nid = n.nid LEFT JOIN {node} n2 ON tn.vid = n2.vid ORDER BY tn.vid, td.weight ASC';

Confused. LEFT JOIN {node} n2 ON vid will only get the most current revision, not all revisions at once. Thinking... the LEFT JOIN {node} ON nid should grab the node the term is associated with, which presumably should always exist unless a bad node deletion occurred, so INNER JOIN should be safe but LEFT JOIN is probably prudence. But the LEFT JOIN {node} n2 ON vid will only ever match if the term is associated to the most current revision of the node. Okay, so this means it is possible for {node} n to match with data and {node} n2 not to. Maybe I'll see from the code below that this was intended, though I do not see how it "selects all revisions at once." More explanation, please?

    if (isset($sandbox['cursor'])) {
      $values = $sandbox['cursor']['values'];
      $deltas = $sandbox['cursor']['deltas'];
    }

It would be helpful to document the "rep invariant" for $sandbox, e.g.: all the possible keys of $sandbox, their format, and what they are guaranteed to look like after each invocation of the function. Otherwise, at this point in the code, I have no idea what $values and $deltas are.

      if (isset($values)) {

        // If the last inserted revision_id is the same as the current record,
        // use the previous deltas to calculate the next delta.
        if ($record->vid == $values[2]) {

"Huh?" See also, please document $values. :-)

          if ($field['cardinality'] != FIELD_CARDINALITY_UNLIMITED && ($deltas[$field_name] + 1) >= $field['cardinality']) {

            // For excess values for a single value field, switch over to the
            // overflow field.

Comment should be "For excess values for a fixed-cardinality field" since you are (properly) handling N>1 fixed cardinality fields.

      // Table and column found in the field's storage details. During upgrades,
      // it's always SQL.
      $table = key($field['storage']['details']['sql'][FIELD_LOAD_REVISION]);
      $value_column = $field['storage']['details']['sql'][FIELD_LOAD_REVISION][$table]['tid'];

I understand that it is safe to assume SQL storage here, but why do it? It seems like it would be simpler and clearer to construct a pseudo-entity and use field_attach_insert().

Hmmm, okay, the reason it is not easier is that you do not always have all the the terms for a single nid/vid at a time, which the Field Attach API requires. I think this whole update function would be simpler if you based it on "SELECT DISTINCT vid FROM {taxonomy_term_node}" then queried and handled all terms for each nid/vid as a separate pass. It would clearly require more queries, but who cares? Of course it would also be a complete rewrite of the function. So if we achieve done-ness soon, don't bother. If not, consider it. :-)

        // Update the {taxonomy_index} table.
        db_insert('taxonomy_index')
          ->fields(array('nid', 'tid', 'sticky', 'created',))
          ->values(array($record->nid, $record->tid, $record->sticky, $record->created))
          ->execute();

Hey, now I know what {node} n2 in the primary query is for! But now I also really think it should be {node_revision} n2 since otherwise won't created and sticky always be NULL for a term attached to a non-current revision?

    // Determine necessity of taxonomyextras field.
    $field = $field_info['taxonomyextra'];
    $table = key($field['storage']['details']['sql'][FIELD_LOAD_REVISION]);
    $node_types = db_select($table)->distinct()->fields($table, array('bundle'))
      ->execute()->fetchCol();

It seems like this could use a FieldEntityQuery in either count or limit-1 mode for each bundle separately, dropping the instance on no data and then no dropping the field if all instances had no data. Again, I guess it isn't necessary/important if the current code really is done.

Setting to CNW because unless I'm mistaken at least the {node} n2 vs {node_revisions} n2 is a real bug.

bjaspan’s picture

Assigned: » bjaspan

I'm working on this today.

pbuyle’s picture

While testing the patch in #76 on a real D6 database, each taxonomy_update_7005 step took about 2min. Starting a transaction in taxonomy_update_7005 reduced each step of 1000 inserts to 2sec. I started the transaction at line 518 (just before starting the foreach loop). I guess it can be started at the begin of taxonomy_update_7005. Or if we don't want to run everything in taxonomy_update_7005 in a transaction, move the loop in an helper function to only do the taxonomy_index inserts in the transaction.

pfrenssen’s picture

Assigned: bjaspan » Unassigned

small problem in the patch: $field_info is used on line 602:

    // Determine necessity of taxonomyextras field.
    $field = $field_info['taxonomyextra'];

but it is only defined when $sandbox['total'] is set. It caused my upgrade to fail. Moving it back up to the top of the function fixes this:

***************
*** 471,476 ****
--- 471,478 ----
   */
  function taxonomy_update_7005(&$sandbox) {
  
+   $field_info = field_info_fields();
+     
    // This is a multi-pass update. On the first call we need to initialize some
    // variables.
    if (!isset($sandbox['total'])) {
***************
*** 498,504 ****
      }
    }
    else {
-     $field_info = field_info_fields();
      $etid = _field_sql_storage_etid('node');
  
      // We do each pass in batches of 1000.
--- 500,505 ----
marvil07’s picture

Just to mention that relations between terms seems to not being conserved.

To reproduce it, just create a relation between two terms, upgrade and the relation is not going to be there.

mrfelton’s picture

+++ modules/taxonomy/taxonomy.install
@@ -412,7 +412,54 @@ function taxonomy_update_7004() {
+
+  // Some contrib projects stored term node associations without regard for the
+  // selections in the taxonomy_vocabulary_node_types table.
+
+  // Allowed values for this extra vocabs field is every vocabulary.
+  foreach (taxonomy_get_vocabularies() as $vocabulary) {
+    $allowed_values[] = array(
+      'vid' => $vocabulary->vid,
+      'parent' => 0,
+    );
+  }

This part is failing my upgrade of a clean D6 site, where no vocbs have been defined. What should allowed values default to?

Powered by Dreditor.

EDIT: To clarify, the $allowed_values problem is only a notice, but still. And to confirm, the change in #89 gets me past this update hook.

mrfelton’s picture

FileSize
707 bytes

I honestly don't know if this is related. But, with the patch applies, and the change made from #89. I get the following error:

Fatal error: Call to undefined function field_read_fields() in /home/tom/workspace/d7/modules/field/modules/field_sql_storage/field_sql_storage.install on line 39

The following patch resolves.

mrfelton’s picture

More debugging:

For what ever reason, the site that I am upgrading has some entries in the term_node table, which relate to a non-existant node. This is screwing up the upgrade process with the following error:

Failed: PDOException: SQLSTATE[23000]: Integrity constraint violation: 1048 Column 'bundle' cannot be null: INSERT INTO {field_revision_taxonomyextra} (etid, entity_id, revision_id, bundle, language, delta, taxonomyextra_tid) VALUES (:db_insert_placeholder_0, :db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3, :db_insert_placeholder_4, :db_insert_placeholder_5, :db_insert_placeholder_6); Array ( [:db_insert_placeholder_0] => 1 [:db_insert_placeholder_1] => 176 [:db_insert_placeholder_2] => 497 [:db_insert_placeholder_3] => [:db_insert_placeholder_4] => und [:db_insert_placeholder_5] => 0 [:db_insert_placeholder_6] => 80 ) in taxonomy_update_7005() (line 569 of /home/tom/workspace/d7/modules/taxonomy/taxonomy.install).

so it is attempting to insert a record with the bundle field as null. This is because the following query in taxonomy.install tries to join term_node on node.

$query = 'SELECT td.vid AS vocab_id, td.tid, tn.nid, tn.vid, n.type, n2.created, n2.sticky, n2.nid AS is_current FROM {taxonomy_term_data} td INNER JOIN {taxonomy_term_node} tn ON td.tid = tn.tid LEFT JOIN {node} n ON tn.nid = n.nid LEFT JOIN {node} n2 ON tn.vid = n2.vid ORDER BY tn.vid, td.weight ASC';

Since there is no entry in the node table in this case, n.type comes back blank, and so when we attempt to insert into field_revision_taxonomyextra, the bundle field is blank - which causes the error.

Note: I can only get this far with the patches from #84 and #92 applied. I also have http://drupal.org/node/895014#comment-3379240 applied

EDIT: And just to confirm, that after doing all the above, and then manually removing those rouge entries from the term_node table just prior to running the upgrade, I am able to get all the way to the end of the upgrade process - for the first time all day :D

mrfelton’s picture

+  // Allowed values for this extra vocabs field is every vocabulary.
+  foreach (taxonomy_get_vocabularies() as $vocabulary) {
+    $allowed_values[] = array(
+      'vid' => $vocabulary->vid,

I don't think we can use taxonomy_get_vocabularies() here. Not with the current state of things anyway. If someone has disabled taxonomy.module, the upgrade results in the following error:

Fatal error: Call to undefined function taxonomy_get_vocabularies() in /home/tom/workspace/d7/modules/taxonomy/taxonomy.install on line 420

Or at least, we need explicitly to load that function first.

Powered by Dreditor.

pfrenssen’s picture

I discovered an issue when investigating this: the upgrade fails if taxonomy terms are present in the database but the taxonomy module was disabled in D6. If filed a new issue for this: #895386: Clean-up the upgrade path: taxonomy

bjaspan’s picture

New patch. FYI, I declare myself very impressed with bangpound's initial patch from #76; several of the problems I "identified" in my review were not in fact problems.

Changes:

#85, #86: All of my correctly-identified problems are fixed, so anything from those comments was either not a problem or is fixed.
#89: Undefined $field_info problem fixed.
#91: I identified this issue in my review in #85, and it is fixed.
#93: term node relations to non-existence nodes are no longer migrated, so this is fixed.
#94, #95: Moved to #895386: Clean-up the upgrade path: taxonomy.

Listed as @todos:

#88: Wrapping a transaction around update 7005. This is a performance improvement that is not critical and can be a separate issue.
#90: Term relations are still not migrated.

Not done:

#92: mrfelton, can you reproduce this?

bjaspan’s picture

re #90: Apparently core no longer supports term relations. taxo update 7001 renames term_relation table to taxonomy_term_relation, and beyond that core ignores it. So yes, the upgrade path does not preserve it.

I removed the @todo. New patch.

bjaspan’s picture

Actually, there are more changes in #96/97:

I added new data to the D6 upgrade test database for nodes that have different terms associated to different versions. thought I was doing this to reveal a flaw in bangpound's update function but it turns out his code handled it perfectly. :-) So, this added small changes to generate-d6-content.sh, and ENORMOUS changes to drupal-6.filled.database.php because dump-database-d6.sh does not order things and I guess my system is dumping in a different order than the previous submitter's did.

With the removal of the @todo from #97, I think this code is ready for a followup review, then RTBC.

bjaspan’s picture

Status: Needs work » Needs review

Oops, CNW. To be clear, it is the patch in #97 that is CNR and I think RTBC.

c960657’s picture

This is not a complete review, but it did notice two minor issues:

+    $query = 'SELECT td.vid AS vocab_id, td.tid, tn.nid, tn.vid, n.type, n2.created, n2.sticky, n2.nid AS is_current FROM {taxonomy_term_data} td INNER JOIN {taxonomy_term_node} tn ON td.tid = tn.tid INNER JOIN {node} n ON tn.nid = n.nid LEFT JOIN {node} n2 ON tn.vid = n2.vid ORDER BY tn.vid, td.weight ASC';
+    $result = db_query_range($query, $sandbox['last'], $batch);

The tn.vid-td.weight tuple is not guaranteed to be unique (AFAICT weight is not used in Tags taxonomies, and even in non-Tags taxonomies there may be duplicates due to weight being just a tinyint), so consecutive queries are not guaranteed to return the terms in the same order. This is required when iterating over the whole result using multiple range queries.

+    $this->assertEqual(sort(array_keys($vocabularies)), sort(array_keys($instances)), t('Node type page has instances for every vocabulary.'));

This does not do what it claims to to. sort() works by reference and just returns a boolean. But the keys of the two arrays are in fact equal, though.

bjaspan’s picture

Status: Needs review » Needs work

Sounds like CNW to me.

ctmattice1’s picture

Tried #97 on a D6 upgrade after applying the patch in #891028: Orphan node types lose bodies on upgrade. After applying #97 compared the D6 taxonomy to newly update D7. It caught all of the terms and even one that was not showing up in D6 which should have. Still may not be complete though dblog is showing

Type php
Date Saturday, August 28, 2010 - 5:37pm
User ctmattice
Location http://d6d7.local/taxonomy/term/9/edit?destination=admin/structure/taxon...
Referrer http://d6d7.local/admin/structure/taxonomy/vocabulary_7
Message Warning: Invalid argument supplied for foreach() in filter_list_format() (line 637 of C:\sites\d6d7\modules\filter\filter.module).
Severity warning
Hostname 127.0.0.1

bjaspan’s picture

Status: Needs work » Needs review
FileSize
666 KB

New patch, fixes both issues from #100, and includes a lengthy comment for one since it is not otherwise obvious what is going on:

    // This query must return a consistent ordering across multiple calls.  We
    // need them ordered by node vid (since we use that to decide when to reset
    // the delta counters) and by term weight so they appear within each node
    // in weight order. However, tn.vid,td.weight is not guaranteed to be
    // unique, so we add tn.tid as an additional sort key because tn.tid,tn.vid
    // is the primary key of the D6 term_node table and so is guaranteed
    // unique. Unfortunately it also happens to be in the wrong order which is
    // less efficient, but c'est la vie.
    $query = 'SELECT td.vid AS vocab_id, td.tid, tn.nid, tn.vid, n.type, n2.created, n2.sticky, n2.nid AS is_current FROM {taxonomy_term_data} td INNER JOIN {taxonomy_term_node} tn ON td.tid = tn.tid INNER JOIN {node} n ON tn.nid = n.nid LEFT JOIN {node} n2 ON tn.vid = n2.vid ORDER BY tn.vid, td.weight ASC, tn.tid';
Berdir’s picture

I haven't really read the patch so I don't know how complex the queries actually are but I've confirmed that they pass on PostgreSQL (with the patch from the changeField() issue...) and SQLite... :)

chx’s picture

Status: Needs review » Reviewed & tested by the community

I think we are ready to go. It'll probably need a slight fixup once the field helpers are in for updates BUT right now it's impossible for a normal user to test taxonomy update w/o this patch. Let's get this in. Even if we find problems later, that can be fixed on. This is an incredible good step ahead.

mlncn’s picture

Tested with the real-world and devel generated databases i've been working with. The improved patch still works for them, too. Beautifully. Agree with chx that this should go in.

RTBC++

webchick’s picture

Priority: Critical » Normal
Status: Reviewed & tested by the community » Needs work

AWESOME work, folks!!! So glad to see this one polished off. :)

The comments in some of these update functions are works of poetry.

Committed to HEAD. Here's some minor clean-up I spotted going through, which would be nice to have a quick follow-up patch for.

+++ modules/simpletest/tests/upgrade/upgrade.taxonomy.test	31 Aug 2010 18:44:11 -0000
@@ -0,0 +1,145 @@
+   * Retrieve an array mapping allowed vocabulary id to field name for
+   * all taxonomy_term_reference fields for which an instance exists
+   * for the specified entity type and bundle.
...
+  function instanceVocabularies($entity_type, $bundle) {

This function name appears to be missing a verb. "getInstanceVocabularies"?

Additionally, is there a way to tighten up this doc in a summary that fits in one line?

+++ modules/simpletest/tests/upgrade/upgrade.taxonomy.test	31 Aug 2010 18:44:11 -0000
@@ -0,0 +1,145 @@
+          if (empty($instances[$tree['vid']]) || $instances[$tree['vid']] == 'taxonomyextra') {

What is "taxonomyextra"? Is that a contrib module we're special casing? A comment here wouldn't hurt.

+++ modules/simpletest/tests/upgrade/upgrade.taxonomy.test	31 Aug 2010 18:44:11 -0000
@@ -0,0 +1,145 @@
+    // Check that taxonomy_vocabulary_node_type and taxonomy_term_node have been
+    // removed.
+    $this->assertFalse(db_table_exists('taxonomy_vocabulary_node_type'), t('taxonomy_vocabulary_node_type has been removed.'));
+    $this->assertFalse(db_table_exists('taxonomy_term_node'), t('taxonomy_term_node has been removed.'));

Is this a good idea? Might contrib modules need to do something with the data in these tables?

+++ modules/simpletest/tests/upgrade/upgrade.taxonomy.test	31 Aug 2010 18:44:11 -0000
@@ -0,0 +1,145 @@
+      debug("Testing node $nid");

Debug.

+++ modules/simpletest/tests/upgrade/upgrade.taxonomy.test	31 Aug 2010 18:44:11 -0000
@@ -0,0 +1,145 @@
+      // The first 12 nodes have two revisions. For nodes with
+      // revisions, check that the oldest revision is associated only
+      // to terms whose ID is equal to the node ID or 49 less the node ID.

Uh. That might explain exactly what the code below is doing, but doesn't help me at all with "why".

+++ modules/taxonomy/taxonomy.install	31 Aug 2010 18:44:11 -0000
@@ -412,7 +412,57 @@ function taxonomy_update_7004() {
+  // Some contrib projects stored term node associations without regard for the
+  // selections in the taxonomy_vocabulary_node_types table, or have more terms
+  // for a single node than the vocabulary allowed. We construct the
+  // taxonomyextra field to store all the extra stuff.
...
+      'description' => 'Debris left over after upgrade from Drupal 6',

Oh, ok. Now taxonomyextra is explained. Maybe just a pointer to it in that other place, then.

Also, what exactly does "store all the extra stuff" mean? The description here could use some work, to help inform people who end up with this field in their database what to do next.

+++ modules/taxonomy/taxonomy.install	31 Aug 2010 18:44:11 -0000
@@ -440,47 +520,164 @@ function taxonomy_update_7005(&$sandbox)
+        $vocabularies[$record->vid][$record->type] = 'taxonomy_'. $record->machine_name;

Concatenation.

+++ modules/taxonomy/taxonomy.install	31 Aug 2010 18:44:11 -0000
@@ -440,47 +520,164 @@ function taxonomy_update_7005(&$sandbox)
+      // Table and column found in the field's storage details. During upgrades,
+      // it's always SQL.
+      $table = key($field['storage']['details']['sql'][FIELD_LOAD_REVISION]);
+      $value_column = $field['storage']['details']['sql'][FIELD_LOAD_REVISION][$table]['tid'];

Interesting. So what do MongoDB people do?

+++ modules/taxonomy/taxonomy.install	31 Aug 2010 18:44:11 -0000
--- scripts/generate-d6-content.sh	30 Jul 2010 01:28:00 -0000	1.1
+++ scripts/generate-d6-content.sh	31 Aug 2010 18:44:11 -0000

This file could use a heck of a lot more comments to explain what it's generating and why.

As a side-note, it's weird to have a file that ends in .sh and have it not be shell script code, but PHP. Not a huge deal, either way, but odd.

+++ scripts/generate-d6-content.sh	31 Aug 2010 18:44:11 -0000
@@ -113,16 +121,26 @@ for ($i = 0; $i < 24; $i++) {
+  // Make every term association different a little. For nodes with revisions,

"a little different"

Powered by Dreditor.

webchick’s picture

Issue tags: -beta blocker

no longer a beta blocker. yay! :D

catch’s picture

At some point there's going to have to be a module/script to migrate from from field_sql_storage to mongodb, that'd make the D6/CCK -> D7/mongodb upgrade process D6/CCK -> d7/field_sql_storage -> mongodb, but it'd also account for D7sql->mongodb at the same time. It's going to be the same for any other pluggable storage like per-bundle too.

I wasn't able to give this issue half the attention I wanted to, so thanks to those who actually found time to work on it.

chx’s picture

A migration script from SQL to any other field storage is very possible -- once you have upped to D7, it's not too hard, the biggest challenge is keeping the field names, you likely want to up the field definitions and then run bare SQL to load and run field_attach_save and batch the process. However, on upgrade it's next to impossible.

yched’s picture

or, for a generic storage migration tool (no assumption on SQL as a starting point) :
update the field definitions,
[batch]
fetch field data with a direct call to [old_storage]_field_storage_load(),
save to new storage with field_attach_save(),
remove old data with [old_storage]_field_storage_purge()
[/batch]
remove old storage (e.g SQL tables) with [old_storage]_field_storage_purge_field()

andypost’s picture

Is Migrate module a better place to discus a field storage data migration?

catch’s picture

Title: "taxo as field" update broken + wipes some node/term associations » Improve comments for the taxonomy upgrade path
Category: bug » task
Priority: Normal » Minor
Issue tags: -D7 upgrade path +Novice

Keep seeing this and thinking it is still broken.

lionel.a’s picture

Version: 7.x-dev » 7.8
Priority: Minor » Normal

Hello,
I'm getting exactly the same message than post #55 when I upgrade from D6 to 7:
PDOException: SQLSTATE[42S22]: Column not found: 1054 Unknown column 'base.format' in 'field list': SELECT base.tid AS tid, base.vid AS vid, base.name AS name, base.description AS description, base.format AS format, base.weight AS weight, v.machine_name AS vocabulary_machine_name FROM {taxonomy_term_data} base INNER JOIN {taxonomy_vocabulary} v ON base.vid = v.vid WHERE (base.tid IN (:db_condition_placeholder_0)) ; Array ( [:db_condition_placeholder_0] => 29 ) in DrupalDefaultEntityController->load() (line 196 of /xxxxx/includes/entity.inc).

But what I don't understand is that I've upgraded succesfully my multisite, except one site. This one is using taxomony and I got an error message while upgrade :

Update #7005
Failed: PDOException: SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry 'node-64-78-0-0-und' for key 'PRIMARY': INSERT INTO {field_revision_taxonomy_vocabulary_16} (entity_type, entity_id, revision_id, bundle, language, delta, taxonomy_vocabulary_16_tid) VALUES (:db_insert_placeholder_0, :db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3, :db_insert_placeholder_4, :db_insert_placeholder_5, :db_insert_placeholder_6); Array ( [:db_insert_placeholder_0] => node [:db_insert_placeholder_1] => 64 [:db_insert_placeholder_2] => 78 [:db_insert_placeholder_3] => story [:db_insert_placeholder_4] => und [:db_insert_placeholder_5] => 0 [:db_insert_placeholder_6] => 30 ) in taxonomy_update_7005() (line 686 of /xxxxx/modules/taxonomy/taxonomy.install).

I still remain with 6 warnings, but not critical.

What is strange, is that I got exactly the same message for another of my sites, but I succeed to upgrade it (and then to apply migration fields).

Thanks if you get any idea,
lionel

---
Post edit : I finaly avoid this error by deleting all taxonomy and upgrading the site.

xjm’s picture

Status: Needs work » Closed (fixed)

#114: If this still occurs when upgrading to Drupal 7.9, I'd suggest searching for an open issue or filing a new one. This one was open for comment cleanup. Thanks!

I opened #1345420: Improve comments for the taxonomy upgrade path to avoid further confusion here, and since folks who might otherwise do this cleanup might be scared off by the 115 comments. :)