Submit a patch. Advocate for the change. Garner and respond to feedback. Changes to taxonomy can languish in contrib forever and ever, or they can find champions who push them into core. Be a champion.

Comments

sdrycroft’s picture

Pushing this into core means Drupal 7, which I have no experience with, and sadly don't have the time to start. I'd love to, although I'm really not sure how many people this would benefit, certainly Googling, and searching the Drupal site itself there is a lot of comment on the use of "Large taxonomies", but what percentage of the total users this is, is hard to say.

Thanks for your comments anyway Robert, I'll leave this issue open as a reminder, of what I should be doing with this module.

coupet’s picture

Great contrib! If possible, a patch in the queue for review would be a good start.

PepeMty’s picture

+1 to Robert

Great contrib!

Warm regards from sunny México!
:-)
Pepe

giorgio79’s picture

Have you guys seen this one?

http://drupal.org/project/lineage

It is meant to address the same issue of super large taxos with the nested tree algorithm.

sdrycroft’s picture

Sorry giorgio79, you've not read the module's project page properly. It definitely isn't designed to improve Drupal's handling of large taxonomies.

giorgio79’s picture

Thanks but I think I have. :) Taxonomy Lineage as I understand it is meant to handle super large hierarchical taxonomies, and the emphasis there is on hierarchy...Currently Drupal Taxonomy module stores only the parent child relationships and it has to rejoin itself to get the full hierarchical taxonomy line for a multi level hierarchy (like geo hierarchy such as continent->country->city->locality...). Taxonomy Lineage solves this issue. As I understand yours is good for free tagging single level taxos as well...

robertDouglass’s picture

Back to the topic of this issue:

The module requires a small patch to "taxonomy.module" which is fully backwards compatible, meaning that it will work with, or without leftandright installed. Installation of the module requires the taxonomy.module file to be replaced with the one provided.

Is there a link to the D7 issue queue where you are proposing this change?

sdrycroft’s picture

Version: 5.x-1.1 » 7.x-1.x-dev

The 7.x version of this module is much more likely to fit into core than the 6.x version. For this reason, I'll leave this issue open and possibly look at suggesting it in the future.

lifepillar’s picture

Honestly, I hope this will never get into the core.

Let me be clear: the author has made a decent job in developing a nested set implementation for taxonomies (although some parts, like the insertion of terms, might be improved) and the module works as expected, so I am not blaming the author. It is the nested set model itself that is flawed, in principle and in practice. First of all, in the relational model, relationships are established by equality of values, not by implicit rules as the one dictated by the nested set model. A consequence of the use of an implicit rule is that referential integrity is not possible. Second, taxonomies in Drupal are not tree-like structures, but, in general, they are directed acyclic graphs (directed graphs without cycles), so nested sets do not fit. Third, queries in the nested set model are not so intuitive (this may also be seen as implied by the first point I've made). Fourth, the nested set model is not as efficient as it may seem. A simpler, cleaner, more general and more effective solution would consist in explicitly maintaining the ancestor-descendant relationship between taxonomy terms (or, more precisely, the transitive closure of the term graph), in a table like AD(ancestor,descendant). The trade-off is that more space is needed to store the AD table (but, even at 16 bytes/row, it's maybe a few 10MB for millions of terms).

The last point is worth elaborating a bit, as the goal of the module is to provide a scalable solution for taxonomies. Assume that hierarchies are tree-like (otherwise, nested sets cannot be applied) and consider just two operations: the insertion of a node and the query “all the ancestors of a given node n”. How do the nested set model and the transitive closure model compare wrt those two tasks?

  • Insertion: with nested sets, you need to update all the nodes above and to the right of the node that you are inserting. In the worst case and on the average, that means O(n) updates, with n number of terms, no matter what the shape of the taxonomy is. In the transitive closure model, your need to insert a number of rows proportional to the depth of the taxonomy. This is also O(n) insertions in the (rather unusual) worst-case of a linear chain of terms, but, if the taxonomy is approximately a balanced tree, it is O(log n) insertions, and it may become even less in practice if the taxonomy is shallow. Besides, in the former case, you have to query all the nodes satisfying certain inequality conditions, while in the latter case, you just have to search for the pairs (ancestor,descendant) where the descendant is a specific node, thus making a better use of indexing structures.
  • Ancestor query: in the nested set model, you need to find all the nodes whose [lft,rgt] interval contains the [lft,rgt] interval of the given node, which requires a join of the corresponding table by itself on an inequality condition. The possibilities for optimization are somewhat limited. In the transitive closure model, all you need to do is essentially “select ancestor from AD where descendant = n”, which, again, can make effective use of indexes.

The above are just two examples, but they are significant because (1) insertion is fundamental to populate the taxonomy, and (2) the ancestor/descendant queries are one of the main reasons to improve over the current implementation.

The transitive-closure approach has been widely studied in the scientific literature: see, for example, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.53 or http://www.cs.wright.edu/people/faculty/gdong/ViewMaintenanceSurveySigmo... or, if you do not want to delve into too much technicality, take a look at this presentation: http://www.slideshare.net/billkarwin/models-for-hierarchical-data.

So, my suggestion to the author is to change the implementation into one based on a transitive-closure table, with a gain in design, clarity and performance.

jantoine’s picture

First I would like to thank sdrycroft for his contribution. I spent a lot of time yesterday researching the different options for storing hierarchical data in a relational database. The following link had a lot of good information: http://stackoverflow.com/questions/4048151/what-are-the-options-for-stor...

#9 by druido caused me to investigate the closure table method and I must say after having researched it, it seems to offer the best balance. I also happened to find a d.o project that is already using this method here: http://drupal.org/project/taxonomy_edge.

Is it possible that we could merge this project into the Taxonomy Edge module, or is there a specific reason to keep this as a separate project? Many hands make light work, and if we can merge all the modules that attempt to fix the taxonomy issues in Drupal, we may have a chance of getting them into core for D8: #1207326: Refactor taxonomy hierarchy API for performance, consistency, and convenience

fgm’s picture

Axel.rutx mentioned my closure table implementation for the tree module about this issue so here is the backlink: http://drupal.org/node/1504720

Note that this is the "closuse table with length" variant, not the pure closure table model, which suffers from difficulties in accessing direct parent and direct children as opposed to full super- and sub-trees.

sdrycroft’s picture

From a quick investigation, it looks like the Taxonomy edge module offers the best solution for this problem, although a better solution that fixes the underlying issue, why are we calling taxonomy_get_tree() on very large vocabularies, is required.

sdrycroft’s picture

Issue summary: View changes
Status: Active » Closed (fixed)