Hello,

Here is a tree.inc include file wich implements mixed approach between Using rational numbers to key nested sets and the classical taxonomy_hierarchy method, currently used in drupal.
this can be useful for taxonomy and maybe comments.
tree.inc includes the database schema
test.inc.php needs to be edited to correct the include path for tree.inc.
I did not create a patch as this is experimental for the moment. Any remarks will be appreciated :).

Members fund testing for the Drupal project. Drupal Association Learn more

Comments

drico’s picture

FileSize
11.23 KB

patch file

drico’s picture

patch file

drico’s picture

FileSize
2.04 KB

txt version of the test.inc.php file for testing purposes.

catch’s picture

Status: Active » Needs work

So some more background on this. Since the EOL taxonomy sprint, we've been discussing a lot about ways to denormalise taxonomy module's hierarchy storage so we don't need to do crazy recursive functions any more. There's the http://drupal.org/project/leftandright module by sdrycroft which implements nested sets, and we already have some materialised path in core with the menu system and comment's threading. Materialised path has storage issues with very deep taxonomies, not to mention multiple parents. Nested sets need to be rebuilt when something is inserted or moved within the tree.

However the nested sets keyed with rational numbers implementation allows for the light storage needs and fast retrieval of nested sets, without the terrible update woes - since there's always an available rational number you can use. This gives us the best of both approaches, and while the patch needs tidying up, it looks like not too much code weight either.

Suggested steps for this:
* Get the patch in line with core coding standards and using dbtng for query building.
* At the same time write tests for taxonomy_get_tree taxonomy_get_children/get_parents (we have some coverage, could use more).
* Convert taxonomy_get_tree() to use the new API, hopefully combined with multiple load so we only get the tree then load the objects into it.
* Looks at comments and other core hierarchies to see if they could benefit.

pwolanin’s picture

catch - if this is really the right solution, then we should move menu links to it too?

catch’s picture

Title: new implementation for database tree parsing ( taxonomy / comment ) » New implementation for database tree parsing ( taxonomy / comment )
Priority: Normal » Critical

pwolanin: if it does everything which that paper promises, and especially if we can distill it into an easy to use API, then it'd be great to have everything using the same system. Menu is still pretty fresh though, so that seems like a consistency patch after it goes in to fix the monsters though.

Wim Leers’s picture

Now *this* is interesting stuff! First time that I'll actually *like* reading a mathematical paper :)

Thanks drico & catch!

Subscribing.

chx’s picture

pwolanin, wtf? When I wanted to add a very similar notation to the menu system, you ran screaming. See http://drupal.org/node/141866/revisions/159158/view .

The problem with this system (and any other similar) is that we need to understand what depth can we describe with floats as used by database systems supported by Drupal. That will be fun. The patch is full of whitespace errors as well. I would ask Mr. Tropashko (he answers emails :) ) as well what he thinks of Mr. Hazel's encoding.

drico’s picture

Hi

I will be working tonight to make the patch more drupal 7 compliant ( query whitespace etc ).
About the float problem, I used bcdiv php function and bcscale to determine precision on the float value.
So for the float value field i'am thinking on putting it as a varchar(30) instead of a DOUBLE (10,9). That would clear the float depth problem maybe.

sdrycroft’s picture

Correct me if I'm wrong, but a path of 8.32.15.15.4.1.3.43.4 (an example from the Catalogue of Life classification) would give rise to values for the denominator and numerator of ~170M and ~1.5G respectively. This means that we'd be forced to use the float data type, rather than int, resulting in slower queries. I've investigated using the Tropashko encoding before, and found it resulted in much slower search queries compared to both regular nested sets, and materialised path tree encoding.

drico’s picture

I will make a test with this value and give feedback

drico’s picture

FileSize
3.01 KB
14.18 KB

Hi
This is a new patch file, more compliant I hope. And also a new test file, using values given by sdrycroft .
I needed to add some large bcscale value, as suspected.
So for the test I ran, I inserted 2454 rows in 8 seconds, and moved 2105 rows in 3s inside the tree.

chx’s picture

You do not need to build relatively simple SQL queries like this -- unless you want to make sure they are alterable which is not the case here IMO. count($parents->parent)>0 hardly needed just use $parents->parent is enough. Kindly please run coder module over your code it's a pain to read!

Edit: you are using normal maths operators instead of bcmath at places? Won't that be a problem?

drico’s picture

FileSize
14.34 KB

Hello,
Here is a coder compliant ( I hope ) patch.
I'm gonna make some optimization now, add some other functions, and correct bugs.
Any remarks appreciated.

catch’s picture

This is looking better - some more code style issues though:

All comments should start with a capital letter and end with a full stop.

Is it worth putting the database table creation into system.install at this point? If so it needs to use schema API - can copy the way cache does it.

* @param $nv
*  numerator value of the term

Should be

* @param $nv
*   Numerator value of the term.

pflv nv, dv etc. - all of these need to be unabbreviated for both variable names and column names.

+    #top level insert
+    #we need in this case the vid in parents

We don't use C style comments, // is fine.

Did you do any benchmarking on tree selection yet, or just insert/update?

@chx, we need to use db_select() for everything here because $table is dynamic - that gets us pluggable/re-usable tree storage same as cache.inc

drico’s picture

@catch : thank you very much for your reply, I'm gonna add some more comments, the right way :). And i'm gonna benchmark some tree retrieval also.

drico’s picture

Just for info :
the retrieval of the tree of a term took 0.03s with a tree of 3 parents and 1804 children
the retrieval of the tree of a term took 0.02s with a tree of 6 parents and 608 children

drico’s picture

FileSize
18.65 KB

ok naming seems ok now,
time for comments then : database creation, functions renaming (?) and parameter change,optimization.

chx’s picture

I think I mentioned already that using the SELECT query builder for these relatively simple queries is unnecessary and bloat. Rule of thumb: only use the select builder if any code, be that yours or someone else's needs to change the query. A static SELECT... query can be ran just fine through db_query.

drico’s picture

@chx, sorry but like said catch, i thought it was needed because of '$table' witch I think is useful for other module (than taxonomy) usage.
I'm ok to change it, but as I'm not really used to the database layer yet, can you check with catch, that it is really not usefuf ?
thanks

catch’s picture

chx: we can't/shouldn't do db_query('SELECT * FROM {' . $table . '}...' so have to use the query builder here as a matter of good practice, even if it's unnecessary otherwise.

chx’s picture

Oh I missed that! Fine then. Next question, why are we using db_and() when they are ANDed by default together anyways?

David Strauss’s picture

I think this approach is over-complicated and unnecessary. The "crazy recursive functions" can be avoided by walking up the taxonomy tree when you save a node's terms. The term_node table would get an additional column like "steps." Terms directly applied to a node would have a value of zero in this column. As we recurse up the tree, the number of "steps" increases. We end up with a row mapping every term associated with the node, no matter how far up the tree, to the node.

We then index (tid, steps). It becomes a fast query to find items linked to a tid with any recursive depth limit.

The only tricky part is when you modify a term's parent(s). You have to clear the non-zero steps for the term and insert the updated parents.

catch’s picture

@David, while that might help with taxonomy/term/%/$depth pages, it doesn't help at all for rendering the entire taxononomy tree, which is currently done for all non-autocomplete term selection (both node/add and term administration), and has a very wide range of uses cases and implementations in contrib.

David Strauss’s picture

@catch If the problem is having so many terms that it becomes slow to build a "select" element offering the whole tree and the question is "How can we build the tree faster?", we're asking the wrong question. If building the tree is slow, then we're putting way too many elements in the select element. We should find a different, better interface for selecting the terms.

(But I'm not saying this work won't see use elsewhere.)

catch’s picture

David, yes for the form widgets I'm in favour of getting something along the lines of hierarchical select module or chx's menu chooser patch into core - ideally to replace taxonomy, book and menu select lists. These have both performance and usability issues as they stand now, and the usability ones hit long, long before the queries start getting expensive.

That still leaves http://api.drupal.org/api/function/forum_get_forums/7 which uses taxonomy_get_tree to load the actual hierarchy rather than just calculating term/node associations. Now for forum.module we could do something specific - use materialised path, cache the tree etc. - since there's no multiple parents and likely to be < 1000 discussion forums and limited depth - but that still leaves us with a crappy API for getting trees and branches of trees even if we work around it for all the places it's actually used in core. comment.module's threading desperately needs a revamp too, and it looks like this approach could also be applicable there.

Now it might turn out that this solution won't scale up as much as it needs to but if it does, then we potentially fix a lot of related issues in one go.

drico’s picture

FileSize
21.47 KB

Hello,

A new file, with some more comments that I hope will make it clearer.

David Strauss’s picture

@catch Forums needs to have its own denormalized implementation for topic management. Even a perfectly optimized taxonomy system wouldn't fix the scalability issues in the forum module.

pwolanin’s picture

@catch - does the approach here support multiple parents?

If not, is it feasible to consider different implementations of materialized path? Assuming we used a 255 varchar column (rather than individual int column), stored numbers as hex, and had 32 bit ints as IDs (i.e. max of 8 chars), plus used a separator character, we could be guaranteed to handle a depth of 28. MySQl 5.0+ and Postgres support longer varchar columns, and of course SQLite is its own beast in terms of this.

I'm not sure what the performance of indexed varchar columns looks like in terms of matching and sorting, but at least all the algorithms involved are immediately understandable.

catch’s picture

@pwolanin - as far as I know it does - it'll treat each instance of the term as a different entity.

David Strauss’s picture

@pwoloanin: Official limits on VARCHAR column length are deceptive. Many relational databases only store the first X bytes in the main row record and store the rest elsewhere, which is little better than TEXT. Indexes have their own limits, independent of VARCHAR ones.

drico’s picture

Yes it does,
I'm quite sorry, i'm still finishing it, I hope it's gonna be ok in less than a week.

sun’s picture

subscribing

drico’s picture

FileSize
27.33 KB

Hello,

Here is an intermediate patch file.
I need to :
- optimize calculation in tree_move
- finish the tree_rebuild function, to be putted into cron to clean the tree after a subtree move.

As we are doing insertion into a level as the rightest son of this level, and manage children ordering with a position field, we can do subtree move quite efficiently :
we don't need to move every subtree of a level to insert a subtree between both of them: we just update the position field of the other children of this level.
we can continue to use inegality for filtering descendant of a node
we put all expensive calculations into cron, so insert move and delete should be quick for the end user.

Then, as some joins are made at the moment into taxonomy, I think we need to pass to all the filtering functions another table name, the field to join on, and the field we want to get back from the second table.
So we won't have to make two query to get a descendant and his name ( as with taxonomy_term_data and taxonomy_term_hierarchy ).

natuk’s picture

subscribing

sdrycroft’s picture

Priority: Critical » Normal

Please tell me this isn't seriously being considered for core.

drico’s picture

Why ?
It's pretty efficient even if it's not finished yet.

sdrycroft’s picture

I guess what needs to be asked is, what issues is this trying to fix with Drupal's current taxonomy? I wrote the leftandright (nested set) module as Drupal is unable to handle large taxonomies efficiently (or at all), this patch doesn't fix that issue.

I've attempted to load a large (2M terms) tree into a database using the algorithm that this patch is using, only to find that it results in denominator and numerator values of ~10^15. This results in having to use bigint columns (or in the case of the patch, varchar), which in turn results in having to use a high level of precision for the float_value column. Comparing varchar columns against each other is never going to be as fast as comparing int columns. If we can't handle large taxonomies using this patch, then what is the point in using it at all.

It's pretty efficient...

Really? I doubt that one very much, in fact, looking at the table layout, I'd say it's far from being efficient.

Assuming the purpose of this patch is to improve the speed at which taxonomies can be loaded into Drupal, then it's a large trade off against the speed at which taxonomies can be edited/added.

Drupal's current taxonomy system works for the vast majority of users, why is it necessary to complicate it, and slow it down, for what appears to be zero gain. What needs looking at, instead of the tree storage algorithm, is how taxonomy is being used, and whether or not we can work without methods like drupal_get_tree.

Mgccl’s picture

subscribe.
Nice idea. So this is still nested sets instead of nested intervals?
I don't know if what I'm going to say can also apply to nested sets. but nested intervals are also used to represent hierarchy.
The way to pick those rational numbers doesn't look optimal. The size grows pretty fast. I don't know if there is other ways to pick rational numbers. maybe nested intervals with Farey Fractions? with that, 1M records produce numerator/denominator below 30k, at least the data used in the article.
I could work on this for GSoC... if Farey fractions could work...

Scott Reynolds’s picture

Good to see people doing nested set work. Recently came across this concept when reading MySQL book. varchar columns seems to not be the way to go. I haven't had a chance to review this patch, but I am putting it on my radar.

I believe that this technique will allow the taxonomy landing pages to pull in all nodes with child tags of that term efficiently.

catch’s picture

Issue tags: +Performance

tagging.

catch’s picture

Version: 7.x-dev » 8.x-dev

This won't get into Drupal 7, we need to seriously revisit this in Drupal 8 to get it properly sorted out though.

giorgio79’s picture

Consider this as well for hierarchical taxonomy handling:
http://drupal.org/project/lineage
I believe it is the best solution for large hierarchical taxo handling.

xjm’s picture

Ran across this in checking to see what issues are already open about the taxonomy API.

Re: #43. Um. I'm not sure lineage is the best solution for anything other than building taxonomy views more simply. It stores strings in database fields, regexes them to render, and loops over big arrays to sort an render hierarchical taxonomy lists. The storage method may be faster for large, complicated hierarchies, but the module itself, while useful for views plugins, contains redundancies and probably isn't that performant.

xjm’s picture

pwolanin’s picture

We should look at entity_tree as the basis for a new (performant) tree API be used across menu_links, book, taxonomy, comment, etc

giorgio79’s picture

ivanjaros’s picture

pwolanin’s picture

@ivanjaros - we already use a variant on materialized path for menu links since Drupal 6.

Version: 8.0.x-dev » 8.1.x-dev

Drupal 8.0.6 was released on April 6 and is the final bugfix release for the Drupal 8.0.x series. Drupal 8.0.x will not receive any further development aside from security fixes. Drupal 8.1.0-rc1 is now available and sites should prepare to update to 8.1.0.

Bug reports should be targeted against the 8.1.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.1.x-dev » 8.2.x-dev

Drupal 8.1.9 was released on September 7 and is the final bugfix release for the Drupal 8.1.x series. Drupal 8.1.x will not receive any further development aside from security fixes. Drupal 8.2.0-rc1 is now available and sites should prepare to upgrade to 8.2.0.

Bug reports should be targeted against the 8.2.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.6 was released on February 1, 2017 and is the final full bugfix release for the Drupal 8.2.x series. Drupal 8.2.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.3.0 on April 5, 2017. (Drupal 8.3.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.3.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.6 was released on August 2, 2017 and is the final full bugfix release for the Drupal 8.3.x series. Drupal 8.3.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.4.0 on October 4, 2017. (Drupal 8.4.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.4.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.