Problem/Motivation

This issue is to propose two significant optimizations to \Drupal\taxonomy\Plugin\views\argument\IndexTidDepth and \Drupal\taxonomy\Plugin\views\filter\TaxonomyIndexTidDepth.

  • the subselect the code generates is a multi-level left-join which is inefficient on MySQL because it cannot make use of indexes
  • in particular, the left join generates many duplicates
  • it generates a WHERE ... IN ( subselect ) clause, which is inefficient on MySQL if the subselect returns anything but a trivially small result set

Here is the complete query HEAD generates for a single TID query with a depth of 3.

SELECT DISTINCT node_field_data.created AS node_field_data_created, node_field_data.nid AS nid FROM node_field_data node_field_data LEFT JOIN taxonomy_index taxonomy_index ON node_field_data.nid = taxonomy_index.nid LEFT JOIN node__field_promote_states node__field_promote_states ON node_field_data.nid = node__field_promote_states.entity_id AND node__field_promote_states.field_promote_states_value = 'landing' WHERE (((node__field_promote_states.field_promote_states_value IS NULL)) AND (node_field_data.nid IN (SELECT tn.nid AS nid FROM taxonomy_index tn LEFT OUTER JOIN taxonomy_term__parent th ON th.entity_id = tn.tid LEFT OUTER JOIN taxonomy_term__parent th1 ON th.parent_target_id = th1.entity_id WHERE (tn.tid = '353') OR (th1.entity_id = '353')))) AND ((taxonomy_index.status = '1') AND (node_field_data.status = '1') AND (node_field_data.type IN ('article', 'recipe'))) ORDER BY node_field_data_created DESC LIMIT 6 OFFSET 5;
+-------------------------+--------+
| node_field_data_created | nid    |
+-------------------------+--------+
|              1594378800 | 110686 |
|              1594360800 | 110446 |
|              1593946800 | 110321 |
|              1593342000 | 109946 |
|              1593241200 | 109896 |
|              1592823600 | 109591 |
+-------------------------+--------+
6 rows in set (0.49 sec)

Proposed resolution

The patch changes the code in two aspects:

taxonomy hierarchy subselect with union of specific inner joins instead of left join

Existing code

The existing code generates subselect following this pattern (tid 353, +3 levels):

explain SELECT tn.nid AS nid FROM taxonomy_index tn LEFT OUTER JOIN taxonomy_term__parent th ON th.entity_id = tn.tid LEFT OUTER JOIN taxonomy_term__parent th1 ON th.parent_target_id = th1.entity_id LEFT OUTER JOIN taxonomy_term__parent th2 ON th1.parent_target_id = th2.entity_id WHERE (tn.tid = '353') OR (th1.entity_id = '353') OR (th2.entity_id = '353');
+----+-------------+-------+------------+-------+---------------+-----------+---------+------------------------------+--------+----------+--------------------------+
| id | select_type | table | partitions | type  | possible_keys | key       | key_len | ref                          | rows   | filtered | Extra                    |
+----+-------------+-------+------------+-------+---------------+-----------+---------+------------------------------+--------+----------+--------------------------+
|  1 | SIMPLE      | tn    | NULL       | index | term_node     | term_node | 14      | NULL                         | 124367 |   100.00 | Using index              |
|  1 | SIMPLE      | th    | NULL       | ref   | PRIMARY       | PRIMARY   | 4       | instyle.tn.tid               |      1 |   100.00 | NULL                     |
|  1 | SIMPLE      | th1   | NULL       | ref   | PRIMARY       | PRIMARY   | 4       | instyle.th.parent_target_id  |      1 |   100.00 | NULL                     |
|  1 | SIMPLE      | th2   | NULL       | ref   | PRIMARY       | PRIMARY   | 4       | instyle.th1.parent_target_id |      1 |   100.00 | Using where; Using index |
+----+-------------+-------+------------+-------+---------------+-----------+---------+------------------------------+--------+----------+--------------------------+

The code in head has to scan all of the taxonomy_index rows.

New code

The patch generates much more efficient code using a UNION of specific inner joins, the replacement for the example above is:

explain SELECT tn.nid AS nid FROM taxonomy_index tn WHERE tn.tid = '353' UNION SELECT tn.nid AS nid FROM taxonomy_index tn INNER JOIN taxonomy_term__parent th ON tn.tid = th.entity_id INNER JOIN taxonomy_term__parent th1 ON th.parent_target_id = th1.entity_id WHERE th1.entity_id = '353' UNION SELECT tn.nid AS nid FROM taxonomy_index tn INNER JOIN taxonomy_term__parent th ON tn.tid = th.entity_id INNER JOIN taxonomy_term__parent th1 ON th.parent_target_id = th1.entity_id INNER JOIN taxonomy_term__parent th2 ON th1.parent_target_id = th2.entity_id WHERE th2.entity_id = '353';
+----+--------------+--------------+------------+------+--------------------------+------------------+---------+-----------------------+------+----------+--------------------------+
| id | select_type  | table        | partitions | type | possible_keys            | key              | key_len | ref                   | rows | filtered | Extra                    |
+----+--------------+--------------+------------+------+--------------------------+------------------+---------+-----------------------+------+----------+--------------------------+
|  1 | PRIMARY      | tn           | NULL       | ref  | PRIMARY,term_node,bundle | term_node        | 4       | const                 |  102 |   100.00 | Using index              |
|  2 | UNION        | th1          | NULL       | ref  | PRIMARY                  | PRIMARY          | 4       | const                 |    1 |   100.00 | Using index              |
|  2 | UNION        | th           | NULL       | ref  | PRIMARY,parent_target_id | parent_target_id | 4       | const                 |    5 |   100.00 | Using where; Using index |
|  2 | UNION        | tn           | NULL       | ref  | term_node                | term_node        | 4       | instyle.th.entity_id  |   17 |   100.00 | Using index              |
|  3 | UNION        | th2          | NULL       | ref  | PRIMARY                  | PRIMARY          | 4       | const                 |    1 |   100.00 | Using index              |
|  3 | UNION        | th1          | NULL       | ref  | PRIMARY,parent_target_id | parent_target_id | 4       | const                 |    5 |   100.00 | Using where; Using index |
|  3 | UNION        | th           | NULL       | ref  | PRIMARY,parent_target_id | parent_target_id | 4       | instyle.th1.entity_id |  223 |   100.00 | Using index              |
|  3 | UNION        | tn           | NULL       | ref  | term_node                | term_node        | 4       | instyle.th.entity_id  |   17 |   100.00 | Using index              |
| NULL | UNION RESULT | <union1,2,3> | NULL       | ALL  | NULL                     | NULL             | NULL    | NULL                  | NULL |     NULL | Using temporary          |
+----+--------------+--------------+------------+------+--------------------------+------------------+---------+-----------------------+------+----------+--------------------------+

Far fewer rows are scanned and indexes are used.

using an INNER join instead of a WHERE IN (...) for the taxonomy filter

if the outer query constructed by views is inefficient (as can easily be the case with any non-trivial LEFT-join), the query as a whole will still remain inefficient because the efficient subselect is used in a WHERE IN (...) which MySQL only uses as a filter on the outer query result.

This can be made much more efficient by using an INNER join. Here's an example for how the patch will change the SQL (reformatted for readability again)

Existing code + the above fix
SELECT DISTINCT node_field_data.created AS node_field_data_created, node_field_data.nid AS nid FROM node_field_data node_field_data LEFT JOIN taxonomy_index taxonomy_index ON node_field_data.nid = taxonomy_index.nid LEFT JOIN node__field_promote_states node__field_promote_states ON node_field_data.nid = node__field_promote_states.entity_id AND node__field_promote_states.field_promote_states_value = 'landing' WHERE (((node__field_promote_states.field_promote_states_value IS NULL)) AND (node_field_data.nid IN (SELECT tn.nid AS nid FROM taxonomy_index tn WHERE tn.tid = '353' UNION SELECT tn.nid AS nid FROM taxonomy_index tn INNER JOIN taxonomy_term__parent th ON tn.tid = th.entity_id INNER JOIN taxonomy_term__parent th1 ON th.parent_target_id = th1.entity_id WHERE th1.entity_id = '353'))) AND ((taxonomy_index.status = '1') AND (node_field_data.status = '1') AND (node_field_data.type IN ('article', 'recipe'))) ORDER BY node_field_data_created DESC LIMIT 6 OFFSET 5;
+-------------------------+--------+
| node_field_data_created | nid    |
+-------------------------+--------+
|              1594378800 | 110686 |
|              1594360800 | 110446 |
|              1593946800 | 110321 |
|              1593342000 | 109946 |
|              1593241200 | 109896 |
|              1592823600 | 109591 |
+-------------------------+--------+
6 rows in set (0.08 sec)
New code
SELECT DISTINCT node_field_data.created AS node_field_data_created, node_field_data.nid AS nid FROM node_field_data node_field_data LEFT JOIN taxonomy_index taxonomy_index ON node_field_data.nid = taxonomy_index.nid LEFT JOIN node__field_promote_states node__field_promote_states ON node_field_data.nid = node__field_promote_states.entity_id AND node__field_promote_states.field_promote_states_value = 'landing' INNER JOIN (SELECT tn.nid AS nid FROM taxonomy_index tn WHERE tn.tid = '353' UNION SELECT tn.nid AS nid FROM taxonomy_index tn INNER JOIN taxonomy_term__parent th ON tn.tid = th.entity_id INNER JOIN taxonomy_term__parent th1 ON th.parent_target_id = th1.entity_id WHERE th1.entity_id = '353') node_taxonomy_depth ON node_taxonomy_depth.nid = node_field_data.nid WHERE ((node__field_promote_states.field_promote_states_value IS NULL)) AND ((taxonomy_index.status = '1') AND (node_field_data.status = '1') AND (node_field_data.type IN ('article', 'recipe'))) ORDER BY node_field_data_created DESC LIMIT 6 OFFSET 5;
+-------------------------+--------+
| node_field_data_created | nid    |
+-------------------------+--------+
|              1594378800 | 110686 |
|              1594360800 | 110446 |
|              1593946800 | 110321 |
|              1593342000 | 109946 |
|              1593241200 | 109896 |
|              1592823600 | 109591 |
+-------------------------+--------+
6 rows in set (0.01 sec)

EXPLAIN is not so useful for this one because the join via query makes for complicated reading - but you can clearly see the performance improvement when running the query.

Overall gains

Use MySQL profiling we can compare the original query versus the final query...

+----------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Query_ID | Duration   | Query                                                                                                                                                                                                                                                                                                        |
+----------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|        1 | 0.49241525 | SELECT DISTINCT node_field_data.created AS node_field_data_created, node_field_data.nid AS nid FROM node_field_data node_field_data LEFT JOIN taxonomy_index taxonomy_index ON node_field_data.nid = taxonomy_index.nid LEFT JOIN node__field_promote_states node__field_promote_states ON node_field_data.n |
|        2 | 0.00789575 | SELECT DISTINCT node_field_data.created AS node_field_data_created, node_field_data.nid AS nid
FROM
node_field_data node_field_data
LEFT JOIN taxonomy_index taxonomy_index ON node_field_data.nid = taxonomy_index.nid
LEFT JOIN node__field_promote_states node__field_promote_states ON node_field_data.n |
+----------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

This shows that original query took 0.49241525 seconds to execute and the new query takes 0.00789575 seconds. This is 98% decrease in the time take to run the query!

Performance testing notes

node_field_data contains 34299 rows
taxonomy_index contains 129814 rows
taxonomy_term__parent contains 6712 rows
node__field_promote_states contains 96159 rows

Remaining tasks

User interface changes

N/a

API changes

N/a

Data model changes

N/a

Release notes snippet

Views that use either the taxonomy depth filter or taxonomy depth argument are massively more performant.

CommentFileSizeAuthor
#54 2133215.53_54.interdiff.txt7.37 KBdww
#54 2133215-54.patch23.9 KBdww
#53 2133215-53.patch24.05 KBalexpott
#53 50-53-interdiff.txt1.39 KBalexpott
#50 2133215-50.patch23.92 KBalexpott
#50 47-50-interdiff.txt5.21 KBalexpott
#47 2133215-47.patch23.92 KBalexpott
#47 42-47-interdiff.txt3.33 KBalexpott
#42 2133215-42.patch22.78 KBalexpott
#42 2133215-8.9.x-42.patch25.42 KBalexpott
#42 36-42-interdiff.txt4.77 KBalexpott
#40 small-site-graph.png383.59 KBalexpott
#39 taxo_plugin_perf.php_.txt4.61 KBalexpott
#39 head_performance graph.png414.41 KBalexpott
#39 performance graph.png370.03 KBalexpott
#36 2133215-36.patch26.46 KBalexpott
#36 31-36-interdiff.txt1.34 KBalexpott
#31 2133215-23.patch25.83 KBalexpott
#26 2133215-8.9.x-26.patch28.48 KBalexpott
#26 25-26-interdiff.txt645 bytesalexpott
#25 2133215-8.9.x-25.patch28.49 KBalexpott
#23 2133215-23.patch25.83 KBalexpott
#23 18-23-interdiff.txt2.48 KBalexpott
#18 2133215-18.patch25.84 KBalexpott
#18 16-18-interdiff.txt4.78 KBalexpott
#16 2133215-16.patch25.86 KBalexpott
#16 13-16-interdiff.txt3.79 KBalexpott
#14 2133215-8.9.x-14.patch26.64 KBalexpott
#13 2133215-13.patch23.85 KBalexpott
#13 9-13-interdiff.txt8.74 KBalexpott
#12 11-12-interdiff.txt1.27 KBalexpott
#12 2133215-8.9.x-12.patch23.33 KBalexpott
#11 2133215-8.9.x-11.patch22.07 KBalexpott
#9 2133215-9.patch20.54 KBalexpott
#9 8-9-interdiff.txt3.12 KBalexpott
#8 2133215-8.patch20.85 KBalexpott
#8 6-8-interdiff.txt8.63 KBalexpott
#6 2133215-6.patch16.94 KBalexpott
#2 term-hierarchy.php_.txt1.66 KBdawehner
#1 0001-Issue-2133215-generate-efficient-SQL.patch3.56 KBuplex_slink
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

uplex_slink’s picture

Status: Active » Needs review
FileSize
3.56 KB
dawehner’s picture

FileSize
1.66 KB

I totally believe that these kind of queries could get quite slow, but the question is whether its worth to change
the behaviour by default.

I tried to do some basic benchmarking setting up the content using the following script, which was still pretty fast.
How many entries do you have in the taxonom_index and term tables?

+++ b/modules/taxonomy/views_handler_argument_term_node_tid_depth.inc
@@ -107,31 +107,56 @@ class views_handler_argument_term_node_tid_depth extends views_handler_argument
+     *
+     * TODO: is there a cleaner way to do this?
+     */

Regarding this, I don't really know a much cleaner way to do it, table_queue is considered actually to be a internal data structure, so there was never really a reason to change it.

chr.fritsch’s picture

mysql> select count(*) from taxonomy_term_data;
+----------+
| count(*) |
+----------+
| 109702 |
+----------+
1 row in set (0.36 sec)

mysql> select count(*) from taxonomy_term_hierarchy;
+----------+
| count(*) |
+----------+
| 109429 |
+----------+
1 row in set (0.01 sec)

mysql> select count(*) from taxonomy_index;
+----------+
| count(*) |
+----------+
| 4204213 |
+----------+
1 row in set (0.00 sec)

MustangGB’s picture

Category: Task » Feature request
Issue tags: +Performance
alexpott’s picture

Project: Views (for Drupal 7) » Drupal core
Version: 7.x-3.x-dev » 9.3.x-dev
Component: Code » taxonomy.module

This is definitely an issue - #1358412: Content: Has taxonomy term ID (with depth) query performance landed in Views 3 and it's a popular core patch. The problem with it is that it slower for MySQL only backends - although it might also be more performant when the db server is under load. Given we're unsure of the performance characteristics I think we should revisit the approach here - or at least something similar.

Moving this issue to the core queue as this is where the latest version of views live and this code is now part of the taxonomy module.

alexpott’s picture

Here's a patch that has quite a big impact on a large site with lots of content and plenty of rows in taxonomy_index - it's the scanning of this table that make the query slow. Also need to fix up TaxonomyIndexTidDepth as it uss the same logic. Perhaps we can move the logic into a trait to make it easy to share.

Status: Needs review » Needs work

The last submitted patch, 6: 2133215-6.patch, failed testing. View results

alexpott’s picture

Status: Needs work » Needs review
FileSize
8.63 KB
20.85 KB

Refactored the common stuff into a trait so \Drupal\taxonomy\Plugin\views\argument\IndexTidDepth and \Drupal\taxonomy\Plugin\views\filter\TaxonomyIndexTidDepth can share the common functionality.

alexpott’s picture

Hmmm... okay the approach in #8 doesn't result in any performance improvements. So I've moved to the union approach in the issue summary.

This results in a test site view going from 537.49 ms to under 200ms so that's quite the performance improvement.

Looking at the queries involved...

HEAD

explain SELECT tn.nid AS nid FROM taxonomy_index tn LEFT OUTER JOIN taxonomy_term__parent th ON th.entity_id = tn.tid LEFT OUTER JOIN taxonomy_term__parent th1 ON th.parent_target_id = th1.entity_id LEFT OUTER JOIN taxonomy_term__parent th2 ON th1.parent_target_id = th2.entity_id WHERE (tn.tid = '353') OR (th1.entity_id = '353') OR (th2.entity_id = '353');
+----+-------------+-------+------------+-------+---------------+-----------+---------+------------------------------+--------+----------+--------------------------+
| id | select_type | table | partitions | type  | possible_keys | key       | key_len | ref                          | rows   | filtered | Extra                    |
+----+-------------+-------+------------+-------+---------------+-----------+---------+------------------------------+--------+----------+--------------------------+
|  1 | SIMPLE      | tn    | NULL       | index | term_node     | term_node | 14      | NULL                         | 124367 |   100.00 | Using index              |
|  1 | SIMPLE      | th    | NULL       | ref   | PRIMARY       | PRIMARY   | 4       | instyle.tn.tid               |      1 |   100.00 | NULL                     |
|  1 | SIMPLE      | th1   | NULL       | ref   | PRIMARY       | PRIMARY   | 4       | instyle.th.parent_target_id  |      1 |   100.00 | NULL                     |
|  1 | SIMPLE      | th2   | NULL       | ref   | PRIMARY       | PRIMARY   | 4       | instyle.th1.parent_target_id |      1 |   100.00 | Using where; Using index |
+----+-------------+-------+------------+-------+---------------+-----------+---------+------------------------------+--------+----------+--------------------------+

PATCH

explain SELECT tn.nid AS nid FROM taxonomy_index tn WHERE tn.tid = '353' UNION SELECT tn.nid AS nid FROM taxonomy_index tn INNER JOIN taxonomy_term__parent th ON tn.tid = th.entity_id INNER JOIN taxonomy_term__parent th1 ON th.parent_target_id = th1.entity_id WHERE th1.entity_id = '353' UNION SELECT tn.nid AS nid FROM taxonomy_index tn INNER JOIN taxonomy_term__parent th ON tn.tid = th.entity_id INNER JOIN taxonomy_term__parent th1 ON th.parent_target_id = th1.entity_id INNER JOIN taxonomy_term__parent th2 ON th1.parent_target_id = th2.entity_id WHERE th2.entity_id = '353';
+----+--------------+--------------+------------+------+--------------------------+------------------+---------+-----------------------+------+----------+--------------------------+
| id | select_type  | table        | partitions | type | possible_keys            | key              | key_len | ref                   | rows | filtered | Extra                    |
+----+--------------+--------------+------------+------+--------------------------+------------------+---------+-----------------------+------+----------+--------------------------+
|  1 | PRIMARY      | tn           | NULL       | ref  | PRIMARY,term_node,bundle | term_node        | 4       | const                 |  102 |   100.00 | Using index              |
|  2 | UNION        | th1          | NULL       | ref  | PRIMARY                  | PRIMARY          | 4       | const                 |    1 |   100.00 | Using index              |
|  2 | UNION        | th           | NULL       | ref  | PRIMARY,parent_target_id | parent_target_id | 4       | const                 |    5 |   100.00 | Using where; Using index |
|  2 | UNION        | tn           | NULL       | ref  | term_node                | term_node        | 4       | instyle.th.entity_id  |   17 |   100.00 | Using index              |
|  3 | UNION        | th2          | NULL       | ref  | PRIMARY                  | PRIMARY          | 4       | const                 |    1 |   100.00 | Using index              |
|  3 | UNION        | th1          | NULL       | ref  | PRIMARY,parent_target_id | parent_target_id | 4       | const                 |    5 |   100.00 | Using where; Using index |
|  3 | UNION        | th           | NULL       | ref  | PRIMARY,parent_target_id | parent_target_id | 4       | instyle.th1.entity_id |  223 |   100.00 | Using index              |
|  3 | UNION        | tn           | NULL       | ref  | term_node                | term_node        | 4       | instyle.th.entity_id  |   17 |   100.00 | Using index              |
| NULL | UNION RESULT | <union1,2,3> | NULL       | ALL  | NULL                     | NULL             | NULL    | NULL                  | NULL |     NULL | Using temporary          |
+----+--------------+--------------+------------+------+--------------------------+------------------+---------+-----------------------+------+----------+--------------------------+
alexpott’s picture

Another advantage of the union approach is that the sub query returns less rows. On the site I'm testing on the sub query in HEAD returns 462 rows whereas the patch with unions returns 219. If I add a distinct the HEAD approach returns 219 and I've checked they contain existing the same node IDs.

alexpott’s picture

Here's a patch for 8.9.x to enable testing on sites that are using #1358412: Content: Has taxonomy term ID (with depth) query performance - this will help us test this patch on production sites that are using that patch.

alexpott’s picture

Whoops forgot to add the testing plugins into views data.

alexpott’s picture

Back to implementing everything in the issue summary - this does part 2:

using an INNER join instead of a WHERE IN (...) for the taxonomy filter

alexpott’s picture

Here's the patch for 8.9.x

The last submitted patch, 13: 2133215-13.patch, failed testing. View results

alexpott’s picture

Fixing the random fail during to ordering. Currently all nodes have the same created time because REQUEST_TIME doesn't change in the test runner. Added a sort and fixed up the created times so that we testing what we think we're testing rather than the somewhat arbitrary DBs emit data when the sort is sorting on data that is equal.

daffie’s picture

Status: Needs review » Needs work

First quick review:

  1. +++ b/core/modules/taxonomy/src/Plugin/views/argument/IndexTidDepth.php
    @@ -101,40 +103,13 @@ public function query($group_by = FALSE) {
    +    $this->addJoin(Database::getConnection(), $tids, /*$this->options['depth'], $this->tableAlias*/);
    
    +++ b/core/modules/taxonomy/src/Plugin/views/filter/TaxonomyIndexTidDepth.php
    @@ -70,32 +68,7 @@ public function query() {
    +    $this->addJoin(Database::getConnection(), $this->value);
    

    Can we replace Database::getConnection() with $this->query->getConnection(). Maybe not needed as a parameter.

  2. +++ b/core/modules/taxonomy/tests/src/Kernel/Views/TaxonomyTermArgumentDepthTest.php
    @@ -0,0 +1,201 @@
    +    // Create a hierarchy 3 deep. Note the parent setup function creates two
    +    // top-level terms w/o children.
    

    I do not think that this comment is correct. The hierarchy is 5 levels deep.

  3. +++ b/core/modules/views/src/Plugin/views/join/InSubquery.php
    @@ -0,0 +1,86 @@
    + *   'left_table_query' => $query,
    

    In the view subquery plugin we have the parameter "left_query" and here we add the parameter "left_table_query". Maybe better for consistency to name it similar?

alexpott’s picture

Category: Feature request » Bug report
Priority: Normal » Major
Status: Needs work » Needs review
FileSize
4.78 KB
25.84 KB

Thanks for the review @daffie.
Re:
1. Sure - fixed.
2. Fixed
3. The "left_query" in \Drupal\views\Plugin\views\join\Subquery is badly named. It's actually being as the left field in a join. What we need here is a subquery being used as the left table. The join plugin added here is badly named... I'm changing it to LeftTableSubquery as that is accurate.


Here are some performance stats from a real site on 8.9.x - 34299 nodes, 6712 terms and 129814 rows in the taxonomy index table.

HEAD

Run 1:
Query build time 10.39 ms
Query execute time 836.03 ms

Run 2:
Query build time 7.07 ms
Query execute time 520.63 ms

#1358412: Content: Has taxonomy term ID (with depth) query performance

Run 1:
Query build time 18.42 ms
Query execute time 13.16 ms

Run 2:
Query build time 41.45 ms
Query execute time 13.9 ms

THIS PATCH

Run 1:
Query build time 14.6 ms
Query execute time 19.3 ms

Run 2:
Query build time 10.14 ms
Query execute time 14.23 ms


Given that views using this filter or argument can cause sites to become very unresponsive I'm changing this to be a major bug with the views system. The fact that this issue exists and that #1358412: Content: Has taxonomy term ID (with depth) query performance has been committed to Views 3 for Drupal 7 shows this is a real problem that people are hitting.

I think the solution on this issue is superior to #1358412: Content: Has taxonomy term ID (with depth) query performance because it results in less queries or accesses to cache - since that one has to load all the taxonomy terms involved and it is obviously a massive improvement on HEAD. Additionally I think solutions like the new module linked in #1358412-84: Content: Has taxonomy term ID (with depth) query performance because maintaining more indexes and tables to speed up taxonomy hierarchy has additional costs at taxonomy and node save time,

alexpott’s picture

Issue summary: View changes

Fixing issue summary up.

alexpott’s picture

Issue summary: View changes

More issue summary fixes.

daffie’s picture

Patch looks good!

  1. +++ b/core/modules/taxonomy/src/NodeDepthQueryTrait.php
    @@ -0,0 +1,64 @@
    +    $definition['left_table'] = 'node_taxonomy_depth';
    

    Why do we have this line? It is not used in the plugin. Can it be removed?

  2. +++ b/core/modules/taxonomy/src/NodeDepthQueryTrait.php
    @@ -0,0 +1,64 @@
    +    $definition['left_table_query'] = $subquery;
    +    $join = Views::pluginManager('join')->createInstance('left_table_subquery', $definition);
    
    +++ b/core/modules/views/src/Plugin/views/join/LeftTableSubquery.php
    @@ -0,0 +1,86 @@
    + *   'left_table_query' => $query,
    ...
    + * - left_table_query: The subquery to use as the left table of the join clause.
    ...
    + * @ViewsJoin("left_table_subquery")
    ...
    +    assert($this->configuration['left_table_query'] instanceof SelectInterface);
    +    $this->leftTableQuery = $this->configuration['left_table_query'];
    

    Can we choose either for "left_table_query" or for "left_table_subquery"? We have have just one then there is just one mistake less to make. Either one is fine for me.

alexpott’s picture

Issue summary: View changes

Updated the first set of queries for D9. Working on changing the second part.

alexpott’s picture

@daffie nice review - great points. Patch attached addresses both of them and fleshes out the documentation on LeftQueryJoin.

alexpott’s picture

Title: make views_handler_argument_term_node_tid_depth efficient » Fix IndexTidDepth views argument plugin and TaxonomyIndexTidDepth views filter plugin performance
alexpott’s picture

Here's the patch for 8.9.x that also works with the new plugins created by the alternative approach patch... I need it for testing on real sites...

alexpott’s picture

Hmmm #25 doesn't work on D8 because it's getConnection() doesn't exist... ho hum.

alexpott’s picture

Issue summary: View changes

Updated the remaining queries in the issue summary to D9.

alexpott’s picture

Issue summary: View changes
alexpott’s picture

Issue summary: View changes

.

alexpott’s picture

Issue summary: View changes
alexpott’s picture

Re-upping the D9 patch so it's the latest one on the issue so people know what to review.

The last submitted patch, 25: 2133215-8.9.x-25.patch, failed testing. View results

alexpott’s picture

Issue summary: View changes

Adding CR for the new join plugin to the issue summary.

alexpott’s picture

Issue summary: View changes
daffie’s picture

All the code changes look good to me.
There is testing added test the code changes.
The IS and the CR are in order.
For me it is RTBC.

alexpott’s picture

Thanks for the reviews @daffie!

We're missing an empty post update to ensure the new views join plugin can be found and we should also address the @todo in the new join plugin. I sat in xdebug to see what is going on with the adjusted flag and reading \Drupal\views\Plugin\views\query\Sql::adjustJoin() makes it clear that setting this to TRUE is a good idea since in this case the left table is a query and the alias is decided by the call to addRelationship(). Maybe this will need adjusting it we start allowing such joins to be configured via hook_views_data() but that feels like a very odd use-case. This type of join needs to be added programatically via some other sort of plugin. Like the existing \Drupal\views\Plugin\views\join\Subquery plugin.

daniel.bosen’s picture

Status: Reviewed & tested by the community » Needs work

I tested the new approach with different terms on our projects. The results were very situation-dependent. When I choose a term, that is the root of a bigger tree, the subselect of the WHERE ... IN() returns a lot of rows, and we actually see a performance regression.

The results for HEAD query:

SELECT DISTINCT node_field_data.created AS node_field_data_created,
                node_field_data.nid     AS nid
FROM   node_field_data node_field_data
           LEFT JOIN taxonomy_index taxonomy_index
                     ON node_field_data.nid = taxonomy_index.nid
           LEFT JOIN node__field_promote_states node__field_promote_states
                     ON node_field_data.nid = node__field_promote_states.entity_id
                         AND node__field_promote_states.field_promote_states_value =
                             'landing'
WHERE  ( (( node__field_promote_states.field_promote_states_value IS NULL ))
    AND ( node_field_data.nid IN (SELECT tn.nid AS nid
                                  FROM   taxonomy_index tn
                                             LEFT OUTER JOIN taxonomy_term__parent th
                                                             ON th.entity_id = tn.tid
                                             LEFT OUTER JOIN taxonomy_term__parent th1
                                                             ON th.parent_target_id =
                                                                th1.entity_id
                                  WHERE  ( tn.tid = '718' )
                                     OR ( th1.entity_id = '718' )) ) )
  AND ( ( taxonomy_index.status = '1' )
    AND ( node_field_data.status = '1' )
    AND ( node_field_data.type IN ( 'article', 'recipe' ) ) )
ORDER  BY node_field_data_created DESC
    LIMIT  6 offset 5;

+-------------------------+--------+
| node_field_data_created | nid    |
+-------------------------+--------+
|              1625284800 | 118893 |
|              1625207400 | 118916 |
|              1625126400 | 118902 |
|              1625122800 | 118394 |
|              1624968000 | 118888 |
|              1624867200 | 118756 |
+-------------------------+--------+
6 rows in set (0.014 sec)

With new code:

SELECT DISTINCT node_field_data.created AS node_field_data_created,
                node_field_data.nid     AS nid
FROM   node_field_data node_field_data
           LEFT JOIN taxonomy_index taxonomy_index
                     ON node_field_data.nid = taxonomy_index.nid
           LEFT JOIN node__field_promote_states node__field_promote_states
                     ON node_field_data.nid = node__field_promote_states.entity_id
                         AND node__field_promote_states.field_promote_states_value =
                             'landing'
           INNER JOIN (SELECT tn.nid AS nid
                       FROM   taxonomy_index tn
                       WHERE  tn.tid = '718'
                       UNION
                       SELECT tn.nid AS nid
                       FROM   taxonomy_index tn
                                  INNER JOIN taxonomy_term__parent th
                                             ON tn.tid = th.entity_id
                                  INNER JOIN taxonomy_term__parent th1
                                             ON th.parent_target_id = th1.entity_id
                       WHERE  th1.entity_id = '718') node_taxonomy_depth
                      ON node_taxonomy_depth.nid = node_field_data.nid
WHERE  (( node__field_promote_states.field_promote_states_value IS NULL ))
  AND ( ( taxonomy_index.status = '1' )
    AND ( node_field_data.status = '1' )
    AND ( node_field_data.type IN ( 'article', 'recipe' ) ) )
ORDER  BY node_field_data_created DESC
    LIMIT  6 offset 5;

+-------------------------+--------+
| node_field_data_created | nid    |
+-------------------------+--------+
|              1625284800 | 118893 |
|              1625207400 | 118916 |
|              1625126400 | 118902 |
|              1625122800 | 118394 |
|              1624968000 | 118888 |
|              1624867200 | 118756 |
+-------------------------+--------+
6 rows in set (0.124 sec)

The queries are the same as in the issue description, just different tids

alexpott’s picture

Thanks for the review @daniel.bosen. It seems that the MySQL is very sensitive to this part...

it generates a WHERE ... IN ( subselect ) clause, which is inefficient on MySQL if the subselect returns anything but a trivially small result set

I think we can remove this part of patch and discuss when and how to add that. Without the left join using a query I see a speed up for both 353 and 718 with the dataset we're looking at.

alexpott’s picture

Issue summary: View changes
Status: Needs work » Needs review
FileSize
370.03 KB
414.41 KB
4.61 KB

I've some extensive performance testing of this patch on every term with results on the production site I have access to. The difference in the performance of JOIN (subselect) ON and the WHERE ... IN ( subselect ) is determined by the number of terms returned by the sub-select.

JOIN (subselect) ON vs WHERE ... IN ( subselect )

Graph showing the performance of 'WHERE ... IN ( subselect )' in blue pluses and 'JOIN (subselect) ON' in green circles. The x axis is the number of rows returned by the subselect. The y axis is the time in milliseconds for the query.
EDIT: the x-axis is supposed to be labelled "Number of node IDs returned by the subquery" as that's what the subquery lists.

This shows that the variation in the performance of JOIN (subselect) ON is a linear growth compared to the number of terms. This is very different for the WHERE ... IN ( subselect ) which shows an exponential decrease in time taken as the number of terms returned by the subquery increases.

Therefore we have a tradeoff to make. I think the fact that linear growth in slow and even when there a over 10000 nids return by the subquery the query still only takes 50ms means that we should continue with the UNIONS subquery + JOIN (subselect) ON approach (i.e. the current patch in #36).

HEAD vs JOIN (subselect) ON

Just to show how HEAD compares to the new code...
Graph showing the performance of 'HEAD' in magenta squares and 'JOIN (subselect) ON' in green circles. The x axis is the number of rows returned by the subselect. The y axis is the time in milliseconds for the query.
EDIT: the x-axis is supposed to be labelled "Number of node IDs returned by the subquery" as that's what the subquery lists.

The performance of HEAD when the subsquery only returns a few rows is dire.

Notes

  • The script to generate the numbers is attached.
  • When analysing the query it was found that the production site is unnecessary adding a filter on $data['taxonomy_index']['status'] - this results in an additional join to taxonomy_index to filter by taxonomy_index.status - but this is the node's status and doesn't need this join because we have a filter on node_field_data.status already. This is a problem that can be fixed on the production site I'm performance testing. Removing this unnecessary filter narrows the gap between JOIN (subselect) ON and the WHERE ... IN ( subselect ) when dealing with subquery that return large numbers of node IDs.
alexpott’s picture

FileSize
383.59 KB

I've also looked at a much smaller production site with only 1874 nodes and 14925 rows in taxonomy term index and 2227 terms. The performance problems are not as significant but there are still benefits of moving to this approach.

  • For HEAD the slowest TID was 92.7ms and the fastest was 1.8ms
  • For WHERE ... IN ( subselect ) the slowest was 56.7ms and the fastest was 1.9ms
  • For JOIN (subselect) ON the slowest was 5.2ms and the fastest was 1.6ms

The graph shows a very similar distribution of times...

Graph showing query performance. X-axis is the number node IDs returned by the sub query. Y-axis is the time for the query in milliseconds. The green circle is for the query generated by HEAD. The blue triangle is for the query generated by the patch in comment #11. The grey cross is for the query generated by the patch in comment #36.

daniel.bosen’s picture

Status: Needs review » Reviewed & tested by the community

@alexpott Thanks for all the nice graphs and performance investigations! I can confirm, that the example I had, was an outlier and that most of the other terms are quicker with the new code.
So back to RTBC.

alexpott’s picture

I was just reviewing #3221933: PHP Notice when using "left_formula" in views join and through that I learnt about \Drupal\taxonomy\Plugin\views\relationship\NodeTermData::query() and in there we do a subquery join! Using the standard views join! So there's no need to add the new plugin. Which means no update function and this could potentially go in a patch release.

daffie’s picture

Status: Needs review » Reviewed & tested by the community

The previous added new plugin has been removed.
Also from the IS and the CR.
All code changes look good to me.
The added testing is not removed.
Back to RTBC.

alexpott’s picture

I've been investigate why we see the patch in #42 slow down whereas HEAD and the WHERE ... IN ( subselect ) gets quicker. It turns out that this is due to how MySQL's query planner is working. It chooses a less optimal query plan for JOIN (subselect) ON when the number of nodes returned by the subquery increases. However it possible to use MySQL query hints to improve this. Against my test site:

SELECT
DISTINCT node_field_data.created AS node_field_data_created, node_field_data.nid AS nid
FROM
node_field_data node_field_data
LEFT JOIN taxonomy_index taxonomy_index ON node_field_data.nid = taxonomy_index.nid
LEFT JOIN node__field_promote_states node__field_promote_states ON node_field_data.nid = node__field_promote_states.entity_id AND node__field_promote_states.field_promote_states_value = 'landing'
INNER JOIN (SELECT tn.nid AS nid
FROM
taxonomy_index tn
WHERE tn.tid = '718' UNION SELECT tn.nid AS nid
FROM
taxonomy_index tn
INNER JOIN taxonomy_term__parent th ON tn.tid = th.entity_id
INNER JOIN taxonomy_term__parent th1 ON th.parent_target_id = th1.entity_id
WHERE th1.entity_id = '718' UNION SELECT tn.nid AS nid
FROM
taxonomy_index tn
INNER JOIN taxonomy_term__parent th ON tn.tid = th.entity_id
INNER JOIN taxonomy_term__parent th1 ON th.parent_target_id = th1.entity_id
INNER JOIN taxonomy_term__parent th2 ON th1.parent_target_id = th2.entity_id
WHERE th2.entity_id = '718') node_taxonomy_depth ON node_field_data.nid = node_taxonomy_depth.nid
WHERE ((node__field_promote_states.field_promote_states_value IS NULL)) AND ((taxonomy_index.status = '1') AND (node_field_data.status = '1') AND (node_field_data.type IN ('article', 'recipe')))
ORDER BY node_field_data_created DESC
LIMIT 11 OFFSET 0

Takes 442 ms.

If you start the query with...

SELECT
/*+ JOIN_PREFIX(node_field_data) */

it takes 28 ms!

Unfortunately core does not support adding query optimisation hints at all so I'm going to create a follow-up to discuss adding support so we can leverage them here to massively improve performance even when there are loads and loads of nodes that have the terms. I don't think any of this revokes the rtbc as the new query is much much more performant and consistently performant.

daniel.bosen’s picture

I am not so sure about the query hints, they tend to be very use-case specific. But yes, let's discuss this separately. This patch is a huge performance improvement RTBC +1

Lendude’s picture

This looks really great, and LOVE the graphs.

But for all the things that the new method does here, it feels a little light on documentation, a little bit like the code in HEAD ;)

Just some things I think would benefit from some extra explanation:

  1. +++ b/core/modules/taxonomy/src/NodeDepthQueryTrait.php
    @@ -0,0 +1,66 @@
    +   * Builds a performant depth subquery and adds it as a join to the query.
    +   *
    +   * @param string|array $tids
    +   *   The terms IDs to do a depth search for.
    

    We have some great examples of the structure of the query that is being build here, maybe add that to the doc block?

  2. +++ b/core/modules/taxonomy/src/NodeDepthQueryTrait.php
    @@ -0,0 +1,66 @@
    +  protected function addJoin($tids): void {
    

    Seems like a pretty minimalist name for all the work it is doing, feels like it deserves something better ;)

  3. +++ b/core/modules/taxonomy/src/NodeDepthQueryTrait.php
    @@ -0,0 +1,66 @@
    +    if ($this->options['depth'] !== 0) {
    +      // Create the queries to build the union.
    

    'the union' is the first reference to the fact that we will be using a UNION here, so it feels like this could use more introduction, but maybe adding the structure to the main docblock would suffice?

  4. +++ b/core/modules/taxonomy/src/NodeDepthQueryTrait.php
    @@ -0,0 +1,66 @@
    +      if ($this->options['depth'] > 0) {
    ...
    +      foreach (range(1, abs($this->options['depth'])) as $count) {
    ...
    +        if ($this->options['depth'] > 0) {
    

    We probably could use some explanation as to what is going on with the juggling with depth and negative depth.

alexpott’s picture

Thanks for the review @Lendude the patch attached addresses #46 - good points!

Lendude’s picture

@alexpott, thanks!

#47 totally satisfies my craving for docs, leaving RTBC

larowlan’s picture

This is some really great work folks. I've got one question about whether we can adapt this to make it more useful for contrib, and another that' s just a code-style nit.

  1. +++ b/core/modules/taxonomy/src/NodeDepthQueryTrait.php
    @@ -0,0 +1,92 @@
    +trait NodeDepthQueryTrait {
    ...
    +    $subquery = $connection->select('taxonomy_index', 'tn');
    +    $subquery->addField('tn', 'nid');
    ...
    +    $definition['left_field'] = 'nid';
    +    $definition['field'] = 'nid';
    ...
    +    $this->query->addRelationship('node_taxonomy_depth', $join, 'node', $this->relationship);
    

    As a maintainer of taxonomy_entity_index, it would be good for us to replicate this logic over there to improve performance of querying other entity-types.

    To that effect, is there any merit in making this not specific to nodes?

    I guess it would be tricky because the base table would need to change to taxonomy_entity_index and then we'd need to also have a condition for entity-type. So all of these parts would need to be customisable, plus additional conditions in the join.

  2. +++ b/core/modules/taxonomy/tests/src/Kernel/Views/TaxonomyTermArgumentDepthTest.php
    @@ -0,0 +1,201 @@
    +    $this->terms[0] = $first;
    +    $this->terms[1] = $second;
    +    $this->terms[2] = $third;
    +    $this->terms[3] = $fourth;
    +    $this->terms[4] = $fifth;
    

    Nit: Could any reason not to write this as

    $this->terms = [$first, $second, $third, $fourth, $fifth];
    
alexpott’s picture

Thanks for the review @larowlan.

1. I had a go at this... I changed the signature to protected function addSubQueryJoin($tids, string $index_table = 'taxonomy_index', string $index_join_entity_field = 'nid', string $index_taxonomy_field = 'tid'): void {. But then I realised that the taxonomy_entity_index module also adds revisions into the mix. I certainly think there might be scope for that module to leverage similar logic but making this re-usable for that will add much complexity and we'd basically have to add taxonomy_entity_index to core to test it.

2. Fixed - it's kinda from another test in core - \Drupal\Tests\taxonomy\Functional\Views\TaxonomyTermFilterDepthTest but we don't change that here so leaving that one alone.

Inspired by 1 I've removed a few of the hardcoded things that can be replaced by information on the class and I've renamed the trait to be about the taxonomy_index table because that's the table it is joining in.

dww’s picture

This all looks really great. Thanks to everyone who's worked on it, especially @alexpott!

Initial review. So far, only trivial doc nits. I've only skimmed the test coverage so far. I'll try to give that a closer look in the next few days...

  1. +++ b/core/modules/taxonomy/src/TaxonomyIndexDepthQueryTrait.php
    @@ -0,0 +1,93 @@
    + * - it uses inner joins and unions to select all the node IDs with the terms in
    

    Given the next line, s/inner joins/INNER JOIN/ and s/unions/UNION/?

  2. +++ b/core/modules/taxonomy/src/TaxonomyIndexDepthQueryTrait.php
    @@ -0,0 +1,93 @@
    + * - it uses INNER JOIN (subquery) rather instead of a WHERE IN (subquery).
    

    "rather instead" seem redundant. I'd remove "rather" and keep "instead of".

  3. +++ b/core/modules/taxonomy/src/TaxonomyIndexDepthQueryTrait.php
    @@ -0,0 +1,93 @@
    + * resulting join to node_field_data will be:
    

    JOIN?
    'node_field_data' (with single quotes)?

  4. +++ b/core/modules/taxonomy/src/TaxonomyIndexDepthQueryTrait.php
    @@ -0,0 +1,93 @@
    +   * Builds a performant depth subquery and adds it as a join to the query.
    ...
    +    // Add the subquery as a join.
    

    JOIN?

  5. +++ b/core/modules/taxonomy/src/TaxonomyIndexDepthQueryTrait.php
    @@ -0,0 +1,93 @@
    +   * @param string|array $tids
    +   *   The terms IDs to do a depth search for.
    

    Do we need to explain how the string option works here?

dww’s picture

Issue tags: +Bug Smash Initiative
alexpott’s picture

Fixed #51 - not in the way asked for - I rewrote the docs to include more detail.

dww’s picture

@alexpott re: #53: Of course you did. ;) That's even better. Thanks!

I finished my pedantic review of the tests, too. Had a brief head-scratch over point 1, then realized it's string view ID vs. ViewExecutable. I think I cleaned this up well in the attached patch:

  1. +++ b/core/modules/taxonomy/tests/src/Kernel/Views/TaxonomyTermArgumentDepthTest.php
    @@ -0,0 +1,197 @@
    +   * Views used by this test.
    ...
    +   * Stores the view used in the test.
    

    Huh? Oh, types. ;)

  2. +++ b/core/modules/taxonomy/tests/src/Kernel/Views/TaxonomyTermArgumentDepthTest.php
    @@ -0,0 +1,197 @@
    +    // Third-level and second-level term, depth -1, using a comma. Note that due
    +    // to performance the "and" meaning of comma is not supported.
    

    🤔 I don't see this limitation mentioned anywhere else in the patch. Not clear what it means here. Maybe this needs more explanation?

Would love to hear about #2 before this is truly RTBC.

Instead of commenting on everything else and having Alex re-roll it, rest of my review in the form of a new patch an interdiff. Basically:
A. Minor copy editing to comments.
B. Move a few things around to put related things closer together (e.g. at first I didn't notice we initialized $this->terms).
C. Add :void to all the new methods in the new test classes.
D. Remove $column_map and $assert_method local variables (which are only used once) and pass the right values directly to assertIdenticalResultsetHelper() (just like assertTermWithDepthResult() was already doing).

daffie’s picture

Status: Needs review » Reviewed & tested by the community

All the changes look good to me.
Back to RTBC.

alexpott’s picture

Re #54.2 This is a restriction based on the HEAD.


    $form['break_phrase'] = [
      '#type' => 'checkbox',
      '#title' => $this->t('Allow multiple values'),
      '#description' => $this->t('If selected, users can enter multiple values in the form of 1+2+3. Due to the number of JOINs it would require, AND will be treated as OR with this filter.'),
      '#default_value' => !empty($this->options['break_phrase']),
    ];

Here's the documentation in HEAD already about this.

dww’s picture

Re: #55: Thanks!
Re: #56: Ahh, right. Makes sense. Thanks again!

$RTBC++;

As the maintainer of views_taxonomy_term_name_into_id I'm extremely in favor of this improvement to these views plugins!

Thanks,
-Derek

  • catch committed 923b2e6 on 9.3.x
    Issue #2133215 by alexpott, dww, uplex_slink, dawehner, daffie, daniel....
catch’s picture

Status: Reviewed & tested by the community » Fixed

Committed 923b2e6 and pushed to 9.3.x. Thanks!

This looks great, couldn't find anything to complain about and thanks for all the detailed performance investigation.

Seems like a small chance this could break hook_views_query_alter() implementations, so not backported to 9.2.x for now.

alexpott’s picture

Filed a fix for the unfortunate views random sorting fail this issue causes.... #3229665: \Drupal\Tests\taxonomy\Functional\Views\TaxonomyTermFilterDepthTest can randomly fail due to views sorts

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.