Problem/Motivation

IN #3548313: Updating to 10.5.3 causes gateway timeouts on revisioned content we joined the subquery to the using a join on the entity ID but #3548313-52: Updating to 10.5.3 causes gateway timeouts on revisioned content makes a really interesting point about correlated vs non-correlated subqueries that potentially is the problem here and the committed fix on #3548313: Updating to 10.5.3 causes gateway timeouts on revisioned content is not going to address.

Steps to reproduce

Proposed resolution

Revert the fix from #3548313: Updating to 10.5.3 causes gateway timeouts on revisioned content and move to an inner join with an optimisation for \Drupal\Core\Entity\RevisionableStorageInterface::getLatestRevisionId()

Remaining tasks

User interface changes

None

Introduced terminology

N/a

API changes

None

Data model changes

None

Release notes snippet

N/a

Issue fork drupal-3555720

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

alexpott created an issue. See original summary.

alexpott’s picture

Status: Active » Needs review

I'm going to credit @adam-vessey on this issue because this work has come directly from their comment.

kristiaanvandeneynde’s picture

So far we had:

  1. A join that blew things up when entities had many revisions
  2. A subquery with GROUP BY that blew things up when a DB had many entities
  3. A subquery without GROUP BY that fixes the above, but is at risk of running correlated subqueries

Reading up on correlated subqueries in a very recent article, they advocate using joins to avoid performance issues. That would bring us full circle.

I'm not sure we want to use a workaround to avoid a deliberate feature of SQL. I'd rather ask myself if we're even supposed to use said deliberate feature here.

Circling back to the query from the MR, after adjusting some names for readability, we get this:

SELECT nr.vid, nr.nid FROM node_revision nr
INNER JOIN node_field_data nfd
ON nfd.nid = nr.nid
WHERE
  (nr.vid IN (
    SELECT "sq".* FROM (SELECT MAX(nr.vid) AS "expression" FROM node_revision nr GROUP BY nr.nid) "sq")
  )
  AND nfd.nid = "1"

Some technical observations:

  1. I don't think this would trigger a correlated subquery because there is no link to the outer table. For that to happen, we would need to add WHERE nr.nid = nfd.nid to the subquery.
  2. We are effectively using a query that would get the max vid for every node, by using a subquery, even though all of this information is in one of the tables we're already using (node_revision). So if we could eliminate "node_field_data" in this case (not sure because entity query), then the query would become really simple: SELECT nr.nid, MAX(nr.vid) AS vid FROM node_revision nr GROUP BY nr.nid WHERE nr.nid = 1
  3. Now I think the WHERE clause on the nid might be a red herring because not all entity queries might have that condition. So in that case the above simplified query would not work because we might not have all the condition fields on the node_revision table
  4. If we stick with the subquery, and add the where clause to it, we need to realize that it is optimized by the DB engine because it's usually used to generate reports (as shown in the article: Employees earning more than the average salary in their department). Wrapping it in another query to prevent the DB engine from optimizing it seems like we're kicking the can further down the road and we will get bitten by DB engine updates again at a future time.

All of this just screams to use a join to me, like we originally did. But we should optimize it so the thing we're joining is far smaller than it was before. So something like:

SELECT nfd.nid, nfd.whatever_you_want FROM node_field_data nfd
INNER JOIN
( SELECT nid, MAX(vid) as maximum FROM node_revision GROUP BY nid) nr
ON nfd.nid = nr.nid AND nfd.vid=nr.maximum
kristiaanvandeneynde’s picture

On a DB with 266k entries in node_revision and 24k rows in node_field_data, the above suggestion ran in 0.02 seconds and gave me the correct max vid for every row in node_field_data.

alexpott’s picture

@kristiaanvandeneynde this issue was opened in case #3548313: Updating to 10.5.3 causes gateway timeouts on revisioned content turns out to not work well. How does the current query in HEAD perform for you? I think it will probably be fine.

kristiaanvandeneynde’s picture

1.01 seconds (HEAD) vs 0.02 seconds (my suggestion)

Query I ran was:

SELECT nr.vid, nr.nid FROM node_revision nr
INNER JOIN node_field_data nfd ON nfd.nid = nr.nid
WHERE (nr.vid IN (SELECT MAX(subquery_nr.vid) AS expression FROM node_revision subquery_nr
WHERE (nr.nid = subquery_nr.nid)))

I dropped the nid 1 check to keep it a fair comparison and because entity queries won't always check on a single entity ID.

kristiaanvandeneynde’s picture

Tried it a few more times, HEAD is always ~1 second and my version is always between 0.02 and 0.07 seconds. Tried my version with both INNER and LEFT join for good measure.

alexpott’s picture

And the query from this issue? Ie...

SELECT nr.vid, nr.nid FROM node_revision nr
INNER JOIN node_field_data nfd
ON nfd.nid = nr.nid
WHERE
  (nr.vid IN (
    SELECT "sq".* FROM (SELECT MAX(nr.vid) AS "expression" FROM node_revision nr GROUP BY nr.nid) "sq")
  )

I think that the join approach is very similar to this FWIW because you have to do ( SELECT nid, MAX(vid) as maximum FROM node_revision GROUP BY nid) before any join... and I'm concerned that the JOIN approach will hit the MARIADB bug as detailed on https://stackoverflow.com/a/6157797

kristiaanvandeneynde’s picture

Re #10 about as fast as what I suggested. So HEAD is definitely slower than the MR or my suggestion here right now.

But the MR here drops the comparison between a column from the subquery and a column from the outer query so it should not run into the correlated subquery thing at all. You may just as well run the following query:

SELECT nr.vid, nr.nid FROM node_revision nr
INNER JOIN node_field_data nfd ON nfd.nid = nr.nid
WHERE nr.vid IN (SELECT MAX(nr_inner.vid) AS vid FROM node_revision nr_inner GROUP BY nr_inner.nid);

Keep in mind you can't use the nr alias inside the subquery or it might trigger the correlated subqueries after all.

So I now have the following on my DB:
This MR, with adjusted table aliases.
SELECT nr.vid, nr.nid FROM node_revision nr INNER JOIN node_field_data nfd ON nfd.nid = nr.nid WHERE (nr.vid IN (SELECT sq.* FROM (SELECT MAX(nr2.vid) AS "expression" FROM node_revision nr2 GROUP BY nr2.nid) sq));
Average of 0.09s, sometimes going lower to 0.06s

This MR, dropping the extra subquery:

SELECT nr.vid, nr.nid FROM node_revision nr INNER JOIN node_field_data nfd ON nfd.nid = nr.nid WHERE nr.vid IN (SELECT MAX(nr_inner.vid) AS vid FROM node_revision nr_inner GROUP BY nr_inner.nid);

Average of 0.09s, sometimes going higher to 0.13s. So maybe some DB engine trickery is still going on, even though it shouldn't.

My suggestion, using a regular JOIN:
SELECT nfd.vid, nfd.nid FROM node_field_data nfd INNER JOIN ( SELECT nid, MAX(vid) as maximum FROM node_revision GROUP BY nid) nr ON nfd.nid = nr.nid AND nfd.vid=nr.maximum;
Average of 0.04s, never going higher than 0.07s.

So we'd be using a simple join rather than subqueries, which should already be faster, and we would not have to worry about any DB specifics regarding how these subqueries are evaluated.

Side observation:
Also, my suggestion only joins those rows on the highest vid, where the 2 other queries seem to still select all nodes because this part is always true: WHERE (nr.vid IN (SELECT sq.*. You'd then have to select the row with the highest vid yourself (nr.vid > nfd.vid).

If you change that to nfd.vid, the result set shrinks to what my INNER JOIN returns. Not sure if that's the desired outcome, changing my code to a LEFT join gives the same results as HEAD and the current MR.

We may want to look into what we exactly want from the query here. As that might also be a point of optimization (shrinking the result set).

alexpott’s picture

#11 that's the problem...

just as well run the following query:

SELECT nr.vid, nr.nid FROM node_revision nr
INNER JOIN node_field_data nfd ON nfd.nid = nr.nid
WHERE nr.vid IN (SELECT MAX(nr_inner.vid) AS vid FROM node_revision nr_inner GROUP BY nr_inner.nid);

Is not true. There is a known bug in MariaDB that we were falling into and we need to be sure that that inner join version does not as well.

kristiaanvandeneynde’s picture

Okay, good to know. Given the timings on my local it seems like the join is not suffering from it. Is there anything else I can post to confirm this?

Here's an explain query:

 EXPLAIN  SELECT nfd.vid, nfd.nid FROM node_field_data nfd INNER JOIN ( SELECT nid, MAX(vid) as maximum FROM node_revision GROUP BY nid) nr ON nfd.nid = nr.nid AND nfd.vid=nr.maximum;
+----+-------------+---------------+------------+-------+--------------------------------------------------------+-----------+---------+-------------------+-------+----------+--------------------------+
| id | select_type | table         | partitions | type  | possible_keys                                          | key       | key_len | ref               | rows  | filtered | Extra                    |
+----+-------------+---------------+------------+-------+--------------------------------------------------------+-----------+---------+-------------------+-------+----------+--------------------------+
|  1 | PRIMARY     | <derived2>    | NULL       | ALL   | NULL                                                   | NULL      | NULL    | NULL              | 23016 |   100.00 | Using where              |
|  1 | PRIMARY     | nfd           | NULL       | ref   | PRIMARY,node__id__default_langcode__langcode,node__vid | node__vid | 8       | nr.maximum,nr.nid |     1 |   100.00 | Using index              |
|  2 | DERIVED     | node_revision | NULL       | range | node__nid                                              | node__nid | 4       | NULL              | 23016 |   100.00 | Using index for group-by |
+----+-------------+---------------+------------+-------+--------------------------------------------------------+-----------+---------+-------------------+-------+----------+--------------------------+

Only the query in HEAD has a type of "DEPENDENT SUBQUERY" in its explanation. But as you said, different versions of MariaDB might be treating other queries incorrectly still.

alexpott’s picture

@kristiaanvandeneynde so I can confirm that your query is the most performant on mariadb 10.5, 10.6 and mysql 8.4 ... but I can't reproduce the problem that was reported in [ #3548313: Updating to 10.5.3 causes gateway timeouts on revisioned content which super frustrating.

I've got a db with 40000+ nodes and 4000000 revisions and your approach is good for this. I think we might also want to see if we can detect a hardcoded ID and use that in the inner join query if we can... that would make it extremely quick for the most common use-case - the \Drupal\Core\Entity\ContentEntityStorageBase::getLatestRevisionId() case...

alexpott changed the visibility of the branch 3555720-latest-revision-subquery to hidden.

alexpott’s picture

Priority: Normal » Critical

Okay I've implemented #15 - so now I think we have the best of all worlds - a general latestRevision() query across many nodes will be as performant as possible but also we'll have great performance for the most common case - \Drupal\Core\Entity\ContentEntityStorageBase::getLatestRevisionId()

I'm bumping the priority to critical here because I think the query we landed in #3548313: Updating to 10.5.3 causes gateway timeouts on revisioned content is very problematic in rare use-cases and this improves things for both use-cases.

alexpott’s picture

Here's an explain for node latest revision ID query where the node has 4001 revisions...

 explain SELECT base_table.vid AS vid, base_table.nid AS nid FROM node_revision base_table   INNER JOIN (SELECT subquery_base_table.nid AS nid, MAX(subquery_base_table.vid) AS maximum_revision_id FROM node_revision subquery_base_table WHERE nid = "3" GROUP BY subquery_base_table.nid) sq_base_table ON base_table.nid = sq_base_table.nid AND base_table.vid = sq_base_table.maximum_revision_id   INNER JOIN node_field_data node_field_data ON node_field_data.nid = base_table.nid WHERE node_field_data.nid = "3"\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: <derived2>
   partitions: NULL
         type: system
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1
     filtered: 100.00
        Extra: NULL
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: base_table
   partitions: NULL
         type: const
possible_keys: PRIMARY,node__nid
          key: PRIMARY
      key_len: 4
          ref: const
         rows: 1
     filtered: 100.00
        Extra: NULL
*************************** 3. row ***************************
           id: 1
  select_type: PRIMARY
        table: node_field_data
   partitions: NULL
         type: ref
possible_keys: PRIMARY,node__id__default_langcode__langcode
          key: PRIMARY
      key_len: 4
          ref: const
         rows: 1
     filtered: 100.00
        Extra: Using where; Using index
*************************** 4. row ***************************
           id: 2
  select_type: DERIVED
        table: subquery_base_table
   partitions: NULL
         type: ref
possible_keys: node__nid
          key: node__nid
      key_len: 4
          ref: const
         rows: 4001
     filtered: 100.00
        Extra: Using index
4 rows in set, 1 warning (0.01 sec)

As you can see we limit the query to only looking at the 4001 revisions...


Here's an explain for query across nodes... note even with 4000000 revisions and 40000 nodes it is outputting 44373 rows in set (0.52 sec)
explain SELECT base_table.vid AS vid, base_table.nid AS nid FROM node_revision base_table   INNER JOIN (SELECT subquery_base_table.nid AS nid, MAX(subquery_base_table.vid) AS maximum_revision_id FROM node_revision subquery_base_table GROUP BY subquery_base_table.nid) sq_base_table ON base_table.nid = sq_base_table.nid AND base_table.vid = sq_base_table.maximum_revision_id   INNER JOIN node_field_data node_field_data ON node_field_data.nid = base_table.nid \G;
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: <derived2>
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 43377
     filtered: 100.00
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: base_table
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY,node__nid
          key: PRIMARY
      key_len: 4
          ref: sq_base_table.maximum_revision_id
         rows: 1
     filtered: 5.00
        Extra: Using where
*************************** 3. row ***************************
           id: 1
  select_type: PRIMARY
        table: node_field_data
   partitions: NULL
         type: ref
possible_keys: PRIMARY,node__id__default_langcode__langcode
          key: node__id__default_langcode__langcode
      key_len: 4
          ref: sq_base_table.nid
         rows: 1
     filtered: 100.00
        Extra: Using index
*************************** 4. row ***************************
           id: 2
  select_type: DERIVED
        table: subquery_base_table
   partitions: NULL
         type: range
possible_keys: node__nid
          key: node__nid
      key_len: 4
          ref: NULL
         rows: 43377
     filtered: 100.00
        Extra: Using index for group-by
4 rows in set, 1 warning (0.00 sec)
alexpott’s picture

With the approach in HEAD post #3548313: Updating to 10.5.3 causes gateway timeouts on revisioned content the second query takes 48 secs! 44373 rows in set (48.80 sec) and even a query where there is a node ID takes 1.23 secs.

As a comparison with this MR the query when we have a node ID takes 1 row in set (0.00 sec)....

alexpott’s picture

Tests are green on Mysql, Postgres and SQLite.

kristiaanvandeneynde’s picture

Priority: Critical » Normal

Code looks good to me and so does the outcome. I do see some test failures, but not sure what those are about. Will RTBC if someone can explain why the overall tests claim everything is fine yet the right-most pipeline is full of red.

kristiaanvandeneynde’s picture

Priority: Normal » Critical
alexpott’s picture

@kristiaanvandeneynde because I ran a test on an unsupported MySQL version :D

kristiaanvandeneynde’s picture

Status: Needs review » Reviewed & tested by the community

Looks good to me.

  • catch committed 6db79ab8 on 10.5.x
    Issue #3555720 by alexpott, adam-vessey, kristiaanvandeneynde: Latest...

  • catch committed e8010d54 on 10.6.x
    Issue #3555720 by alexpott, adam-vessey, kristiaanvandeneynde: Latest...

  • catch committed 7f08cb00 on 11.2.x
    Issue #3555720 by alexpott, adam-vessey, kristiaanvandeneynde: Latest...

  • catch committed e2b6b825 on 11.3.x
    Issue #3555720 by alexpott, adam-vessey, kristiaanvandeneynde: Latest...

  • catch committed 40471acf on 11.x
    Issue #3555720 by alexpott, adam-vessey, kristiaanvandeneynde: Latest...
catch’s picture

Version: 11.x-dev » 10.5.x-dev
Status: Reviewed & tested by the community » Fixed

OK this is now committed/pushed/cherry-picked to 11.x, 11.3.x, 11.2.x, 10.6.x, and 10.5.x.

I think it might be worth a follow-up to split the individual vs collection logic into two methods if we can, then we could decide if we want to deprecate the collection logic. Anything that needs multiple latest revisions could individually request the latest revision for each one and skip multiple loading.

Glad we were able to finally find a good solution that works for both kinds of large datasets.

Now that this issue is closed, review the contribution record.

As a contributor, attribute any organization that helped you, or if you volunteered your own time.

Maintainers, credit people who helped resolve this issue.

alexpott’s picture

Issue summary: View changes
neclimdul’s picture

Tested this on a site. It was causing serious performance problems on a node with ~20k revisions. The slow queries that where showing up didn't make sense either b/c they should have been the optimized case.

After discussing with Alex on slack, and confirming the poor performance went away when the database was copied to a different server(by way of a database dump) we decided this was likely an issue with the storage causing it to choose a bad subquery logic.

After hours I ran the following. A warning for those that follow, after hours was a good instinct. The queries ran immediately but then took down the site as all other queries touching those tables where queued until it finished running background tasks on the table.

ANALYZE TABLE node_field_data PERSISTENT FOR ALL;
ANALYZE TABLE node_revision PERSISTENT FOR ALL;

This seems to have fixed the problem and the slow node that was timing out now loads immediately and the query the slow query returned instantly.

alexpott’s picture

catch’s picture

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.