Problem/Motivation
IN #3548313: Updating to 10.5.3 causes gateway timeouts on revisioned content we joined the subquery to the using a join on the entity ID but #3548313-52: Updating to 10.5.3 causes gateway timeouts on revisioned content makes a really interesting point about correlated vs non-correlated subqueries that potentially is the problem here and the committed fix on #3548313: Updating to 10.5.3 causes gateway timeouts on revisioned content is not going to address.
Steps to reproduce
Proposed resolution
Revert the fix from #3548313: Updating to 10.5.3 causes gateway timeouts on revisioned content and move to an inner join with an optimisation for \Drupal\Core\Entity\RevisionableStorageInterface::getLatestRevisionId()
Remaining tasks
User interface changes
None
Introduced terminology
N/a
API changes
None
Data model changes
None
Release notes snippet
N/a
Issue fork drupal-3555720
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #2
alexpottComment #4
alexpottI'm going to credit @adam-vessey on this issue because this work has come directly from their comment.
Comment #5
kristiaanvandeneyndeSo far we had:
Reading up on correlated subqueries in a very recent article, they advocate using joins to avoid performance issues. That would bring us full circle.
I'm not sure we want to use a workaround to avoid a deliberate feature of SQL. I'd rather ask myself if we're even supposed to use said deliberate feature here.
Circling back to the query from the MR, after adjusting some names for readability, we get this:
Some technical observations:
WHERE nr.nid = nfd.nidto the subquery.SELECT nr.nid, MAX(nr.vid) AS vid FROM node_revision nr GROUP BY nr.nid WHERE nr.nid = 1All of this just screams to use a join to me, like we originally did. But we should optimize it so the thing we're joining is far smaller than it was before. So something like:
Comment #6
kristiaanvandeneyndeOn a DB with 266k entries in node_revision and 24k rows in node_field_data, the above suggestion ran in 0.02 seconds and gave me the correct max vid for every row in node_field_data.
Comment #7
alexpott@kristiaanvandeneynde this issue was opened in case #3548313: Updating to 10.5.3 causes gateway timeouts on revisioned content turns out to not work well. How does the current query in HEAD perform for you? I think it will probably be fine.
Comment #8
kristiaanvandeneynde1.01 seconds (HEAD) vs 0.02 seconds (my suggestion)
Query I ran was:
I dropped the nid 1 check to keep it a fair comparison and because entity queries won't always check on a single entity ID.
Comment #9
kristiaanvandeneyndeTried it a few more times, HEAD is always ~1 second and my version is always between 0.02 and 0.07 seconds. Tried my version with both INNER and LEFT join for good measure.
Comment #10
alexpottAnd the query from this issue? Ie...
I think that the join approach is very similar to this FWIW because you have to do
( SELECT nid, MAX(vid) as maximum FROM node_revision GROUP BY nid)before any join... and I'm concerned that the JOIN approach will hit the MARIADB bug as detailed on https://stackoverflow.com/a/6157797Comment #11
kristiaanvandeneyndeRe #10 about as fast as what I suggested. So HEAD is definitely slower than the MR or my suggestion here right now.
But the MR here drops the comparison between a column from the subquery and a column from the outer query so it should not run into the correlated subquery thing at all. You may just as well run the following query:
Keep in mind you can't use the nr alias inside the subquery or it might trigger the correlated subqueries after all.
So I now have the following on my DB:
This MR, with adjusted table aliases.
SELECT nr.vid, nr.nid FROM node_revision nr INNER JOIN node_field_data nfd ON nfd.nid = nr.nid WHERE (nr.vid IN (SELECT sq.* FROM (SELECT MAX(nr2.vid) AS "expression" FROM node_revision nr2 GROUP BY nr2.nid) sq));Average of 0.09s, sometimes going lower to 0.06s
This MR, dropping the extra subquery:
Average of 0.09s, sometimes going higher to 0.13s. So maybe some DB engine trickery is still going on, even though it shouldn't.
My suggestion, using a regular JOIN:
SELECT nfd.vid, nfd.nid FROM node_field_data nfd INNER JOIN ( SELECT nid, MAX(vid) as maximum FROM node_revision GROUP BY nid) nr ON nfd.nid = nr.nid AND nfd.vid=nr.maximum;Average of 0.04s, never going higher than 0.07s.
So we'd be using a simple join rather than subqueries, which should already be faster, and we would not have to worry about any DB specifics regarding how these subqueries are evaluated.
Side observation:
Also, my suggestion only joins those rows on the highest vid, where the 2 other queries seem to still select all nodes because this part is always true:
WHERE (nr.vid IN (SELECT sq.*. You'd then have to select the row with the highest vid yourself (nr.vid > nfd.vid).If you change that to nfd.vid, the result set shrinks to what my INNER JOIN returns. Not sure if that's the desired outcome, changing my code to a LEFT join gives the same results as HEAD and the current MR.
We may want to look into what we exactly want from the query here. As that might also be a point of optimization (shrinking the result set).
Comment #12
alexpott#11 that's the problem...
Is not true. There is a known bug in MariaDB that we were falling into and we need to be sure that that inner join version does not as well.
Comment #13
kristiaanvandeneyndeOkay, good to know. Given the timings on my local it seems like the join is not suffering from it. Is there anything else I can post to confirm this?
Here's an explain query:
Only the query in HEAD has a type of "DEPENDENT SUBQUERY" in its explanation. But as you said, different versions of MariaDB might be treating other queries incorrectly still.
Comment #15
alexpott@kristiaanvandeneynde so I can confirm that your query is the most performant on mariadb 10.5, 10.6 and mysql 8.4 ... but I can't reproduce the problem that was reported in [ #3548313: Updating to 10.5.3 causes gateway timeouts on revisioned content which super frustrating.
I've got a db with 40000+ nodes and 4000000 revisions and your approach is good for this. I think we might also want to see if we can detect a hardcoded ID and use that in the inner join query if we can... that would make it extremely quick for the most common use-case - the \Drupal\Core\Entity\ContentEntityStorageBase::getLatestRevisionId() case...
Comment #17
alexpottOkay I've implemented #15 - so now I think we have the best of all worlds - a general latestRevision() query across many nodes will be as performant as possible but also we'll have great performance for the most common case - \Drupal\Core\Entity\ContentEntityStorageBase::getLatestRevisionId()
I'm bumping the priority to critical here because I think the query we landed in #3548313: Updating to 10.5.3 causes gateway timeouts on revisioned content is very problematic in rare use-cases and this improves things for both use-cases.
Comment #18
alexpottHere's an explain for node latest revision ID query where the node has 4001 revisions...
As you can see we limit the query to only looking at the 4001 revisions...
Here's an explain for query across nodes... note even with 4000000 revisions and 40000 nodes it is outputting 44373 rows in set (0.52 sec)
Comment #19
alexpottWith the approach in HEAD post #3548313: Updating to 10.5.3 causes gateway timeouts on revisioned content the second query takes 48 secs!
44373 rows in set (48.80 sec)and even a query where there is a node ID takes 1.23 secs.As a comparison with this MR the query when we have a node ID takes
1 row in set (0.00 sec)....Comment #20
alexpottTests are green on Mysql, Postgres and SQLite.
Comment #22
kristiaanvandeneyndeCode looks good to me and so does the outcome. I do see some test failures, but not sure what those are about. Will RTBC if someone can explain why the overall tests claim everything is fine yet the right-most pipeline is full of red.
Comment #23
kristiaanvandeneyndeComment #24
alexpott@kristiaanvandeneynde because I ran a test on an unsupported MySQL version :D
Comment #26
kristiaanvandeneyndeLooks good to me.
Comment #35
catchOK this is now committed/pushed/cherry-picked to 11.x, 11.3.x, 11.2.x, 10.6.x, and 10.5.x.
I think it might be worth a follow-up to split the individual vs collection logic into two methods if we can, then we could decide if we want to deprecate the collection logic. Anything that needs multiple latest revisions could individually request the latest revision for each one and skip multiple loading.
Glad we were able to finally find a good solution that works for both kinds of large datasets.
Comment #37
alexpottComment #38
neclimdulTested this on a site. It was causing serious performance problems on a node with ~20k revisions. The slow queries that where showing up didn't make sense either b/c they should have been the optimized case.
After discussing with Alex on slack, and confirming the poor performance went away when the database was copied to a different server(by way of a database dump) we decided this was likely an issue with the storage causing it to choose a bad subquery logic.
After hours I ran the following. A warning for those that follow, after hours was a good instinct. The queries ran immediately but then took down the site as all other queries touching those tables where queued until it finished running background tasks on the table.
This seems to have fixed the problem and the slow node that was timing out now loads immediately and the query the slow query returned instantly.
Comment #39
alexpottCreated a follow-up #3556351: Stale table statistics can result in extremely slow queries
Comment #40
catchI've added a summary of #38 to the release notes for https://www.drupal.org/project/drupal/releases/11.2.6 and https://www.drupal.org/project/drupal/releases/10.5.5