I currently have a client that has over 700k entries in the advagg_aggregation table. This causes the first query in advagg_remove_old_unused_aggregates() to take ~10 minutes to run. While browsing through the code it looks like it does a subselect and "group by" which causes all of the rows to be pulled (and in the case of the subselect, pulled multiple times). With a large table this query is taking me about 10 minutes to run due to the large number of rows.

I'd like review/feedback/suggestions about the attached patch. I've verified it has worked in my setup but I would appreciate someone else looking it over to make sure my shortened query isn't missing any use cases. The shortened query in the patch takes about 2 seconds to run on my setup.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Andrew M Riley’s picture

mikeytown2’s picture

Status: Active » Needs review
DamienMcKenna’s picture

Issue tags: +Performance
Andrew M Riley’s picture

Status: Needs review » Closed (won't fix)

Looks like the 7.x-2.x branch has this now. Closing out

mikeytown2’s picture

Status: Closed (won't fix) » Needs review
mikeytown2’s picture

Status: Needs review » Closed (works as designed)

The patch removes the ability to find orphaned aggregate versions entries in the database. Swapping the tables around defeats the purpose of a LEFT OUTER JOIN in this situation http://www.khankennels.com/blog/index.php/archives/2007/04/20/getting-jo...

A left outer join combines the ideas behind a left join and an outer join. Basically – if you use a left outer join you will get the records in the left table that DO NOT have a match in the right table.

Query before the patch

SELECT aav.aggregate_filenames_hash AS aggregate_filenames_hash
FROM advagg_aggregates_versions AS aav
LEFT OUTER JOIN advagg_aggregates AS aa 
  ON aa.aggregate_filenames_hash=aav.aggregate_filenames_hash
WHERE aa.aggregate_filenames_hash IS NULL
GROUP BY aav.aggregate_filenames_hash

Query after the patch

SELECT aa.aggregate_filenames_hash AS aggregate_filenames_hash
FROM advagg_aggregates AS aa
LEFT OUTER JOIN advagg_aggregates_versions AS aav 
  ON aa.aggregate_filenames_hash = aav.aggregate_filenames_hash
WHERE aa.aggregate_filenames_hash IS NULL

Local testing shows it takes around 100ms to run the original query; something I'm ok with in a cron situation. What happens when you remove the group by in the original query; that is something I could consider doing.

Silicon.Valet’s picture

There's a HUGE performance difference between 2.7 and what is in dev, I think that the original need was aimed at 2.7 as it contained a rather poorly performing subquery.

SELECT aa.aggregate_filenames_hash AS aggregate_filenames_hash
FROM 
advagg_aggregates aa
LEFT OUTER JOIN (SELECT aav.aggregate_filenames_hash AS aggregate_filenames_hash
FROM 
advagg_aggregates_versions aav) aav ON aa.aggregate_filenames_hash=aav.aggregate_filenames_hash
WHERE  (aav.aggregate_filenames_hash IS NULL ) 
GROUP BY aa.aggregate_filenames_hash;

The query in develop seems fine and takes about 340ms on average, so I'm satisfied with where it is there. I'll use dev until a stable release is available.