Large number of advagg_aggregate rows (slow cron) [#2374077]

I currently have a client that has over 700k entries in the advagg_aggregation table. This causes the first query in advagg_remove_old_unused_aggregates() to take ~10 minutes to run. While browsing through the code it looks like it does a subselect and "group by" which causes all of the rows to be pulled (and in the case of the subselect, pulled multiple times). With a large table this query is taking me about 10 minutes to run due to the large number of rows.

I'd like review/feedback/suggestions about the attached patch. I've verified it has worked in my setup but I would appreciate someone else looking it over to make sure my shortened query isn't missing any use cases. The shortened query in the patch takes about 2 seconds to run on my setup.

Comment	File	Size	Author
#1	advagg-unused_aggregates-2374077-0.patch	986 bytes	Andrew M Riley
#1

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

Andrew M Riley CreditAttribution: Andrew M Riley commented 12 November 2014 at 17:45

File	Size
advagg-unused_aggregates-2374077-0.patch	986 bytes

Comment #2

mikeytown2 CreditAttribution: mikeytown2 commented 12 November 2014 at 17:46

Status:

Active

» Needs review

Comment #3

DamienMcKenna

NH, USA

CreditAttribution: DamienMcKenna commented 12 November 2014 at 22:11

Issue tags:

+Performance

Comment #4

Andrew M Riley CreditAttribution: Andrew M Riley commented 26 November 2014 at 02:33

Status:

Needs review

» Closed (won't fix)

Looks like the 7.x-2.x branch has this now. Closing out

Comment #5

mikeytown2 CreditAttribution: mikeytown2 commented 28 November 2014 at 20:17

Status:

Closed (won't fix)

» Needs review

GroupBy is still in the latest dev
http://cgit.drupalcode.org/advagg/tree/advagg.cache.inc#n390

Comment #6

mikeytown2 CreditAttribution: mikeytown2 commented 4 December 2014 at 02:35

Status:

Needs review

» Closed (works as designed)

The patch removes the ability to find orphaned aggregate versions entries in the database. Swapping the tables around defeats the purpose of a LEFT OUTER JOIN in this situation http://www.khankennels.com/blog/index.php/archives/2007/04/20/getting-jo...

A left outer join combines the ideas behind a left join and an outer join. Basically – if you use a left outer join you will get the records in the left table that DO NOT have a match in the right table.

Query before the patch

SELECT aav.aggregate_filenames_hash AS aggregate_filenames_hash
FROM advagg_aggregates_versions AS aav
LEFT OUTER JOIN advagg_aggregates AS aa 
  ON aa.aggregate_filenames_hash=aav.aggregate_filenames_hash
WHERE aa.aggregate_filenames_hash IS NULL
GROUP BY aav.aggregate_filenames_hash

Query after the patch

SELECT aa.aggregate_filenames_hash AS aggregate_filenames_hash
FROM advagg_aggregates AS aa
LEFT OUTER JOIN advagg_aggregates_versions AS aav 
  ON aa.aggregate_filenames_hash = aav.aggregate_filenames_hash
WHERE aa.aggregate_filenames_hash IS NULL

Local testing shows it takes around 100ms to run the original query; something I'm ok with in a cron situation. What happens when you remove the group by in the original query; that is something I could consider doing.

Comment #7

Silicon.Valet CreditAttribution: Silicon.Valet commented 12 December 2014 at 19:35

There's a HUGE performance difference between 2.7 and what is in dev, I think that the original need was aimed at 2.7 as it contained a rather poorly performing subquery.

SELECT aa.aggregate_filenames_hash AS aggregate_filenames_hash
FROM 
advagg_aggregates aa
LEFT OUTER JOIN (SELECT aav.aggregate_filenames_hash AS aggregate_filenames_hash
FROM 
advagg_aggregates_versions aav) aav ON aa.aggregate_filenames_hash=aav.aggregate_filenames_hash
WHERE  (aav.aggregate_filenames_hash IS NULL ) 
GROUP BY aa.aggregate_filenames_hash;

The query in develop seems fine and takes about 340ms on average, so I'm satisfied with where it is there. I'll use dev until a stable release is available.