Recently had to use Apache Solr to index ~26 Million records. Trying to index that many records exposed one particular bottleneck. The following query takes 42 seconds on our ~26 million row table:

SELECT aie.*
FROM   apachesolr_index_entities aie
WHERE  ( aie.bundle IN ( 'awesome_entity' ) )
       AND ( aie.status = '1' )
       AND ( ( aie.changed > '1391294908' )
              OR ( ( aie.changed = '1391294908' )
                   AND ( aie.entity_id > '91002536' ) ) )
       AND ( aie.entity_type = 'awesome_entity' )
ORDER  BY aie.changed ASC,
          aie.entity_id ASC
LIMIT  6000 offset 0

This query runs for every interval of "apachesolr_cron_limit". So for a dataset of 26m it runs A LOT!

I simplified to this query, which takes only 16.8 milliseconds:

SELECT aie.*
FROM   apachesolr_index_entities aie
WHERE  ( aie.bundle IN ( 'awesome_entity' ) )
       AND ( aie.changed >= '1391294908' )
       AND ( aie.status = '1' )
LIMIT  6000 offset 0

The only difference that I can see is that entities which were changed in the same second as the last changed value might get re-indexed. However, overall it will be MUCH faster to re-index ObjectA than to avoid reindexing ObjectA.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

adam.weingarten’s picture

adam.weingarten’s picture

adam.weingarten’s picture

adam.weingarten’s picture

Issue summary: View changes
adam.weingarten’s picture

Assigned: adam.weingarten » Unassigned
Status: Active » Needs review

Status: Needs review » Needs work

The last submitted patch, 3: apachesolr-speedup-slow-select-query-2457953_b.patch, failed testing.

adam.weingarten’s picture

Nick_vh’s picture

Status: Needs work » Needs review

Status: Needs review » Needs work

The last submitted patch, 7: apachesolr-speedup-slow-select-query-2457953_c.patch, failed testing.

kducharm’s picture

Status: Needs work » Needs review
FileSize
34.26 KB

I created a patch to allow an alternative ApacheSolrQueue architecture to run in parallel with the existing apachesolr_index_entities tables. This allows time consuming queries on the apachesolr_index_entities tables to instead be based on a queue table that is empty when indexing is considered complete, rather than having to filter the entire apachesolr_index_entities table.

This feature can be enabled/disabled via the admin UI, or using:
drush vset apachesolr_use_index_queue 1
---------

Status: Needs review » Needs work

The last submitted patch, 10: apachesolr-slow_queries_reindex-10.patch, failed testing.

kducharm’s picture

FileSize
34.18 KB
kducharm’s picture

kducharm’s picture

Nick_vh’s picture

A queue, by definition, is a bad idea for a problem of tracking what to index. The problem is that if you update a node twice, it will add it to the queue twice since you can't search a queue (see interface of the drupalQueue).

Have you somehow solved that problem? I saw you added a bunch of functionality in the queue class, which is based on the reliableQueueInterface. I've created a couple of queue modules before and deleting items in the queue means you have control over the queue. This is not true when you look at queues such as amazon SQS. Drupal is supposed to work with other queue systems and they do not have lookup functions. You now basically created the same tracking table in a queue with assumptions that the queue will work in mySQL and that you have access to what data is in your queue?

Sorry, but this is not the right way to go. I do advice you to take a deeper look in the tracking table functionality and see if there are performance optimizations to make. Perhaps with indexes in mysql or with a flag that makes the query easier.

Thanks for your effort!

adam.weingarten’s picture

Hey Nick,
We found that the current approach of trying to prevent reindexing resulted in considerably worse performance than the process of reindexing. If you have to reindex a record a couple of times the cost is negligible. Using the approach in kducharm's patch we were able to dramatically increase the performance of the index process. This patch reduces each batch run by about 42 seconds.

brian.reese’s picture

Addressing an issue where saving a taxonomy term not attached to a node causes a PDO exception.

brian.reese’s picture

#18 replaces #17