I believe I have found a serious bug in apachesolr.module's ApacheSolrUpdate class' update_index function. The cleanup code:
if (is_object($solr)) {
// remaining documents
try {
watchdog('Apache Solr', "Adding ". count($documents). " docs in cleanup");
$solr->addDocuments($documents);
$solr->commit();
$solr->optimize(FALSE, FALSE);
}
catch (Exception $e) {
watchdog('Apache Solr', $e->getMessage(), WATCHDOG_ERROR);
}
self::success('apachesolr', $solr_last_change, $solr_last_id);
}
is called within the loop for the node list result!!! This means that the modulus for submitting 20 documents at a time will be called every 20 nodes, but additionally every single document (up to 20) will be submitted, committed, and optimized whenever one more is added in the meantime. This means instead of submitting 20 documents until all n documents are finished and then committing/optimizing once, the script will submit 1, commit, optimize, then 2, commit, optimize, then 3 ... then 20 documents, commit, optimize, then 1 again...until all n are finished. So, 10x as many documents are added as needed, all redundantly, 20 times as many queries are made, and N times as many commits and optimizes are performed. Now, I really tried to double check that this wasn't in fact a necessary functionality, but I could still be wrong, as I am not as familiar with the inner workings of this module as its creators. However, I did look into the apachesolr services solr module and it seemed to confirm the extraneous nature of these calls, and the misuse of the document array makes me think it was in fact an error. If someone could confirm this, I think this problem should be rectified as quickly as possible in the release and development code. On our site both apachesolr and drupal report that the proper number of pages are still being indexed, and performance for cron's search indexing has gone from an untenable 3000 seconds for 100 submits to less than 90 seconds for 500.
Comments
Comment #1
robertdouglass commentedThanks, it's on my radar. My time with this module has been limited recently, but we've got a Summer of Code project dealing with the module and bugfixes can be looked at within that context, too, so we'll hopefully be looking at a nice round of fixes/features in the near future.
Comment #2
janusman commentedI fixed this on my instance and yes, it works =) It was just a matter of adding a bracket and removing another.
Robert: I'd hate for development to pause because of the Google SOC thing =) Can the people doing SOC and us collaborate somehow?
Comment #3
robertdouglass commentedFixed in D6.
Comment #4
robertdouglass commentedand now in D5
Comment #5
Anonymous (not verified) commentedAutomatically closed -- issue fixed for two weeks with no activity.