This is a MAJOR bug.

Symptom:

If there are more *updated* items in the apache solr indexing queue than will be indexed by one cron run, the "index all queued content" option will only index the number of items that will be indexed by one cron run.

How to reproduce:

- You have indexed 100 documents from Drupal.
- Your Apachesolr settings say you should index 50 items per cron run.
- Force an update of the apache solr index by setting the apachesolr_index_entities_node.changed column beyond the last update for 90 of your already indexed items.
- The Apachesolr status page will now say 90 items remain for indexing.
- If you attempt to "index all queued content", only 50 items are actually sent to Solr.
- The other 40 items will never be reindexed.

Why does this happen?

The update algorithm for selecting items only uses last entity_id or last changed date (apachesolr_index_get_entities_to_index). After the equivalent of one cron run, the last changed date is changed to the date of the last indexed item (apachesolr_index_entities). If several entries share the same timestamp (not at all unthinkable in big custom bulk operations) you risk not getting your data indexed and you have no error messages to tell the story.

Proposted solution

I propose adding a dirty-bit column to the apachesolr_index_entities_node table, named "pending". If set to 1, update, if set to 0, leave alone. The function apachesolr_index_entities could then run a bulk db_update on all rows that were successfully indexed, setting the dirty bit to 0.

You lose efficiency with the db_update, but gain efficiency in the function that selects items to reindex.

I have myself implemented a custom solution (1 changed and 3 added lines of code) that doesn't demand changes in the database or API. It's much less efficient and not worthy of publication. I can mail it to anyone interested, though :)

Remi

Comments

remimikalsen’s picture

Issue summary: View changes

I found the bug; I was in the vicinity in my first bug report, but narrowed it down to the exact spot now.

Nick_vh’s picture

Status: Active » Closed (duplicate)

Very valid point but it is a duplicate of #1828014: Mass re-indexation can miss (a lot of) content. Could you please add your 2 cents there?

Nick_vh’s picture

Issue summary: View changes

Fixed typos

jazio’s picture

Hi @remimikalsen, I am interested in your solution. Are you kind to attach here your solution to found so far? Many thanks in advance.

remimikalsen’s picture

Hi. Since this problem started (and returned every time I updated the module) I have given up following upstream updates. I've forked the moduel and incorporated a whole lot of changes to it (among other things multi-language indexing to suit our special data structures). So right now I can't pinpoint the necessary changes for this to work on the current version of the module (or say if it's necessary).

The change I did initially was to subtract 1 millisecond from the change date on every indexed document. This was a simple database call on each indexed document in the apache solr mysql database table "apachesolr_index_entities_node", done within the php function "apachesolr_index_entities". Then I used that new timestamp to set that as the last index time via apachesolr_set_last_index_position called within apachesolr_index_entities. This way, the next time the indexing callback was called, it would resume indexing the documents that previously shared timestamp with the most recently indexed documents.

Remi