The current queries for index status do not take account of node translations - the site can think it is fully indexed, even though it has only indexed on language for particular nodes. The currently queries are on node left join search_dataset on nid.
This is partially mitigated by the fact that the site will add entries to search_dataset when nodes are edited, which will often cause new translations to get indexed. However if you call search_index_clear() with a langcode parameter and another language is indexed for that node then there is nothing that will cause the deleted language to be reindexed.
I have managed to work around this by changing the query to look at node_data instead of node, as this has the langcode available, but node_data will have many times the number of rows that node has, so we need to be aware of the performance implications. Performance problems have already been raised with the existing queries, see #312395: Queries on search admin and node indexing are slow for many-node sites
Comments
Comment #1
jhodgdonOK, let's think about what the Node plugin/module are doing here.
In NodeSearch::updateIndex(), the query basically says "Give me all nodes for which either there is no entry in {search_dataset} or for whom {search_dataset}.reindex is non-zero". This query, as you say, groups by langcode, so the "no entry" part will not be triggered if any language has been indexed for a given node.
Then in NodeSearch::indexNode() it goes through and indexes all language for each node it indexes.
NodeSearch::indexStatus() does the same query as NodeSearch::updateIndex() except it's a count query to check for how many nodes are left vs. how many nodes are in the {node} table.
In the Node entity class, Node::postSave() calls node_reindex_node_search(), if the save() is an update, and Node::preDelete() calls search_index_clear('node_search', $nid) when a node is being deleted. Nothing special is done when a node is added.
Content translations use the standard entity add/edit forms (see ContentTranslationController::add() and ::edit()). So adding or editing a translation will trigger a $node->save(). Deleting a translation also triggers a $node->save() -- see ContentTranslationDeleteForm::submitForm().
In node.module, node_reindex_node_search() calls search_mark_for_reindex('node_search', $nid), without the $langcode argument, so it is marking all existing $langcode entries for that $nid as needing reindex.
So. Consider this scenario:
a) Node 1 is added in English. It is not added to {search_dataset}. updateIndex() catches it next cron run (due to it not being in {search_dataset}) and indexes it.
b) Node 1 is translated into Spanish. When the translation is added, $node->save() is called, so search_mark_for_reindex('node_search', $nid) is called and the node ID is marked for reindex and will be caught at the next cron run.
c) Node 1's Spanish translation is deleted. This also triggers a $node->save() and the node will be marked for reindex.
So... Hm... In scenario (c) it doesn't look like the Spanish translation for the node is ever deleted from the search index. So that's a bug -- I think in NodeSearch::indexNode() we should be calling search_index_clear('node_search', $nid) before we start indexing the various languages for the node.
But I don't think there is a problem with added translations not being picked up, is there? In your issue report, you said the add translation problem was "partially mitigated" by the node save, but I think it's fully mitigated and will work correctly. You also said "if you call search_index_clear() with a langcode parameter...", but the Node module is never doing that. I think the Node module is using the Search API correctly, and the only bug I can see is that we would not clear out translations in a particular language when they are deleted.
Did I miss something? Is there some scenario in which a translation can be added without it ever being picked up? I don't think so...
Comment #2
jhodgdonI filed a separate issue for the case of deleting a translation and it never being deleted from the search index:
#2381799: Deleted node translations are never removed from the index
Comment #3
ianthomas_ukI guess you're right with the way that NodeSearch is currently using the API this won't cause any problems. I remain concerned that's it's fragile and not the best use of the API, but that's not a good enough reason to add significantly more work to a query that's already reported as being slow (or slow indexing by treating each language independently).
Comment #4
jhodgdonOK. Thanks for bringing it up, in any case! It was worth spending a few minutes going through the analysis to make sure the use of the API was adequate, and it did uncover that other issue about never deleting stale translations.