Problem/Motivation
The Search settings page shows indexing status. There are inconsistencies in how it is calculated, which can lead to sometimes seeing negative percentages.
Proposed resolution
Make the calculation of settings consistent, and make sure the throttling settings are consistent with the numbers being displayed. Currently some calculations are using nodes and some are using translations of nodes. They should all use nodes.
Remaining tasks
Make a patch that makes sure throttling and counts are per-node and not per-translation. Also add tests to verify that throttling is being done that way. (See SearchMultilingualEntityTest, which is currently the only test for throttling and index status -- it would also be good if that test was updated to use the UI to set the value so that we would have more complete testing.)
User interface changes
Throttling and indexing status will consistently count nodes and not individual translations of them.
API changes
None.
Original report by @ianthomas_uk
I have a fully indexed site consisting of 6 nodes with some translations.
The indexing status box at admin/config/search/settings says "100% of the site has been indexed. There are 0 items left to index."
If I click the 'Re-index site' button, the message changes to "-100% of the site has been indexed. There are 6 items left to index."
That should read 0%.
Comment | File | Size | Author |
---|---|---|---|
#54 | 2178643-54.patch | 10.21 KB | ianthomas_uk |
#22 | explain-12-count.png | 20.74 KB | jhodgdon |
#22 | explain-18-find-nodes.png | 17.69 KB | jhodgdon |
#22 | explain-12-find-nodes.png | 22.23 KB | jhodgdon |
Comments
Comment #1
ianthomas_ukI've can confirm that this is related to the translations, the bug is in NodeSearch::indexStatus(), where the first query will count one per node but the second will count one per translation.
A simple fix would be to change the second COUNT(*) to DISTINCT(nid).
A better fix might be for all indexing logic to count based on translations. The would also remove the need for the special case in NodeSearch::updateIndex() ($counter). There's probably a reason it doesn't do this already, I'll need to have a closer look.
Comment #2
jhodgdonGood catch. There are probably several things in the search module that weren't updated for languages fully.
Comment #3
ianthomas_ukI'm not sure which way to go with this, at the moment there is a weird hybird of sometimes counting nodes, and sometimes counting translations. The confusion centres around the code in NodeSearch::updateIndex(), which was introduced in #1669876-10: Add missing language functionality in search module. I assume the thinking was that the throttle would be set to a suitable value for a single language, and multi-language nodes will take much longer to index, so fewer should be done at once.
At the moment, the indexing status is meant to count nodes, but the indexing throttle is meant to apply to translations. That means if you have 10 nodes and you set the throttle to 10, then one cron run will show:
- 10 items processed if they have one translation each
- 5 items processed if they have two translations each (2 * 5 nodes = 10 translations)
- 3 items processed if they have three translations each (because the fourth node would take it over the throttle)
- 0 items processed when it hits a node with >10 translations (because it can never fit within the throttle)
We need to decide one way or the other if we are counting nodes or translations. I think we should count nodes, and just accept that more complicated nodes take longer to index. I think users would expect the limits to apply to the number of nodes.
If we were to count translations properly, then that would include splitting the indexing of a single node across two cron runs, which will slow down indexing.
I'll add a comment to the other issue to see if anyone objects to switching back to a node-based throttle.
Comment #4
jhodgdonI agree that we should count nodes in the throttle, but the UI should probably warn (on that setting) that multi-lingual nodes may take longer to index than single-language nodes. It's also true that some node types will take longer to index than others, though... the throttle is just a guestimate...
Really, it would be even better to do what the Queue system does and throttle based on time. Or better yet, use the queue system, since that is precisely what it is set up for. There's an issue on that somewhere.
Regarding the indexing status, I think we have to report it as number of nodes because as far as I know, the only way to count the number of un-indexed items with a simple db query is by node, right? We don't want to have the counting process be a long and complicated for loop.
Comment #5
ianthomas_ukNo, we can count the number of unindexed translations with an SQL query (doing so accidentally is what caused this bug).
The issue with the queue system is it doesn't handle items being re-queued very well.
That would be pretty easy. We could even have a total time and split it by search page ourselves (e.g. you allocate 10 seconds for indexing, we spend 5 seconds indexing NodeSearch, and 5 seconds indexing FooSearch. If NodeSearch was fully indexed, then we'd spend the whole time on NodeSearch.)
It would make my patch on the cron_limit issue obsolete of course ;)
Comment #6
jhodgdonI agree, using the queue system is more complex. Let's discuss that on the other issue dedicated to the queue question (which will likely not be done anyway).
So... I am OK with any of the options:
a) Be more careful about reporting: report number of nodes, not translations, in indexing status, and keep the throttling based on number of nodes.
b) Throttle based on (approximate) number of node translations, with the understanding that we'd always fully-index all the translations when processing a given node. So we'd stop when the number of translations that have been indexed meets or exceeds the throttle number, at the end of indexing a node.
c) Throttle based on time. Be careful in reporting so that everything is based on nodes.
My thoughts:
1. It seems like the simplest fix for just the observed bug reported here would be (a).
2. It also seems to me that if we are going to change how the throttling is done, the most sensible thing to do would be (c), because the objective of throttling is to make sure cron doesn't time out, so basing it on time is the obvious solution. However, it would I think require a change in the API -- the best way (IMO) to implement it would be for search_cron() to pass in a $max_time parameter to the updateIndex() method on each plugin, wouldn't it?
3. Given all of that, probably we should just do (a), and maybe file a separate issue to change throttling to being time-based?
Comment #7
ianthomas_ukOK, let's settle on node-based counting and implement the fix suggested in #1 (as suggested #6a). We can consider this issue fixed when a patch for that is committed.
The more I think about it the more I think a translation-based count doesn't make sense, especially when you're indexing multiple content types. A node with five translations but two fields would probably index faster than a node with one translation but 50 fields.
I've therefore opened #2180389: Switch from item-based to time-based index throttling to switch to time-based indexing.
Comment #8
jhodgdonSounds like a plan.
Comment #9
jhodgdonAdded issue summary. Also this will take care of one of the tests needed in #2179081: [meta] Inventory search.module tests. File issues to address test holes..
Comment #10
jhodgdonI'm working on this.
Comment #11
jhodgdonHere is a patch. NodeSearch now throttles and calculates status based on number of nodes, not number of translations, and there are tests.
Comment #12
jhodgdonI added a couple more lines to the end of the test (thought of one more thing that should be tested). Otherwise, this is the same patch. Didn't bother with an interdiff as there has not been a review yet.
Comment #13
jhodgdon12: 2178643-12.patch queued for re-testing.
Comment #14
ianthomas_ukDo we need the subquery here? We're want each unindexed nid once, can't we do that with SELECT n.nid, max(d.reindex) ... GROUP BY n.nid?
This one I think we could do as "SELECT count(n.nid) ..."
If we do need the longer queries, they would be easier to read if split onto multiple lines.
The changes to the test look good.
Comment #15
jhodgdonRegarding comment #14:
1) I think we do need a subquery is the most efficient way to do this query and the most database-independent, and I tried a bunch of other methods and found they were not doing what I expected; also using a sub-query makes the join a lot faster. Please feel free to try and make it better if you can, if you can make sure all the tests pass. I think this is the best option.
2) Again, I tried a bunch of variations and this is the only one I found that would make the tests pass. Please feel free to try to improve it if you can, but again I think this is the best option that actually works.
3) And regarding your other comment about splitting up the query into separate lines, I haven't seen that done elsewhere in Core (or at least in the Search code), so I am not inclined to do it here?
Comment #16
ianthomas_ukI was thinking of something along the lines of
I can't test this manually though, because I can't translate my nodes at the moment for some reason. I'll come back to this when I can translate them.
The coding standards (https://drupal.org/node/2497) include examples of different ways that people are formatting multi-line SQL, but I had a quick look over core and couldn't find any examples either. We should probably be using Database\Connection::select().
Comment #17
jhodgdonWell, good luck, I don't think that query in #16 will work, but you can try. We have tests for the desired behavior in this patch, so if you can get them to pass, you can try another patch. I'd be interested to know about the performance implications though -- from my reading, the subquery would be expected to be faster than joining first.
Regarding using select(), sure, we could do that. I left it as one-line queries because that is what it was before. I have heard in the past from Berdir that these are also more efficient than using select(), but I am not sure if that is still the case.
Comment #18
ianthomas_ukThis patch has my join-based queries, which pass tests locally. The queries in HEAD, patch #12 and patch #18 are as followed
Nodes to index, where :type = $this->getPluginId()
HEAD
#12
#18
Number of nodes still needing indexing
HEAD
#12
#18
There are no other changes. I would expect a join to be faster, than a subquery, as the dbms has more chances to optimise the query plan, but we'd need to do some profiling to be sure. Is there a recommended way to do that on a large db? I guess we'd need devel_generate.
IMO the join-based queries are easier to read. Note the additional order by clause in the first query is a new feature, deliberately prioritising content that has never been indexed over outdated content.
Comment #19
ianthomas_ukOne point I forgot to mention is that neither version of the queries copes with the situation where there is one translation of a node indexed and up to date (i.e. in search_dataset with reindex = 0) and one translation of a node has never been indexed (i.e. not in search_dataset). It's not possible to detect this using just these two tables, because we assume that the missing translation just doesn't exist.
I can't find a way that this currently causes a problem, because both adding a translation and updating any translation currently updates search_dataset for all translations. Maybe langcode should just be removed from search_dataset?
Comment #20
jhodgdonRE #19... Right, we do not need to worry about this at all. In Node::postSave(), it calls node_reindex_node_search($nid), and this will be called whenever any change is made to any translation, including adding a new translation. We are always calling search_mark_for_reindex('node_search', $nid), and never making it translation-specific.
Anyway... I am pretty sure your new queries are less efficient than the previous ones, because they are potentially joining a lot more rows, and then doing the grouping after the join -- from what I've read, it is more efficient to do the group first as the queries in #12 do.
I'm also not sure your new queries will run on PostgreSQL -- did you test it?
Also, the Order By clauses seem very weird to me, like "ORDER BY MAX(sd.reindex) is null DESC". The query in #12 has the same effect as it orders by d.reindex ASC, which will put the NULL entries first, and I think it's cleaner. I don't share your enthusiasm for the new queries being easier to read... but whatever. We need to fix this one way or another.
Comment #21
jhodgdonWell, I still think the queries in my patch are more efficient, but in the interest of getting this patch in and not wanting to argue endlessly, I am willing to use the queries in #18. They are working correctly, and I just tested in PostgreSQL too, it's fine.
So, I am good for RTBC on either the patch in #18 or the patch in #12. But since I wrote most of the code in both (except for the changed queries in #18), we need someone else to review the patch(es).
I am going to ask in IRC for someone who is more of a query optimization expert than I am to weigh in on the queries in #12 vs. #18. They give us the same results, as demonstrated by the tests.
Comment #22
jhodgdonI realized I had a D7 database with some search data indexed -- about 2500 entries in search_dataset. I ran the queries in #12 and #18 with EXPLAIN (due to caching, query times are hard to measure). This may not count as a huge data set, but at least it is not insignificant. Nothing else was running on my local DB at the time.
Results:
a) Nodes to index:
#12 explain output:
#18 explain output:
b) Counting nodes:
#12 explain output:
#18 explain output:
Hopefully that will help figure this out?
Comment #23
jhodgdonOK... I guess the information I read about doing this type of join earlier was incorrect or didn't consider indexing. From what I can tell, the queries in #18 look like they are more efficient, given this EXPLAIN output, because they are making better use of indexes (the inner SELECT table isn't indexed).
So, I'm +1 for RTBC on the queries in #18 that ian_thomas_uk supplied.
ianthomas_uk: I wrote the rest of the code in this patch. Would you care to review that and mark the patch RTBC (hopefully) or whatever? Thanks!
Comment #24
ianthomas_ukI'm not sure what to do here.
The rest of the patch looks good, and fixes the standard use case without breaking anything.
However, the queries could still end up with the wrong values if modules call the API functions themselves (as described in #19). The correct fix to that is just to drop the $langcode parameter from all these functions and tables, and ideally refactor into the appropriate classes at the same time. But that's a much bigger change and I think there are more important things to do. There's a related bug where if you call search_reindex(null, null, $langcode) then the langcode will be ignored.
So I'm tempted to say we should go with the current patch (which fixes the observed problems), and file a followup to fix the little-used API.
Comment #25
jhodgdonRE #24, which API functions are you talking about specifically that would malfunction?
Comment #26
jhodgdonSo... Regarding #19/#24, I do not think we need to worry about this, because:
The patch here is about what the NodeSearch plugin does regarding index throttling and status reporting. There is no problem here -- the NodeSearch plugin and node.module will always mark an entire node as needing a reindex if any translation is updated, so it is all self-consistent. And the queries in question for this issue are part of NodeSearch plugin, not generalized, so that is fine.
For another hypothetical plugin that uses the search index, we have several API functions:
a) search_index() causes {search_dataset} tol be updated, for the particular "sid" value and language code that it was called with, to have reindex=0 for that particular type/sid/language.
b) search_reindex() causes {search_dataset} and {search_index} to be cleared for that particular item (type/sid/language) requested.
c) search_mark_for_reindex() -- that function does not have a language code parameter, so all existing languages' reindex values for the given type/sid will be marked to the current request time when this is called. We could add a $langcode parameter, but that's a separate issue.
So, this hypothetical other plugin is free to report whatever it wants in its search indexing status and do whatever it wants in its update index method. I don't see how any of (b-d) are problems in that regard, since it can do its own queries on the {search_dataset} table joined with its particular table. What's the problem you are seeing?
Comment #27
ianthomas_ukThe queries could produce invalid data for a valid database, but I can't see how NodeSearch could create that valid database, short of a fatal error in the loop in indexNode(). The "valid database" could contain:
-1 node
-2 translations
-1 entry in search_dataset with reindex = 0
The queries would tell you that everything was indexed, because we know we've indexed every node (we check by joining the node table) and we know that we've indexed every translation that we know about (because reindex = 0).
You could get the database into that state by calling search_reindex with a $sid and/or $type, plus $langcode, while a node was waiting to be indexed.
Also, a call to search_reindex(null, null, 'en') will delete the index for all languages (separate bug, not a regression).
Finally, it's poorly named - I'd expect to have a freshly populated index after calling search_reindex, but I'd get an empty one (again, a separate issue).
To summarise, it's not a good data structure / api, but this patch doesn't make that any worse.
Comment #28
jhodgdonI feel like you're trying to find problems that don't exist.
search_reindex() is *designed and documented* to clear out the whole search index if you call it with type/sid == NULL -- it ignores language code in that case. That is not a bug -- that is how to clear out the whole search index with an API call. The name of the function is stupid, I'll grant you that, but it's been like this since Drupal 5, and the name of that function is not relevant to whether the API is broken or not, or this issue at all.
In any case... NodeSearch and the Node module never call search_reindex with $langcode, and certainly never calls search_reindex(null, null, $langcode). So I think NodeSearch is using the database/API in a consistent way. If other plugins (with presumably their own indexing type) choose to use the API in an inconsistent way, that is their choice, but not a good choice. I really don't see what the problem is at all with the API or the database or this patch.
Can we just review *this patch* and if you think there are other problems, discuss them elsewhere please please please?
Comment #29
ianthomas_ukSorry for going off-topic above. These are real problems that I spotted while reviewing the change, but they are not regressions caused by the change, so shouldn't block the patch.
I'm also +1 to RTBC #18, but I'd be happier if someone else could have a look too.
Comment #30
ianthomas_ukSummary of patch reviews so far:
- jhodgdon posted a patch on #12. I was happy with it apart from two SQL queries.
- I posted a patch with alternative queries on #18. jhodgdon initially didn't like them but later came round. The queries are equivilent so it's partly personal preference. Performance is not known to be a problem, but #18 does look faster.
- I had concerns about the API, but these are not caused by this patch. I've opened #2211241: Refactor search_reindex() into separate functions as a followup.
Therefore after another look over the patch I'm now happy to RTBC.
Comment #31
alexpott2178643-18.patch no longer applies.
Comment #32
jayeshanandani CreditAttribution: jayeshanandani commentedRerolled #18.
Comment #33
ianthomas_ukExtra whitespace.
Why is language being changed to langcode? It's REALLY, REALLY important when resolving conflicts to changed lines to check if other parts of the line have been changed. The intentional change here was $plugin to $this->plugin.
More whitespace
Comment #36
jayeshanandani CreditAttribution: jayeshanandani commentedLast reroll was chaging some other code. Modified that and uploading a new reroll.
Comment #38
jayeshanandani CreditAttribution: jayeshanandani commentedLast Rerolled missed some modifications. Updated the patch with those and added new modficiations so test that doesnt fail.
Comment #39
ianthomas_uk@jhodgon Did you deliberately combine the tests in #11? I didn't spot it before, but it means we're now testing searching in a function called testIndexingThrottle(). Here's a patch that splits them again, based on #18 (because I started it before #38 was posted).
#38 looks good as a straight reroll of #18.
Comment #40
jhodgdonYest, it was intentional. It is a lot more efficient to have all the tests in one method, because setup doesn't have to be done twice. So let's instead of splitting that up, change the name of the method.
Comment #41
ianthomas_ukThis is a revision of #38 with an alternative name and comment.
Other than this I would have been happy to RTBC #38.
Comment #43
ianthomas_uk41: 2178643-41.patch queued for re-testing.
Comment #44
jhodgdonThis looks good except in the test there is still this comment:
that refers to a now nonexistent method.
Comment #45
ianthomas_ukI've just removed those two lines - the full explaination it was referring to is only a few lines above anyway.
Comment #46
jhodgdonOK, I'm +1 for RTBC again, and this time I think I can mark it since I didn't make any of the changes since the last RTBC (which finally amounted to mostly rerolling).
Comment #47
alexpottThis should be called
assertIndexCounts
Comment #48
jhodgdonGood idea. New patch, changing the test method checkIndexCounts to assertIndexCounts.
Comment #49
ianthomas_ukChanges in #48 are as described, but the patch already needed a reroll.
Comment #51
ianthomas_uk49: 2178643-49.patch queued for re-testing.
Failures in
Drupal\node\Tests\NodeTranslationUITest
Drupal\path\Tests\PathLanguageTest
Comment #52
jhodgdonReroll is good. Only difference (besides reroll) between the patch in #50 and patch in #45 is the requested change to the assert method in the test class. Tests are passing. So back to RTBC.
Comment #54
ianthomas_ukSimple reroll caused by
$this->entityManager->getStorageController('node');
becoming$this->entityManager->getStorage('node');
Comment #56
jhodgdon54: 2178643-54.patch queued for re-testing.
Comment #57
jhodgdonComment #58
alexpottCommitted 3c488c4 and pushed to 8.x. Thanks!