I think the current logic used in node_update_index for node selection prior to indexing is not correct. Indeed, on a site I have, it prevents 80% of all nodes from being indexed.

The problem is in the criterium used to restart the loop. Currently, this is driven by a WHERE clause:

WHERE [...](n.created > %d OR n.changed > %d OR c.last_comment_timestamp > %d)

where %d is filled with the value of variable_get('node_cron_last', 0);.

This will often work. However, on high-volume sites, it is frequent to have nodes with identical timestamps. Indeed on the site I'm mentioning, about 4000 nodes have the same 'changed' value, of which a few hundred even have the same 'created' value, and the site have a script time limit of 20 sec.

In such a case,

  • on the first call to node_update_index during a cron where some nodes match this WHERE, node_update_index starts indexing these nodes
  • after it has processed at least one of them, but while it has not finished indexing all of them, the script execution is cancelled because it takes too long
  • at this point, node_cron_last is set to the value of this shared timestamp, because at least one of the nodes sharing it has been processed, but the loop has not finished processing all of the nodes sharing it and gone on to later nodes.
  • on the following cron call, node_update_index is invoked again, and fetches variable_get('node_cron_last', 0);. However, the value it obtains is the shared timestamp, that can be common to thousands of nodes or more
  • because the WHERE tests for "greater than" instead of "greater or equal", all other nodes sharing the timestamp are ignored this time
  • loop processing can continue, because this largely shared value has now been ignored

In the end, this can result in a very high "unindexed" rate that can easily go unnoticed. One symptom can help pinpoint the problem:

  1. ask a reindex in admin/settings/search.
  2. Refresh admin/settings/search: Indexing progress is a 0%
  3. Invoke cron.php
  4. Refresh admin/settings/search again: indexing must have progressed in relation to the items to index setting
  5. Repeat (3) and (4)
  6. At some point, the indexing progress will be jumping well over the "items to index setting", and the corresponding nodes won't be indexed

This can also be measured directly on a site by issuing:

SELECT count(n.nid) cnt, n.changed FROM `node` n 
GROUP BY n.changed
HAVING cnt > 1

Any line appearing in that query can cause nodes missed from indexing (adapt query if you have comments too), and will certainly result in misses if some line has a very high cnt value and the site limits the running time of scripts.

Until this is patched, a workaround is to falsify the timestamps of nodes to avoid duplicates, which will effectively remove the problem (until it reappears due to high rate publishing causing new timestamp duplicates to appear). The root cause is the too low granularity of 1 second timestamps used as a serialization means.

The "quick fix" idea of changing the ">" to a ">=" does not remove the problem: it can only alleviate it in some cases (if it allows the full set of duplicates to be processed in just one loop while it couldn't be fully processed if the duplicate list appeared at the end of a previous loop), but will usually make it worse: if the full list cannot be processed in one pass, it will restart on the next pass... and fail again.

What is needed is probably an actual way of serial-stamping nodes that will be guaranteed to provide a strictly monotonic suite of values. This could be implementing by serializing revisions: add a "serial" column, the value of which is set by a call to db_next_id. But of course, it makes tha table heavier.

Just an idea: maybe the revision vid could be used in that context ? I don't know how this works, though, so it is just a suggestion.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

fgm’s picture

Until this is fixed a quick hack to make sure the timestamps differ is to execute this code on the site with duplicate timestamps. Use devel.module / execute PHP code feature:

$sq = 'select count(nid) cnt, changed from {node} group by changed having cnt > 1 order by cnt desc';
$q = db_query($sq);
while($o = db_fetch_object($q))
  {
  echo "$o->changed, $o->cnt";
  $sq2 = 'update {node} set changed = changed+rand()*3600 where changed = %d';
  $q2 = db_query($sq2, $o->changed);
  }

Execute repeatedly until no more duplicates remain. Of course this means the 'changed' values are no longer accurate and can cause problems elsewhere.

Cvbge’s picture

So the problem with indexing is like this:

we have a set of rows, that have the same timestamp. We processed some, but we don't know which. So we need to add something, a property, that will make each row unique.

Such property could be a node nid. We need to save nids already index for this timestamp. Then, on next cron run, when we start processing the same timestamp, we just skip nids that were indexed.

In fact we could add sorting by nid too. That means we could only remember last nid, not all already indexed nids.

This, plus '>=' instead of '>', should solve the problem.

merlinofchaos’s picture

Status: Active » Needs review
FileSize
2.57 KB

Here's a patch as Cvbge suggested.

Steven’s picture

I would prefer sorting by nid like cvbge proposed, then we only need to store one variable. Also, to reduce the amount of queries we could make node_cron_last an array with the timestamp and last nid in it. Each call to variable_set is quite heavy.

Cvbge’s picture

Each call to variable_set is quite heavy.

Why don't we add memory_set()/get() ? That would store variable in cookie/url/whatever. This functions could be also used by the rest of Drupal code.

killes@www.drop.org’s picture

Status: Needs review » Needs work

setting to "needs work" as per Steven's comment.

Cvbge’s picture

Marked http://drupal.org/node/44196 as duplicate of this issue.

My comment about memory_set() was not correct of course, we can't store the variable in memory (temporarily).

ccourtne’s picture

Storing just the last nid and ordering the query by nid will not work properly.

First the nid method
Nodes 100-250 all have same timestamp
Index runs. Nodes 100-200 are indexed we store timestamp of 200 and nid of 200.
Node 110 is changed
We run a query asking for all changed nodes above nid 200.
We don't pick up 110 as a changed node this time
110 will never be indexed because we will always be storing a last node index of higher than 110

The existing array solution works in this scenario and correctly indexed content in the order it was changed, but it suffers from a scaleability problem. I have a situation where I import a large number of nodes at the same time. I can have 1000 or more nodes with the same timestamp. In this case the array would have to be saved/stored as a large array which is probably not a huge problem if the variable set is down outside the loop. But a NOT IN clause with a couple hundred or thousand creates a huge sql statement (with a 4 digit nid we add 4 chars per node so 400 bytes for 100 or nearly 4k bytes for 1000). I'd have to look at the PHP/MySQL integration to see if it has a statement size cap.

The real solution is to store last index timestamp along with the node as in my original patch that got marked as a duplicate. In addition adding a timestamp to the nodes allows you to reindex an individual node by settings it's timestamp to 0. In the current implementation it's reindex the entire site or nothing. The one downside is reindexing the whole site would require a mass reset of the index column on the node table. My patch should be refactored and put index timestamp in a separate table, that way you could just truncate that table for a reindex, and you would not have to worry about a node edit changing the indexed timestamp.

Steven’s picture

Err no ccourtne, the nid clause would only be applied to nodes with the last indexed timestamp same as "node_cron_last". The logic would be (simplified):

SELECT * FROM node WHERE (changed > node_cron_last) OR (changed == node_cron_last && nid > node_cron_nid);

e.g., Nodes 100 through 200 all have the same timestamp 1234.

Cron runs, and stops indexing after node 150. Node_cron_last is 1234, node_cron_nid is 150.

Now, node 125 is edited and its timestamp is set to 'now', e.g. 2345. Cron runs again.

Nodes 151 through 200 are indexed, because their nid > node_cron_nid (150) and their timestamp is 1234 (= node_cron_last).

Node 125 is indexed, because its timestamp 2345 is > node_cron_last. Its nid does not factor into things.

ccourtne’s picture

Status: Needs work » Needs review
FileSize
4.09 KB

Of course you are correct. Sorry I didn't see that at first because the original description said >= instead of > which would lead to the scenario I outlined. I'll blame my current head cold ;)

I have attached a patch which implants the SQL logic you outlined. It actually is more heavy weight than my approach to creating a second table. Each call to variable_set generates 3 sql statements (delete, insert, and then delete from cache) and this approach requires 2 calls inside the loop so we end up with 6 round trips to the database per node just to keep track of state. Using a dedicated node_index timestamp table we can reduce that to 1. In addition the variable set causes us to lock the table which is a fairly heavy weight call. Which means on default settings during a cron.php run we will be locking / unlocking the table 200 times in one php file.

Not sure if any of that plays much difference since this is in the cron.php run, but it could reduce server load and allow more indexing per cron call. The site I have the run into this problem are not high traffic just large amounts of data, do high traffic sites slow down during cron indexing right now because of these lock and flushes?

Cvbge’s picture

Version: 4.7.0-beta2 » x.y.z

This can be improved:

I assume that it's not possibile to run two update index functions at the same (makes not sense anyway).

That's why we don't have to lock the table. We also can be sure that the variable is in the table, so we don't need the DELETE/INSERT pair, just UPDATE. We don't have to use the variable_set/get functions.

Since we want to be really correct and save the nid and timestamp after each loop, we could be very extreme and use a special table for keeping the last nid and timestamp. That one-row table would be very fast to update (I say faster than the index-column approach ;))

We could use {variable} for that, but a) I don't know how big this table can get, b) it has only 1 column for the value, so we'd have to have 2 rows there or serialize the data and c) it has an index which we don't need.

Cvbge’s picture

I didn't read your last followup carefully, so if my reply sounds like discovering the wheel again, then I'm sorry. I was basing my reply on your former idea of 'index' column in node table.

Steven’s picture

I don't like all those variable_set() calls either ;). All those 'DELETE FROM cache' queries mean that cache access is sluggish and that each visitor during indexing recreates the variable cache over and over again.

However it is important to keep in mind why they are there: because the indexing will sometimes not complete.

This can happen because:

  1. the PHP time limit is hit.
  2. the PHP memory limit is hit.
  3. PHP crashes (I used to have this a lot locally, I'm not sure if this still happens on the latest PHPs).
  4. A node with PHP code was indexed which aborts the indexing (e.g. by calling drupal_goto()).

If the indexing aborts, then a node will be partially indexed and/or search_total will not be up-to-date. So perhaps a better option is to go further and refine the whole process?

For example, only update node_cron_last/nid at the end of indexing. If the last run did not complete, we could automatically make the 'number of nodes per run' smaller and try again. Some nodes would be unnecessarily re-indexed, but it's better than skipping some. We can do this for free too: there is a check in cron.php which we simply need to pass to hook_cron.

The PHP code node issue is tricker: if exit() is called, the indexer would get stuck in a loop. However, we don't want such nodes in the index at all, because the code would also be executed in the search results page. And generally, you use PHP for dynamic pages, so it doesn't make much sense to index them because it only takes a snapshot. So, we could just skip over all nodes that use a format with the PHP filter in it.

ccourtne’s picture

Here is what I would suggest. Don't use the variable and function to track indexing. We create a new table (node_indexing) with three fields (nid, indexed, attempts). indexed stores the timestamp of the last indexing of that particular node. Attempts counts how many unsuccessful attempts at indexing that node have occurred.

In the indexing loop we select nodes that have been changed since their last indexing and have less than three attempts. During the loop first we update the index attempt count for that node. Next we index the node and then update the indexing timestamp and reset the attempts to 0.

This approach maintains the current behaviors of addressing timeouts and indexing failures. It also adds a few key features. First it reduces the amount of sql calls to keep track of indexing state from 6 per node indexed to two. Second it never affects the cache at all. Third it makes it possible to find out which nodes have not been indexed because of errors. Fourth you can force a reindex of nodes individually if necessary. Fifth it localizes changes to just the indexing routines of the node module.

We could also just add an index status field to the node table itself, but this requires specific handling in the node save/update logic to make sure the index status is reset when the node is updated. This code would be spread out across several non-indexing functions. And would create more update traffic on the node table which is a table that gets accessed allot.

Steven’s picture

Keep in mind that the node indexing recovery was only added as an emergency measure for clueless admins. It is not reliable!

The search_total table is only updated at the very end of a cron run and it is an essential part of the search queries. For example: if the cron started indexing at node 100, and was aborted for some reason at node 150, then any new words in nodes 100-150 will still not be searchable, because they were not added to search_total. The next run would start at 151, while nodes 100-150 would only become searchable if the words in them are repeated in later nodes.

This is why I'd really prefer the "better safe than sorry" approach of re-indexing, and just disallowing PHP-format indexing altogether. It seems overkill to require an additional table, and introduce a row-specific JOINed inequality on the SELECT (rather than the table-wide one we have now).

I don't have time to make a patch and test it though... exams :|.

moshe weitzman’s picture

One of the main benefits of maintaining an index is that we can index the ouput of php format nodes. That is a key advantage over trip_search, and it would be a bit sad to see it go. Not terrible, just slightly sad.

Steven’s picture

Well, on Drupal.org we use PHP nodes to do drupal_goto()s (for example, the patch bingo). I suppose we could hack drupal_goto() not to redirect/exit for cron.php and make a note somewhere that you should never call exit/die from within a PHP node.

walkah’s picture

OK, chx has asked me to look at this ... so here are my initial thoughts:

It sounds to me (and someone correct me if I'm wrong) ... but the big issue here is handling the scenario where cron doesn't exit "cleanly" during the index...

my first guess (off the cuff) is that we might be able to do something with register_shutdown_function() ( http://ca.php.net/manual/en/function.register-shutdown-function.php ), so that we can hold off on setting the last cron variable until the process is done (or if it's killled).

i'll pursue this for a bit as I think this will allow us to safely jump over things like PHP nodes that call exit() (which is the issue with drupal_goto afaict).

anyone else have thoughts?

Dries’s picture

I doubt you can use a shutdown function. It would allow us to buy unlimited time and solve a whole bunch of problems? :) Worth a try though.

walkah’s picture

FileSize
4.99 KB

ah... but it *does* work. attached is the patch... great way to test this - using devel module's generate-content.php as it will create a bunch of same-timestamp content.

for me, after generating content... pre-patched search gets stuck forever at 66% ... patched version will go to completion.

m3avrck’s picture

Can this patch be made to use the new Drupal variable 'cron_last' that is set anytime cron runs (see 2nd to last line of cron.php)? Seems the 'node_cron' is merely a duplicate variable.

Steven’s picture

Status: Needs review » Fixed

m3avrck: Cron_last records the last time cron was run. Node_cron_last records the timestamp of the last node that was indexed. Not all nodes get indexed in one go.

Anyway, the patch works as advertised: at most one node will not be indexed if cron times out or PHP code causes the request to stop prematurely. search_total is always up to date.

I can't break it locally no matter what I throw at it ;). Committed with some minor code cleanups. Thanks everyone.

fgm’s picture

Status: Fixed » Needs review

Sorry to unfix, but I just noticed another aspect of this same problem: when a node author sets authoring date past the current date (say many years from now), node_cron_last can get set to this date, at which point nodes created until that future date are not indexed by search. The timestamp of a node set to 19/01/2038 was preventing any further indexing on the site.

Does the patch fix this condition too ?

walkah’s picture

i suppose that we could set it to be the min($last_change, time()) .. but is there really a practical use case for setting the node modified timestamp to be that far in the future?

moshe weitzman’s picture

well, drupal currently sorts stuiff by timestamp and if you want your stuff pushed tothe top but not sticky, you set a future date. so i think the min() is useful and adds just a few characters.

Steven’s picture

Better solution: don't rely on $node->created at all? $node->changed is not user-editable, so it will never be a time in the future (see node_save() ).

ccourtne’s picture

  // If node has never changed, set $node->changed to $node->created
  // If we set $node->created to time(), then 'changed' and 'created' will be
  // different for new nodes which were previewed before submission
  // The changed timestamp is always updated for bookkeeping purposes (revisions, searching, ...)
  $node->changed = $node->changed ? time() : $node->created;

Using $node->changed does not fix the problem unless we fix the above code from $node->save. I personally think the above code should be removed and $node->changed should always be set to the current time.

Steven’s picture

Well, IIRC that code was added because for admin users, $node->created is initialized when the node/add form is first loaded (and filled into the "Authored on" field), whereas $node->changed is set in node_save(). So there would be a small time difference.

Having $node->changed be later than $node->created was deemed undesirable.

Personally I agree with Cvbge. $node->changed is an internal bookkeeping timestamp, $node->created is a public, user-defined value.

ccourtne’s picture

Also using min(time(), $last_change) whould cause any nodes that are set to future dates to be reindexed every cron run. Probably a minor problem but one to be considered. Including time() in the select criteria would also make node cron indexing logic assume that the system clock on the machine doing the indexing and the machine editing the pages have the same system time. It's possible using NFS or other shared filesystem to run your indexing on another machine. Probably a rarity but one that should be documented. The only way to get away from this is to not use dates but set a boolean "indexed" flag on each node which gets reset every time it is saved or a comment gets posted accompanied by an "index attempt" count field.

I'm still not wholey convinced we aren't being overly elegant with this complicated date storage, compare and register shutdown which will might continue to leave us open to odd bugs like the one this was re-opened for. But the current head tree has the major advantage of only updating state to the DB once per cron run.

Zen’s picture

Priority: Critical » Normal
Status: Needs review » Active

Issue queue clean-up: Primary issue fixed. Secondary issue non-critical.

Thanks,
-K

Taran’s picture

Umm. The documentation of this problem doesn't make sense to me. It doesn't state which version of Drupal the patches are for, when they'll be incorporated... this is a showstopper of me for the last stable release of Drupal, and now I have to fumble through and figure out which patch belongs where, and even if the patches are relevant with the latest releases of Drupal.

I am seeing it in 4.6.6.

Further, the title is very ambiguous. I'd been searching for days on this matter, and only when I tracked the problem to search_index and timestamp did I get here, which means that other people who are not seeing their sites indexing properly may never get here. I notice the thread is marked 'fixed', but it's hard to tell from the discussion that there was a fix which everyone agreed on.

So now I get to figure out what goes where, with what version, etc.

Taran’s picture

Status: Active » Needs work

The patch apparently breaks when used in conjunction with flexinode in Drupal 4.6.6

' Cannot use a scalar value as an array in \...\node.module on line 1763.'

Not nice... Waiting for this to be incorporated in Drupal 4.7 in a more vector manner. Or I'll have to adapt it.

LAsan’s picture

Version: x.y.z » 7.x-dev

Is this still a bug in current version?

Damien Tournoud’s picture

Version: 7.x-dev » 5.x-dev

This is at least solve by the new indexing logic of Drupal 6, but I'm unsure about Drupal 5.x.

davidredshaw’s picture

I've had this kind of issue and may have discovered another issue with this query. The ORDER BY clause is as follows:

ORDER BY GREATEST(n.changed, c.last_comment_timestamp) ASC, n.nid ASC

but in MySQL 5.0.13 and above the GREATEST function returns NULL if *any* of the arguments are NULL (http://dev.mysql.com/doc/refman/5.0/en/comparison-operators.html#functio...). This means that if your nodes have no entries in node_comment_statistics, the ordering will be wrong in the query results and the node_cron_last and node_cron_last_nid will not be updated correctly.

I need to look at this some more but it looks like that's the problem - it may also explain this problem: http://drupal.org/node/232491

I'll keep looking but if anyone has any observations let me know.

Mattias-J’s picture

We hit this bug on our site too. Found the error to come from the GREATEST command just like dredshaw.
The mystifying part is that they take care of the NULL problem in the column selection part but not anywhere else.

If you have no comments on your site the first cron job will work. The next run though will sort the nods after nid, BUT it will filter it by last changed. This means that all nodes lower then the nid with older timestamp will get dropped. The fix is easy. Copy the if statement around c.last_comment_timestamp in the first section of the select query, and make all the other GREATEST functions have the same design. This should give you a SQL query like this:

SELECT GREATEST(IF(c.last_comment_timestamp IS NULL, 0, c.last_comment_timestamp), n.changed) as last_change, n.nid FROM {node} n LEFT JOIN {node_comment_statistics} c ON n.nid = c.nid WHERE n.status = 1 AND ((GREATEST(n.changed, IF(c.last_comment_timestamp IS NULL, 0, c.last_comment_timestamp)) = %d AND n.nid > %d) OR (n.changed > %d OR c.last_comment_timestamp > %d)) ORDER BY GREATEST(n.changed, IF(c.last_comment_timestamp IS NULL, 0, c.last_comment_timestamp)) ASC, n.nid ASC

And it made a drastic change, we run 10 nodes per cron job, and we went from ca 4.8% of the sites indexed to 100%.

I don't have the tools to create a patch, so if someone could whip something together...

pmetras’s picture

I found this thread after (Grrrr!) hitting the bug last week and spending a day analyzing this problem. It still current at least for version 5.x, and my solution is similar to Mattias-J's.

The query for the node_update_index() function becomes:
SELECT GREATEST(COALESCE(c.last_comment_timestamp, 0), n.changed) as last_change, n.nid FROM {node} n LEFT JOIN {node_comment_statistics} c ON n.nid = c.nid WHERE n.status = 1 AND ((GREATEST(n.changed, COALESCE(c.last_comment_timestamp, 0)) = %d AND n.nid > %d) OR (n.changed > %d OR COALESCE(c.last_comment_timestamp, 0) > %d)) ORDER BY last_change ASC, n.nid ASC

And to be consistent, here is the one for node_search():
SELECT COUNT(*) FROM {node} n LEFT JOIN {node_comment_statistics} c ON n.nid = c.nid WHERE n.status = 1 AND ((GREATEST(COALESCE(c.last_comment_timestamp, 0), n.changed) = %d AND n.nid > %d ) OR (n.created > %d OR n.changed > %d OR COALESCE(c.last_comment_timestamp, 0) > %d))

Mattias-J’s picture

Status: Needs work » Needs review
FileSize
2.55 KB

Ok, I finally think I've managed to get a working patch. I liked the look of pmetras queries so i used them instead.

This is my first patch, and I hope I haven't screwed up to much.

This was made on the 5.x-dev 2008-Jun-07 build.

Summit’s picture

SUbscribing, needing a drupal 5.8 version.
Thanks for going into this!
Greetings,
Martijn

Mattias-J’s picture

Version: 5.x-dev » 5.10
FileSize
2.66 KB

Ok. Finally had the time to look into the patch documentations.

Heres a new patch built from the correct path. It is build ageist DRUPAL-5-10, but will work against all drupal 5 releases.

This is a quite simple fix, and it is critical for the search function to work at all...
So i hope this patch will have some more attention then the old one.

webchick’s picture

subscribing.

DamienMcKenna’s picture

I think it'd be better to use the #146466: D6 search refactoring (backport to 5) patch instead?

Mattias-J’s picture

#207639: Backport recent 6.x-dev search engine improvements suggests to me that the d6 search actually isn't wanted for d5. I don't really get why catch made such a mess of everything by changing the d6 case to a backport case...

But sure, as long as this gets fixed sooner rather then later. It stuns me that there hasn't been a fixed version of d5 released for this yet...

Damien Tournoud’s picture

Version: 5.10 » 5.x-dev

Bumping. This would require some quick review and some testing (particularly on PostgreSQL).

gpk’s picture

See also #139537: search indexing gets stuck: node_update_index() - if comment.module is disabled which is (was) looking at doing something very similar.

TR’s picture

Status: Needs review » Closed (won't fix)

This appears to be an issue only with Drupal 5 and not with Drupal 6 or higher (see #34). As Drupal 5 is no longer supported, I'm marking this as won't fix.