By not using the filter cache we are wasting precious CPU cycles.
We can see that each batch job takes around 16 seconds. If we dive a little deeper we discover the following data :
http://note.io/18cjA0c
3 of the top 10 most expensive functions, excluding wall time (so the time spent only in that specific function) comes from the filter functions.
Example with the patch when we time it in Drush
10848 items successfully processed. 10848 documents successfully sent to Solr. [status]
real 9m50.093s
user 7m29.995s
sys 0m11.233s
Now, if we apply the patch we get the following result with the following xhprof results. We can clearly see that some functions are not called as frequently anymore but more importantly the time it takes has almost been split in two...
http://note.io/18cl1fa
http://note.io/18clz4N
and if we check this with Drush :
10848 items successfully processed. 10848 documents successfully sent to Solr. [status]
real 4m26.749s
user 3m13.977s
sys 0m5.224s
The filter cache documentation states that the cache is infinitely valid :
// Cache the filtered text. This cache is infinitely valid. It becomes
// obsolete when $text changes (which leads to a new $cache_id). It is
// automatically flushed when the text format is updated.
// @see filter_format_save()
So I think we can safely state that we should push this in the indexing code and significantly speed up the indexing process.
Comment | File | Size | Author |
---|---|---|---|
#1 | 2093031-1.patch | 603 bytes | Nick_vh |
Comments
Comment #1
Nick_vhComment #2
Nick_vhComment #3
Nick_vhIf we could somehow speed up
$document->content = apachesolr_clean_text($text);
it could give us another performance gain, but I understand that is not so easy to avoid.
Comment #4
Nick_vhCommitted to 7.x-1.x. We should figure out if this also applies to 6.x-3.x
Comment #4.0
Nick_vhChanging markup