We're running into quite large memory spikes when updating an index. Likely this is due to the fact that one of our node types loads 5 node references during index.

We're seeing spikes over 200M which cause the requests to fail. We can fix at the server level, of course, but there has to be a better strategy for managing memory during an index run.

At the moment, I'm overriding 'memory_limit' inside the drush index command. That helps with bulk indexing.

Migrate module does something like this, where it re-cycles its drush processes if memory usage gets about 80%. Would be really nice to have. See drush_migrate_import() for the example.

CommentFileSizeAuthor
#8 search_api_1137734.patch2.14 KBdrewish

Comments

drunken monkey’s picture

Category: task » feature

Would this maybe fix your problem: #946624: Use a cron queue for indexing?
When using a cron queue, I think only one item will be indexed at once, probably resulting in lower memory usage.

agentrickard’s picture

Likely, thought it wouldn't help with Drush integration. We'd be better off if we could figure out where the memory consumption is coming from -- though I suspect it's simply the process of loading a hundred or so nodes.

drunken monkey’s picture

Yeah, first loading them, then extracting the data and storing all data of all nodes in one giant array. If you index all nodes at once, there is really no avoiding that.
Maybe you should just fix the Drush integration to index in batches (e.g., according to the cron limit), even when all items are indexed.

agentrickard’s picture

The Drush integration does default to use cron-size batches, unless you explicitly override it.

I'll take a look at the migrate code if I get a chance and see if we can use that trick as well.

drunken monkey’s picture

The Drush integration does default to use cron-size batches, unless you explicitly override it.

Yes, I know. I meant that, even when explicitly specifying a higher limit, or no limit at all, it should index those items by repeatedly taking a chunk of them and indexing those.

if ($limit < 0)
  $limit = $total_unindexed_items;
for (; $limit > 0; $limit -= 50) {
  search_api_index_items($index, min(50, $limit));
}

This plus error handling.

drewish’s picture

Here's what I've been using. Basically it follows migrate's trick of spawning new drush instances in child processes. I'd tried doing some initial indexing in the current process then in the children like migrate does but ran into SQL locking issues that disappeared when I switched over to this model.

It's kind of a little brittle in that if you don't give it a limit it'll just call itself with a limit. I'm trying to convince myself that there are no edge cases that result in infinite recursion but haven't yet.


/**
 * Index items.
 */
function drush_search_api_index($index_id = NULL, $limit = NULL) {
  if (search_api_drush_static(__FUNCTION__)) {
    return;
  }
  $indexes = search_api_drush_get_index($index_id);
  if (empty($indexes)) {
    return;
  }
  $limit_string = $limit;
  foreach ($indexes as $index) {
    $status = search_api_index_status($index);
    if (empty($status['total'])) {
      drush_print(dt('!index is empty, skipping it.', array('!index' => $index->name)));
    }
    elseif ($status['indexed'] >= $status['total']) {
      drush_print(dt('!index is fully indexed, skipping it.', array('!index' => $index->name)));
    }
    else if (!is_numeric($limit) || $limit < 1) {
      $remaining = $status['total'] - $status['indexed'];
      $batch_size = $index->options['cron_limit'];
      $runs = ceil($remaining / $batch_size);
      drush_print(dt('Indexing all !remaining remaining items in !index in !runs runs of !batch_size', array('!index' => $index->name, '!remaining' => $remaining, '!runs' => $runs, '!batch_size' => $batch_size)));
      for ($i = 0; $i < $runs; $i++) {
        // Spawn new drush instances to avoid the memory limit.
        $result = drush_backend_invoke('search-api-index', array($index->machine_name, $batch_size));
        // TODO add some error checking...
        drush_print($result["error_status"]);
      }
    }
    else {
      $limit = empty($limit) ? $index->options['cron_limit' : $limit;
      drush_print(dt('Submitting up to !limit items in the !index index.', array('!index' => $index->name, '!limit' => $limit)));
      search_api_index_items($index, $limit);
    }
  }
}
drewish’s picture

Status: Active » Needs work

I guess I should add that I didn't like the existing behavior of not giving any feedback when you didn't need to keep running it so I started checking the index status and displaying some output on what it would or would not be doing.

drewish’s picture

Status: Needs work » Needs review
StatusFileSize
new2.14 KB

The messages could be improved but it's been working very well for me.

ethnovode’s picture

Thanks for your patch.
drush_backend_invoke() works only with drush 4 and should be replaced with drush_invoke_process() witch works for both drush 4 and drush 5.

agentrickard’s picture

Status: Needs review » Needs work
domidc’s picture

Status: Needs work » Needs review

#8: search_api_1137734.patch queued for re-testing.

domidc’s picture

Can we also make it concurrent?

Status: Needs review » Needs work

The last submitted patch, search_api_1137734.patch, failed testing.

domidc’s picture

Here is a trick that you can use to solve the memory limit issue and to index concurrently without patching. http://dominiquedecooman.com/blog/drupal-7-tip-concurrent-indexing-searc...

colan’s picture

Issue summary: View changes

Get rid of the trailing "-dr" and that link will actually work. ;)

In other news, I'm going to see if I can re-roll #8.

colan’s picture

I think the latest code is already doing things concurrently as it's using drush_backend_batch_process().

Process a Drupal batch by spawning multiple Drush processes.

This function will include the correct batch engine for the current major version of Drupal, and will make use of the drush_backend_invoke system to spawn multiple worker threads to handle the processing of the current batch, while keeping track of available memory.

The batch system will process as many batch sets as possible until the entire batch has been completed or half of the available memory has been used.

This function is a drop in replacement for the existing batch_process() function of Drupal.

The emphasis there is mine, as that's the limit I've been running into. with drush --verbose --debug I've been seeing a message stating that 50% of memory was hit. So that must have been killing it.

My workaround is to double my memory_limit in drush.ini from 256M to 512M.

But yes, I agree that we should optimize in code as much as possible. So if anyone has any improvement ideas, let's hear them.

donquixote’s picture

This happened for me with Migrate API. After a successful migration of e.g. 1000 entities (using --limit=1000), I got an out of memory with the following stack trace:

#0  SearchApiSolrDocument->addField()
#1  SearchApiSolrService->addIndexField()
#2  SearchApiSolrService->indexItems()
#3  SearchApiServer->indexItems()
#4  SearchApiIndex->index()
#5  search_api_index_specific_items()
#6  _search_api_index_queued_items()
#7  call_user_func_array()
#8  _drupal_shutdown_function()

I think the interesting place to look is SearchApiSolrService::indexItems().
This function tries to process all items at once.
Instead, it should process e.g. 100 items at a time.

EDIT: I am no longer convinced this is a memory issue in search_api in my case. Some debugging showed it was likely caused by something else. I will say more if new information comes up.

ndobromirov’s picture

After some fun in the last 2-3 days and this in the bag, I've landed here.

The indexing process is completely fine by itself the problem is with what is expected from it, when there are MANY things to index...

In my case the Drupal\search_api\Utility\PostRequestIndexing had aggregated the whole list of entities I've processed in a queue with drush. Once the queue finished Search API starts to index that (at the end of the Drush process).

Some questions:
1. Should it trigger indexing in that manner in Drush context?
2. Why use a memory queue for maintaining the whole list. At some point it will go out of any memory limit, even if it's not doing the bulk loading and indexing.
3. Is it possible to have a hybrid solution that will keep a small list of items and once it get's bigger - flush that to a persistent queue. Every item in the persistent queue will be properly sized and avoid out-of-memory errors and be correctly processed when needed.

Memory (ab)usage also comes from many and unexpected static caches in different places. Some examples:
- Content entity cache
- Path aliases cache
- Maybe others that I was not able to find this time.

@donquixote - for me the work-around was to have the indexing happen as part of the migration, so it can be avoided in the end of the request. Calling the following snippet line once every item was being processed lessened the memory strain on the system greatly.


# ... Do my heavy operation / processing / import / etc...
$entity->save();

# Force index of the entity that was just saved...
\Drupal::service('search_api.post_request_indexing')->destruct();

# Force static cache clear on the entity we have just worked on.
$this->entityStorage->resetCache([$entity->id()]);

Note that this was on D8.

drunken monkey’s picture

@ ndobromirov: In D8, you can simply call \Drupal\search_api\IndexInterface::startBatchTracking() before doing lots of entity operations to avoid the indexing at the end of the page request. (Or temporarily disable the index_directly option on the index – that’s possible in both D7 and D8.)