When proceeding to very large file import batch, at save time (after the first 50 elements, with default configured limit value) the FeedsSource::save() method, at line 202, will save the source implementation separately.

I did not make any profiling, I will this afternoon, but I think the source is trying to save either the file buffer, either a lot of useless objects which fataly breaks the PHP memory limit.

A good thing would be to unset the file buffer from memory before saving, which would force the FeedsFileFetcher to read it again when loading the current state from database.

Another cause may also be the current batch items which represent a large memory amount (parsed RSS feed of, in my case, something like 1,500 items), when trying to save it in database, at some point, the full object tree is copied in memory and it breaks PHP doing to much memory allocations, somehow saving an offset rather than the current items tree could be less memory-killer (at a minor performance cost which is the need to re-parse the file at each batch run and manually remove each already parsed item until reaching the given offset within batch context).

Not sure about this statement, but It thinks these upper described statements may be the cause.
I might be totally wrong about the causes, I'll do some code introspection and review in order to find the real cause, but I think I'm not far from the answer.

What's annoying with this bug, is that the FeedsNodeProcessor works very well and the batch is limited to a low enough number of elements to process, so all are saved without any problems; It's quite irritating that it's at the context save time that all goes mad:)

Comments

pounard’s picture

Looking a bit more to your code, I can ensure you that 1500 feed items stored in the serialized batch table database row is really too much. Items fetching should be incremental.

alex_b’s picture

Version: 6.x-1.0-alpha13 » 6.x-1.x-dev

There's an issue for using batchAPI also on the parsing stage here: #744660: Expand batch support to fetchers and parsers.

Did you do any measurements on how much memory is being consumed by save() ?

Looking at the code, I'm seeing that save() is using an array to pass into drupal_write_record for saving.

    // Using stdClass and not an array here will avoid copying the array when passing it to drupal_write_record().
    $object = array(
      'id' => $this->id,
      'feed_nid' => $this->feed_nid,
      'config' => $config,
      'source' => $source,
      'batch' => isset($this->batch) ? $this->batch : FALSE, // This should not cause a copy in memory. Does it?
    );
    // Make sure a source record is present at all time, try to update first,
    // then insert.
    drupal_write_record('feeds_source', $object, array('id', 'feed_nid'));

There may be other points in the code where arguments are passed by value rather than by reference... Setting this issue to HEAD as it is not limited to alpha 13.

pounard’s picture

This is quite hard to profile, but if you want numbers, I'm parsing a 50mo file which contains 1500 items, using a 512mo of PHP memory limit is not enough at save time (I'm quite sure this isn't even 30% consumed when parsing the nodes).

Indeed, using as much as references could a lot improve the memory consumption!

The problem is the batch stores the raw data and the items, one of those should be dropped, even better both of them, the batch should be able to known its current offset and ask the parser to parse if its item array is empty.
I'm currently trying to do a patch that way, but the current feeds design makes the task quite difficult because the batch implementation does not know anything about the parser.

pounard’s picture

I did this modification, to avoid (at least) one array copy.

    // Using stdClass and not an array here will avoid copying the array when passing it to drupal_write_record().
    $object = new stdClass();
    $object->id = $this->id;
    $object->feed_nid = $this->feed_nid;
    $object->config = &$config;
    $object->source = &$source;
    $object->batch = isset($this->batch) ? $this->batch : FALSE;
    // Make sure a source record is present at all time, try to update first,
    // then insert.
    drupal_write_record('feeds_source', $object, array('id', 'feed_nid'));

This is not enough, here is the PHP trace when failing:

[Wed Apr 07 16:39:58 2010] [error] [client 127.0.0.1] PHP Fatal error:  Allowed memory size of 536870912 bytes exhausted (tried to allocate 134456643 bytes) in /var/www/test.triage/www/includes/database.mysql.inc on line 321, referer: http://test.triage.guinevere/batch?op=start&id=17
[Wed Apr 07 16:39:58 2010] [error] [client 127.0.0.1] PHP Stack trace:, referer: http://test.triage.guinevere/batch?op=start&id=17
[Wed Apr 07 16:39:58 2010] [error] [client 127.0.0.1] PHP   1. {main}() /var/www/test.triage/www/index.php:0, referer: http://test.triage.guinevere/batch?op=start&id=17
[Wed Apr 07 16:39:58 2010] [error] [client 127.0.0.1] PHP   2. menu_execute_active_handler() /var/www/test.triage/www/index.php:18, referer: http://test.triage.guinevere/batch?op=start&id=17
[Wed Apr 07 16:39:58 2010] [error] [client 127.0.0.1] PHP   3. call_user_func_array() /var/www/test.triage/www/includes/menu.inc:348, referer: http://test.triage.guinevere/batch?op=start&id=17
[Wed Apr 07 16:39:58 2010] [error] [client 127.0.0.1] PHP   4. system_batch_page() /var/www/test.triage/www/includes/menu.inc:0, referer: http://test.triage.guinevere/batch?op=start&id=17
[Wed Apr 07 16:39:58 2010] [error] [client 127.0.0.1] PHP   5. _batch_page() /var/www/test.triage/www/modules/system/system.admin.inc:1808, referer: http://test.triage.guinevere/batch?op=start&id=17
[Wed Apr 07 16:39:58 2010] [error] [client 127.0.0.1] PHP   6. _batch_do() /var/www/test.triage/www/includes/batch.inc:34, referer: http://test.triage.guinevere/batch?op=start&id=17
[Wed Apr 07 16:39:58 2010] [error] [client 127.0.0.1] PHP   7. _batch_process() /var/www/test.triage/www/includes/batch.inc:106, referer: http://test.triage.guinevere/batch?op=start&id=17
[Wed Apr 07 16:39:58 2010] [error] [client 127.0.0.1] PHP   8. call_user_func_array() /var/www/test.triage/www/includes/batch.inc:189, referer: http://test.triage.guinevere/batch?op=start&id=17
[Wed Apr 07 16:39:58 2010] [error] [client 127.0.0.1] PHP   9. feeds_batch() /var/www/test.triage/www/includes/batch.inc:0, referer: http://test.triage.guinevere/batch?op=start&id=17
[Wed Apr 07 16:39:58 2010] [error] [client 127.0.0.1] PHP  10. FeedsSource->import() /var/www/drupal-installation-profiles/testing/modules/contrib/feeds/feeds.module:436, referer: http://test.triage.guinevere/batch?op=start&id=17
[Wed Apr 07 16:39:58 2010] [error] [client 127.0.0.1] PHP  11. FeedsSource->save() /var/www/drupal-installation-profiles/testing/modules/contrib/feeds/includes/FeedsSource.inc:143, referer: http://test.triage.guinevere/batch?op=start&id=17
[Wed Apr 07 16:39:58 2010] [error] [client 127.0.0.1] PHP  12. drupal_write_record() /var/www/drupal-installation-profiles/testing/modules/contrib/feeds/includes/FeedsSource.inc:202, referer: http://test.triage.guinevere/batch?op=start&id=17
[Wed Apr 07 16:39:58 2010] [error] [client 127.0.0.1] PHP  13. db_query() /var/www/test.triage/www/includes/common.inc:3477, referer: http://test.triage.guinevere/batch?op=start&id=17
[Wed Apr 07 16:39:58 2010] [error] [client 127.0.0.1] PHP  14. preg_replace_callback() /var/www/test.triage/www/includes/database.mysql-common.inc:41, referer: http://test.triage.guinevere/batch?op=start&id=17
[Wed Apr 07 16:39:58 2010] [error] [client 127.0.0.1] PHP  15. _db_query_callback() /var/www/test.triage/www/includes/database.inc:0, referer: http://test.triage.guinevere/batch?op=start&id=17
[Wed Apr 07 16:39:58 2010] [error] [client 127.0.0.1] PHP  16. db_escape_string() /var/www/test.triage/www/includes/database.inc:225, referer: http://test.triage.guinevere/batch?op=start&id=17
[Wed Apr 07 16:39:58 2010] [error] [client 127.0.0.1] PHP  17. mysql_real_escape_string() /var/www/test.triage/www/includes/database.mysql.inc:321, referer: http://test.triage.guinevere/batch?op=start&id=17

As you can see, it tries to allocate about 130mo of memory, which is, I must say, really a lot! If you consider that at least 50mo have been consumed by the file loading (buffer remains within the batch implementation $raw property), and consume about at least twice this size by storing the full item list array, this means with a 50mo file, the drupal_write_record tries to store about 50 + 100 = 150mo of data in only one row!

This is defective by design, parsers and batch implementation should really do incremental work, and batch table should never store any piece of these data, but only a small amount of data necessary to continue its work (things like source file and current offset).

alex_b’s picture

This is quite hard to profile

No luck with memory_get_usage() at the beginning and end of save().

Indeed, using as much as references could a lot improve the memory consumption!

Have you tried it?

The problem is the batch stores the raw data and the items.

The raw should not be stored in most cases (actually only if you're using pubsub). What fetcher and parser are you using? I could think of a resetRaw() or similar method on a batch object that a parser can call after parsing is done.

the batch should be able to known its current offset and ask the parser to parse if its item array is empty

This is clearly in the problem space of #744660: Expand batch support to fetchers and parsers, but let me respond to this as it came up here: I think the way to address the problem of large feeds is not to make batch objects aware of fetchers or parsers but to support batching on the parser stage:

A Fetch
B Parse, store return value
C Process until processor returns FEEDS_BATCH_COMPLETE
D Parser returned FEEDS_BATCH_COMPLETE in B? NO: go back to B. YES: DONE. 

This would be a modification to FeedsSource::import() and to parser plugins.

pounard’s picture

I'm think about solutions, and the actual design is defective to do incremental parsing.

One solution could be to store the items in the FeedsSource implementation, not in the FeedsImportBatch implementation. This would leave the FeedsImportBatch as a single value object containing only useful information such as current offset and total number of items.

Another way would be to set the current parser as property of the batch, and if the parser supports incremental work (may be using an interface or a FeedsParser specialization, such as FeedsIncrementalParser) try to lazzy load the items when the processor calls the shiftItem() method.

Something like this:

  /**
   * Set the parser for incremental parsing.
   */
  public function setParser(FeedsParser $parser) {
    $this->parser = $parser;
  }

  protected $offset = 0;
  protected $limit = variable_get('feeds_batch_limit', FEEDS_BATCH_DEFAULT_LIMIT);

  /**
   * @return
   *   Next available item or NULL if there is none. Every returned item is
   *   removed from the internal array.
   */
  public function shiftItem() {
    if (empty($items)) {
      if (isset($this->parser) && $this->parser instanceof FeedsIncrementalParser) {
        $this->parser->parseRange($this, $this->offset, $this->limit);
      }
    }
    // In case incremental parse won't fetch any data (empty) then it will
    // return NULL as well.
    $this->offset++;
    return array_shift($this->items);
  }

In the FeedsImportBatch implementation should do the trick, then store a "reduced serialized" version in database.

EDIT: This kind implementation would be the less intrusive and would leave the actual non-incremental existing implementation working without any further modifications (exception in the FeedsSource::import() method).

pounard’s picture

Status: Active » Closed (duplicate)