I'm hoping someone familiar with Pathauto's code base can help me diagnose a problem. I'm working on a module which imports real estate listings and creates nodes for them. It does all this via Batch API. One of our clients kept having problems with exhausting PHP's memory limit while this was happening, even after I put steps in place to, for example, hold no more than 250 listings in memory at one time (a run can fetch over 8000 listings at one time). Long story short, after some experimentation, I narrowed it down to the presence of the Pathauto module. After replicating their site and logging in, I can do a database dump, then do a property listing fetch, causing the out-of-memory error - I can watch the top output in a terminal, and see the RAM usage for the PHP process go up and up and up until it stops. However, if I then restore the database from that dump, disable Pathauto, and then do a run, memory usage will plateau at around 60 MB and the listings will be imported successfully. Nothing is different besides Pathauto's enabled state.

The fact that this is happening during a Batch API job is itself a bit puzzling. I suppose that means that whatever is leaking memory is either somehow maintaining it across "breaks" in the job, or is doing it so quickly that it's happening before a break can occur.

I'm not sure if Pathauto itself is to blame, or if it's something further down in the Drupal core… I'm hoping someone will have some insight as to how this problem can be avoided, as obviously Pathauto is a pretty handy module, and "just disable Pathauto" is not a solution the client is going to want to hear.

(My own code is calling node_load() with TRUE as the third parameter in order to clear its cache on each run of the node creation loop, so the problem probably doesn't lie there.)

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

greggles’s picture

I believe this is related to static caching in a taxonomy function which cannot be cleared externally the way that the node_load cache can.

I think the only solution is 1) disable pathauto before the runs and execute it later or 2) run smaller batches.

If this memory leak is from taxonomy terms and if you do not use pathauto for terms then you should look to other issues in the queue that want to make it so that pathauto doesn't do work in this case: #572604: If there is no taxonomy pattern, don't do any work.

Garrett Albright’s picture

I don't think it's a taxonomy-related issue. If I disable the Taxonomy module, the problem still occurs the same way.

Hmm.

greggles’s picture

It could also be the token module which has a static cache. I think that Pathauto tries to clear that out, but it might not.

Really, at this point you'll either need to fire up a code profiler or just disable all non-essential modules and see if one of those fixes it and, if not, dig into the remaining modules.

sspraggs’s picture

Hello,

I have been dealing with the exact same issue since yesterday and have narrowed it down somewhat. By commenting out various lines of code and using memory_get_usage(), I was able to determine that when I commented out this line:
$node->content = array_merge((array) $node->content, $additions);
on line 336 of content.module. The memory leak subsided by about 20k an iteration.

Recognizing that that line of code is benign and not adding to or setting any static or global variables, I worked backwards using "$node->content = array()" to see if I could pinpoint the spot where unsetting $node->content would have an impact.

I ended up in the function token_get_values of token.module. This function is called by pathauto_get_placeholders in pathauto.inc which is called several times by pathauto.module.

I found that if I cleared out the content array before the scope returned to pathauto_get_placeholders, the memory leak went away, but if I cleared it out just after it returned in pathauto_get_placeholders, the memory leak remained.

Looking closer at token_get_values, I noticed that $object = drupal_clone($object) is called on line 272, and therefore the content array is being set for the local version of $object, not the one passed by reference to token_get_values by pathauto_get_placeholders.

I then tried an 'unset ($object)' at the bottom of token_get_values, but that still left the memory leak. The only way to reduce the memory leak was to set $object->content = array(), or explicitly set it to NULL before returning from token_get_values.

This leads me to theorize that the garbage collection of PHP is not cleaning up the member data of $object when it goes out of scope.

A quick search on Google turned up this article: http://paul-m-jones.com/?p=262, which says:
If you have two objects in circular reference, such as in a parent-child relationship, calling unset() on the parent object will not free the memory used for the parent reference in the child object. (Nor will the memory be freed when the parent object is garbage-collected.)

This is indeed the case with this memory leak. If you do a print_r of $object, you'll find that there is a recursive link back to the node object which is likely preventing the whole object from being cleaned up by PHP's garbage collector.

Hopefully someone out there can corroborate this and recommend a fix.

Garrett Albright’s picture

sspraggs, wonderful sleuthing. If the node object has CCK fields, each of those fields are going to contain a recursion to the node itself (at $node->content['field_{fieldname}']['field']['#node']), and the nodes I'm creating have plenty of CCK fields. If PHP's garbage collection really isn't cleaning those up, they'll definitely add up fast. I'll do some more investigating, and possibly bump this issue over to Token's queue.

greggles, unfortunately, the decent PHP profiling tools for OS X require OS X 10.5, and I'm still on 10.4 here at work. I could take the project home and test there, but, well… bringing work home…

Garrett Albright’s picture

Title: Memory leak when bulk inserting nodes » Provokes PHP memory leak with recursive objects
Project: Pathauto » Token
Version: 6.x-1.2 » 6.x-1.12
Component: Miscellaneous » Code
Category: support » bug
FileSize
878 bytes

$object = NULL did not stop the leak on my end, but $object->content = array() did. However, there's no way of knowing offhand that $object will be a node (and therefore contain a content item) or that it won't contain recursion elsewhere. I found a general-purpose solution was to iterate over the object and unset anything that might possibly contain a recursion to the base object - so unset arrays and objects.

  // PHP 4 can't iterate over objects; thus, the array typecasting trickery. 
  foreach (array_keys((array)$object as $key)) {
    if (is_array($object->{$key}) || is_object($object->{$key})) {
      unset($object->{$key});
    }
  }

This will probably not stop memory leakage if a child object of some level is recurring - for example, if $object->foo is an object, and $object->foo->bar is a reference to $object->foo. But I imagine that doesn't happen very often (possibly Node Reference CCK fields might work this way…?).

Moving this issue to the Token queue and attaching a patch. Thanks again, sspraggs.

sspraggs’s picture

Thanks for the reply and the patch. I'm glad someone else has independently replicated and confirmed that destroying $object->content does indeed free up memory when destroying $object on it's own does not. The memory footprint of my import still grows over it's lifetime, but it's significantly slower than before. Still, it may become a scalability issue down the road.

There are likely other parts of Drupal core that suffer from this problem. Hopefully D7 developers will take note.

Garrett Albright’s picture

Hopefully PHP's developers will take note… I think its "here, let me just take care of that for you" approach to how it handles objects can cause more problems than it solves - particularly esoteric, difficult-to-find, edge-casey problems like this one. I hate it when computers try to be smarter than me and end up doing something stupid instead. Three jeers for PHP's object handling.

andreiashu’s picture

Wow... I banged my head around this for quite a while. I was batch saving/updating nodes that had CCK nodereference fields.
The patch in #6 seems to do the trick - thanks guys!

greggles’s picture

Status: Active » Needs review

Patches need review.

dhthwy’s picture

I just posted a CCK issue http://drupal.org/node/736440 , which I believe is the same problem described here.

Note that I only use Text and Number CCK fields, so no Node Reference fields. I use a limited set of modules so no Pathauto, or Token yet.

If you run the PHP test script provided within that link, put it in your parent Drupal directory and you should be able to see if Drupal is leaking memory. You might have to change the set of nodes it iterates thru in the for loop.

This issue seems to be impacting some widely used modules, and after grepping thru Drupal's core code, it looks as tho it's the Drupal way of doing things? It is definitely something that developers/users need to be aware of when doing batch processing on many nodes. And as I stated in the link I provided above, this makes the node_load reset cache parameter useless on rendered nodes implanted with circular references by popular modules such as CCK, and I'm sure it's happening with many other modules.

Status: Needs review » Needs work

The last submitted patch, recursive-object-leak.patch, failed testing.

Dave Reid’s picture

I think we just need to get rid of this cloning as it really doesn't do any good. We don't modify $object anywhere in token.module and I think it's only there for protection against modification? If that's the case, we shouldn't be babysitting broken code.

D7 core's token implementation doesn't clone and technically token implementations could do whatever they want:

/**
 * Implements hook_tokens().
 */
function example_tokens($type, $tokens, array $data = array(), array $options = array()) {
  // Node tokens.
  if ($type == 'node' && !empty($data['node'])) {
    $node = $data['node'];
    $node->title = 'HAHA I CHANGED IT!';
  }
  return array();
}

$node = node_load(1);
echo "{$node->title}\n";
token_replace('[node:title]', array('node' => $node));
echo "{$node->title}\n";

Results in:

Iaceo Lenis Praesent Praemitto (the actual node title)
HAHA I CHANGED IT!
Dave Reid’s picture

Version: 6.x-1.12 » 6.x-1.x-dev
Status: Needs work » Needs review
Dave Reid’s picture

Assigned: Unassigned » Dave Reid

If there's no objection, I'll commit this patch tomorrow.

greggles’s picture

Status: Needs review » Needs work

A little bit of sleuthing with cvs annotate led me to #149222: content_view() breaks hook_nodeapi $op == 'view'.

Is there no other way to unset the object to save on RAM?

Dave Reid’s picture

Should we have CCK fix it then (since CCK tokens are no longer provided by token.module)? And document that the object is passed by reference by default, so if you're doing wacky stuff, clone it?

dhthwy’s picture

I think it should be fixed in CCK. Cloning the object fixed the memory leak for me.

greggles’s picture

If the token hook implementations do the cloning does that actually help reduce RAM usage?

What if we have 2 or 3 modules all doing their own clone? Couldn't that make the problem even worse?

Dave Reid’s picture

If modules like CCK need to clone, they should just be able to do an unset() on the node, at least with PHP 5.2 it handles it correctly.

$node = node_load(100); // Insert a valid node ID here.
echo "BEFORE CLONE:\t" . memory_get_usage() . "\n";
$clone = drupal_clone($node);
echo "AFTER CLONE:\t" . memory_get_usage() . "\n";
unset($clone);
echo "AFTER UNSET:\t" . memory_get_usage() . "\n";

Results in:

BEFORE CLONE:	3696400
AFTER CLONE:	3701832
AFTER UNSET:	3696400
Dave Reid’s picture

Gah, apparently the link above that no longer works says that doesn't actually solve it. Hrm. :/

greggles’s picture

Yeah, if that worked token could do that. And there is the (annoying) part that Drupal 6.x has to work well with PHP5.1 and below...

We could of course just say "known issue on php below 5.x: memory leak"

Dave Reid’s picture

I think it's worth removing the clone & documenting the leak, and filing a CCK issue to add their own unsets/memory hack. We're going to run into this same problem with Drupal 7 eventually, so let's just get people doing the right things now instead of attempting to mask it in Drupal 6.

dhthwy’s picture

load & render nodes that use CCK fields. A module I recently used ran a batch and had to render each node, I couldn't do so without my PHP process running out of RAM _very quickly_ until I cloned the objects in instances where '#node' = $node

this is easily fixed by cloning objects and isn't acceptable IMHO.

<?php
require 'includes/bootstrap.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);


for($nid = 1; $nid < 2000; $nid++) {
  if (nl($nid)) {
    print memory_get_usage()/1024/1024 . "\n";
    usleep(50000);
  }
}

function nl($nid) {
  $node = node_load($nid, NULL, TRUE);

  if ($node) {
    $node = node_build_content($node, FALSE, FALSE);
    $node->body = drupal_render($node->content);

    return 1;
  }

  return 0;
}
?>
dhthwy’s picture

and btw this behavior I'm referring to applies to PHP5, I haven't tested PHP4.

Edit: let me elaborate on this.

When CCKified nodes are rendered in Drupal, CCK creates circular references by setting '#node' = $node in the node object. PHP5 can't free objects with circular references (its a well known problem), so the memory never gets freed and PHP ends up eating RAM until it croaks. It won't happen if you just do a node_load. If you run the script I provided you should see what I mean. In the script the memory _should_ be getting freed after the function nl ends. A known workaround for this is to clone the object instead of pointing a reference to itself within.

Edit #2: unset($node) won't work either when circular references are present. You have to remove the reference inside the node first.

greggles’s picture

I agree we're likely to run into this more in the future, which is why I think we should fix it once in Token rather than fixing it 20 times in contribs.

dhthwy’s picture

You don't need to have the Token module enabled. It isn't Token specific. Try disabling Token and keep CCK and run my script. If we're talking about the same problem, sounds similar enough :)

crea’s picture

Subscribing

ISPTraderChris’s picture

Subscribing

ayalon’s picture

I had to import several thousend nodes. And I experienced exactly the same problem with the memory leak. Applying the patch in #6 completely solved the memory issue.

wheelercreek’s picture

Thanks for this one.. I placed this line into my page.tpl.php file and it solved the issues for me:

if(!empty($node)){ 
  $node = node_load($node->nid, NULL, TRUE);
}
Dave Reid’s picture

Assigned: Dave Reid » Unassigned
Sylvain_G’s picture

Subscribe

roderik’s picture

@Dave Reid/#17/#24: just my 2 cents:
agreed, CCK should fix the token code. (And then this issue can continue with the state as of #15/#17, and warn people to just keep their hands off the node object.)

Incidentally, yched is right now leaning toward fixing CCK himself, in http://drupal.org/node/736440#comment-3742788

One argument for doing this:
This drupal_clone() call in token_get_values() in is a nice stop gap for some / most cases, but it doesn't fix everything.
Like: code in other hook_token_values() implementations. If the implementation happens to get called after content_token_values(), the $node object is in an unpredictable state.
(One thing suffering from this is notifications_content 4.x, in #736328: render the node more completely for [node-body] to better handle cck fields (and others))

Just wanted to throw that argument into the mix.

msprunck’s picture

Subscribe

roderik’s picture

Status: Needs work » Needs review

Fix has just landed in CCK 6.x -dev.

Back to #13 needs review "we shouldn't be babysitting broken code"?

With the note that this should not hit a stable release before another CCK-stable is released, or issue queues will be impacted harder than they did from #845250: Duplicate [node-url] token definition with token module.
(And it will probably need a 'conflicts with CCK <= 6.x-2.8' message on the front page)

Dave Reid’s picture

Is there a schedule for the next CCK release? #932680: content_token_values() does not pass along $options to the called hook_token_values() is also another really important CCK token-related patch that must get in.

Dave Reid’s picture

kaakuu’s picture

Subscribe

roderik’s picture

<3 yched. He didn't reply here but did get the message forwarded and was responsive on this & other matters.
(quote) "I wasn't planning for a release soon, but syncing with token / pathauto sounds like a reason in itself."

New CCK release is out since yesterday.

Dave Reid’s picture

Status: Needs review » Fixed

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

jwilson3’s picture

This segmentation fault / memory issue related to pathauto / token module is rearing its head again in Drupal 8. Migrating about 500 taxonomy terms from a CSV file (used this post as a guide).

If I disable Pathauto before running migration, no seg fault. Similarly, if I do a bulk update of Pathauto alias patterns through the Drupal UI after data import, Batch API gives me an Ajax error and cannot complete more than about 200 aliases before falling over.

Might there already be D8 issue for this? I couldn't find it.