I've been working on a site with a large number of nodes, and I've come across something that doesn't quite make sense - every time node_load loads a node, the node is stored in a static variable for the next request.

This makes sense when you are working with nodes which are likely to be called again within the request. However, occasionally, we run updates where we need to load a very large number of nodes for updates or something similar, and we end up storing thousands of nodes in memory for the duration of the request.

I would like to see a variable added to node_load which will prevent the node from being added to the static cache when loaded. This way, if we are operating on a node we are unlikely to touch again, we can avoid using the memory to store it.

I know that you can work around the issue somewhat by setting $reset = TRUE for each call, but I don't like the idea of dumping nodes that may have been legitimately cached by other processes, and which make good sense to cache.

I've added a patch which implements this in D6. However, I am marking the issue for D8 as I assume we would need to implement in HEAD and backport from there.

CommentFileSizeAuthor
#1 node_load_caching_flag-D6.patch1.38 KBbrianV
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

brianV’s picture

D6 patch. I don't understand the DBTNG magic in D7 and D8 well enough yet to patch this there.

brianV’s picture

Issue tags: +Performance, +memory
cweagans’s picture

HEAD is still D7, and your performance patch is still eligible until November 15, IIRC.

I doubt this will go in for D6, but you might take a look at the DBTNG docs here: http://drupal.org/node/310069 --It's pretty easy if you've ever used other frameworks with query builders (CakePHP, Kohana, CodeIgniter, etc.)

dww’s picture

Status: Active » Needs work

@brianV: If you have a patch you want reviewed for consideration into core, you need to set the status of the issue to "needs review"...

Anyway, looking at the patch, I'm not psyched about the two slightly different solutions to the same problem: $reset = TRUE and $cache = FALSE. That seems like a big DX WTF. The $reset flag is exactly for the case you describe: an update that's loading tons of nodes. I'm not convinced that the "I don't like the idea of dumping nodes that may have been legitimately cached by other processes, and which make good sense to cache" concern outweighs the downside of adding a whole new argument for developers to consider when using this API.

Also, definitely no chance of changing the node_load() API in D6. If you're interested in getting this into core, your only chance at this point is D7/HEAD as a performance fix, but I doubt you're going to get much support for this approach. Really, the whole node_load() API (now part of the larger entity_load() API) is pretty set, and the $reset flag is the suggested solution to this problem, and I don't see any major benefit in adding this parallel solution...

axyjo’s picture

Version: 8.x-dev » 7.x-dev

Moving to D7.

brianV’s picture

@dww

First of all, the D6 patch wasn't actually meant for review, nor did I have any thoughts about it actually getting in althouth I realize I wasn't particularly clear on that in my initial post. It was there more to illustrate what I wanted to get across about the new feature in case my explanation didn't hold up well.

Handling the problem with $reset is a very poor solution; a node_load may be called for a given node half a dozen times in a request, and forcing a fresh DB fetch every time because some completely unrelated portion of the codepath is dumping the cache repeatedly is somewhat akin to throwing the baby out with the bathwater.

For that matter, $reset = TRUE doesn't even prevent the current node from being cached. It just dumps all existing nodes in the cache prior to attempting to load the node it was called for, and following that, adds it to the cache.

I don't think $reset and $cache both existing would be a DX WTF. One dumps all the existing nodes in the cache, and the other controls whether the node being loaded should be cached, two separate and distinct functions. I think using $reset for for preventing a node from being cached is a bigger WTF given the performance issues using it entails.

Chris Johnson’s picture

I agree with Brian, for what it's worth (I'm used to having just another unpopular opinion in the eyes of many [much more descriptive and colorful metaphor elided to avoid offending some and confusing non-native English speakers]).

I might even go a step further -- $cache is already a DX WTF*: node_load() should be about loading nodes, not dumping cache. It's probably just one of many historic DX WTFs all over Drupal core.

Great coinage, Derek, if that's yours.

*(for our international audience) a "cause of developer confusion and anguish" abbreviated into an American english colloquialism.

tloudon’s picture

my general understanding of things is that:

node_load can't really work for a large number of nodes--basically I think this ends up doing a "SELECT *" for every node in your loop, meaning you are A) transmitting a TON of superfluous data and B) you are doing N database queries (opposed to selecting only the needed columns in one query). so my thought is that node_load slowness doesn't really have to do w/ caching.

with that in mind, you have I think a couple of options,
1) use the views module--which is basically generating SQL for you
2) write your own SQL query--especially if you need to "UPDATE"

I'm not that up on OOP or PHP 5 objects, so there might be other very fast ways to do things...with iterators perhaps--but in general, writing my own SQL queries has been the best thing for my performance issues.

side note, I thought cache_clear() dumped the cache whereas RESET meant don't access the cache and then cache the new value (ie, where does dump the whole cache come in?)

but maybe I've misunderstood the use case here, could you post a code/psuedo-code sample?

brianV’s picture

@t_l,

I suspect you misunderstood what we are trying to achieve. node_load is necessary in order to call all the proper nodeapi hooks and build the $node object correctly before we perform our own operations on it.

With respect to cache_clear_all() vs. $reset = TRUE, they deal with two completely different systems. cache_clear_all clears a variable cached in one of the cache_* database tables, while $reset=TRUE dumps all nodes saved in the static $nodes array contained in the node_load() function.

As for a code sample, let's assume that we are dealing with data that needs to be merged with Drupal from an external system. On any given run of this function, there could be 10-10,000 nodes to process. This is similar to what we are doing in our module, but very much simplified:


while ($external = external_api_get_next_record()) {
  // Fetch the node id of the node that corresponds to this external record. 
  // Done with a joining a few tables and checking a custom field.
  $nid = mymodule_get_nid($external->id);

  // Load the node object. We are counting on some specific nodeapi hooks being called on $op = 'load'
  $node = node_load($nid);

  // Send to our processing function. $node is passed by reference.
  mymodule_merge_data($node, $external);

  // Save the node. We also depend on some nodeapi save hooks being called here.
  node_save($node);
}

}

Now, I could call $reset on each node_load, but there are a bunch of nodes loaded in other parts of the request which should be cached, and there is a noticeable performance loss if I dump the $nodes array, and force them to be reloaded from the DB. On the other hand, if I don't dump the $nodes array, on a big loop, we end up with massive memory usage.

The current solution is to utilise a non-caching version of node_load which we call for these instances. This keeps the memory usage down by not caching these nodes we only intend to use once, and it doesn't dump the nodes cached by other parts of the request, avoiding that performance hit as well.

catch’s picture

Just seen this - I'd much rather we did #375494: Restrict the number of nodes held in the node_load() static cache in the entity loader, which would solve this problem generically.

tloudon’s picture

@brianv

thanks for clearing up my confusion on the RESET parameter.

I'll try to restate my suggestion, it seems to me that you are trying to optimize a costly sequential operation and in doing so are creating collateral damage in the cache. As an alternative, you might want to increase the "throughput" so to speak. Meaning, would it be possible to use one massive "update" SQL statement instead of 10,000 individual "select"s + 10,000 individual "update"s (or "delete" and "insert" combos)?

I realize you are not asking for alternative solutions, I'm just trying to think of a way you could improve performance without changing the API.

flaviovs’s picture

Why not move the code that loads the node out of node_load() to, say, node_load_nc() ("no cache" -- couldn't think of a better name for now), and call the latter from the former when needed?

E.g.:


function node_load_nc($param = array(), $revision = NULL)
{
    // Handle all module loading stuff, call nodeapi hooks etc.
    return $node;
}

function node_load(($param = array(), $revision = NULL, $reset = NULL)
{
     static $nodes = array();

     if ($reset) {
        $nodes = array();
     }

     // check if node is cached -- return cached version it it's there
     (...)
     if ( $cached_node_found )
          return $cached_node_found;

     // Oops... not in cache -- let's load the node
     $node = node_load_nc($param, $revision);

     // Now add the node to the cache
     if ($cachable) {
      $nodes[$node->nid] = is_object($node) ? drupal_clone($node) : $node;
    }

    return $node;
}

This way, backward compatibility is completely preserved, and modules wanting to just load (perhaps several) nodes would then call node_load_nc().