tested on 4.7.2 drupal core with 4.7.0 pathauto module
node table has approx cardinality of 900.000
url_alias table has approx 1.000.000
user table - approx 2.000
listing query of somesite.com/2006/07/28 takes about 1 min.
listing query of somesite.com/2006/07 takes about 5 mins.
tested on athlon 2500 with 512MB RAM CentOS apache 2.0.52, php 4.3.9, mysql 4.1.20
machine acts only as a standalone testing web server

i managed to speed it up a lot with 2 temporary tables
below is a slice from function node_pathauto_page()

$tempresult = db_query_temporary("SELECT nid, CONCAT('node/', nid) path FROM node WHERE status = 1 ORDER BY sticky DESC, created DESC", 'temp_node');
$tempresult = db_query_temporary("SELECT src FROM url_alias WHERE dst LIKE '$prefix%'", $prefix, 'temp_url_alias');

$query = 'SELECT n.nid FROM {temp_url_alias} a '.
"INNER JOIN {temp_node} n ON a.src = n.path";

there is also an issue with bulk update of node paths
bulk update of node paths with 80.000 nodes takes 33 mins. on the same machine
max execution time in php.ini is set to 8 hours, memory limit to 128 MB
when tested on 900.000 nodes, after 8 hours execution broke, and mysql was left on 99% CPU usage with locked tables and stayed like that until "mysql restart"

hope this results will be helpful to u
+ if u find some time try to optimize bulk updates plz
and if u need more results i can send u a copy of "Webserver Stress Tool" report

steph

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

greggles’s picture

These seems like a good idea - I'm curious if it works under PostgreSQL.

About bulk updates - those are covered in issue: http://drupal.org/node/67665

If you could quantify "speed it up a lot" to a specific amount and also provide a patch that would be great.

greggles’s picture

Status: Needs work » Active

I'm still interested in this, could you provide an actual patch? I'm not clear on how to implement your ideas.

FiReaNGeL’s picture

If possible, please post the original offending query with the EXPLAIN from mysql. I'm sure there's a way to optimize it. I'll try to do so when I have the query ;)

acidcortex’s picture

ok
i'm very busy lately, so I can't give u any testing results at the moment
u can find in my 1st post, function name and module name, so I figured out it would be easy to find a place to patch...
anyway, here are the details for the patch

file: pathauto_node.inc

line: 263-269

original code:
//////////////////////////////////////////////////////////////////////////
$query = 'SELECT n.nid FROM {url_alias} a '.
"INNER JOIN {node} n ON a.src = CONCAT('node/', n.nid) ".
"WHERE n.status = 1 AND a.dst LIKE '$prefix%' ".
'ORDER BY n.sticky DESC, n.created DESC';
$result = pager_query(db_rewrite_sql($query, $prefix),
variable_get('default_nodes_main', 10));
//////////////////////////////////////////////////////////////////////////

replace with:
//////////////////////////////////////////////////////////////////////////
$tempresult = db_query_temporary("SELECT nid, CONCAT('node/', nid) path FROM node WHERE status = 1 ORDER BY sticky DESC, created DESC", 'temp_node');
$tempresult = db_query_temporary("SELECT src FROM url_alias WHERE dst LIKE '$prefix%'", $prefix, 'temp_url_alias');

$query = 'SELECT n.nid FROM {temp_url_alias} a '.
"INNER JOIN {temp_node} n ON a.src = n.path";

$result = pager_query($query, variable_get('default_nodes_main', 10));
//////////////////////////////////////////////////////////////////////////

the point is that u can filter tables and put only rows u need, in resultSet or in WHERE clause, into temporary tables (ex. temp_url_alias and temp_node),
and u can shrink table by moving a WHERE clause in query that creates temporary table
so when u do a JOIN query, size of joined objects is reduced, DBMS needs less memory 4 the operation, and runs faster
really good optimization would be with redesigning durpal's relational model, but I can't do that

it worx $x-times faster (3 < $x < 20) and i don't see why it shouldnt worx with pgSql, since i only used function from drupal's DB abstraction layer

.::acidcorteX

FiReaNGeL’s picture

I played with this for an hour or two, trying to optimize it with adding indexes, building temporary tables, subqueries, etc, and I came to the conclusion that a database schema would be needed to speed up this query (and potentially others). The problem lies in the "a.src = CONCAT('node/', n.nid)" part (the join, actually). The concat can't be indexed for obvious reasons (its built and not in the table per se), so when the join occur on a big table, you get bad performance ( > 30 seconds in my case).

Could the node id be added to the url_alias table? This way a simple a.nid = n.nid would do, and if indexed, it would be very fast. If someone can do this, Im sure the query can be optimized to subsecond performance.

Also, there's no index on status, sticky, created or nid, status, sticky, created (depending on the implementation of the query), so the ORDER BY is very slow no matter what right now. I created it for testing purposes and even with the index, the join on unindexable CONCAT('node/', n.nid) is by far the slowest part right now.

Greggles, is adding a nid column to url_alias be feasible? And if so, would it help the forthcoming (hopefully) addition of a prefered_alias column to node?

greggles’s picture

Well, yeah, we can brainstorm new ideas. As long as we're doing this I think we should consider the delete case as well. When we go to delete paths from pathauto the logic to find which paths were created by pathauto is a little weak and can be tricked leaving old aliases in the table.

We should know which object an alias relates to (e.g. nid, uid, tid, any other ids that we would need to track?)
We should know what kind of alias it is (e.g. base level, related to feeds, related to additional modules like contact, tracker, etc.)
We should know who/what created the alias - e.g. an admin, pathauto, something else

I'm not sure we'll ever get to a happy way of determining why/where the alias was created and whether it can be deleted. These things would be a step in the right direction.

This is a whole new level of functionality and I can't quite get my brain around it yet, but we have time. It can't happen until Drupal6.x so we can wait and ponder :)

m3avrck’s picture

Any updates on this? Seems like a great idea for pathauto in the 5 branch? Or 4.7?

greggles’s picture

I don't like bulk updates very much and bulk updates on sites that have hundreds of thousands of nodes is a corner case so I'm not going to spend time on this.

If someone else can create a patch and test it on MySQL and PGSQL and benchmark it then I'd be happy to commit it.

FiReaNGeL’s picture

The proposed patch speeds up things by 10x, but were still looking at 30 seconds queries, which is not acceptable obviously. As I said in my previous comment, the current url_alias table schema is broken (as in, it does not permit to do this, and other stuff too. Its also the reason why paths are not scalable as a whole, too).

To me, adding an 'id' and 'type' (user, menu, node, other?) columns to url_alias would greatly improve things. Need path for node 19373? id = 19373, type = node, voila. Also, we could store the 'created by' provenance of the alias (useful for stuff like 'delete all pathauto nodes, but not my custom ones') Also useful for 'dont let pathauto replace aliases I created', which is annoying on a large site.

To get rid of the 'multiple aliases per path' behavior would also help on many fronts.

csevb10’s picture

In concert with schedule module integration for bulk update, I am planning on creating a patch incorporating some of these ideas. Hopefully the temp table will help improve performance, and a throttling system as discussed here (http://drupal.org/node/67665) can be used in concert to speed the bulkupdate system and provide unlimited scalability. If anyone has any ideas/thoughts about this, please voice them, otherwise, I'll try and get a patch out for this in the next week or so.
--Bill

greggles’s picture

Title: slow CROSS JOIN query in pathauto_node.inc and neverending bulk update » improve speed of bulkupdates
Version: 4.7.x-1.x-dev » 5.x-2.x-dev

Ok, well the proposed patch in #4 is related to a piece of code which is now gone. So...I'm changing the title to reflect what I see as the remaining purpose of this issue (and also the version).

greggles’s picture

Status: Active » Postponed (maintainer needs more info)

for what it's worth, I committed a fix for the "perform updates in configurable sized chunks" so the priority of this issue just went down, in my opinion. I wasn't working on it anyway, but if everyone else agrees we can shut this issue...I'll wait a month for feedback or activity before doing so. I mean, I'd love to see improved performance, but this happens so infrequently and now that we have a workaround, we should focus our tuning efforts elsewhere, right?

http://drupal.org/node/67665

csevb10’s picture

That was the core of the patch I wrote, too, so I definitely agree that, with configurable sized bulkupdates, the speed issue is not all that big of a deal. Sorry that I've never gotten the chance to contribute back what I wrote, but glad that you managed to get it done. :-)

jason.fisher’s picture

I believe this query in pathauto_node needs to be revisited. I can update 20,000 categories in a few seconds, but this query completely locks with 40,000 nodes and 30 content types. That 'LEFT JOIN ON CONCAT' is extremely painful and a non-starter here.

patuauto_node.inc, line 68:

$query = "SELECT nid, type, title, uid, created, src, dst, vid FROM {node} LEFT JOIN {url_alias} ON CONCAT('node/', nid) = src WHERE src IS NULL ". $type_where;

jason.fisher’s picture

Doing some investigating..

Quick benchmark of original query:

SELECT nid, type, title, uid, created, src, dst, vid FROM node LEFT JOIN url_alias ON CONCAT('node/', nid) = src WHERE src IS NULL AND ( type = 'blog' OR  type = 'book' OR  type = 'calculation_center' OR  type = 'decisions_ranking' OR  type = 'decisions_selection' OR  type = 'event' OR  type = 'forum' OR  type = 'helptip' OR  type = 'host' OR  type = 'hostevent' OR  type = 'hostnews' OR  type = 'hostpage' OR  type = 'page' OR  type = 'pcalc_cga_test' OR  type = 'poll' OR  type = 'profile' OR  type = 'resource' OR  type = 'simplenews' OR  type = 'source' OR  type = 'sponsor' OR  type = 'staff' OR  type = 'story' OR  type = 'testimonial' OR  type = 'userlink' OR  type = 'usernode' OR  type = 'weblink') LIMIT 0,50;

...
50 rows in set (5.67 sec)

LEFT JOIN ON CONCAT is impossibly slow, and we're really only joining to find non-matches. Possible replacement query:

SELECT nid, vid, type, title, uid, created FROM node WHERE concat('node/', nid) NOT IN (SELECT src FROM url_alias) AND (type = 'blog' OR  type = 'book' OR  type = 'calculation_center' OR  type = 'decisions_ranking' OR  type = 'decisions_selection' OR  type = 'event' OR  type = 'forum' OR  type = 'helptip' OR  type = 'host' OR  type = 'hostevent' OR  type = 'hostnews' OR  type = 'hostpage' OR  type = 'page' OR  type = 'pcalc_cga_test' OR  type = 'poll' OR  type = 'profile' OR  type = 'resource' OR  type = 'simplenews' OR  type = 'source' OR  type = 'sponsor' OR  type = 'staff' OR  type = 'story' OR  type = 'testimonial' OR  type = 'userlink' OR  type = 'usernode' OR  type = 'weblink') LIMIT 0,50;

...
50 rows in set (0.75 sec)

Testing now..

greggles’s picture

Priority: Normal » Minor

A 5.6 second query is a "non starter" and "impossibly slow"?

Bulk updates should only be run in very uncommon situations.

If you find ways to improve it then that's great and I will consider including it. Please do, however, provide it as a patch and test on all supported databases (mysql4.x+, pgsql7.3+) as well.

jason.fisher’s picture

Priority: Minor » Normal
Status: Postponed (maintainer needs more info) » Needs review
FileSize
1.71 KB

OK, we're getting somewhere.. with this large of a site, I am now able to run node_pathauto_bulkupdate for 2000 (haven't tried more yet) nodes per bulk run.

The code that generated the list of node types and appended to the old query was not working properly -- it just ended up pulling all node types. We do not need to pull the profiles before querying nodes, we're already pulling the profile per each node. We also only need to pull the nid because we're immediately going to node_load with it.

Updated node_pathauto_bulkupdate function for pathauto_node.inc:

function node_pathauto_bulkupdate() {

  $query = "SELECT nid FROM {node} WHERE CONCAT('node/', nid) NOT IN (SELECT src FROM {url_alias}) ";
  $result = db_query_range($query, array(), 0, variable_get('pathauto_max_bulk_update', 50));

  $count = 0;
  $placeholders = array();
  while ($node_ref = db_fetch_object($result)) {
    $node = node_load($node_ref->nid, NULL, TRUE);
    $node->src = NULL;
    $node->dst = NULL;
    if (module_exists('taxonomy')) {
        // Must populate the terms for the node here for the category
        // placeholders to work
        $node->taxonomy = array_keys(taxonomy_node_get_terms($node->nid));
    }
    $placeholders = pathauto_get_placeholders('node', $node);
    $src = "node/$node->nid";
    if ($alias = pathauto_create_alias('node', 'bulkupdate', $placeholders, $src, $node->type)) {
      $count++;
    }
  }

  drupal_set_message(format_plural($count,
    "Bulk generation of nodes completed, one alias generated.",
    "Bulk generation of nodes completed, @count aliases generated."));
}

I've attached a patch for pathauto_node.inc against 5.x-2.x-dev.

greggles’s picture

Status: Needs review » Needs work

If you look at the commit logs for the file you will see why we need to limit by node type.

Please add that code back or find a better solution to the problem (there are some better solutions to it, but we need at least one).

Also, test on the supported db platforms prior to uploading the patch.

jason.fisher’s picture

This patch keeps the original (type = '%s' OR type = '%s' .. etc) code.

As for testing on other supported databases, I agree.

I have only tested this on MySQL 4.1.20 -- bulk update nodes is completely unusable (MySQL runs at 99% CPU until it eats all available memory, then dies and possibly takes the server with it) without this patch. I invite anyone with a different supported database installed to test this patch and report back here, I will gladly do what I can to help. At a glance, PostgreSQL supports "NOT IN (SELECT)" and should handle this patch fine.

greggles’s picture

Perhaps I misunderstood.

Isn't the offending query they one that took 5.67 seconds in comment #15?

But now you are saying that it completely takes down the server?

I feel like there are some pieces of information missing from your description of the situation.

Does the server only use up all memory and die when trying to alias thousands of nodes in one update? If so...isn't the solution "don't do that".

jason.fisher’s picture

FYI - on a virtual server with 256MB of RAM I am able to generate roughly 8000 nodes in 10 minutes, or 13.3 nodes per second. I am also using CCK and OG heavily, and the OG token hook adds an additional node_load to the overhead. I have not tracked memory usage yet.

Would it be feasible to calculate ini_set('max_execution_time') = bulkupdate_limit * nodes_per_second prior to jumping into the bulkupdate loop?

greggles’s picture

Note that additional calls to node_load within a single page request are cached - see http://api.drupal.org/api/function/node_load/5

The node_load in og_token_values should add basically zero extra processing time.

Perhaps we can start over with this: what is a reasonable number of nodes to alias in a single bulk update run? To me, that number is 100.

jason.fisher’s picture

Some background:

I am migrating from OpenACS/CCM to Drupal and thus pathauto bulkupdate is very useful for me -- there are 40k+ nodes (including usernodes) each with CCK/terms and an empty url_alias table, and updating 50-100 at a time is not an option.

LEFT JOIN with "ON CONCAT" on the query above is slow when there are many nodes and few url_alias records.

SELECT nid, type, title, uid, created, src, dst, vid FROM node LEFT JOIN url_alias ON CONCAT('node/', nid) = src WHERE src IS NULL AND ( type = 'blog' OR  type = 'book' OR  type = 'calculation_center' OR  type = 'decisions_ranking' OR  type = 'decisions_selection' OR  type = 'event' OR  type = 'forum' OR  type = 'helptip' OR  type = 'host' OR  type = 'hostevent' OR  type = 'hostnews' OR  type = 'hostpage' OR  type = 'page' OR  type = 'pcalc_cga_test' OR  type = 'poll' OR  type = 'profile' OR  type = 'resource' OR  type = 'simplenews' OR  type = 'source' OR  type = 'sponsor' OR  type = 'staff' OR  type = 'story' OR  type = 'testimonial' OR  type = 'userlink' OR  type = 'usernode' OR  type = 'weblink') LIMIT 0,10;

I ran this just now: 10 rows in set (7.59 sec)

Same with LIMIT 0,20: 20 rows in set (13.29 sec)

-- 7.59 seconds with a limit of 10 nodes. A limit of 8000 nodes would take 101 minutes just to select which nodes we should process.

Modifying that one query to be:

SELECT nid FROM node WHERE concat('node/', nid) NOT IN (SELECT src FROM url_alias) AND (type = 'blog' OR  type = 'book' OR  type = 'calculation_center' OR  type = 'decisions_ranking' OR  type = 'decisions_selection' OR  type = 'event' OR  type = 'forum' OR  type = 'helptip' OR  type = 'host' OR  type = 'hostevent' OR  type = 'hostnews' OR  type = 'hostpage' OR  type = 'page' OR  type = 'pcalc_cga_test' OR  type = 'poll' OR  type = 'profile' OR  type = 'resource' OR  type = 'simplenews' OR  type = 'source' OR  type = 'sponsor' OR  type = 'staff' OR  type = 'story' OR  type = 'testimonial' OR  type = 'userlink' OR  type = 'usernode' OR  type = 'weblink') LIMIT 0,50;

After further testing, this replacement query was much faster for me up until url_alias had 35,000 entries or so. It is now much slower than the old query -- it could be an issue of limited memory here, but regardless it is not a final solution. Before getting that many records, I was able to process 8000 nodes in entirety in 10 minutes with the updated query. The old query would timeout with a limit set to much beyond 50.

I am doing some more investing on other alternatives to this query -- which I believe is the biggest bottleneck for pathauto_node timing out for large sites. I will post any patches and notes I come up with here.

jason.fisher’s picture

In summary -- the bottleneck for me is not processing the aliases, it's selecting which nodes need to be processed. Not having a simple int relationship between url_alias and node makes things tough.

I believe creating a node.nid<->url_alias.pid table would be one solution to fix this. Another (possibly easier) solution would be to add functionality to 'Bulk update from scratch' -- where we clear all LIKE 'node/%' from url_alias before running node update. We would then not have to do any comparison/concat/joins, and could get down to processing nodes.

greggles’s picture

Well, your use case is pretty uncommon and for the uncommon case I have no problem with the idea of you modifying the code to fit your precise situation. I certainly hope that you find a solution, though.

jason.fisher’s picture

Status: Needs work » Postponed (maintainer needs more info)

I've reached a workable (for me) temporary solution for now. It takes much less time to process all of my node aliases again from scratch than it does to figure out which nodes need an alias.

This modified function resets aliases for nodes, sets the timeout to a time long enough to process them all, and generates them. url_alias generated in about an hour.

I am not providing a patch for this one, just going to include the function for reference -- hopefully others in my situation can find this and take something from it.

function node_pathauto_bulkupdate() {

  // From all node types, only attempt to update those with patterns
  $pattern_types = array();
  $type_where = '';
  foreach (node_get_types() as $type => $info) {
    $pattern = '';
    $pattern = variable_get('pathauto_node_'. $type .'_pattern', '');

    // If it's not set, check the default
    if (!trim($pattern)) {
      $pattern = drupal_strtolower(variable_get('pathauto_node_pattern', ''));
    }
    if (trim($pattern)) {
      $pattern_types[] = $type;
      if (!trim($type_where)) {
        $type_where = " AND (type = '%s' ";
      }
      else {
        $type_where .= " OR type = '%s'";
      }
    }
  }
  $type_where .= ')';

  ini_set("memory_limit", "244M");
  ini_set("max_execution_time", "3600");

  db_query("DELETE FROM {url_alias} WHERE src LIKE 'node/%'");
  $query = "SELECT nid FROM {node} WHERE 1=1 ". $type_where;
  $result = db_query($query, $pattern_types);

  $count = 0;
  $placeholders = array();
  while ($node_ref = db_fetch_object($result)) {
    // print ". ";ob_flush();flush(); // real-time "progress" hack.
    $node = node_load($node_ref->nid, NULL, TRUE);
    $node->src = NULL;
    $node->dst = NULL;
    if (module_exists('taxonomy')) {
        // Must populate the terms for the node here for the category
        // placeholders to work
        $node->taxonomy = array_keys(taxonomy_node_get_terms($node->nid));
    }
    $placeholders = pathauto_get_placeholders('node', $node);
    $src = "node/$node->nid";
    if ($alias = pathauto_create_alias('node', 'bulkupdate', $placeholders, $src, $node->type)) {
      $count++;
    }
  }

  drupal_set_message(format_plural($count,
    "Bulk generation of nodes completed, one alias generated.",
    "Bulk generation of nodes completed, @count aliases generated."));
}
greggles’s picture

Status: Postponed (maintainer needs more info) » Closed (won't fix)

I don't see any of those features as good things for us to do on an ongoing basis, so I'm "won't fixing" this.

szy’s picture

Status: Closed (won't fix) » Needs work

fysa, is there a chance to rewrite your module to current head version?

It works with current head, but... doesn't care about specified patterns for different node (or terms) types - it generates all the paths only by default path pattern.

And it does it really fast - about 20 paths for a second... I liked it ;)

But, the question is - is this fast speed was beacuse of a head version, or beacuse of ignoring specified path patterns?

greggles, thanks for the module :]

Szy.

greggles’s picture

Status: Needs work » Closed (won't fix)

@szy - take a look at http://drupal.org/node/212327 for some specific ideas on drastically improving the speed of bulk updates.

adminfor@inforo.com.ar’s picture

The SQL db_rewrite_sql into function node_pathauto_page() of pathauto_node.inc gave me some problems. With a large number of nodes from the same kind, it locked other SQL processes by 200 or 300 seconds. Wonder 200 other SQL processes locked waiting for this db_rewrite_sql ends.

So, what I did as a workaround, I put a watchdog entry to tell me which paths aliases were consuming so many resources:

I've added 2 entries one before and one after the rewrite for logging purposes

$t1=time();
$result = pager_query(db_rewrite_sql ....  etc etc
watchdog ('pathauto','TIME '.(time()-$t1));

So, after analyzing watchdog "pathauto" entries daily I've replaced the conflicting pathauto aliases urls with views, and deleting by hand the corresponding url_alias record in order to avoid node_pathauto_page() run instead of view.

Not so elegant but works

I'm running drupal 5.x and Pathauto 5.x-1.1 (without token module)

Gustavo
www.inforo.com.ar