See Investigate use of conf_url_rewrite for context...
The current core support for translating incoming path aliases to the internal Drupal path (drupal_get_path_alias) and substituting aliases when generating links to internal paths (drupal_get_normal_path) does not scale well with many aliases. The ease with which pathauto enables site administrators to generate large numbers of aliases exposes this issue, but it is inherent in the core implementation, because drupal_get_path_map reads in the entire url_alias table at bootstrap time. I'd like to discuss ideas on how to improve the performance...
Most obviously, why not simply query the url_alias table as needed instead of loading the whole table? In the incoming case, only a single simple SELECT is necessary, which will always be more efficient than reading the table. In the outgoing case, there might be some slight performance advantage to caching the table with a small number of aliases, but the disadvantage can become huge as the alias table grows.
One note - I've noticed that the src column in url_alias is not indexed, I think adding an index should significantly help the performance in the outgoing case if we were to do individual SELECTs.
Any other thoughts?
Comment | File | Size | Author |
---|---|---|---|
#41 | alias-fix.patch | 530 bytes | matt westgate |
#32 | alias-query-on-demand2.patch | 1.03 KB | matt westgate |
#31 | alias.patch | 7.97 KB | Dries |
#30 | alias-query-on-demand_0.patch | 7.5 KB | matt westgate |
#28 | alias-query-on-demand.patch | 5.72 KB | matt westgate |
Comments
Comment #1
mikeryanWhat the hell, I went ahead and gave it a shot... On my home system, where I'm testing out 4.6.0, page loads have been taking several seconds, which I attributed to the fact that it's an old computer and I'm multi-tasking like crazy. I implemented my own suggestion, and now pages load in about one second, it makes an incredible difference (FWIW, my url_alias table has over 4000 rows).
Patches attached, go give it a try...
Comment #2
mikeryanOne attachment per note, that's tedious...
Comment #3
mikeryanComment #4
mikeryanComment #5
mikeryanComment #6
mikeryanNote: I just noticed that the database.*sql patches showed an extra diff (not mine) to the location field in locales_source. I did my diffs against a freshly-updated DRUPAL-4-6, and had made the edits against a release from a few days ago... Those locales_source changes should probably be removed from my patches before applying.
Comment #7
mikeryanI upgraded Fenway Views to Drupal 4.6.0 today, incorporating these patches. Performance is very noticeably improved.
Comment #8
killes@www.drop.org CreditAttribution: killes@www.drop.org commentedI've merged the patch into one. Much more convenient. I also changed it to cvs as only there new features will be added.
Upgrade path needs to be added.
Mike, can you run some more tests? We are especially interested in "hard" data ie numbers with possibly error bars. it would be interesting to know how this patch affects sites that have only a few path aliases vs those with a lot of them. Also, how pages with a lot of node links (tracker) are affected.
It might be worthwhile to add static caching in drupal_get_normal_path if it gets calle dmore than once for a link on one page view.
Comment #9
mikeryanThanks for merging the patches, I wasn't aware you could patch multiple files at once. If this is accepted, of course, adding the index on src should be incorporated into updates.inc.
In terms of hard data, I've never profiled PHP code - what tool(s) do you use? I couldn't find anything referenced in the PHP manual...
re: pages with lots of links - on Fenway Views the big test is the calendar, and the performance improvement seems even more dramatic here. I don't know why - the url_alias table is read once per page, so I would expect as you do that preloading the table would tend to look better when you have a hundred internal links on the page. Maybe we're both underestimating exactly how fast a simple MySQL query on an indexed key can be (I'm using MySQL 4.0.20, BTW)...
Comment #10
killes@www.drop.org CreditAttribution: killes@www.drop.org commentedAs we speak, Mathias is doing some tests. The tool of choice is usually apache bench. If you have a lot of url aliases, the generated array will be huge. Part of the improvement could be due to the fact that you need less memory now.
Comment #11
mikeryanGreat, thanks!
Comment #12
Dries CreditAttribution: Dries commentedOn drupal.org it takes 800 ms or more to generate a page. Of these 800 ms, only 4 ms is spent building the path alias map (incl. the SQL query time which takes about 1 ms). That is, building the map takes 0.5% of the total time which is neglible. In our
url_alias
table are only 321 aliases though. How many aliases do you have?What is more, when we build drupal.org's main page, we query this map 215 times. I believe your patch would introduce 215 SQL queries ...
I'm afraid that if we'd apply your patch, we'd pay a serious performance penalty (unless we have many more aliases).
Can you provide some numbers/measurements?
Comment #13
Dries CreditAttribution: Dries commentedNote: my measurements did not take into account the time spent searching the map. On the main page, we search this array 215 times.
Comment #14
puregin CreditAttribution: puregin commentedI believe that it's felt that adding path aliases would improve the Drupal documentation. This may well add 400+ paths to the URL aliases table on Drupal.org
Comment #15
Dries CreditAttribution: Dries commentedI spent some more time investigating this.
The relevant code in bootstrap.inc is this:
Turns out that the time to build the map is neglible (< 5ms, see previous comments), however the total time spent on 'path aliasing' it about 70ms per page! 50ms of these 70ms are spent on
$map = drupal_get_path_map()
. The remaining 20ms is spent on$newpath = array_search($path, $map)
.The first call to drupal_get_path_map() takes 3 to 5 ms, each subsequent call takes about 0.3 ms. Searching the array costs 0.1 ms. However, if you have to do this 130+ times, this adds up to a whopping 70 ms. Remind that we have 321 aliases in our map.
As
drupal_get_path_alias()
is typically called from code>url(), which in turn is typically called froml()
I set out to investigate the time spent inl()
. Looks likel()
easily gets called 120 times/page, and that each call tol()
costs about 0.9 ms. That is about 2 or 3 times the cost of the average SQL query. The total time spent inl()
is 115 ms!Comment #16
adrian CreditAttribution: adrian commentedThis is a complete aside, and probably off-topic, but I would like to mention that I was actually
looking at rewriting url() to be able to handle 'external urls' , because I want to do aliases using the site subdomain, like for instance :
http://category.sitename.com/article/title_goes_here -> node/23
Rewriting url to allow this would also allow us to use l() for external links of any kind, getting rid of a bunch of inline html.
Comment #17
chx CreditAttribution: chx commentedadrian, I have written a patch for l to support external links but Ber said that a module like weblink shall handle those.
Comment #18
mikeryanI've got over 4000 aliases on Fenway Views - another pathauto user reported over 6000.
I don't have a profiling tool to give timing data, but adding a quick counter shows that for the Fenway Views calendar, drupal_get_path_alias() is called 404 times and drupal_get_normal_path is called 227 times. And trust me, this page loads MUCH faster with my patch than it does with the path map.
I think those SQL queries are a lot cheaper than one might expect...
Comment #19
matt westgate CreditAttribution: matt westgate commentedSome benchmarks, with MySQL and Drupal caching disabled.
2 SETS OF TESTS
=======================================
1) 500 nodes and aliases
2) 20,000 nodes and aliases
3 TYPES OF TESTS PER SET
=======================================
1) Baseline unmodified Drupal
2) The following change inside drupal_get_path_map():
3) Pull aliases only when needed rather than loading the entire alias table.
SET 1 - 500 Nodes and 500 Aliases
using: ab -c 10 -n 100 [homepage]
=======================================
Baseline
Time taken for tests: 13.00 seconds
Requests per second: 3.85 [#/sec] (mean)
Transfer rate: 56.35 [Kbytes/sec] received
is_null() VS NULL
Time taken for tests: 12.463 seconds
Requests per second: 4.01 [#/sec] (mean)
Transfer rate: 57.63 [Kbytes/sec] received
Per Instance Alias Lookup (32 additional queries)
Time taken for tests: 11.502 seconds
Requests per second: 4.35 [#/sec] (mean)
Transfer rate: 63.30 [Kbytes/sec] received
SET 2 - 20,000 Nodes and 20,000 Aliases
using: ab -c 5 -n 50 [homepage]
=======================================
Baseline
Time taken for tests: 204.583 seconds
Requests per second: 0.24 [#/sec] (mean)
Transfer rate: 3.34 [Kbytes/sec] received
is_null() VS NULL
Time taken for tests: 127.629 seconds
Requests per second: 0.39 [#/sec] (mean)
Transfer rate: 5.35 [Kbytes/sec] received
Per Instance Alias Lookup (32 additional queries)
Time taken for tests: 28.788 seconds
Requests per second: 1.74 [#/sec] (mean)
Transfer rate: 23.59 [Kbytes/sec] received
ANALYSIS
=======================================
Converting 'is_null()' to '=== NULL' is a no brainer and results in a nice performance gain. And while per instance alias lookups also give a huge boost for site with thousands of aliases, they're benefits are entirely dependent on the number of system URLs and menu items visible per request. As noted above I tested the standard homepage which added 32 additional queries. A request for something like the admin interface would presumably show less benefits. I would've tested this, but I didn't know how to invoke an authenticated page request with 'ab'. A positive of this approach is we no longer would be storing all the aliases in memory.
What other concerns do we have with this last approach? Am I properly testing the strain on the database?
Comment #20
Dries CreditAttribution: Dries commentedCan you try making the following changes?
1. To common.inc:
2. To bootstrap.inc:
That is, replace the string comparison with a boolean comparison. Not sure it is going to make a significant difference but it might be another micro-improvement.
I'll test your patch shortly.
Comment #21
matt westgate CreditAttribution: matt westgate commentedDries. Those changes had no real performance gain.
20,000 Nodes and 20,000 Aliases
using: ab -c 5 -n 50 [homepage]
=======================================
Baseline
Time taken for tests: 204.583 seconds
Requests per second: 0.24 [#/sec] (mean)
Transfer rate: 3.34 [Kbytes/sec] received
String VS Boolean
Time taken for tests: 204.190 seconds
Requests per second: 0.24 [#/sec] (mean)
Transfer rate: 3.34 [Kbytes/sec] received
Comment #22
Dries CreditAttribution: Dries commentedBefore this patch can be committed, we need to do more testing. I'd like to know how this behaves on sites with few or no path aliases, and on sites like drupal.org, with a modest amount of path aliases.
The reason I'm asking is because MySQL is often the bottleneck and not PHP/Apache. This is the case on drupal.org, for example. The proposed patch moves some of the processing costs from PHP to MySQL. On drupal.org, the amount of SQL queries/page would double. Needless to say, this is somewhat scary. ;)
More numbers would be appreciated.
Comment #23
mikeryanI'm curious, why test with MySQL caching disabled? Since much of the issue is the expense of making (potentially many) more queries, this wouldn't seem to reflect the performance gain in practice.
The caching (compared to making the query each time) seems like overkill to me - with MySQL caching enabled, I would think this complicates the code for little (if any) performance gain. Did you do any profiling using my original (cacheless) patch?
Thanks.
Comment #24
matt westgate CreditAttribution: matt westgate commentedIn an ideal world everyone would be running a query cache but I wanted to see how this would hold up under the worst-case scenario.
I tested the original patch which didn't use static variables and the amount of queries doubled. For example, to load the front page, Drupal queried the url_alias table 22 times looking for an alias for 'node'. There's no need to send duplicate queries to the DB since it seems that's where most bottlenecks seem to lie.
Comment #25
mikeryanre: worst-case scenario - I understand that.
re: # queries - I would not want to assume adding to the number of queries is necessarily a significant performance hit (with database caching on, of course). I understand the suspicion - I've been around long enough to remember when you did everything you could to avoid making one single SQL query more than you absolutely had to. But modern databases (including MySQL) are very well optimized for this kind of application (lots and lots of small queries). And, I have seen real-life cases where adding caching of data that's already being cached somewhere upstream actually degrades performance.
As long as we're examining this particular area of performance under a microscope, let's make sure we squeeze every millisecond of savings we can out of it - I'd just like to see the numbers for a cached database/no Drupal caching combination for comparison.
Thanks.
Comment #26
matt westgate CreditAttribution: matt westgate commentedDries asked me to do some tests using the drupal.org database. Once again it appears that plucking aliases out of the database as needed performed better than grabbing the entire alias table at the beginning of the request lifecycle, even if the alias table is small. Here's the numbers.
Notice that the database does work a little harder to retrieve the queries one by one. And it also doesn't make sense to run 40 extra queries on a site with only 50 aliases.
I think at this point we have a couple of options to explore:
1. Continue to try and squeeze out any other optimizations we can in the original path aliasing code. Maybe look at output buffering or a mechnism that'll allow us to start working the database resultset as soon data is fed into the pipe rather than waiting for the request to finish.
2. Try to identify the point (e.g., the number aliases) at which it becomes more effecient to fetch aliases and change patterns on the fly.
As with most optimizations, the end result will probably be a combination of both.
Comment #27
Dries CreditAttribution: Dries commentedLooks like in all cases, performance actually improves. I've run a couple tests myself using the drupal.org database and measured performance improvements up to 15%. Of course, this was done on a machine where the MySQL database wasn't hammered on by 50+ concurrent requests.
It would be appreciated if you could extend your patch with a couple lines of documentation to document the behavior (so people understand why we are doing it this way), the code and the API. Maybe add an entry to CHANGELOG.txt? Do we need an update for
updates.inc
?As for future work, I think we're better off looking at
l()
. As said, one call tol()
takes about 0.9 ms and we easily do 100+ such calls per page. Caching at the level ofl()
is likely to buy us more but I don't know whether it is feasible. Maybe we can think of other optimizations too. If anything, it might be a good idea to add a TODO/NOTE to the PHPdoc ofl()
. Or, if you feel like messing withl()
first, by all means! Count me in -- it's fun. :-)Comment #28
matt westgate CreditAttribution: matt westgate commentedI think this patch is now commit-worthy.
I did some more profiling using xdebug 2.0 and kcachegrind and sure enough, drupal_get_path_map() was a resource hog on sites with many aliases. This should be quite an improvement.
I also found a couple other small optimizations such as replacing array_key_exists() with a NULL check inside the arg() function (before / after) and calling isset() in order to avoid calls to error_handler.
Comment #29
Dries CreditAttribution: Dries commentedThat last patch doens't seem to work. None of my URLs are being substituted by their path alias. Also, mind to rename the
src
action tosource
?Comment #30
matt westgate CreditAttribution: matt westgate commentedDoh! The previous version only worked when the path module was enabled. Here's a new one that works independent of path module. It's also a little faster now.
The gains on drupal_org and sites with under 400 or so aliases is small. Actually those sites will spend more time in the DB and less time churning through PHP code. Further optimizations could be made here by defaulting to the old method of grabbing all aliases and looking for further PHP optimizations. Should we look into this or is that overkill?
Comment #31
Dries CreditAttribution: Dries commentedI upgraded the patch to HEAD (didn't apply) but it still doesn't work ... my URLs are not replaced by their aliases. I attached the updated version of your last patch.
No need to optimize this more so let's evaluate this first. I'll test it on drupal.org, which only has a few aliases. Let's hope other people will test this too.
Comment #32
matt westgate CreditAttribution: matt westgate commentedDatabase schema updates for the applied patch.
Comment #33
Dries CreditAttribution: Dries commentedCommitted to HEAD. Thanks.
Comment #34
killes@www.drop.org CreditAttribution: killes@www.drop.org commentedDries apparently updated drupal.org to use the new path retrieval mechanism.
While looking at the query log, I find that quite a lof of the path alias lookups are apparenlty done from the non cached menu items. Since we already collect those in _menu_append_contextual_items we should investigate if we couldn't retreive them in one db query there.
Comment #35
altano CreditAttribution: altano commentedThis performance issue is SCREAMING temporal locality. Is there no middle ground to grabbing the entire path alias table and grabbing them one by one?
Now, I assume we can all agree that a path being accessed makes other paths created soon before or after much more likely to be accessed (especially for more recent content).
Given this is true, if we could load the 25 paths created before and the 25 after, we would probably see an amazing reduction in SQL queries while still keeping memory usage down and not having to grab the entire path table.
Of course though, the issue is how would we do this? If we have to do a SELECT with an ORDER BY, that immediately negates any benefits we might get. I'm no DBA, but isn't there a way we could keep the table in date-added-sorted-order as we add paths, and then SELECT the extra path's on retrieval without any work? Find and SELECT a row, SELECT everything 25 above and 25 below. Does MySQL support views yet? We could make a view that stores the table in sorted order and update it everytime a path is added.
I know I'm in over my head posting in this thread, but I want to give it a shot... feel free to call me stupid :O
Comment #36
eldarin CreditAttribution: eldarin commentedThe point of associating aliases would be to try and group all of the aliases that would loud in each page. Now, the loading of a page - an unique URL - always uses the l() function to generate the URLs; therefore generating the statistics for grouping URL-aliases (i.e associative id) could e.g be initiated in a hook from that function. The tagging of association would only happen on changes in generating the page (i.e possible new aliases). Examples of this would be when the node was created, when a new comment was added/removed or other block-elements were changed. Optimizing further, one could say that block-content aliases would always be loaded.
In my opinion - it is possible to make some performance improvements based on this.
Finally, one thing that would degrade performance further:
localized URLs and URL aliases
Because of the big search-engines and how they work as well as for the users, it would be very beneficial to have intelligible and meaningful URLs. To me, that's part of good design.
So, being able to localize the l() function and the aliases would be very good. And using hastables, associative grouping columns in the database or similar techniques to improve performance could alleviate the extra table lookups neccessary.
Comment #37
saerdna CreditAttribution: saerdna commentedGuys, I would like a followup on this. What was the final solution?
Comment #38
killes@www.drop.org CreditAttribution: killes@www.drop.org commentedthe patch was committed to CVS, people who'd like to chane the current mechansm should open a new issue. Marking fixed.
Comment #39
(not verified) CreditAttribution: commentedComment #40
dkruglyak CreditAttribution: dkruglyak commentedPath alias map gets corrupted and is inefficient. The issue and fix posted here:
http://drupal.org/node/43712#comment-64826
Comment #41
matt westgate CreditAttribution: matt westgate commentedThis patch fixes the corruption of the path alias cache for HEAD.
Regarding performance, I could not find any speed improvements using this code on HEAD, although it would still make a world of difference on a 4.6 site.
Comment #42
Dries CreditAttribution: Dries commentedCommitted to HEAD. Thanks.
Comment #43
(not verified) CreditAttribution: commented