Let's benchmark the difference between the node.module code and the views implementation.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

dawehner’s picture

FileSize
71.64 KB
80.3 KB
1.3 KB
1.42 KB
1.3 KB
1.43 KB

Setup two identical drupal installations with the standard profile, using devel_generate and
create 1000 nodes with 20 comments.

First runned some ab benchmarks using -c1 -n500 (see after.txt after2.txt before.txt before2.txt:

before:
291 +- 22.7
285 +- 6.0
after:
371 +- 37.9
354 +- 8.0

Using xhprof to profile the actual request (also anonymous user)
with xhprof enabled shows something like

before:
Overall Summary
Total Incl. Wall Time (microsec):	350,401 microsecs
Total Incl. CPU (microsecs):	340,021 microsecs
Total Incl. MemUse (bytes):	10,471,648 bytes
Total Incl. PeakMemUse (bytes):	10,633,648 bytes
Number of Function Calls:	45,986

after:

Overall Summary
Total Incl. Wall Time (microsec):	449,621 microsecs
Total Incl. CPU (microsecs):	436,028 microsecs
Total Incl. MemUse (bytes):	13,040,504 bytes
Total Incl. PeakMemUse (bytes):	13,281,360 bytes
Number of Function Calls:	53,844

More information to come.

xjm’s picture

Are these both on the first request to the page? Or on cached requests? Is page caching enabled? What about caching for the view?

dawehner’s picture

Forgot to mention this.
There is no page cache enabled on the page and no caching in views.
Both is not the first request on the page, the AB testing came first then the xhprof. The difference between after/after1, before/before1 is
a new installation from scratch with the same configuration.

xjm’s picture

And is Views UI enabled or disabled? Is Views enabled for the "before"?

dawehner’s picture

Views UI is disabled on before and after, though views is enabled on both as well. My idea was to be able to just see the difference
between running the view and not running the view, as there might be a constant addition by for example just load views.module

Maybe we could discuss a better benchmark strategy?

xjm’s picture

I think that's a good strategy. We should also do benchmarks for cached content, I think, and profile the queries.

I'm concerned because, so far, that's actually a pretty severe regression.

One last question about the tests above; are they using the VDC sandbox version of the front page view or the old one?

dawehner’s picture

The benchmark was using d8/VDC-sandbox, so the strategy you expect to do.

If you just compare the pure difference between d8 and vdc it's the function in node_page_default (which uses things like node_load_multiple and node_view_multiple) compared to views, which uses node_load_multiple (not node_view_multiple yet, see below) and the full machine of views, to call the same code at the end.

Sadly there will be no performance gain in using node_view_multiple, but anyway here is a new issue for that: #1811878: Use node_view_multiple (and possible other view_multiple in comment/taxonomy/user ...) in row plugin

I will do tests with views caching enabled (probably entity-cache would improve the situation as well).

dawehner’s picture

This are aggregated data from 100 runs on d8/vdc to reduce the fluctuations

Overall Summary
Total Incl. Wall Time (microsec):	264,621 microsecs
Total Incl. CPU (microsecs):	257,496 microsecs
Total Incl. MemUse (bytes):	9,506,953 bytes
Number of Function Calls:	35,361


Overall Summary
Total Incl. Wall Time (microsec):	327,855 microsecs
Total Incl. CPU (microsecs):	317,780 microsecs
Total Incl. MemUse (bytes):	12,095,800 bytes
Number of Function Calls:	43,417
dawehner’s picture

FileSize
59.14 KB
59.73 KB
  • It's sad to see that we loose 5 ms by just find more classes, i guess #1658720: Use ClassLoader instead of UniversalClassLoader could help here.
  • If you look deeper into the function hierachy you see the main difference (why more cache_get calls, i will try to optimize them)
    One small thing you can see a bit deeper, that file_load_multiple is executed per node (because of #1811878: Use node_view_multiple (and possible other view_multiple in comment/taxonomy/user ...) in row plugin, yeah!)
  • We get 10 cache()->get as total because we need the plugin information for all of the views plugins.
  • 8 more cache gets are coming from config, due to the processing of a view, see follow-ups
  • 2 of these are coming from DisplayPluginBase::init to cache the unpackOptions() result, even in d8 this might not be required anymore,
    if the config/export contains all the values (not only the one which are different from the defaults), so unpackOptions actually should
    not be required anymore

Some follow-ups: #1811982: Remove ui preview logic viewExecuteable::render() and move it to viewUI, #1811986: Try to minimize the "overhead" of display extenders

xjm’s picture

Project: VDC » Drupal core
Version: » 8.x-dev
Component: Code » views.module
xjm’s picture

@catch suggested disabling contextual.module on both sites and benchmarking them that way as well, since that module apparently adds a lot of overhead.

catch’s picture

So contextual.module will check the access of each contextual link. If there's a contextual link to a view operation, then it does menu_get_item() on that path, which I'd assume will load the view or similar. If you profile with xhprof before/after, checking the number of menu_get_item() calls and their relative expense might help to track it down. Also look for any duplicate view processing (i.e. if views_arg_load() gets called more than once). Mainly depends on the views_ui menu access loader function / access callback how bad it is.

Kars-T’s picture

I wanted to benchmark an isolated view with xhprof. So I did write a mini module that uses the frontpage view.

The code for the module is in a .zip in the next issue. Sorry.

By this you could see that the biggest problem was the loading of class files. As soon as I did switch the class loader to

$conf['autoloader_mode'] = 'apc';

there was a lot less str_pos calls in the class loader and the performance did get up at least by 4%. Otherwise the PDO stuff is always the biggest performance hit and everything else is relativly close to each other. So there is not much space for direct optimization.

Kars-T’s picture

FileSize
1.2 KB

Attached is the code for the issue before.

chx’s picture

We could use #1658720: Use ClassLoader instead of UniversalClassLoader. observe $loader = new ApcClassLoader('drupal.' . $GLOBALS['drupal_hash_salt'], $loader); so we could write a $loader = new StaticClassLoader($loader); thing just aping ApcClassLoader and that would be some of a speedup without a lot of drawbacks -- adding a root would need to clear the static but that's it.

Kars-T’s picture

FileSize
48.26 KB
39.87 KB
1.35 KB

I did extend my benchmark script so it fits to the topic. Please see the attachment. It is not perfect but does give us some numbers. If you want to use it please make sure xhprof is running and you are using the correct paths.

I let the benchmark run with and without the contextual links module. There was no real difference.

Here some screenshots from the xhprof results:
full_run.png

Views is on the right and we can see that uncached it is two times slower.

With cache active we get a totally oposite result:
full_load_cache.png

With 5 Minute cache of both result and HTML views now is twice as fast.

Please keep in mind that the comparsions are totally unfair.

The script that gives us "node" is just programmed directly to do this and views is a full subsystem. So it is obvious that the mini script is faster. But there is no configuration. With views we get out of the box a billion features. For example the cache. It is unfair to compare cached HTML with a script that runs every time. But the script doesn't have an option to provide cache at all.

Please don't get shocked or frustrated by the results. imho views is the best option and we should use it.

Fabianx’s picture

I can't reproduce the numbers from #16 at all.

Using my find-min-web.sh from https://github.com/LionsAd/xhprof-kit I get for 1000 runs finding the minimum (re-run with 1000 was not different / better / difference was in low micro secs part):

=== core-node..core-views-frontpage compared (5090ade75e72a..5090ad974b9b4):

ct  : 211,940|219,504|7,564|3.6%
wt  : 750,068|786,284|36,216|4.8%
cpu : 748,047|784,050|36,003|4.8%
mu  : 15,192,240|18,831,040|3,638,800|24.0%
pmu : 16,991,680|19,772,232|2,780,552|16.4%

---
ct = function calls, wt = wall time, cpu = cpu time used, mu = memory usage, pmu = peak memory usage

for 100 random nodes and APC autoloader off. 5% regression does not seem too bad to me.

The only thing I can think of for #16 is that the 'ignore_functions' is making the results bad. I used that myself once and it completely screwed up the results as it made XHProf version much much slower ( especially if its called more from one path than from another) ...

XHProf runs can be provided to interested performance experts on demand.

===

Some more information:

Just comparing node_page_default and views_get_page is not fair, because views is actually rendering via theme() already, while node_page_default prepares a render array, that is just rendered in drupal_render_page via the system default block.

The actual rendering overhead is 1% by comparing Drupal\system\Plugin\views\row\EntityRow::render vs. drupal_render_page. And 2.4% by comparing theme in both variants.

Drupal\views\Plugin\views\style\StylePluginBase::pre_render has roughly the same overhead as node_view_multiple (0.3%, 0.3ms).

The remaining rendering is 2.4% as already outlined.

The loading overhead is:

entity_load_multiple is called _one time_ more (uh, what?), which is 1.9% overhead or 1.4ms.

To calc a little:

views_page: 732 ms

node_page_default: - 160 ms
node.tpl.php: - 336ms
template_preprocess_node: - 112ms
some_other_rendering functions implicitly called as part of views_page: -88 ms (around, not exact)

=> 36 ms different, yup around the same as seen above

So the overhead roughly is like that.

===

Why using 100 nodes?

Well, the O(1) parts and the one-time overhead should not be a priority for us. That is micro-optimizing. The O(n) parts (repeated parts scaled per number of nodes) are what matters and here the overhead is around 5%, which is IMHO acceptable.

Note that the above is with caching off and just the raw view as coming currently from core.

Fabianx’s picture

Issue tags: +Performance

Adding tags, so others can find it.

dawehner’s picture

Great response, and it's good to know that it's not such a big problem.

One small thing i'm wondering about: Does it really makes sense to use the minimum time and not the average,
as you want to know that is happening most of the time not just once.

Fabianx’s picture

#1811816-19: Benchmark node_page_default vs. node view: The reason why taking the minimum time works is two-fold:

The reason why PHP is fluctuating in time is related to the operating system: I/O, context switch overhead, etc., but not to the PHP application itself. PHP in itself is pretty deterministic and I guess it could be possible to write a PHP simulator / VM that is profiling PHP CPU time and not real time.

a) The corresponding XHProf results are much nearer the "truth" than the fluctuating runs. Yes, over time the aggregation is also good, but they are not exact, so if I am trying to micro-optimize something I don't know if this is within the standard deviation or a real performance gain, while with the minimum I can be quite sure this is the case.

b) The minimum is not precise, but the probability with how much it deviates from the "real" possible minimum is limited and shrinks depending on the number of runs and also is then the difference between 0.1ms, which even I don't optimize. This is similar to a "monte carlo" algorithm.

Of course disturbing factors need to be removed, but I rented a dedicated server just for the purpose of doing precise performance testing. So the server is having no load at all during the benchmarks.

Overall I also benchmarked the same with ab on many scenarios and got similar results in terms of percentage of overhead. Here is ab again for the same benchmark as above (XHProf turned off):

$ ab -n 100 http://127.0.0.1/index-perf.php # index-perf is just for internal savings of times

Core
----

              min  mean[+/-sd] median   max
Total:        373  378   3.3    378     391

Views
-----

              min  mean[+/-sd] median   max
Total:        390  396   3.8    397     409

=> Overhead: 4.76%

So its the same, this is also around 4.5-5% overhead.

I tested this again and used just: "ab -n 1000 http://127.0.0.1/" and got even better results. ( the index-perf.php above was throwing a notice of XHProf not being available)

Core
----

              min  mean[+/-sd] median   max
Total:        382  387   2.6    388     399

Views
-----

              min  mean[+/-sd] median   max
Total:        390  396   2.7    396     408

This would then be even only 2.33% overhead and this still makes sense, XHProf as also XDebug profiling are good at setting an upper bound, but some things that are slow while being profiled are not that bad if run live.

TL;DR: So: Views vs. Core is the difference of 2.33% overhead for 100 nodes (averaged over 1000 requests with no load on the machine).

catch’s picture

Why using 100 nodes?

Well, the O(1) parts and the one-time overhead should not be a priority for us. That is micro-optimizing. The O(n) parts (repeated parts scaled per number of nodes) are what matters and here the overhead is around 5%, which is IMHO acceptable.

I'm not sure this is a safe assumption.

A lot of views overhead is when you have a block on every page showing a list of 5 nodes, but the view has 15 displays. Also issues like #1000898: Don't load default view by default in views_get_view() and #1421844: views_fetch_data() cache item can reach over 10mb in size although some of those have been fixed.

Berdir’s picture

Agreed, the O(1)/one-time overhead does matter.

What exactly should be compared here is a tricky question, as others have said before. the default node page is obviously going to be considerably faster than views, we can't change that, we can just try to make views as fast as we can.

This is an irrelevant comparison anyway, because how many Drupal sites that have to worry performance on this level are using the default node frontpage? :) Almost all non-trivial sites replace that page anyway, most of them with views and others with custom queries. And those sites that do custom queries can still disable Views (or use minimal and never install it in the first place) and do their thing.

What IMHO would be more useful is to simulate a typical real site, by having a bunch of views as blocks that display different things in different ways and then do some profiling to see what's slowing us down. Maybe even create some scripts to generate a few content types with 20-30 fields on each and similar things to try and find issues like #1421844: views_fetch_data() cache item can reach over 10mb in size before releasing 8.0 :)

Kars-T’s picture

FileSize
53.81 KB

Just comparing node_page_default and views_get_page is not fair, because views is actually rendering via theme() already, while node_page_default prepares a render array, that is just rendered in drupal_render_page via the system default block.

That is true and I tried to keep this in mind. Please take a look at my code.

  $page_array = node_page_default();
  drupal_render($page_array);

vs.

  views_embed_view('frontpage');

I did compare the full rendering with each other to be as fair as possible. This was done in an isolated manner so we could only see those code paths and uncached. Here is another result without any excluded functions. Didn't know that could mess things up.
full_load_comparsion.png
And if we hammer drupal with "ab" or "find-min-web.sh" we do this as anonymous user and all caches are active. By this I wouldn't expect that the load is much different if we use views or not. I mean it is pretty clear even without benchmarks that views if uncached will be slower.

So what do we want to achieve in this issue now? That topic still is "Benchmark node_page_default vs. node view" but now seems to move to "Build a frontpage with so many views, displays and nodes that we get into performance problems" ;)

I mean we probably can optimize views a lot later. But now this shouldn't be a show stopper to get a frontpage that is configurable in nearly every way you could imagine. And we could activate cache by default.

dawehner’s picture

So one thing we should achieve with this issue is to realize how much slower the actual default frontpage will get
but at the same time we should work on understanding where exactly it is worth to optimize things so it will be faster.

Just looking a single view is probably not the best thing to do, but we should do a a real site enviroment, as berdir said before.

Fabianx’s picture

And if we hammer drupal with "ab" or "find-min-web.sh" we do this as anonymous user and all caches are active. By this I wouldn't expect that the load is much different if we use views or not. I mean it is pretty clear even without benchmarks that views if uncached will be slower.

This is unfortunately not true. All caches are by default off for anonymous users as well. So the overhead of 2.33% is realistic, but only for 100 nodes. If you just have one node or such, sure views has more overhead. (will provide benchmarks for this later and try some extrapolation).

Fabianx’s picture

I'm not sure this is a safe assumption.

A lot of views overhead is when you have a block on every page showing a list of 5 nodes, but the view has 15 displays. Also issues like #1000898: Don't load default view by default in views_get_view() and #1421844: views_fetch_data() cache item can reach over 10mb in size although some of those have been fixed.

Okay, I agree. I'll re-benchmark with different node numbers (1, 10, 50, 100) so that we can hopefully deduct a pattern like:

static_overhead + n * dynamic_overhead

1 Node

=== core-node-1..core-views-frontpage-1 compared (50925fa961936..5092601544a3c):

ct  : 18,806|24,423|5,617|29.9%
wt  : 78,450|101,672|23,222|29.6%
cpu : 76,004|100,007|24,003|31.6%
mu  : 8,969,832|11,569,144|2,599,312|29.0%
pmu : 9,020,088|11,659,864|2,639,776|29.3%

10 Nodes

=== core-node-10..core-views-frontpage-10 compared (509261979fd04..509262a780119):

ct  : 36,820|42,584|5,764|15.7%
wt  : 141,294|165,266|23,972|17.0%
cpu : 140,009|168,011|28,002|20.0%
mu  : 9,840,600|12,453,336|2,612,736|26.6%
pmu : 9,942,464|12,606,624|2,664,160|26.8%

50 Nodes

=== core-node-50..core-views-frontpage-50 compared (50926406b125b..50926462e4176):

ct  : 114,431|120,995|6,564|5.7%
wt  : 409,321|438,938|29,617|7.2%
cpu : 412,026|436,027|24,001|5.8%
mu  : 12,231,584|15,297,112|3,065,528|25.1%
pmu : 12,811,648|15,795,816|2,984,168|23.3%

100 Nodes (from above)

=== core-node..core-views-frontpage compared (5090ade75e72a..5090ad974b9b4):

ct  : 211,940|219,504|7,564|3.6%
wt  : 750,068|786,284|36,216|4.8%
cpu : 748,047|784,050|36,003|4.8%
mu  : 15,192,240|18,831,040|3,638,800|24.0%
pmu : 16,991,680|19,772,232|2,780,552|16.4%

200 Nodes (just for the heck of it)

=== core-node-200..core-views-frontpage-200 compared (5092665232d80..509266f93a356):

ct  : 409,217|418,781|9,564|2.3%
wt  : 1,529,672|1,595,637|65,965|4.3%
cpu : 1,524,095|1,592,100|68,005|4.5%
mu  : 21,213,960|26,035,832|4,821,872|22.7%
pmu : 25,504,744|28,461,520|2,956,776|11.6%

I'll try to get them into Gnuplot sometime later, but can't promise for today ...

dawehner’s picture

FileSize
12.27 KB

Thanks for even more benchmarks!

Here is a short fit and some a plot of the data.

fit.png

Node: 1.96 * Nodes + 16.7
Frontpage view: 1.98 * Nodes + 22.28

So what you see the actual overhead per node is in the range of one percentage but the constant overhead of initialization exists.

dawehner’s picture

FileSize
888 bytes

Here is the small python script to generate all the different plots/fitting etc.

xjm’s picture

Berdir’s picture

Looking a bit at the differences in xhprof with views_ui and contextual links disabled. The biggest ones are

- 29 additional calls to Classloader::findFile(). Is obviously much better with apc. And given how slow the classloader is, that's going to be almost a must have anyway.
- 32 additional queries, most of them because of additional cache()->get() calls (some of them because of the additional config() calls). views_cache_get() for example currently calls config() on every call, so 60%+ of the time of that function is actually spent in config(). But we need to fix config, and we need to fix it anyway, so not views's problem. Also a ton of other functions/methods related to this which are called more often, from unserialize() to all the query stuff.
- 36 calls to PluginManager::processDefinition(), there's an issue to move that to a decorator so it can be cached.

The closest comparison right now is looking at the difference between node_page_default (15ms) compared to views_page( 57ms)
I think. That might look bad, but I think it's kinda ok, considering the following (I already made some of these arguments before):

- As the plot in #27 shows, most of this is a one-time initialization overhead. See also below for that.
- Most sites out there would install and use views anyway, because they need the configurability it provides. So, as I said before, the goal is not to make them equally fast, it's to know where the differences are so that we can do our best to counter them.
- One way to counter them is that views gives you the option to enable caching, making your side suddenly 25% faster than with node_page_default().

I also had some fun with the recent comments view and cloned the block in the provided view a few times and then cloned the view to and placed those blocks ~10 times on the page. Page request time is now at 340ms, compared to ~100ms without those blocks. I think that's the real-world use case that we need to profile, as suggested before.

- We're now executing 650 queries: 320 cache get multiple and 270 cache checksum tag queries, 260 of the cache get queries are caused by 260 config calls, of which 180 happen in views_get_enabled_display_extenders(), so that's something that we might want to optimize even if we improve config too, not sure.
- 650 calls to Drupal\views\Plugin\Type\PluginManager::processDefinition() now, 5% of the page request. Again, that will move into the cache.
- 300 calls to Drupal\views\Plugin\views\PluginBase::setOptionDefaults() (4%), which calls t() 270x
- 180 calls to Drupal\views\Plugin\views\display\DisplayPluginBase::init make up 26% of the page request, mostly spent in config() but still something to check later on. I know from 7.x that the whole init/unpack/options handling of views is very slow.
- With enabled caches on those views, we get down to 250ms, so not that much better because we still have to check access, which initializes all those plugins and leads to the same config calls().

Wan to check how this evolves when improving config()/plugin definition processing, but that can happen later.

catch’s picture

I'm surprised there's so many checksum queries. Are we issuing duplicate queries? Or did a bad tag get added somewhere?

Berdir’s picture

dawehner’s picture

Thanks for benchmarking!
In general i think we should attach the raw data, so other people can analyze the data as well.

So let's try to create issues for all of the problems:

  • Default options shouldn't call t(): #1849280: Optimize the performance of PluginBase::setOptionDefaults.
  • Too many calls to views_get_enabled_display_extenders:
    There is one issue which tries to reduce the amount of actual instances to views_get_enabled_display_extenders()
    #1811986: Try to minimize the "overhead" of display extenders (So for every executed view it executes that function foreach display, so basically N^2)
    Some more ideas on that issue.

    In general #1817582: Lazy load display plugins would be a huge win in that area!

  • I know from 7.x that the whole init/unpack/options handling of views is very slow.

    We already reduced that, because we can now be sure that something else is handling the translations for us, so the method is already simplified. On the long run
    we could try to convert the defineOptions array to just something like array('key' => 'value', 'subkey' => array('key2' => 'value2', 'key3' => 'value3')) so it's really easy to apply this values.

Berdir’s picture

FileSize
643.59 KB

There were a number of improvements/bugfixes recently, so I thought I'd do another round of testing. Cloned the recent comments block a few times and then the whole view twice and placed a lot of blocks of that on the page, I think 21 blocks.

Notes:
- views caching enabled, I specifically wanted to check the overhead of views and not the actual rendering/querying.
- using APC for the classloader.

Xhprof file is attached.

- Cache queries went down *a lot* :)
- Almost as many state queries as cache queries now, mostly from menu_get_item() and related functions. Issue for that is #1786490: Add caching to the state system.
- Something new that I didn't notice before is a lot of duplicated calls to module_load_include(), mostly from views_include() (just for 'cache') and views_theme_functions(). Not sure about theme, but that cache.inc is just 160 lines of code, it's probably way faster to just move that stuff into the module file instead of making sure it's loaded ~400 times.
- Drupal\views\Plugin\views\PluginBase::setOptionDefaults() is now responsible for ~4% of the wall time, 30% of that is t(). Applied the lazy load patch and the issue mentioned in #33 and this goes down to 0.9%. Nice.
- _views_fetch_data() is responsible for 65% of the time spent in drupal_static() with just 30% of the calls. Might make sense to use the fast static cache pattern here?
- Almost 4k calls to Drupal\Core\Utility\ThemeRegistry::offsetExists, 1% of the time. Not huge, but we should still replace it with a real class and do direct method calls instead of offset*. Not sure why offsetExists() is so much slower than offsetGet() which had 1k calls to but those took only a handful of milliseconds.. maybe xhprof is not reliable at that level? Hint: I could use a review of #1858616: Extract generic CacheCollector implementation and interface from CacheArray! :)
- It's good that the config cache patch went in. Seeing 600 calls to config(), language_url_rewrite_url and language_negotiation_url_prefixes each do 250 calls...

dawehner’s picture

views caching enabled, I specifically wanted to check the overhead of views and not the actual rendering/querying.

Which views caching did you enabled, the render or just the query cache? I think we should also take care about the overhead on the actual render level. We already use entity_view_multiple though we could improve a lot on field views by using the entityDisplay, see postponed issue #1867518: Leverage entityDisplay to provide fast rendering for fields.

- _views_fetch_data() is responsible for 65% of the time spent in drupal_static() with just 30% of the calls. Might make sense to use the fast static cache pattern here?

I'm wondering whether it makes sense to have a service which provides the information, so that there is no need anymore for a procedural function and so no drupal_static.

damiankloip’s picture

I'll work on something for views_fetch_data/_views_fetch_data. We could have a service but we could also just create a cacheArray like object that we store/inject into ViewExecutable?

dawehner’s picture

damiankloip’s picture

Yeah, good point. I might work on this following progress in that issue.

damiankloip’s picture

I started #1867782: Replace _views_fetch_data with cache class that I am using to start work on this, all explained over there :)

dawehner’s picture

Status: Active » Fixed

I guess now that /node is in, we don't need this anymore.

Thanks for all the testing

Automatically closed -- issue fixed for 2 weeks with no activity.