Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
Cache node_load, clear on node_save. I am not too interested in this patch, I just wanted to get the ball rolling.
The status would be adequate as code needs work but then people would assume I will work on it.
Comment | File | Size | Author |
---|---|---|---|
#196 | cache_node_load.patch | 28.67 KB | catch |
#195 | benchmarks.txt | 6.23 KB | catch |
#191 | cache_node_load.patch | 28.96 KB | catch |
#187 | cache_node_load.patch | 29.48 KB | catch |
#183 | cache_node_load_28.patch | 29.48 KB | Berdir |
Comments
Comment #1
killes@www.drop.org CreditAttribution: killes@www.drop.org commentedsome discussion revealed that:
poll.module needs to be patched to do user specific stuff in hook_view not hook_load. I already discussed this with Steven.
comment.module will need to invalidate the node cache if a comment is added.
This patch is likely to give us gains for complicated node types a la cck. I am not sure how much it duplicates cck's inbuilt cache.
The update path is missing too.
Comment #2
killes@www.drop.org CreditAttribution: killes@www.drop.org commentedFor an earlier attempt to get this into core and related discussion see here:
http://drupal.org/node/8506
Comment #3
killes@www.drop.org CreditAttribution: killes@www.drop.org commentedpatch needed to be fixed wrt cache usage.
Comment #4
killes@www.drop.org CreditAttribution: killes@www.drop.org commentedI have done some testing with simple node types (story, forum, page) and there was no performance improvement. This was expected since there are hardly any extra queries performed when loading these node types.
Comment #5
moshe weitzman CreditAttribution: moshe weitzman commentedi would recommend looking into this in the context of moving field modules into core (i.e. CCK). there isn't much point without those, and CCK has thought about this a lot already.
Comment #6
m3avrck CreditAttribution: m3avrck commentedWould it make sense to move this caching into a *node specific* cache table?
It seems to be a site like d.o that has 100,000+ nodes could fill up this cache really fast. Having it in it's own table might be more efficient.
Comment #7
killes@www.drop.org CreditAttribution: killes@www.drop.org commentedGood morning, Ted:
+ if (!isset($nodes[$param]) && ($data = cache_get($param, 'cache_node'))) {
:)
Comment #8
m3avrck CreditAttribution: m3avrck commentedErr, yes yes! I was looking in the patch for the install file patch to create that table doh :-)
Splendid patch though.
Comment #9
chx CreditAttribution: chx commentedHmm, this is a bigger boost than I imagined. I benchmarked this on NowPublic where we saw 50% even 75% less page load times but note that NP uses memcached.
Comment #10
chx CreditAttribution: chx commentedI upped the wrong file --again!-- and also note that as NP is a D5 , of course not this patch is tested but our variant of it. But the concept is the same.
Comment #11
chx CreditAttribution: chx commentedThe user module changes are feature creep. First, most of them are whitespace trims and a bugfix actually, user_edit cleans the cache but user_delete does not, that's fixed now.
Comment #12
Gábor Hojtsy1. The node schema has one too many empty lines now with this patch.
2. You repeat the is_object()... line twice. I would do:
3. The system update does not look like being up to schema API standards to say the least.
That's it. Otherwise it would be nice to see a simple benchmark, after which I could say: well done :)
Comment #13
moshe weitzman CreditAttribution: moshe weitzman commentedsounds reasonable to me. we need to add code that flushes the cache when a module is enabled/disabled, or does the install system do that already?
Comment #14
chx CreditAttribution: chx commentedReroll.
Comment #15
yched CreditAttribution: yched commentedRerolled with Gabor's comments in #12 fixed.
So all this needs now is a benchmark ?
Comment #16
chx CreditAttribution: chx commentedyes please!
Comment #17
yched CreditAttribution: yched commentedPS : and moshe's comment #13 was adressed by latest chx's patch
I'm not really equipped to benchmark this myself. Anyone ?
Comment #18
chx CreditAttribution: chx commentedSame as yched's patch, a typo in a comment is fixed (thanks pwolanin) and -p added.
Comment #19
chx CreditAttribution: chx commentedA little test module for benching.
Comment #20
pwolanin CreditAttribution: pwolanin commentedquick bench using ab -n100 -c5 http://127.0.0.1/drupal6/
Server Software: Apache/1.3.33, PHP 4.4.7, MySQL 5.0.45 , Mac 10.4.9
hitting the the front page (/node) with 30 nodes of a mix of types (book, forum, page, etc).
without patch:
Requests per second: 5.66 [#/sec] (mean)
Requests per second: 5.62 [#/sec] (mean)
Requests per second: 5.62 [#/sec] (mean)
with patch:
Requests per second: 6.44 [#/sec] (mean)
Requests per second: 6.40 [#/sec] (mean)
Requests per second: 6.25 [#/sec] (mean)
node load/edit/save works fine.
Comment #21
pwolanin CreditAttribution: pwolanin commentednote - these benchmarks are for the 2nd and later times running ab. There is, of course, a pump-priming effect so the first time running it with the patch is about the same or even slower than without the patch.
Comment #22
pwolanin CreditAttribution: pwolanin commentedenabled modules:
core-req + actions, book, color, comment, dblog, forum, help, menu, taxonomy, update, devel, devel_generate, simpletest
Note, with the patch simpletest find a warning when running either the "Page node creation" or "Unauthorized page view" tests:
Comment #23
pwolanin CreditAttribution: pwolanin commentedthis error seems to be generated by calling node load like:
Comment #24
chx CreditAttribution: chx commentedFixed. I used $cachable to decide on caching but that's not enough. I now cache only node_load($nid) which is 99% of the cases anyways. It's not really feasible to database cache node_load(array()) because it can be dependent on user, time, whatever. Thanks for the report and the benchmark.
Comment #25
pwolanin CreditAttribution: pwolanin commentedone more fix: the relevant node cache needs to be cleared in this function:
http://api.drupal.org/api/function/book_outline_form_submit/6
in fact, it should probably call cache_clear_all() as well.
Comment #26
yched CreditAttribution: yched commentedThere is still a TODO in translation.module ?
Comment #27
pwolanin CreditAttribution: pwolanin commentedIn addition to the book module fix-ups, I think some of the function calls are missing a parameter:
cache_clear_all('*', 'cache_node');
should probably be:
cache_clear_all('*', 'cache_node', TRUE);
Comment #28
pwolanin CreditAttribution: pwolanin commentedrevised patch - adds last param to cache_clear_all() and more cache clearning in book.module and menu.inc. Removes spurious change to index.php.
Please review.
Comment #29
Gábor Hojtsy- how are the menu module changes related?
- how are the poll module changes related?
- as yched said, there is still a TODO in translation module
Comment #30
chx CreditAttribution: chx commentedComment #31
Gábor HojtsyGreat, committed! (I think this will be loved by people with lots of "xmas tree hangings" on their nodeapi load hooks).
Comment #32
eaton CreditAttribution: eaton commentedI'm very concerned that this change will have a HEAVY impact on large, high-traffic sites. If I understand it correctly, it will be saving to the cache table *every time a node is loaded*, and clearing the node cache *every time a node is saved*, correct?
Unless I'm missing something, this will have a detrimental effect on high-traffic sites unless the 'only clear the cache of nodes that actually changed' code goes in.
Comment #33
moshe weitzman CreditAttribution: moshe weitzman commented"every time a node is loaded" will be infrequent because of this very patch. we are replacing the node load queries with a cache_get().
we already wipe most caches when any node is saved. not sure why that would be worrisome. please elaborate.
Comment #34
eaton CreditAttribution: eaton commentedI think this was a bit of a false alarm; I just noticed that the coe in node_save() only clears the individual node. This needs to be documented VERY clearly, though, or we'll experience a world of pain as modules convert over.
Repeat: this is HUGE in terms of unpredictable, unexpected behavior. node_load() was previously fresh per page load, and now it's not. Red flashing lights!
Comment #35
Gábor HojtsyYes, that definitely need to be documented in the update docs at least. Excuse me for not setting this to "needs work" (as usual after a core commit when the update docs need updating).
Comment #36
eaton CreditAttribution: eaton commentedMy concern was more serious before I realized that only multi-node operations cleared the ENTIRE node cache. Basically, node_load() has always been a 'read-only' operation. While there may be lots of queries, at least we knew that the flow of data was only going in one direction.
This patch ensures that any fresh node_load() will go in TWO directions. It will pull in data, and write to the cache. That makes node loads even MORE expensive, but the payoff is that the next load for that node will be faster -- unless it gets cleared. This means that the frequency with which the node cache is cleared becomes CRITICAL. If it's cleared too frequently, we never get the benefits of the cached loads, and everything just gets more expensive. In addition, if I understand properly, anyone who's trying to to a master/slace setup has to deal with *loads* writing to the master DB.
None of these are obvious showstoppers, but I worry that we don't understand the impact of the change on large sites with high traffic. One contrib module that flushes this cache relatively frequently could bring a site to a crawl.
Comment #37
pwolanin CreditAttribution: pwolanin commentedI agree with eaton about the frequency of clearing the node cache. So, a related question - is it worth running extra queries to determine exactly which nodes need to be cleared (for example in book module or the translation module) rather than do a blanket clear of the node cache? It seems running an extra query (for example) and then potentially deleting a bunch of single items from the node cache it going to be pretty expensive too.
Also, issue for the menu.inc part of the patch I rolled (which chx sut out as unrelated)- indeed it relates to clearing the block and page cache, not the node cache: http://drupal.org/node/170514
Comment #38
chx CreditAttribution: chx commentedFollowup patch, reported by Gabor. Translations fixing.
Comment #39
Gábor HojtsyGreat. Modified the patch to get it documented. Now we have the translation set caches flushed only if we update a node in the set, not all cached node_loads. (I added some documentation and fixed one indentation problem I found on the way).
Setting it to needs work, so it gets documented.
Comment #40
Gábor HojtsyWe discussed with chx and come up with some documentation for the node_load cache. I also ended up documenting the internal cache, so people know what the effect of both are, how are those invalidated and what to put into the load operations. Committed the attached patch.
It still needs to be documented in the update docs, but I leave this to others :) Also, anyone is welcome, if having more correct documentation.
Comment #41
pwolanin CreditAttribution: pwolanin commentedI'd still like some feedback in terms of book module and whether it makes sense to make the cache clearing there more targeted too.
Comment #42
chx CreditAttribution: chx commentedOK this needs more consideration. Hope next time folks will review the patch BEFORE it's in.
Comment #43
chx CreditAttribution: chx commentedWell, let's set this to review and see what people have on it.
Comment #44
chx CreditAttribution: chx commentedMotivation for rollback: obviously the huge performance benefit is not enough to get people move their dynamic code from load to view.
Comment #45
seanrI'm the type of person who really needs the performance benefit. Shouldn't we just put a big bold item in the update docs saying "Put dynamic content in hook_view"? Don't at all agree this should be rolled back. Which modules are currently suffering as a result of this patch? I'd suggest we help roll patches for them instead of rolling this back.
Comment #46
chx CreditAttribution: chx commentedBack to original -- ie. needs documentation. I was never serious about rolling back.
Comment #47
Morbus IffHrm. Say I had a poll.module of my own (not core, but this example exists merely because everyone knows what a poll is). Said poll has 45 votes on it, and is growing rapidly. Are those 45 (46, 47, 48, etc.) votes "data" about the node that should be loaded in accurately, or merely visual garbage to be added at view time? I argue that they're data that should always be accurate at node_load() time - I would hate to have to sit on a 'view' to find out how many votes have been cast when I may not actually care little for displaying the node. How does this cache handle this consideration?
Comment #48
Gábor HojtsyMorbus, first sentence in the issue: Cache node_load, clear on node_save.
Comment #49
Morbus IffI've also some general concerns about parent/child relationships between different node types. I've built various internal node types that have specific relations to each other - a Card is a child (subclass, etc.) of an Expansion. As such, any time a Card is loaded, I also add to the $card_node the results of an $expansion_node node_load() - in almost every use case, a Card required the Expansion data as well, so I wanted to save duplicate code by loading the Expansion in always. With the above cache, Expansion data could get stale within the cached Cards - I'd have to include a cache clear on Cards for every Expansion save (or else, a node_load then node_save). Not as worrisome as "I consider vote counts to be node_load data, not node_view" (above) certainly, but another scenario that may be valid for the forthcoming documentation.
Comment #50
Morbus IffGábor: irrelevant. A poll vote is user-contributed supplementary data to the poll - it has a similar data "feeling" as a user comment. It doesn't change the bulk of the node itself - it doesn't change the body, or the question, or the options, or the title of the node. It is merely a statistical increment about the node itself. I would hate to spend a whole node_save to increment a poll vote by one (any more than we should see a full node_save to increment, say, the node "Views" statistic by one, or to issue a node_save when one comments).
Comment #51
Morbus IffOur comments crossed. Resetting title back.
Comment #52
Morbus IffGábor: incidentally, voting in core's own poll.module doesn't use node_save either (see poll_vote for the direct updates) - it instead issues a clear against the entire cache. Is that the suggestion for designs like the above (both the poll incrementing and the parent/child relationship)? Fine, if it is, but it should be part of the forthcoming documentation.
Comment #53
Morbus IffWe probably want to change the cache_clear_all() inside poll_vote() to something like cache_clear_all($node->nid, 'cache_node'), maybe? I'm not set up in any way for Drupal 6 development, so can't test or create a patch. It would /seem/ to make sense, however, based on the existing changes this patch added.
Comment #54
drewish CreditAttribution: drewish commentedi think this patch is what broke upload: http://drupal.org/node/168813
nodeapi op 'prepare' is being called after 'view'. i'm not up to speed on this enough to know what the right way to fix that would be.
Comment #55
pwolanin CreditAttribution: pwolanin commented@Morbus Iff : cache_clear_all() is called by node_save(), comment_save() and lots of other places - it clears the page and block cache NOT the node, menu, filter, etc caches. It's called anytime anything changes (pretty much)
Comment #56
Crell CreditAttribution: Crell commentedAs just mentioned on the dev list before I saw this thread...
I'm finishing up a site now where most of my node_load() calls are NOT for displaying the node directly. They're loaded for their data. Data should not be loaded in op view. op view is for viewing. op load is for loading. There is no reason why you will always be viewing after loading. The site I'm on now loads all the time without doing a view op. Saying "well just do your load in the view" is a bad answer, since that's confusing two decidedly different pieces of the system (data vs. rendering).
Comment #57
moshe weitzman@drupal.org CreditAttribution: moshe weitzman@drupal.org commentedThe motivation here is to create a cache layer for often static stuff. We have chosen the 'load' step for that. I think thats reaonable.
Then we said that if you have dynamic info in $node you need to set that later. But the only later opportunity that we offer is 'view' which is not good, as you point out.
So, perhaps we need a new nodeapi op that happens after load. The results of that are not cached. In a way, thats a lot like the D5 menu system which caches the $_menu but gives a chance to change it when it does the !$may-cache operation.
An alternative solution would be to check for the presence of $node->node_nocache property and avoid caching if it is TRUE. Modules with dynamic load info could choose to set that.
Comment #58
chx CreditAttribution: chx commentedThen, I think we really need to roll this back. A new nodeapi op or a new column in node definitely does not fit into the freeze. Of course, we might want to say "hey, the performance boost worths it" and then I will be happy to roll something -- but I smell D7 here. Rollback patch above.
Comment #59
chx CreditAttribution: chx commentedThen I am wrong, says Moshe. One can set a $node->no_load_cache property in hook_load or nodeapi op load and then dynamic stuff won't be saved.
Comment #60
moshe weitzman CreditAttribution: moshe weitzman commentedyes, short and sweet.
Comment #61
Dries CreditAttribution: Dries commentedPersonally, I'd roll back this patch and postpone it to Drupal 7. IMO, it should not have been committed in the first place.
Comment #62
Gábor HojtsyAgreed. This needs more discussion and breathing room to get accepted or massaged into a state where it is applicable.
Rolled back the patch with chx's patch, slightly modified to also roll back the node_load doc block I added about this and reworded the system update removal notice, so it does not suggest this *will* be part of Drupal 7, but only that it is up for discussion. (Actual committed patch attached).
Let me know if there are any remainings which are not rolled back, but should have been.
Comment #63
(not verified) CreditAttribution: commentedComment #64
catchI don't think this should've been closed - was the rollback that was fixed rather than the original issue, reset if I'm wrong.
Comment #65
moshe weitzman CreditAttribution: moshe weitzman commentedWould be terrific is someone resubmits this patch. It has decayed a little but not too bad. I think my suggestion to add a $node->no_load_cache property (see #59) addresses the mentioned concerns.
Comment #66
Anonymous (not verified) CreditAttribution: Anonymous commentedsubscribing.
Comment #67
Anonymous (not verified) CreditAttribution: Anonymous commentedhere's a first cut at reviving this patch for drupal7 (including moshe's suggestion at #59).
tested this lightly on a fresh install (create/update/edit nodes and comments), and it seems to work as advertised. i haven't tested the translation code.
not 100% sure i've updated this faithfully, but i'm happy to keep working on this and see it through until its committed/rejected.
Comment #68
moshe weitzman CreditAttribution: moshe weitzman commentedThanks much for updating this I had a read through the patch and found minor doc issues
- "The cache API based cache stores all nodes throughout requests,". Howe about ch"The cache API based cache persists loaded nodes across page requests"
- I prefer positive names so I would $node->no_load_cache to $node->load_cache and make abort caching if it $node->load_cache === FALSE
- $node->load_cache needs to be added to the doxygen for node_load()
I will try to do a functional review soon.
Comment #69
pwolanin CreditAttribution: pwolanin commentedDo we want to postpone this until we have a sense of how or whether node fileds will be in core? that would seem like it might have substantial impact on whatever caching approach we want to take.
Comment #70
moshe weitzman CreditAttribution: moshe weitzman commentedThis is a small patch that can be changed by the fields patch if that materializes and gets committed. I don't believe in postponing useful patches in favor of non existent, non ready patches.
Comment #71
Anonymous (not verified) CreditAttribution: Anonymous commentedupdated patch as per moshe's comments in #68 attached.
Comment #72
moshe weitzman CreditAttribution: moshe weitzman commentedI created and edited nodes and used books and added comments and could not get this to misbehave. Update path works too.
I only added two cache clears in when submitting or cancelling a poll vote. Code looks good. RTBC.
To summarize, any module can disable the can disable caching of any given node as needed. That addresses the one significant objection so far. benchmarks have shown that this is a small performance boost for a plain core site. It will be much more impressive for a fully loaded site with oodles of Contrib modules who perform various nodeapi load operations.
Comment #73
floretan CreditAttribution: floretan commentedThe doxygen comment for translation_node_get_translations() at the very end of this patch seems to be some unwanted leftover from the many patches/rollbacks, the changed sentence is false. Same patch as above, but with that last change removed.
Comment #74
Anonymous (not verified) CreditAttribution: Anonymous commentedupdated patch to keep up with HEAD, no functional changes.
Comment #75
Anonymous (not verified) CreditAttribution: Anonymous commentedupdated patch to keep up with HEAD, no functional changes.
anyone?
Comment #76
moshe weitzman CreditAttribution: moshe weitzman commentedRerolled for 2 failed hunks and restested. Works fine.
Comment #77
moshe weitzman CreditAttribution: moshe weitzman commentedThe failed the revisions simpletest and the problem is in the patch. I will investigate.
Comment #78
moshe weitzman CreditAttribution: moshe weitzman commentedrerolled with revisions fix and a book outline fix. unit tests are super useful (if you can stand to wait for them to finish)
Comment #79
moshe weitzman CreditAttribution: moshe weitzman commentedtrying to get patch to attach
Comment #80
moshe weitzman CreditAttribution: moshe weitzman commentedI uploaded patch to http://tejasa.com/mw.patch
Comment #81
R.Muilwijk CreditAttribution: R.Muilwijk commentedpatch is not working on your site moshe
Comment #82
moshe weitzman CreditAttribution: moshe weitzman commentedHere is as a local upload.
Comment #83
R.Muilwijk CreditAttribution: R.Muilwijk commentedUnfortunalty this breaks some tests. I ran all the node related modules. Please fix untill then 'code needs work'.
It breaks:
Poll vote
Taxonomy nodeapi
Translation nodeapi
Comment #84
moshe weitzman CreditAttribution: moshe weitzman commentedFixed the test failures. The remaining failures happen before this patch as well.
Comment #85
R.Muilwijk CreditAttribution: R.Muilwijk commentedThe tests are fine now... I do however have some comments about the book part
book/book.module:489 and book/book.pages.inc:216 We just have one node here? Why clear the whole cache?
Comment #86
R.Muilwijk CreditAttribution: R.Muilwijk commentedHmmz I did some more testing and see why you do the cache_clear_all now. You need to clear all the parents and the book too.
Just hope we don't add/delete alot of book pages then because we'd have to build the cache allover again.
The tests are all working and I can't find any strange stuff in the patch. Looks RTBC to me. If noone reviews in a few days I'll set to RTBC.
EDIT: It might be good though to comment the cache_clear_all's in book so everybody understands them and makes your code readable.
Comment #87
catchI also ran through all tests and had a quick look through the code - everything looks good. Probably worth putting an extra comment in book to make it clear what's happening, so RTBC with those - just needs a quick re-roll to put them in.
Comment #88
moshe weitzman CreditAttribution: moshe weitzman commentedAdded comments to book module's cache flush.
Comment #89
catchquick note to confirm this applies cleanly, causes no test failures etc.
Comment #90
R.Muilwijk CreditAttribution: R.Muilwijk commentedApplied fine, tests are good, RTBC
Comment #91
Morbus IffFor what it's worth, the load_cache added here, and the poll submission changes, address all my concerns (#49, #50, #52), so I have no further qualms about the implementation.
Comment #92
moshe weitzman CreditAttribution: moshe weitzman commentedFixed a new failure with tracker.test and reworked poll a bit to fix a menu bug that I had introduced in the last version of this patch.
Comment #93
sunOnce the static flag $node->load_cache is set, a node load will not be cached. Even if it is just one property that cannot be cached. Now consider a contrib module that acts on all nodes and cannot cache its properties. This flag reminds me of the 'no_cache' op for filters, which is equally inflexible (but reasonable there). However, I don't see why we do not want to allow f.e. 90% of a node to be cached, and only the rest (if at all) uncached - by caching op 'load' and loading any non-cacheable data in a subsequent op 'after_load' (or 'load_uncached' or similar).
Comment #94
catch@sun, while that sounds like a good idea - is it not something which could be done in a followup patch?
Comment #95
moshe weitzman CreditAttribution: moshe weitzman commented@Sun - thats out of scope for this patch. We don't do partial caching of objects anywhere I can think of ... The prior status was RTBC.
Comment #96
chx CreditAttribution: chx commentedthis together does not mesh as the function does not substract anything. Looks copy paste error.
poll_allow_votes
should return (besides setting) ->allowvotes so the menu access callback can bereturn user_access($perm) && ($node->type == 'poll') && !$inspect_allowvotes || poll_allow_votes($node));
.Why is this here?
Comment #97
chx CreditAttribution: chx commentedforgot to update status.
Comment #98
catchNow that #324313: Load multiple nodes and terms at once is in I revisited this. Overall it takes much less special casing to add the extra hook (provisionally hook_nodeapi_post_load()) so I went with sun's suggestion from #93 and implemented the hook in poll.module to load user-specific information. Seems like you might want that kind of information when you're not actually viewing a node, so worth loading it earlier on in the process, closer to how it's done at present.
This incorporates #333171: Optimize the field cache as well - meaning the front page will load everything with one cache_get_multiple().
Probably still some tidying up to do, but enough there for review/benchmarking to see if it's in the right direction.
Comment #99
catchBenchmarked, looks like it's about 12% improvement on a bloated front page, and the same for an individual node:
ab -c1 -n500 with 5000 nodes, 5000 taxonomy terms, 30 nodes on the front page.
/node
HEAD:
Requests per second: 5.27 [#/sec] (mean)
Time per request: 189.764 [ms] (mean)
Requests per second: 5.35 [#/sec] (mean)
Time per request: 187.017 [ms] (mean)
Patch:
Requests per second: 5.93 [#/sec] (mean)
Time per request: 168.534 [ms] (mean)
Requests per second: 5.89 [#/sec] (mean)
Time per request: 169.785 [ms] (mean)
/node/10
HEAD:
Requests per second: 13.45 [#/sec] (mean)
Time per request: 74.339 [ms] (mean)
Requests per second: 13.66 [#/sec] (mean)
Time per request: 73.210 [ms] (mean)
Patch:
Requests per second: 15.21 [#/sec] (mean)
Time per request: 65.766 [ms] (mean)
Requests per second: 15.32 [#/sec] (mean)
Time per request: 65.274 [ms] (mean)
Comment #100
David StraussI like this patch. As a note to people about the cache warming question, in most cases, when a user saves a node, they go to the viewing page for that node. This viewing takes care of warming the cache.
It might not be a bad idea to only cache nodes that are published. Unpublished nodes should not be kicking published ones out of a cache that has finite space.
Comment #101
moshe weitzman CreditAttribution: moshe weitzman commentedCode looks good. Note that creating the new cache_node table can be briefer. Here is how block.install does it:
Comment #102
David StraussPer Moshe's post, please replace the cache table creation code with the brief version.
Comment #103
catchMoshe, the full schema definition is just in the update function - looks like cache_block also does that in update_fix_d6_requirements(). If we can load the schema in the update function itself that'd be great though - is this what you meant?
I noticed that cache_get() returns the full row from the cache_foo table, whereas my conversion to multiple only really wants $item->data. We'll have to get the whole lot then pull ->data out I guess, so marking to needs work until I can re-roll for this. If someone can clarify cache/schema updates that'd be great - I guess I could just try it and see if it works though.
Comment #104
catchcrossposted.
Comment #105
catchRe-rolled with brief update function and the cache_get changes mentioned earlier. Also re-ordered some stuff in node_load_multiple so that we can do node_load_multiple(array($nids), array('type' => 'article')) and still grab stuff from the persistent cache (also saves one foreach loop).
I'm getting very, very odd failures with path.test - says that node isn't being created, but manual testing works fine, as does the identical call to createNode() earlier in the test - marking as needs review to run past the test bot.
Comment #107
catchReplaced path.test's home grown createNode function with drupalCreateNode() and it works fine now. Also manually replicated the test and everything's working as you'd expect.
The extra hook is now called hook_nodeapi_load_alter() - 'Act on nodes after they've been loaded from the database' (and the database is now either the cache_node table or direct query) - and poll.module implements it. I think this is fairly clear in what it does and what it should be used for, but needs review obviously.
Also incorporated David Strauss' suggestion to only cache published nodes.
Could do with some independent benchmarks and another close look through.
Comment #108
catchComment #109
swentel CreditAttribution: swentel commentedAt this point, benchmarks make my setup slower, which isn't our goal I guess ..
2 other things:
- shouldn't we clear the cache_node table when we install a new module (especially if it has a nodeapi_load implementation)
- drupal_flush_all_caches doesn't flush cache_node, I'd put that there too
Comment #110
catch- added cache_node to drupal_flush_all caches
- added node_modules_enabled() and node_modules_disabled() - these only clear cache if either hook_load() or hook_nodeapi_load() is implemented.
- fixed a silly bug where I wasn't actually unsetting the $nids for nodes loaded from the persistent cache - so it was going off and grabbing them from the database too.
Comment #111
swentel CreditAttribution: swentel commentedOk, benchmarks now show an increase of performance (even on a very old laptop)!
Done with 10 nodes on front page, comments & url aliases, no caching enabled.
Without
Requests per second: 15.09 [#/sec] (mean)
Time per request: 66.276 [ms] (mean)
Requests per second: 15.37 [#/sec] (mean)
Time per request: 65.063 [ms] (mean)
With
Requests per second: 16.48 [#/sec] (mean)
Time per request: 60.679 [ms] (mean)
Requests per second: 16.38 [#/sec] (mean)
Time per request: 61.046 [ms] (mean)
Very interesting patch for logged-in users also and when having lot's of nodeapi_load hooks surely will make your database much happier.
Comment #112
catchFor those who read through the whole issue, here's a quick summary of stuff that might need clarifying:
The big performance gains shown in initial benchmarks won't be as much now since multiple load saved us quite a lot on the front page - however - it's still 10+% gain on the front page (and likely extra with memcache), and we get around 10% on node/1 too - whereas multiple load didn't buy anything there. Still very good, but important to note that node_load got faster rather than caching getting slower here :)
Eaton's concerns about the extra expense of an uncached node_load() are partially dealt with by multiple load - an uncached node_load() is less expensive than it used to be. Perhaps in a followup patch we could look at multi insert cache_set() as well... not sure how much support there is for that in dbtng yet.
The extra hook added was suggested several times from around #47 onwards but was too late for D6 code freeze, I think Morbus' points about things we might not want to cache persistently in the node object still being node data - votingapi, flag etc. still stand. And hook_nodeapi_load_alter() seems a reasonably named hook for modules to do that sort of thing in.
The current patch still does a cache_clear_all when saving a book outline - pwolanin, do you have ideas on how we could make that less disruptive? Seems like a select query would be a lot less work than re-building the cache.
Comment #114
catchResetting due to bot issues.
Comment #115
catchpwolanin informs me that SELECT nid from book WHERE bid=%d will work fine for getting all nids in a book - so will re-roll for that change, hopefully tomorrow. CNW for now.
Comment #116
catchNow with more intelligent book cache clearing. No other changes - that's the main small thing that needed fixing up from earlier patches, so it's ready for proper review now I think. There's one kitten in there with the test which I'll try to split off later if I get time.
Comment #118
catchPath test cleanup is over at #348742: Path test should use drupalCreateNode() now.
Comment #119
riccardoR CreditAttribution: riccardoR commentedI think there are two typos at lines 231 and 242 of the patch within the two new functions
node_modules_enabled()
andnode_modules_disabled()
should be:
The patch applies with offsets.
Rerolled it against HEAD with the two corrections above.
Comment #120
riccardoR CreditAttribution: riccardoR commentedcache_get_multiple() returns an array of items indexed by cid, whose value can either be cache data or FALSE, as returned by _cache_prepare_item()
node_load_multiple(), however, skips database loading for all items returned by cache_get_multiple() without checking if they contain cache data or not.
BTW, The Doxygen comment for cache_get_multiple() says that it returns an array in the form array('hits' => $cache, 'misses' =>$cids)
HTH, riccardo
Comment #121
catchriccardo, thanks for the review. I've fixed the docs on cache_get_multiple and it now only returns items which aren't FALSE.
Comment #122
riccardoR CreditAttribution: riccardoR commentedyou are welcome, catch. system_update_7016() just got committed. Rerolled.
Comment #124
catchRe-roll.
This has had lots of reviews from various people, the benchmarks are good, anyone want to give it a final look over?
Comment #125
catchComment #127
Damien Tournoud CreditAttribution: Damien Tournoud commented- I know this is not touched by that patch, but there is a dot missing at the end of the garbage collection comment.
- The query should be dynamic (again, not touched by this patch, but you are writing a dynamic one in cache_get_multiple()...
This code comment should be in one line.
You can type-hint $cids as an array.
[...]
You can also type-hint $cache as a StdClass.
I'm not sure this micro-cache expiration approach really worth it. Just truncating the table should probably be a faster operation, especially on big books hierarchy. Has this been thought about?
Oh, here we simply clear the whole cache_node table. Why two different approaches?
[...]
- Those should be drupal_function_exists() calls.
- I get the "_nodeapi_load" part, but why is the "_load" part required?
[...]
- Why only caching published nodes ($node->status > 0) ?
Nice clean-ups!
Comment #128
catchDynamic query - yes it should be one - but don't you have a patch somewhere for that already?
Book cache expiration - this was requested earlier in the issue, so I added it in. I think on sites with only a few book nodes and lots of other content, it's nice to not clear the entire cache unless necessary - but then it's a lot simpler/more elegant if we just do a cache_clear_all. The two different approaches - most likely because I'm an idiot and forgot the other one ;). Would be good to get some more feedback on whether this is worth doing or not, I don't think either option is ideal to be honest.
nodeapi_load() and _load() - afaik it's quite possible for a module to be disabled, but the nodes to persist in the database - say forum topics or something - forum.module doesn't do an automatic node delete for all forum topics, even on hook_uninstall(), and nor should it. afaik poll is the same - but the cache definitely needs to be invalidated in those cases.
Only caching published nodes was David's suggestion - the idea is to conserve space in memcache (or wherever) by not taking up space with nodes which will be loaded rarely. The only people who should ever be viewing un-published nodes are administrators, and we don't really need the extra 5-10% performance (or for that matter extra cache_set()s on stuff which might not get viewed again) just for them.
Path test cleanup has it's own issue I think, would be nice to disentangle that from here if possible.
Comment #129
catchRe-roll.
Path test changes got committed, so those are out.
Type hinted cache_get_multiple() to array.
Did not type hint _cache_prepare_item() because a record returned from the IN() query can be FALSE as well as an stdClass.
Reverted back to a straight cache_clear_all() in book.module - it's risky foreaching like that compared to a straight clear, unless we add an array option to cache_clear_all() and DELETE FROM all at the same time - but then that's adding complexity for not necessarily much benefit.
Fixed some comments.
That query in cache_get() is impossible to convert at the moment - per #344088: cache.inc cannot be fully converted to dbtng - we shouldn't try to fix that here.
I'm getting some test failures, but also seem to be getting those with HEAD (and in suspicious places like aggregator and profile tests with no node interaction, as well as book and translation which are touched by the patch), so uploading and setting back to CNR to see what the bot thinks.
Comment #131
catchsystem_update_N() conflict.
Here's another re-roll - I'm getting four exceptions from the new node submitted test - which manually posts a node and then runs it through theme('node_submitted'). There's nothing in there which should break, but it appears drupalGetNodeByTitle() returns FALSE for the second test function (it's fine for the first one, and elsewhere in node.test). This doesn't break without node_load() caching, or when running the same steps manually, so I'm not sure what's happening there. Marking to CNR only to get the test bot to run it.
Comment #133
chx CreditAttribution: chx commentedUnion fixes in the patch. instead of += now we have
$nodes = $nodes_to_process + $nodes;
and$node_cache = $nodes_to_process + $node_cache;
.Comment #134
catchRemoved some unrelated whitespace which had crept in, along with some variations in phrasing in the book cache_clear_all() inline comments.
Comment #135
chx CreditAttribution: chx commentedThe testing bot likes it. The code is solid. Let's rock.
Comment #136
sunIs there a reason why we use a static query in cache_get() and not a dynamic one like in cache_get_multiple() ?
Missing blank line.
References to functions should use a trailing "()" to make them obvious.
Please try to avoid abbreviations in PHPDoc at all cost.
Instead of the @see, we want to provide actual example code here. Easiest way: Copy directly from poll.module.
Albeit correct, it would be good if we could use "cannot" here, since this comment reads like the opposite when quickly scanning it.
Are we returning nids or nodes?
First PHPDoc summary line must be on one line.
Aside from those minor points, good job, catch!
Definitely RTBC afterwards.
Comment #137
catch@sun Yes, a dynamic query brings the universe crashing in on itself - see #344088: cache.inc cannot be fully converted to dbtng (Damien asked the same question above).
edit: also I left the @see for poll in, in case someone changes the documentation example later. But copied it in for now too.
Otherwise all great changes. Here's a re-roll, passes all tests locally.
Comment #138
webchickCould someone please summarize what this patch does? (And remove this tag after.) I actually did read the entire issue, but it's not clear to me exactly how this patch has progressed since its first iteration that was rolled back and had people freaking out about it.
Comment #139
catchHere we go.
The rollback happened for two reasons:
1. eaton's concerns about cache clearing, although some were a false alarm it's worth going over.
When nodes are saved, we only clear the cache for that individual node. When translations are saved we save them for the translation. When any multi-node operation happens (moving book pages, deleting content types), we clear the whole cache.
In the case of book module, there's the potential for some micro-clearing (see #124), but I removed that for simplicity, and because without cache_clear(array(cid1, cid2)) we'd have to do a DELETE FROM for each record, which on Drupal.org handbook pages could get nasty.
In terms of priming the cache, since node_load() itself is less expensive in many cases, in the worst case we're adding 30 inserts to an operation we've just removed 100+ SELECTs from. Since those 30 will be served from one query (or memcache) the next time they're requested, I think any concerns about re-priming the cache are less compelling than they were - although the performance increase is also a bit less impressive (still around 10% with core modules though and potentially a lot more on real sites).
Two potential followup patches to enable some more fine grained clearing would be:
cache_clear_all(array $cids)
And if it's possible to do multi-update/insert with the new database layer, potentially cache_set_multiple() as well - so a single insert when caching every node on a listings page.
2. Potential for stale data
The original patch which was rolled back left any uncached operations entirely to node_view - Morbus had good reasons in #45 why this kind of information is still 'data' and doesn't belong in hook_nodeapi_view().
Later iterations of the patch added a 'don't cache me' flag to nodes, but this means that one naughty module could effectively disable node caching for the entire site. Adding a new nodeapi $op wasn't an option late in the D6 freeze, hence postponing this rather than a quick re-roll.
So, I added hook_nodeapi_load_alter()
The order of execution now looks like this:
Fetch from the static cache if possible.
Fetch anything remaining from{cache_node}
Fetch anything /still/ remaining from {node}, hook_load() and hook_nodeapi_load().
Store the fully built node objects with cache_set()
Invoke hook_nodeapi_load_alter() to allow for current user/language/vote-type data to be added - which can't be cached persistently, doesn't belong in hook_nodeapi_view(). poll_nodeapi_alter() is the implementation of this hook in core.
Add the fully loaded and altered node to memory so that we don't run poll_nodeapi_load_alter() 13 times on a page request.
hook_nodeapi_alter() will have exactly the same performance as the current hook_node_load() in HEAD - you can load multiple stuff in one query still etc., so if a module needs to use it, they don't lose any performance compared to what we have now, and can't drag down everything else with them like the NO_CACHE flag would have allowed them to.
I think that covers all the new stuff.
Comment #140
webchickYay! Thanks for the useful summary! :)
And yet, from drupal_flush_all_caches():
Why are cache_block, cache_node, and cache_registry not in the PHPDoc? Are they special in some way or is this just an omission?
The various cache_clear_all('*', 'cache_node', TRUE); lines in book.*.* make me very nervous, given that Drupal.org has a very active documentation team whose long-term plans are set to involve storing new handbook pages in a "test" area of sorts and selectively moving these into the "real" handbook hierarchy by volunteers... so this entire cache is probably going to get cleared multiple times per day. I voiced these concerns on IRC, but catch said (paraphrasing) that clearing the cache in a more fine-grained way was a hard problem to solve and best pushed to a follow-up patch.
I'm not sure of the name hook_nodeapi_load_alter(). I'm wondering if hook_nodeapi_postload() or hook_nodeapi_load_uncached() or something makes more sense, given that we also have hook_nodeapi_presave() and hook_nodeapi_alter(). Let's bicker about it on IRC. ;D
"For example" needs a comma. I'd maybe rephrase this to something like "For example, node properties which are dependent on the current user or language." as currently it doesn't really read as a complete sentence.
I don't understand what "The results of this hook are cached during the course of a page request" has to do with "you should not alter fields returned from the {node} table here"... feels like I'm missing a part of the discussion that went on. Could we expand this out so it's a bit more clear?
global $user is indented two too many spaces.
We don't smoosh words together. $node->allow_votes. (Bah. Nevermind. The code that you're copy/pasting from already has it that way. Curses. Follow-up patch. :P)
I feel like we lost some clarity in the comments here. Unless I'm mis-reading, we still don't cache nodes if we're loading a revision.
Also, I don't see any new tests added with the addition of cache_get_multiple() and similar. Seems like it would be a good idea to add some, no?
That's about all I can see. I'll also ping Dries to take a look at this, since he's committed most of the other node performance-related patches, so I'd feel better w/ his okay before committing.
Comment #141
catchWe discussed the naming of the new hook in #drupal a lot a few weeks ago and I was originally in favour of hook_nodeapi_post_load() so I've changed it back to that, and chx +1ed on irc.
Book caching, I've made a change to cache_clear_all() to allow it to take an array as the first parameter - this was already discussed in irc last night and I wanted to defer it, but it actually simplifies things a fair bit. This is useful for taxonomy and translation as well (at least if you had translation in 50 languages or something).
Also added unit tests added for both cache_get_multiple() and the new cache_clear_all() behaviour. cache_clear_all() might do with splitting out into some different functions, since it does a lot of argument checking internally already, but that's definitely not for this issue, and shouldn't postpone it either.
However I then realised that _book_update_outline() is called on every node_save/update, and doesn't call node_save() for child nodes either - which means I was wrong on the frequency of cache clears earlier. This means in the case of book itself, I've moved book_nodeapi_load() to book_nodeapi_post_load() - which isn't cached persistently at all and means no chance of meltdowns on Drupal.org. {menu_links} pretty much acts as a cache anyway since it's all denormalized. Since the actual menu tree shuffling is dealt with outside of hook_nodeapi_save() this is the proper use for that new hook.
Since it's feasible that the array sent to cache_clear_all() could contain a few thousand cids, I've added a safety valve to cache_clear_all() to avoid max_allowed_packet errors. We always send the same data into cache_clear_all(), but if it's over a certain threshold, clear the entire cache instead. I've made this a variable and set it to 1000 to be conservative. Deleting a term is a relatively rare occurrence on most sites, particularly ones which are associated with thousands of nodes, so I don't think the book issues apply to the same extent, and anything up to 1000 nodes by default will be cleared in a single DELETE FROM as opposed to either looping through the array or clearing the whole lot.
Also went through and cleared up the comment issues identified by webchick.
Comment #142
catchComment #144
catchBroken HEAD.
Comment #145
moshe weitzman CreditAttribution: moshe weitzman commentedLooks really strong. Minor nits only:
Comment #146
catch3. I can get behind hook_nodeapi_after_load() - but webchick also mentioned hook_nodeapi_uncached(). I'm a bit worried about using uncached, because it is statically cached for a page request - but that's also its purpose. We should probably hash this out one more time in irc.
6. helper function maybe? I want to do this for more bits of the multiple load functions - but there's enough variation in the objects we're dealing with to make it hard - again I'll try to grab you in irc to see if it can be rationalised a bit.
Will re-roll when these two are resolved.
Comment #147
webchickafter_load() is fine with me. It's a little awkward paired with "pre-", but it's true that "post" is *extremely* ambiguous.
LOL @ Olympic gymnastics medal.
Comment #148
catchAlrighty, this should address all those issues apart from the gymnastics.
Comment #149
moshe weitzman CreditAttribution: moshe weitzman commentedThanks catch. The gymastics is a compliment - no neded to change the code.
Lets move this along - RTBC.
Comment #151
catchsystem_update_7018() was committed. No other changes.
Comment #153
catchRe-rolled for whitespace changes.
Comment #155
catchSince there were no changes since this was RTBC, and the re-roll didn't introduce any errors, bumping this back up.
Comment #156
catchre-rolled for offset/fuzz.
Comment #157
wedge CreditAttribution: wedge commentedMinor nit but is the "not" in the new comment really supposed to be there?
Comment #158
catchWell spotted, re-worded that comment.
Comment #160
chx CreditAttribution: chx commentedWhether duplicating field caching is a problem or not -- I would say, no. Storage is, for all practical purposes, free. As long as there is hook_nodeapi, caching nodes are necessary and bothering with excluding is just that: bothering and unnecessary.
Page caching and filter caching are already duplicating and noone is screaming.
Comment #161
Dries CreditAttribution: Dries commentedI'm trying to figure out where exactly the field API caching happens. Can you point me to the code?
Storage might be free, but checking the cache (and updating the cache) requires an SQL queries.
Comment #162
David StraussRegardless of Field API caching, caching entire loaded nodes is a good idea. We should strive to cache as late in processing as possible, even if there are earlier caches.
Comment #163
catchThe cache_set/get() is in _field_attach_load(). This also checks if the object_type is cachable or not.
http://api.drupal.org/api/function/_field_attach_load/7
So, in node_fieldable_info() there's this
If we uncomment cacheable in node_fieldable_info(), then fields loaded and saved from node objects won't be cached in {cache_field} due to the checks in _field_attach_load() - this means zero duplication between cache_node() and cache_field(). Attached patch does this ;).
Also, note that on a node listing, every cached node in the listing is retrieved in a single query (via cache_get_multiple()), and we avoid hitting the database at all if all nodes are cached and are using memcache as a back end.
Comment #165
catchAll tests pass locally for me.
Comment #166
catchDid some more benchmarks over at #369011: Performance: do check_markup() in text_field_load() - we get about 8% improvement with this, even with only one hook_nodeapi() module enabled and 10 nodes on the front page. That 8% is well worth having, regardless of whether every hook_nodeapi_load() gets converted to Field API.
Comment #168
catchComment #170
Anonymous (not verified) CreditAttribution: Anonymous commentedgetting back to this.
updated patch, conflict with update_N function only, no other changes.
Comment #171
Anonymous (not verified) CreditAttribution: Anonymous commentedtest nitpick:
don't we need another entry or ask for less than the 3 items to be deleted? you're creating three items, then asking that three be deleted, so not sure how that tests the threshold.
also, on the threshold functionality, i'm wondering if we shouldn't just chunk the queries? seems like this is less surprising? something like (not tested):
Comment #172
Anonymous (not verified) CreditAttribution: Anonymous commenteddiscussed the chunking idea with catch in #drupal and he's ok with it, will reroll shortly.
Comment #173
Anonymous (not verified) CreditAttribution: Anonymous commentedupdated patch attached.
Comment #174
catchDiscussed this with Peter and Damien, $cids now passed by reference to cache_get_multiple() so the gymnastics stays in cache.inc
Comment #176
catchThis time with less fail.
Comment #177
Anonymous (not verified) CreditAttribution: Anonymous commentedmy only issues with this patch have been addressed, test bot is happy, i think this is RTBC.
Comment #179
catchre-rolled for hook_node
api. No other changes.Comment #181
catchrickrolled.
someone needs to put this issue out of its misery, either way.
Comment #183
BerdirThe function translation_node_get_translations() has an internal static cache (not sure why, is that called so often?), this function is called multiple times, so when called from the new translation_clear_node_cache() it didn't contain the latest translation because there is already a cached array of translated nodes which was created earlier.
Added a $reset parameter to that function and call it from translation_clear_node_cache(). All tests should now pass again.
Comment #184
BerdirThat form doesn't like me. I'm sure I updated the status...
Comment #185
catchThat change looks fine - we can look at whether translation_node_get_translations() actually needs a static cache in a separate patch.
Taking advantage of not being the last patch author and marking back to RTBC. Would really appreciate a review from Dries here since it's been 7 weeks since his last questions were answered and it seems webchick isn't going to decide whether this gets committed or not.
Comment #187
catchsystem update collision, no other changes
Comment #188
bjaspan CreditAttribution: bjaspan commentedI hate it when people parachute into my issues late in the game, drop a bomb, and then run away very fast. But people do it to me all the time lately, Dries asked me to comment on this patch, and I can't handle reading 187 comments at the moment, so now I'm going to do it, too. I'm sorry, really.
1. Field API caches field loads if asked. Do we *really* need a separate node cache? As more things become fields, it seems like the complexity of the node cache will become less worthwhile, but still stay with us. Also, Field API leads Drupal in a direction in which nodes are not the only things that are content. Why not cache user_load? Or comment_load? Or every_other_fieldable_type_load? Yes, obviously node is the thing we have the most of it and is viewed the most often, but this seems to be adding more special-case logic to Drupal instead of removing it.
2. This really, really disturbs me:
So, now taxonomy_term_delete() is required to know that node module maintains a load cache which taxonomy module is required to clear under certain circumstances. Special cases like this are going to pop up everywhere and make Drupal harder to work with. This is anti-modular programming. It feels a lot like having to know you need to clear node_load()'s static cache after calling node_save() which we *finally* just got away from.
Can you guarantee that there will never a field type that when saved will require the node cache to be cleared? If not, you are creating an impossible solution because Field API will not hard-code logic like this for the field module if I have anything to say about it.
Running away now...
Comment #189
catchGrabbing your shirt before you get away...
#1: We'll always get more benefit caching at the highest level possible because the most work has been done. So if we can cache fieldable objects, we probably should, and the field API allows for this to happen currently.
With database caching, the field cache requires a round trip to the database for every object (+ the node_load and hook_node_load() round trips we already have) - because it retrieves from cache on a per-object basis. With this patch, it's a single query for the entire array of node objects (see cache_get_multiple()).
So as far as I understand it, the body and teaser as fields patch, on a front page with 30 nodes, will actually cause an additional 30 cache_gets() on top of the node_load_multiple() query - that's going to slow things down on a default install. I'll come into your issue and drop some benchmarks soon, don't worry :)
#2: this could be done in node_taxonomy_term_delete() :p I should note that taxonomy module is tied into node module in far worse ways than this. When and if taxonomy vocabularies are fully converted to a field, this special casing will go with all the rest of it. That's still a long way off though, as is the conversion of other core hook_node implementations to fields.
#3: Fields are saved as part of object saving - which clears the cache for the node anyway. I think we're going to end up with a use case for field_attach_load_uncached() (similar to hook_node_post_load()) anyway - and field API shouldn't prevent any fieldable object from being persistently cached should it? If that worked like hook_node_post_load() - we'd get still get one extra query per list of fieldable objects (per uncacheable field) instead of 10-30 with the field cache by default.
Comment #191
catchdbtng conversion conflicts.
Nearly three weeks since the last update here. There are benchmarks in #20, #99 and #111 and there's an issue summary in #139. David Strauss gave much the same answer to Dries regarding the field cache back in February in #162 as I did in my response to Barry three weeks ago, no-one's actually addressed what's been said on that point. If this is won't fix it'd be nice to know instead of having it sit in the RTBC queue for another three months.
Comment #192
Dries CreditAttribution: Dries commentedLike Barry in #188, I agree that things get messy when everyone and his dog need to know when to clear the node cache, something which doesn't seem to be necessary with the field level caching. It seems abstracted better, and eventually, will do the same job -- especially with some of the fields related follow-up patches in the issue queue.
I'm tempted to mark this postponed until we dealt with some of the fields API patches. At this point, it seems better to revisit this patch after we drove those home. I'm worried that this patch adds complexity on top of an already new and complex (albeit properly designed) field API. Combined with the reasons outlined above, I'd prefer to keep things simple for the time being. Can we agree to revisit this patch later?
Also, the hack for "max_allowed_packet" in _cache_prepare_item() is a no go.
Comment #193
catchI just posted benchmarks for the first core field conversion at http://drupal.org/node/372743#comment-1493570
That's a 30% slowdown vs. a 10% improvement here. We already have a valid comparison in terms of performance impact for object level caching vs. the field cache - and that's on vanilla HEAD with only comment and taxonomy modules enabled (and no comments or taxonomy terms attached to the nodes in those benchmarks either).
Deferring this until comment, taxonomy, poll and other hook_node_load() implementations are converted means we'll end up with the same situation we had 170 posts ago - it'll be too close to code freeze and it'll need to be postponed until Drupal 8. And meanwhile we made core xx% slower again.
Obviously the final decision is yours, but unless someone's got another way to claw back our very poor performance in HEAD I'm concerned that keeping things simple is going to mean everyone has to run memcache and materialized view API on their Drupal 7 sites just to keep up with stock Drupal 5.
Comment #194
yched CreditAttribution: yched commentedDo you think node-level caching would increase performance of 'body as field' (or 'X as field') ? From a first look at #438570: Field API conversion leads to lots of expensive function calls, it looks like the PHP overhead is added by the whole field workflow, and probably more the 'display' part (theme(), drupal_render(), check_plain()...). I don't see how node cache vs field cache on node load would make things better or worse here.
Other than that, the road is indeed very long until all current core node additions have been migrated to fields... There's no guarantee we'll get there in D7.
Comment #195
catchYeah I did the profiling because 40% of a request seemed like far too much for 20 cache_gets. I think the extra database access from the field cache as it stands now is responsible for 3/40 of that hit compared to the status quo - going by the results of #333171: Optimize the field cache at least.
However I just benchmarked 10 nodes again, this time with comment and taxonomy modules disabled
HEAD:
11.80 / 11.79 [#/sec]
Patch from #191:
12.82 / 12.78 reqs/sec
That's 8% improvement with neither field_attach_load() nor hook_node_load() being called at all. When nodeapi and/or field API are added on top, we'll see additional improvements from caching either of them - but the baseline of 8% is still going to be there.
Benchmarks attached.
Comment #196
catchIn response to this:
I'm not sure how the field cache differs in this. For simple fields like number and field it's nice and easy to keep those self-contained since field API is the only entry point - this is equally easy for object-level caching.
However, any field which references something that has it's own independent existence isn't going to be as clean and self-contained though.
Here's a very simple example with nodereference:
For example:
1. Create node A
2. Create node B - reference node A
3. Delete node A
The cache_content entry for node B still contains the reference to node A. If you go back to edit node B then save it, you'll get a validation error because the deleted node is still a value displayed in the form (edit, this is actually because the field table doesn't get updated either, which looks like a straight bug).
The only reason that stale data doesn't get displayed when viewing node B is because the actual node titles are queried individually by the formatter on every request.
This takes the nids, which may or may not have been deleted, then issues a query per title to actually display them. Non-existing nids get an empty string. This code hasn't changed between D6 and D7.
If the same model was used for taxonomy terms in core, you're looking at objects * terms queries just to get the term names - a considerably worse performance situation than either D6 or HEAD for taxonomy module where it's been a query per node in D6 and two queries per page in D7. And you'd still have stale cache entries.
Additionally, whereas it's pretty easy to get the nids attached to a term and do cache_clear_all($nids 'cache_node'), for fields, everything is in one big cache table and it's not going to be possible to generate a list of cids from a single query, since the keys aren't just an object type and id. So you'd have to clear the entire field cache potentially every time a node or term is updated if using nodereference and term_field modules - something which this patch has taken care not to require.
So to me it looks like this is currently an unsolved problem in both D6 CCK and field API and I find it odd that claims are being made to the contrary. In fact there's both stale data being left around, and potentially expensive stuff doesn't get cached at all. So it's going to be messy one way or another, it's more a case of whether it's going to be messy and fast or messy and slow.
Attached patch removes the max_allowed_packet workaround from cache_clear_all().
Comment #197
catchComment #198
yched CreditAttribution: yched commentedThe argument could be : if you implement a field type that is a reference to some entity (taxo term, file, node...), then it's up to you to clear the field cache in hook_[entity]_delete(). A field type module requiring a field cache clear is less spaghetti than taxonomy.module requiring a node cache clear.
It also answers the fact the module defining entity X can live in core, and the module defining the 'X-reference' field type lives in contrib - or that both can be different contrib modules.
Refrential integrity is and has always been a complex issue for 'reference' fields. Right now, if taxonomy was a field ('term_reference field'), field API offers no easy way to do the equivalent of taxonomy.module's current 'on term delete, remove all node associations for that term'. Current reference fields (noderef / userref) do their validity checks at display time.
Comment #199
sunI agree with catch. Drupal should cache as much as possible. That's one of main reasons why Drupal can't compete with several other systems.
However, it seems that (somehow? suddenly? after very much work has been done?) there seem to be worries about the invalidation of this additional cache. On an attempt to at least allow for node_load caching, would it be possible to add the bare minimum to core? So people like catch and me (who prefer fast Drupal sites) can use a node_cache contrib module without requiring core hacks?
Basically, what I mean is to turn our object storage process and arbitrarily invoked cache* functions into a real Cache API, (also) allowing modules to implement hooks to act upon creation/manipulation/deletion of objects _and_ cache entries. Definitely a separate issue.
I can feel your pain, catch.
Comment #200
catchLess complexity, but means the entire field cache would be cleared every time any object which has a reference field gets updated or deleted. On some sites that's going to be every second if they have nodereference and lots of node updates - even if they have a million nodes and only five references.
If we don't clear the cache (let alone the referential integrity issues, although that's only for deletes), then we're introducing exactly the same issues which resulted in an early version of this patch being rolled back in 2007 - stale data. The complexity added in taxonomy module in this patch is precisely to avoid stale data and overly aggressive clearing of caches, this would be considerably harder to implement with a field level cache for the reasons given above - a single bin for all fieldable objects and complex to generate cache IDs.
Seems worth mentioning at this point that it was stale data which caused the D6 patch to be rolled back 150-odd followups ago and most of the work since then was to deal with those issues. That extra work is now being turned around as a reason to defer it the patch in favour of an architecture which has even more severe issues on this point. and given how nodereference avoids using the field cache for anything other than nids, considerably worse performance, even if we got all the conversion done in time to use it 'fully'.
@sun - that's somewhat where I've been planning to head with multiple load - although at this stage still a pattern rather than an API because an API takes us into ORM land, hence updating this patch rather than going for a completely new approach. For an abstracted way to do intelligent cache clearing, I've added some notes over at #439186: Cacheable object API.
Comment #201
yched CreditAttribution: yched commentedIf taxonomy and file uploads can be attached to nodes only, then I'm not overly shocked that they need to be aware of a node cache and sometimes have to manually clean it when terms or files are deleted. Similarly, if we move to 'taxo as term_reference field' or 'uploads as file_reference field' (and I think we know we'd like that), I think it's alright that those field types are aware of the field cache and manually clean it when needed. Where the node cache is problematic, though, is that a X_reference field type then also needs to be aware of node cache, or any other entity-level cache, and act on it. This breaks a pattern. As a field type, you shoudn't even know about entity types.
I see the node cache as a (hopefully) temporary optimization while we move towards fields as the primary, cross-entity 'data attach' mechanism - which can take some time. In the end, we should aim at field API and field-level cache being flexible and performant enough not to require fieldable entities to maintain their own object level cache. The discussion here shows, for instance, that we need to define a strategy for cache and referential integrity issues. But it's possible that when D7 freezes, some but not enough of the current node additions have been migrated to fields, and that we're not able to state that the performance outcome that once was identified from node caching is satisfyingly provided by field cache. Then, we might want to ship with both, and the node cache becomes a special case every X_reference field needs to care about.
HEAD being sort of in-between right now, it's a tough call. I guess my feeling would be that we have the patch ready just in case, and that we can focus on moving ahead with 'X as field' tasks, and make sure they are as performant as they can be. The discussion here, the spinoff issues and the benchmarks catch already ran on 'body as field' are precious in this regard.
Comment #203
joshk CreditAttribution: joshk commented@sun & @catch:
++ On this. I apologize if this seems like "parachuting in," but if we want Drupal to perform well, the ability to cache object-level data is critical. We should at the very least be able to do this without
hackingpatching core. Maybe, as Dries suggests, we return to this after the Field stuff is worked out, but I see the field work as being a long road, and am concerned that we won't get all the way there, and thus miss out on the much lower-hanging fruit.@bjaspin:
If you want drupal to be fast and support very large use-cases, then yes, we do. Object-level caching is a best-practice for scalability and high-performance architecture. Field-level caching presents just as much complexity, but with less performance gain because of increased PHP overhead. Why? It's not uncommon for node types have a lot of fields. Is it faster to get and assemble 15 caches or 1? What about all the things that are critical application-level node data, but not quite "content" (e.g. comment statistics, etc).
You bet we should! Advcache module does just these things. Having a general cache api for the important objects in core (which contrib modules can then implement as needed) would be a huge win. Catch's 8% number only hints at the difference this will make with an optimized caching system.
Seems like the thing a well-structured field-API would handle? The glue between a field and object should be easy to use (automagic) for clearing a necessary cache when a delete operation happens.
Comment #204
CorniI CreditAttribution: CorniI commentedsubscribe
Comment #205
yched CreditAttribution: yched commentedre joshk #203:
You got field cache wrong. Field cache caches all fields for a given object in one cache entry.
A taxonomy *field* would know that *field API* maintains a cache and clear it when needed. I precisely think Field API is fairly well structured on that regard : fields do not know about nodes or users specifically, or care about their specific internals.
Comment #206
catchIn this case though, as mentioned already, X_reference field types are going to require clearing the entire field cache every time the thing they reference is saved - or have to rely on formatters to load data with direct database queries and validate it on every request as happens currently - in either case you're spending a lot of time loading things manually or rebuilding the cache over and over - and each of these already has a solution here - dealing with those cases in field API is going to introduce exactly the same level of complexity.
For example, if the taxonomy_field formatter had to go and fetch term names for every term on a page with 10 nodes and 5 distinct terms each (not that many) - in the same way that nodereference currently does - then you're looking at 50 additional queries on that page. Add a nodereference field to a site and you're clearing the field cache every time a node is updated or deleted. Even if the field cache moves to per fieldable object - you've still got find all the object types with X_reference fields attached and clear their caches.
One way to avoid the knowledge of referenced objects required in either case would be to do exactly what joshk says - cache all objects consistently.
That way, the node IDs in a node reference field are cached persistently with the object - but no data about the nodes themselves. Then we load all the referenced nodes in a single query (or from memcache) via cache_get_multiple() for that request - and the formatters then just do $node = node_load($nid) $node->title instead of db_result(db_query('SELECT title')); - which by that point is more or less for free. Even with ten nodes, each with five term, user and node references each - you're looking at 4-5 queries per page, or 1 query + memcache, instead of 150 with the way field formatters currently work.
Comment #207
yched CreditAttribution: yched commentedHere's a propoal for the 'reference field' pattern (noderef, taxo as field, filefield...) which tries to addresss:
- performance issues
- stale cache issues
- object-level caching on some object types with a consistent pattern for 'who knows the internals of who'
Taking 'taxonomy as field' as an example. Replace [taxonomy field, term, tid] with [noderef field, node, nid], [file field, file, fid], etc...
a) We don't clean up existing field values when a term is deleted. Too much potential cost.
(this is how existsing D6 noderef and filefield work)
b) Instead, during field_attach_load(), hook_taxonomy_field_load() massages field values coming from the raw storage:
remove references to tids that don't exist anymore, add in base properties from the term object (term name, etc...).
(this is how D6 filefield works, but not noderef)
Once #362024: Field Attach: Make hook_field_load() be 'multiple' like field_attach_load() lands, this is one additional query per field of type 'taxo field' even on multiple object load.
The results of this get cached in the field cache or in the object cache if any.
c) When a term is updated or deleted, we need to clean field and object cache entries for each object that has $tid as a value in one of the fields of type 'taxo field'
This could be done by a function provided by Field API (function name obviously debatable). Taxonomy doesn't need to know that node objects are cached globally while users are not. Field API, OTOH, knows that already.
As outlined above, the only requirement *if we want to allow object-level cache* is that, in addition to whether an object type maintains its own object-level cache, Field API needs to know the name of the cache table (or assume 'cache_[obj_type]' ?) and the structure of the cids (or assume [id] ?).
[edit: also, note that you don't need to run field_attach_extract_ids() on the results of field_attach_query(), which has a 'return raw ids' mode]
Thoughts ?
Comment #208
catchLooks great to me, and similar to what I was thinking in #439186: Cacheable object API except put much better.
Comment #209
yched CreditAttribution: yched commentedNote that the proposal in #207 critically requires we make some progress on designing #415044: Indexes for field storage.
Comment #210
catchMultiple cache getting is now over at #333171: Optimize the field cache. And a spin off of that #369011: Performance: do check_markup() in text_field_load().
frando just posted #460320: Standarized, pluggable entity loading (nodes, users, taxonomy, files, comments) which would give us a clean way to do this for nodes (and users, and taxonomy terms, and etc.) from contrib. Marking this postponed on that issue.
Comment #211
catchSee how this goes: http://drupal.org/project/entitycache
Comment #212
catchNew issue #597236: Add entity caching to core.
Comment #213
greenc CreditAttribution: greenc commentedsubscribe
Comment #214
XanoWe handle this at the entity level now.