With #324313: Load multiple nodes and terms at once on the horizon, I was thinking about revisiting #111127: Cache node_load (since the multiple load has no effect, or a very small cost, for single loads). However, doing N cache_gets when otherwise you could load the nodes in a single query per hook seems a bit much.
So, here's an initial patch which just copies cache_get to a new function which accepts and returns an array instead of a single cid and moves some code around. Moves what would otherwise be entirely duplicated code to a shared private function.
Completely untested. I'm going to bask while t.d.o does it for me ;)
Comment | File | Size | Author |
---|---|---|---|
#61 | cache_42.patch | 12 KB | catch |
#59 | cache.patch | 12 KB | catch |
#56 | cache_field.patch | 17.82 KB | catch |
#54 | cache_field.patch | 16.54 KB | catch |
#52 | cache_field.patch | 16.54 KB | catch |
Comments
Comment #1
moshe weitzman CreditAttribution: moshe weitzman commentednice.
If you are going to loop over each result anyway, then don't do fetchAllAssoc('cid'). Just foreach the resultset.
Comment #2
catchI wanted the results indexed by cid though, although a typo in the foreach didn't make it clear why that was useful. I'm not sure how expensive fetchAllAssoc is though - if it's doing a lot of work, then we should probably key the results in that same foreach loop.
New patch, changes the return to
return array('cached' => $cache, 'misses' => $cids);
So in the node multiple load implementation it'd look something like this:
This would save doing array_diff_key(array_flip($cids), $cache); whenever you care about the stuff that wasn't fetched from cache, which is presumably all the time.
Comment #3
catchRemoved the fetchAllAssoc() and created a new array rather than unsetting stuff from the result set. Also moved garbage collection into a shared function to remove the remaining code duplication.
Comment #5
catchre-roll for the test bot.
Comment #6
yched CreditAttribution: yched commentedlooks like you mean:
Rerolled for that. No credits plz :-)
Comment #7
lilou CreditAttribution: lilou commentedCoding issue : 2 spaces
Comment #8
catchI've rolled this into #111127: Cache node_load - but don't want to close it out as duplicate yet since there's potentially other use cases (although this patch is wrong, fixes are in the patch for the other issue). So marking postponed pending what happens with node_load().
Comment #9
drewish CreditAttribution: drewish commentedsubscribing
Comment #10
David Strausssubscribing
Comment #11
catchHere's current work from the cache node_load() patch broken out, but this would all happen a lot quicker if more people could review that other patch instead of subscribing to this postponed one, I'm hoping we can mark as a duplicate when node load goes in.
Comment #12
catchPostponed for almost six months months is long enough.
Here's a refreshed version of the patch from node_load() which tidies up the way the return value from cache_get_multiple is handled and adds cache clearing with an array of cids. Both cache_get_multiple() and cache_clear_all($array_of_cids) have unit tests.
I also modified field_attach_load() to use cache_get_multiple(). Since we don't actually have any multiple loaded objects with fields attached to them in core yet, this'll take a bit of setup to benchmark.
Comment #13
catchSeems I hadn't actually saved field.attach.inc and that bit doesn't work yet. Have to go out, but here's a snapshot with that bit so far.
Comment #14
catchPatch works now.
Comment #16
catchBenchmarked HEAD with http://drupal.org/files/issues/body-as-field-372743-120.patch, 10 nodes on front page.
7.40 reqs/second
Same again with this patch applied:
7.58-7.62 reqs/second
Saves 9 queries and about 3%.
Comment #17
yched CreditAttribution: yched commentedMinor remarks about field_attach_load() :
- If an object is not $cacheable, we shouldn't need to add it to the list of queried $cids
- In the second foreach,
"field:$obj_type:$id:$vid"
could be factorized to a $cid variable ?Comment #18
catchAdded a check before we add an item to the $cids array and consolidated the $cid variable in the foreach. Also made the tests pass.
Comment #19
yched CreditAttribution: yched commentedDries doesn't seem to like the 'cache_clear_threshold' bit.
I'm not sure whether he doesn't like the slicing itself, or the new variable, though.
Other than that, this looks ready to me.
Comment #20
catchI'm also not sure if it's the slicing or just the new variable which was the problem, I'm hoping just the new variable - since the slicing is awesome (I can say this because it was Justin's update, not mine, which introduced it).
Comment #22
catchRe-rolled, removed the variable.
Comment #23
moshe weitzman CreditAttribution: moshe weitzman commentedI reviewed each line and found no issues. Time for this to finally land.
Comment #24
catchFor what it's worth, I did some additional benchmarks / devel logging with this over at http://drupal.org/node/369011#comment-1536282 to see how much we can squeeze out of it.
Comment #25
sunWhile being there, it would be nice to fix the coding style of this @return.
Hm. Obviously, those were never updated. Of course, because new cache tables are defined/added elsewhere. Instead of fixing them, I'd propose to remove the entire
part here.Wrong indentation here.
Why not simply "Prepare a cached item." ?
Hm. I needed to read that 3 times - maybe replacing "have not expired" with "did not expire," helps?
Please avoid abbrevations in comments.
How about "Delete in chunks when large array is passed." ?
Does this value of 1,000 already account for well-known max_allowed_packet_size issues?
Missing blank line between functions.
Can we rewrite this to:
Comment #26
catchThanks for the review, attached patch addresses the feedback, slightly improves the doc for $table (we're getting data, not storing it, in cache_get) and also removes an unnecessary line break before a return elsewhere. Didn't change the $exists check because we have an entire cache test using this style and I kept it consistent -not planning to make the patch twice as big changing the rest of the test.
max_allowed_packet defaults to 1MB on most systems I think - MySQL docs say it's 16MB now. 1,000 ought to be conservative even with long cache IDs.
Comment #28
catchtest bot meltdown.
Comment #29
David StraussMySQL docs say 16MB limit for the client and 1MB limit for the server. This technically means max accepted packet size, not generated.
Comment #30
catchgotcha, a straight delete from with 1000 cache ids isn't going to hit 1mb - much more likely to hit it with a straight cache_set on a large object with lots of text in it.
Comment #31
suncid == 255 chars max.
* 1,000
+ remaining query ~= 50 max.
:= 255,050 chars max. < 1 MB?
Sorry, I'm not good at such calcs.
Comment #32
catchMe neither. But according to http://www.unitconversion.org/data-storage/characters-to-kilobytes-conve... it comes to 259k.
Comment #33
catchRerolled for getInfo() static method changes.
Comment #34
catchChanges in field_attach_load() made this no longer apply. Re-rolled but was getting a lot of exceptions / fails - however fields loaded via the CCK UI work fine here, so not sure why. Uploading to see how bad it really is.
Comment #36
yched CreditAttribution: yched commentedcatch : you might want to wait until #456488: cache field data only for the current revision is in
Comment #37
catchAnd it is :)
Patch does exactly the same thing, but there's enough changes to make it work with the new field_attach_load() that it could do with another look over before it's back to rtbc.
Comment #38
moshe weitzman CreditAttribution: moshe weitzman commentedLooks proper to me. Bot is happy. Adds nice tests. RTBC.
Comment #39
yched CreditAttribution: yched commentedReworked the field_attach_load() part to be more inline with the recent changes.
Back to CNR so that catch can review and the bot can retest.
Comment #40
catchThose changes look good. Test bot is happy.
Comment #41
catchBumping to add a summary - I wrote up loads of summaries on node load caching but there's no proper explanation of what the patch does here.
At the moment the field cache does a cache_get() for every object loaded on a page. With database caching this means a query per object - which makes it the exception in terms of how our fieldable objects are loaded.
The field and filter caches are currently the only places in HEAD where we have per-object queries in listings. When combined with #369011: Performance: do check_markup() in text_field_load() we should have exactly the same number of database queries on a page regardless of how many objects are loaded.
So the patch itself adds cache_get_multiple() - which takes an array of cids, and gives you an array of cache records back (and your $cids array reduced to the items you still need for convenience). With node or profile listings, this reduces the field cache lookups from 10 (or 30, or 300) to 1.
With the body as fields patch, this saves us 9 queries and gives us an extra 3% requests per second on the default front page with 1 node (see #16). If we ever make taxonomy terms fieldable, it'll save us hundreds of lookups to the field cache on node listings - because we load full term objects there which would call field_attach_load() - and issue a cache_get() for every term on the page.
Swappable cache systems should only have to do:
And not a lot else if there's no benefit to re-implementing the multiple getting. There might be situations where a pluggable system supports something like this and could use it though - to reduce impact of network latency for example.
The patch also adds the facility to clear cache ids using an array of ids in addition to the current cid/wildcard/everything options - this should allow for more intelligent cache clearing when that kind of information is available. This was something necessary for the node_load() caching, but might be considered kitten killing as part of this patch. I think it'll be useful in core when combined with things like nodereference #392494: Fields API needs to allow searching of scalars though so leaving it in until I'm told otherwise. That bit also comes with tests.
Comment #42
yched CreditAttribution: yched commentedmultiple cache clear is definitely needed for stuff like 'clear (node|file|...)-reference field cache entries when a referenced object is deleted', so I think it belongs in there.
Comment #43
catchNow body as field is back to CNR, it's possible to benchmark this usefully.
Front page, 30 nodes. Devel output and ab -c1 -n500
HEAD:
Executed 101 queries in 72.98 milliseconds
3.42 [#/sec]
Body as field:
Executed 73 queries in 71.71ms
2.86 [#/sec]
Body as field + latest patch on this issue:
Executed 43 queries in 58.79 milliseconds.
3.03 [#req/sec]
Comment #45
catchtagging.
Comment #46
catchWith #413192: Make taxonomy terms fieldable the gains are even more obvious - this is with a completely empty field cache. Once we have thousands of records in there the individual queries will take a bit longer. Combined with body as field, half the front page queries will be from the field cache.
Devel, 30 nodes on front page, each with 5 terms attached, works out as 50-ish unique terms on the page:
HEAD + fieldable terms.
Executed 115 queries in 55.51 milliseconds.
Patch:
_get_multiple() + fieldable terms
Executed 72 queries in 50.18 milliseconds.
Comment #47
catchfieldable terms and body as field are in HEAD, so new benchmarks. Can't upgrade a D6 site to get large amounts of content because that's broken at the moment, so one node with 20-odd freetagging terms created manually. Since we're now getting dozens of database hits from cache_get() on node listings with lots of unique taxonomy terms, bumping to critical.
One node, 20 odd terms.
HEAD:
Executed 62 queries in 44.97 milliseconds.
8.31 [#/sec]
Patch:
Executed 45 queries in 32.31 milliseconds.
8.60 [#/sec]
chx also pointed me to http://us3.php.net/manual/en/function.memcache-get.php which takes an array of cache keys - so it looks like this would help some alternative drivers rather than make things harder at least in some cases
Comment #49
catchRe-rolled after #482622: Make cache friendlier for alternative implementations.
Not sure about _garbageCollection() and _prepareItem() but it all works.
Comment #50
chx CreditAttribution: chx commentedInstead of function _garbageCollection you can just use
protected function garbageCollection
. Although a private function would work too but there is absolutely no reason to hide the function from another implementation that wants to build on the default.Comment #51
catchThanks chx. Re-rolled with those changes.
Comment #52
catchSpoke to chx and bjaspan in irc (and had discussed briefly with yched before) - here's a slight extension of the patch.
We now create a cache table for each fieldable entity - i.e. cache_field_node, cache_field_user etc. This means table size (and hence overhead, key sizes etc.) will be a lot smaller, you'd be able to set up seperate cache bins in memcache per entity - good for when you have millions of records , and also backends like memcache which don't handle wildcard cache clearing very well. It also means we can use object id as the cache key, which should also make lookups a tiny bit faster than using composite strings like field:$obj_type:$id
Couple of very small kittens - hook_modules_installed() isn't called in install.inc, and hook_modules_uninstalled() is currently called after a module and all it's tables have been removed - not very useful if you need to check things like hook_fieldable_info() for the module being uninstalled.
Comment #53
yched CreditAttribution: yched commentedThe patch doesn't take care of the cache clears with respect to the new 'bins'.
I'm OK with the idea, but do we really need to do this here ? The patch lingered as RTBC for weeks now, I'm not sure this will help it get committed...
Comment #54
catchyched - if we do two separate patches, then they'll break each other, I'm fed up re-rolling this one every couple of weeks and don't want to compound that by maintaining two conflicting patches in tandem.
Updated field_cache_clear(), thought I'd already done that, but apparently not...
Comment #55
yched CreditAttribution: yched commentedSame patch as in #52 ;-)
I was not only thinking of field_cache_clear(), but also the various cache_clear_all() calls - mainly in field_attach_*() functions, I think.
One patch / several patches: OK, your call :-)
Comment #56
catchDoh. Right patch this time.
Comment #58
chx CreditAttribution: chx commentedCould
$cids = array_keys(array_diff_key(array_flip($cids), $cache));
this be changed to$cids = array_diff($cids, array_keys($cache))
? I am just wondering.Comment #59
catchModules loaded in install is sufficiently broken that the table-per-entity part of this is going to end up postponed on four or five other patches.
So splitting that out to #496344: Create a cache table per entity.
This patch just makes the field_attach_load() changes and adds cache_get_multiple() and yes chx, that array_diff works fine. Note to self, listen to yched more.
Comment #60
yched CreditAttribution: yched commentedcatch: heh, no pb, I'm not feeling frustrated ;-)
The Field bits are OK for me, but I'm not enough familiar with the new cache backends to RTBC myself.
chx ?
Comment #61
catchnudging test bot.
Comment #64
catchComment #65
sunComment #66
webchickI spent about a half hour reading through this issue and the associated code. Looks like a good change that's compatible with similar API improvements we've made in this release, and the performance benefits are demonstrated well. Committed to HEAD!