Updated: Comment #54
Problem/Motivation
When you pass in a $cid that is longer than 255 characters, it currently results in an Exception because it's longer that the VARCHAR column.
In addition the recently PhpBackend fails for log cache IDs (300 chars in the test)
Proposed resolution
For database, use a solution like the memcache module where a cid longer than 255 is replaced by a truncated cid plus a hash to keep it at the length limit but partially readable.
For PhpBackend, replace the cid value with a hashed version that will be consistently short enough and be file-system safe.
Remaining tasks
Back-port the database fix to Drupal 7.
User interface changes
none
API changes
none
follow-up: #2270607: Automatically hash cid's in Cache\DatabaseBackend for better lookup performance
Comment | File | Size | Author |
---|---|---|---|
#89 | 2224847-89.patch | 13.02 KB | hgoto |
#89 | 2224847-89-test_only_should_fail.patch | 9.38 KB | hgoto |
#82 | interdiff-2224847-82.txt | 2.66 KB | damiankloip |
#82 | 2224847-82.patch | 12.65 KB | damiankloip |
#71 | 2224847-71.patch | 12.36 KB | pwolanin |
Comments
Comment #1
Wim Leersmsonnabaum suggested it should just throw an exception, which I agree with.
This implies we support cache keys greater than 255 characters. Do we really want to do that? If we fix this for the database cache back-end, does that mean we also need to apply it for other cache back-ends?
Related: #2224861: Cache SystemMenuBlock and BookNavigationBlock per active trail (currently cached per URL, this breaks on very long URLs).
Comment #2
BerdirWell, the cache API consumers would then by doing exactly the same?
255 is an arbitrary limit that's specific to the database. Another backend might have a different limit? Maybe even lower and then it would need to do this anyway?
And yes, this needs to be documented either way.
The issue was created while discussing cmi file part lengths, which could also result in too long cache ID's. @alexpott and @catch both agreed with this.
Comment #3
danblack CreditAttribution: danblack commented> 255 is an arbitrary limit that's specific to the database
255 isn't the mysql limit - 65535 is (see #2224295). posgres has a limit in the 1G range and sqlite doesn't have a limit.
If you're going to hash a cid put this in a new column and index it. Then search by select ... where cid=s and cidhash=md5(s) to avoid hash collisions.
Comment #4
Berdir255 is an arbitrary limit that's specific to the database *cache backend implementation*.
Theoretical limits of text fields are not relevant, the cache tables default to cid = 255, and making it longer would just make the indexes more complicated.
The memcache module in 7.x already does the hash as a cid for memcache is limited too. not sure why it has to as the database has basically the same limit though: http://drupalcode.org/project/memcache.git/blob/refs/heads/7.x-1.x:/dmem...
Comment #5
BerdirStarted with this but we have a problem, this breaks getMultiple() as we lose the original cid. memcache doesn't have that problem as it stores the cache object including cid.
Comment #7
BerdirNot much of a problem after all.
#2230187: Invalid cache id generation when creating comment field was fixed as a duplicate of this.
Also note that this also fixes a edge case where an array of cache ID's is passed in that has associative array keys which contain characters like a . or :, that breaks database placeholders, we had that in one of the language override issues.
Comment #8
catchMemcache already does this and has for a long time:
http://drupalcontrib.org/api/drupal/contributions!memcache!dmemcache.inc...
Default limit (can be varied per-instance) is 250 there instead of 255, so throwing an exception on 255 wouldn't help in that case.
Another issue that strongly suggests doing this automatically is #2224861: Cache SystemMenuBlock and BookNavigationBlock per active trail (currently cached per URL, this breaks on very long URLs) - cache granularity/contexts mean you can bloat cid length massively, but there's no way to enforce this really.
Comment #9
BerdirYep already linked to that in #4 :)
I did wonder why it does that for a moment, but the thing is that memcache also has a per-site prefix and another prefix for the cache bin, so the actual cid length varies based on those prefixes and differs between different sites, which makes it impossible to handle it in the code that uses the cache.
There's also no question whether we should hash or not, the current render cache system with an unlimited set of cache keys and contexts will sooner or later run into this, as the existing bug reports already show. So the only question is who needs to do it, and this seems much easier...
Comment #10
sunIf we're using sha1, then we can as well use md5, right?
→ Faster + shorter, leaving more chars for the actual/original key?
Comment #11
BerdirI used sha1 because that's what memcache uses, which says that it's a good combination of fast and few collisions, see my link, I think catch's is older. It has been made configurable there, but that seems a bit overkill?
Comment #12
danblack CreditAttribution: danblack commentedi'm working on an alternate like attached - not quite there yet however. Feedback welcome.
Comment #13
Wim LeersThe patch in #7 looks *great*. I'm not sure what #12 is trying to achieve?
Comment #14
danblack CreditAttribution: danblack commented> The patch in #7 looks *great*. I'm not sure what #12 is trying to achieve?
Replaced #12 with a more updated one. (incorporates part of binary database fields patch)
What I'm achieving with this patch is a composite primary key that has the binary result at the start to facilitate more rapid determination of an existing key for those DB engines that have a B+tree index (ie. Innodb and possibly others).
What I'm also achieving is a implementation that is immune to hash collisions.
On a more philological level, databases should use columns for different data rather than concatenation into single fields.
Comment #16
sunInteresting. I understand @danblack's alternative proposal to essentially
I think I like that approach more, because it makes the storage use the most optimal approach to data, whereas the human-readable cid is only retained for manual cache data table introspection purposes.
(For that matter, the cid column should be moved last in the schema, and we can turn it into VARCHAR(1000) or also TEXT, if we're still hesitant to leverage VARCHAR > 255... despite supported by all of the database engines primarily supported by core, as I recently researched in #2181549-31: Provide a StringLong field item with schema type 'text' without a 'format' column)
Comment #17
danblack CreditAttribution: danblack commentedCompleted.
The cache-do-not-test.patch is the raw cache patch without the #710940 database binary types. cache-with-binary-datatypes.patch is so it will work.
Comment #18
danblack CreditAttribution: danblack commentedComment #19
danblack CreditAttribution: danblack commentedtest will fail as #710940 is a dependency.
Comment #22
mikeytown2 CreditAttribution: mikeytown2 commentedIn D8 do we do any wildcard cache ops (D7 has cache_clear_all())? If we do then the cid column will still need to be indexed.
Comment #23
danblack CreditAttribution: danblack commented> In D8 do we do any wildcard cache ops (D7 has cache_clear_all())? If we do then the cid column will still need to be indexed.
Not in the interface or API currently.
We'll assess it if the need comes up however I'd lean it towards a tags based purge approach (assuming I'm understanding tags correctly which may not be the case).
Comment #24
mikeytown2 CreditAttribution: mikeytown2 commentedComment #25
catchThere's no wildcard clears in 8.x, just tags.
Comment #26
pwolanin CreditAttribution: pwolanin commentedI'm not sure about the 2 part index. Possibly, a performance hit. It's not clear why it's useful.
Patch needs work: sha1 should never be used in new Drupal code. Use sha-256 or sha-512. The latter would be a 64 byte column an really an astronomically small (or possibly 0) chance of hash collision for strings of length < 1000.
Comment #27
damiankloip CreditAttribution: damiankloip commentedWhat is wrong with a sha1 exactly?
Comment #28
danblack CreditAttribution: danblack commentedIf you don't know measure it.
As the lookup attempt to match the entire hash and string I've indexed as much of the string as well. Whether a partial index here is useful depends on the database engine. You're are right in that its not strictly necessary.
Did you read comments 10 and 14? If you look closely at the patch, you'd see that when the lookup is done it is done on the cidhash and the cid. As such hash collisions are perfectly ok. They are there as an optimisation to achieve fixed size and as random as possible. As sun said, we could use md5.
You'll see that the cidhash isn't declared unique. As a primary key the composite of cidhash and cid(255) is unique however if objections to a composite primary key still occur a non-unique cidhash key would be my alternative.
Even if it was unique, the matching on cid and cidhash makes it immune to the vulnerabilities of hash collision. It would just mean you could only cache one of the collisions.
Perhaps we got #710940: Support for BINARY and VARBINARY in Database Schema committed we could actually make some progress here (if we stop jumping on hashes like this magical crypto gospel and focus on what the problem and proposed solution actually is).
I hope my clarifications alleviate your concerns.
Comment #29
pwolanin CreditAttribution: pwolanin commentedhttps://drupal.org/writing-secure-code/hash-functions
as of Drupal 7 we removed every use of ms5 and sha1 from core. There was a very long debate, but it has been a settled decision for almost 4 years.
Comment #30
damiankloip CreditAttribution: damiankloip commentedThis is not that applicable when hashing a cache key. There are not security implications here.
Comment #31
pwolanin CreditAttribution: pwolanin commented@damiankliop - please read the Drupal 7 issue, but this is settled policy. We should not release any 8.x code with these deprecated functions added. It's as much about setting a good example as anything else, and frankly the average developer can't tell when something matters for security or not, so we need a policy that is secure by default.
It's also not worth debating on performance grounds. The performance difference of hash('sha512') and sha1() is not enough to matter - they are both built-ins so run in a couple microseconds.
Comment #32
pwolanin CreditAttribution: pwolanin commentedto be clear: i think there is a good idea here. We shouldn't have to care about the length of the $cid passed in, but we should hash is to something consistent and use the hash as the actual primary key while storing the user cid as unindexed text for debugging.
Comment #33
damiankloip CreditAttribution: damiankloip commentedIsn't that basically what the patch in #19 is trying to achieve? Not saying that is what that patch is going for but having anything store in a cache bin for 'debugging' is *not* a good idea IMO.
I am not sure whether that whole approach is a bit overkill. I think I like the initial approach berdir took in #7 better. This is the approach memcache uses, which works fine. There are also other cases this has worked; like views doing this for block IDs that are too long in D7.
I would say using sha512 is not a good idea here, as that will be what, 128 chars? That's a big chunk of the key. sha256 could be a good compromise at 64 chars. We want whatever the minimum is to tick this ridiculous need to not have md5 or sha1 usage in the code base.
Comment #34
catchWe can open a new issue to reverse the md5() policy. There's plenty of md5() in a grep of core given it's used in vendor libraries anyway.
I'd be fine with the approach in #7, this is the implementation of the cache backend which shouldn't be used on high performance sites anyway, and as long as there's an upgrade path it's also safe to change it post beta/release.
Comment #35
pwolanin CreditAttribution: pwolanin commented@damiankloip - depends how these things are encoded (base16, base64, or binary)
sha512 base64 encoded is like 86 chars.
That's all you need as the index. Nothing else. We can implement that without needing binary column support.
Comment #36
pwolanin CreditAttribution: pwolanin commentedThe last patch didn't apply.
Here's an adjusted implementation using a base64 encoded hash which can work without the binary column definition.
Comment #37
danblack CreditAttribution: danblack commentedOne of the factors why the db backend can't be used on high performance sites is crappy database implementation. Smaller indexes of non-character types with a more distributed key pattern will increase its usefulness.
The default character encoding on UTF-8 introduce significant overhead in the database when doing comparisons so I'd rather not. They also increase the size by at least 3 times.
Comment #38
danblack CreditAttribution: danblack commentedComment #40
pwolanin CreditAttribution: pwolanin commentedNice test fails - seems PhpBackend has a similar bug with cid length. So, let's fix that too.
We are decreasing the key size from up to 255 down to 43 in this patch and making it a better key since it will be well distributed rather than many having a common prefix. It also fixes the bug with overly long cids. So, I think this is good a step forward and it doesn't depend on a blocked patch.
We can also make a BINARY column for mysql specifically. Implementing this I also came across a bug in the schema code where it's calling drupal_strtoupper() or drupal_strtolower() which it seems may not be loaded (I had a fatal with drush) and which is pointless, since the SQL data types are only ASCII, so it can just use strtoupper()/strtolower().
Also, a utf8 VARCHAR column containing single-byte characters will only be as long in bytes as the number of characters plus the byte or 2 to hold the length. http://dev.mysql.com/doc/refman/5.5/en/storage-requirements.html
So, I don't know that the use of a binary column for mysql here will actually make much difference.
Comment #41
pwolanin CreditAttribution: pwolanin commentedDiscussing in IRC, Crell says he doesn't object to the use of mysql_type in the patch if there is a clear need and performance benefit.
Discussed the effect of BINARY vs VARCHAR with @pimvanderwal who is a MySQL expert, and his take was that index performance would be similar for BINARY vs VARCHAR but VARCHAR would be likely to allocate 3x as much memory if using the utf8 charset. So, given that schema API doesn't support applying latin1 or ascii charset per column, we'd need to either apply it to the whole table, or use BINARY for the indexed column. I don't think applying it to the whole table is viable since it's possible the cache tags contain non-ASCII characters, so the solution in the patch is the best option until we have BINARY column support in schema API.
Comment #42
pwolanin CreditAttribution: pwolanin commentedI also noticed that the patch fixes a bug with the DB backend that would cause it to throw an exception when asked to delete or invalidate an empty list of Cache IDs.
Here's a small test addition (should show 2 exceptions) plus incorporating that with the prior patch.
Comment #44
damiankloip CreditAttribution: damiankloip commentedWeird, seems like me and catch said we liked the approach from #7.
Comment #45
pwolanin CreditAttribution: pwolanin commented@damiankliop - can you explain why that would be better? Hashing the value gives you a better key in terms of a BTREE index and a consistent length for it.
Comment #46
Wim LeersI favor the simplicity of #7, but pwolanin also makes an interesting case.
I can't help but wonder what the real-world performance win is of pwolanin's patch? OTOH, if a MySQL expert says there's an advantage to using an optimized primary key (which is also used as the index), then do we have a good reason to not do that, if that will speed up cache gets, even if the benefit may be marginal for most sites? (Not sure if that is, assuming worst case here.)
Overall, I think we should just get this done (and minimize bikeshedding), no matter whether it's #7 or #42 — and I think it's up to catch to decide.
So, I'd RTBC #42 (to get catch's feedback), except that I found a bunch of nitpicks:
Incomplete docblock.
s/cid/cache ID/
s/Cache ID/cache ID/
The hashed version of the original cache ID. (After applying Crypt::hashBase64()).
s/invalidate/invalidateMultiple()/
Incomplete docblock.
s/entry/item/
See above.
s/delete/deleteMultiple()/
Comment #47
pwolanin CreditAttribution: pwolanin commentedComment #48
pwolanin CreditAttribution: pwolanin commentedThanks Wim. This fixes comments as suggested, including a couple more similar to what you pointed out.
Comment #49
Wim LeersFrom #46:
Comment #50
sunAs also discussed with @danblack in IRC, I don't understand why we're hashing anything at all here.
We can simply turn the 'cid' column into [VAR]BINARY. A $cid is a string, which is a valid binary value. Hashing the cid is unnecessary.
The one and only gain we're after is to eliminate the charset and collation from the column. Essentially following the best practice for storing UUIDs in databases that do not support a UUID type natively.
@danblack already invested quite some work to add support for *BINARY types to database drivers. In case that doesn't happen in time, a feasible alternative approach is to use a custom charset + collation for the 'cid' column (e.g.,
ascii_bin
).Lastly, we're using VARCHAR columns with a length of >255 in core already; we can and should simply increase the column's length. The
PhpBackend
may need to hash (since max length is constrained by filesystem), but that SHOULD NOT affect all other implementations in any way.Comment #51
catchYeah I'd personally go for a longer varchar or #7. The extra column is extra complexity for no obvious benefit.
Comment #52
damiankloip CreditAttribution: damiankloip commentedOk, here is a new patch, Me, catch and Peter spoke about this on IRC.
I will not provide an interdiff, as this is a rerolled approach from #7 with stuff from #50 too.
- Uses Crypt::hashBase64() to hash long cids
- Empty getMultiple() coverage
- Port of cid hashing code to PhpBackend
- Use foreach and array_chunk instead or do/while loops in invalidateMultiple()/deleteMultiple()
Comment #53
pwolanin CreditAttribution: pwolanin commentedDiscussed again with damiankloip in IRC - I think the version always using a hash for the file name for PHP storage is preferable, since we don't know the file name length limit on different systems. We already use this base64 hash for aggregated CSS and JS file names, so it should be safe.
Also, need to remove the array_splice() calls that are now a no-op.
So, the PhpBackend is the same as #48, otherwise just 2 lines changed like this:
Comment #54
pwolanin CreditAttribution: pwolanin commentedComment #55
danblack CreditAttribution: danblack commentedThere has been a lot of claims about performance, including by myself, without any proof. Given the amount of bikeshedding so far I've attached a load test for perusal. Lets judge the suitability of patches of a cache patch on what really matters, a measurable test that replicates real load.
You'll need to run the script multiple times in parallel to generate some realistic load and combine results. Improvements to the script welcome.
Alternately if you want to measure it on a real system - recommend pulling out http://www.percona.com/doc/percona-toolkit/2.2/pt-index-usage.html and tuning the slow query log [#560228] or using a combination of the binary log (for updates) and a network capture (transformed using http://www.percona.com/doc/percona-toolkit/2.2/pt-query-digest.html) on an operational system to see what the performance is really like.
Any form of hashing/truncation of the key is susceptible to collisions. Just because its hard to generate doesn't mean its impossible or future advances in cryptography make more possible to predict so please follow the lead of #17 and match on the cid as well as its truncated form. An EXPLAIN query should show that it will use the index followed by the cid and given the cid is retrieved anyway and the low probability of collisions this shouldn't be performance impacting, but hey, don't take my word for it - TEST IT!
Comment #56
danblack CreditAttribution: danblack commentedmissed the test file. Opps
Comment #57
pwolanin CreditAttribution: pwolanin commented@danblack - let's move the performance discussion to the follow-up issue? catch was clear in IRC that this needs to be a back-portable fix and any schema change made that nearly impossible due to the way cache tables are created.
Comment #58
danblack CreditAttribution: danblack commentedComment #59
damiankloip CreditAttribution: damiankloip commentedWe don't really need the 'IN' specified. That will get done for us. Depends if we want that explicitness I guess?
We should just assert that an empty value is returned too.
Sorry @danblack, we should discuss performance in another issue. This is really about fixing the cid length. This allows us to do this in an easily backportable way. Also, the Database cache backend is not what you want to use if you want performance anyway. So is it really worth trying to optimise this so much?
Comment #60
pwolanin CreditAttribution: pwolanin commented@damiankloip I don't think the explicit IN hurts? It was already there in any case.
Comment #61
pwolanin CreditAttribution: pwolanin commented53: 2224847-53.patch queued for re-testing.
Comment #62
damiankloip CreditAttribution: damiankloip commentedFine, what about the other comment?
Comment #64
damiankloip CreditAttribution: damiankloip commentedComment #65
pwolanin CreditAttribution: pwolanin commentedYes, checking that the return is empty is a good idea.
Comment #66
pwolanin CreditAttribution: pwolanin commentedRe-reading the patch, it feels like ensureCidLength() is not really a clear method name. Something like truncateCid() or normalizeCid() or hortenCid() or ... ?
Comment #67
damiankloip CreditAttribution: damiankloip commentedWhy is that not a good name? Seems ok to me. It is ensuring the cid is within our length threshold. truncateCid() is not a good name IMO, as it may not be truncated at all, and in facto wont in the majority of cases. (s)hortenCid :) is the same as the previous. normalizeCid() could be possible. I still think the initial naming of ensureCidLength() is better though...
Comment #68
pwolanin CreditAttribution: pwolanin commentedI have no idea what the return value of an "ensure" would be. To me, it would be as or more likely that it would throw an exception if it's too long as shortening it. If it did alter it i'd as much expect it to be by reference as returned.
The semantics seem similar to the word "validate", so to me they don't communicate what's actually happening.
Comment #69
pwolanin CreditAttribution: pwolanin commentedBase on IRC discussion with damian, maybe my comment wasn't clear. I agree we should NOT call it something like "validate", and to me "ensure" seems the same as "validate"
Seems like we are both ok with something like "normalize" as a verb.
Comment #70
damiankloip CreditAttribution: damiankloip commentedYes, I think I could live with that :)
Comment #71
pwolanin CreditAttribution: pwolanin commentedre-roll for PSR-4 and method name change.
Comment #72
catchMarked #2275905: Clicking the enable language support link for a field settings form results in an SQL Insert error as duplicate. Since that's a fatal error in core, bumping this to critical.
Comment #73
damiankloip CreditAttribution: damiankloip commented71: 2224847-71.patch queued for re-testing.
Comment #74
pwolanin CreditAttribution: pwolanin commented71: 2224847-71.patch queued for re-testing.
Comment #77
pwolanin CreditAttribution: pwolanin commentedre-roll for trivial conflict in the test code.
Comment #78
kgoel CreditAttribution: kgoel commentedPatch looks good except for some very minor nitpicks.
get rid or an extra period
Comment #79
pwolanin CreditAttribution: pwolanin commentedthanks, fixed.
Comment #80
kgoel CreditAttribution: kgoel commentedComment #81
catchCould we add a normalizeCid() method here same as on the database backend? I see at least three separate calls to Crypt::hashBase64().
Comment #82
damiankloip CreditAttribution: damiankloip commentedComment #83
pwolanin CreditAttribution: pwolanin commentedthanks @damiankloip. Looks good. I guess better to have a central method in case we ever want to tweak it.
Comment #84
damiankloip CreditAttribution: damiankloip commentedAbstraction ftw :)
Comment #85
catchCommitted/pushed to 8.x, thanks! Moving to 7.x for backport.
Comment #89
hgoto CreditAttribution: hgoto as a volunteer commentedThis issue is old but I created a patch for D7 backport.
In my understanding, D7 doesn't have a substitution of PhpBackend and we need to backport only
DrupalDatabaseCache
(D7 version of DatabaseBackend) and add tests for it.I believe that the test coverage added by this patch is almost same as the D8 version's and is enough. But I'd like someone to check that point. Thank you.
Comment #90
hgoto CreditAttribution: hgoto as a volunteer commentedComment #93
alexpott@hgoto the new backport policy is to open a new 7.x issue and link it to the 8.x issue.
Comment #94
hgoto CreditAttribution: hgoto as a volunteer commented@alexpott thank you! I see, I'll open a new issue.
Comment #95
alexpott@hgoto thanks for working on this :)
Comment #96
alexpottComment #97
hgoto CreditAttribution: hgoto as a volunteer commentedI opened a new issue for D7 backport: #2793297: [D7] Automatically shorten cid's in DrupalDatabaseCache. I'd like someone to review it.