Automatically shorten cid's in Cache\DatabaseBackend and PhpBackend [#2224847]

Comment	File	Size	Author
#89	2224847-89.patch	13.02 KB	hgoto
7.x: PHP 5.3 & MySQL 5.5 1,977 pass, 55 fail
#89	2224847-89-test_only_should_fail.patch	9.38 KB	hgoto
7.x: PHP 5.3 & MySQL 5.5 2,020 pass, 5 fail
#82	interdiff-2224847-82.txt	2.66 KB	damiankloip
#82	2224847-82.patch	12.65 KB	damiankloip

#79	2224847-79.patch	12.36 KB	pwolanin

#79	increment.txt	570 bytes	pwolanin
#77	2224847-77.patch	12.36 KB	pwolanin

#71	2224847-71.patch	12.36 KB	pwolanin

#71	increment.txt	2.52 KB	pwolanin
#64	interdiff-2224847-64.txt	1016 bytes	damiankloip
#64	2224847-64.patch	12.4 KB	damiankloip

#56	db_cache_test.php_.test	2.95 KB	danblack
#53	2224847-53.patch	12.36 KB	pwolanin

#52	2224847-52.patch	11.1 KB	damiankloip

#48	2224847-48.patch	16.86 KB	pwolanin

#48	increment.txt	4.03 KB	pwolanin
#42	2224847-42.patch	16.61 KB	pwolanin

#42	2224847-test-only-42.patch	1.1 KB	pwolanin

#40	2224847-38.patch	15.91 KB	pwolanin

#40	increment.txt	7.84 KB	pwolanin
#36	2224847-36.patch	9.14 KB	pwolanin

#19	cache.patch	9.91 KB	danblack

#17	cache-with-binary-datatypes.patch	14.49 KB	danblack

#17	cache-do-not-test.patch	9.69 KB	danblack

#14	cache.patch	10.89 KB	danblack

#12	cache-do-not-test.patch	8.36 KB	danblack

#7	hash-cid-2224847-7-interdiff.txt	1.89 KB	Berdir
#7	hash-cid-2224847-7.patch	5.92 KB	Berdir

#5	hash-cid-2224847-5.patch	5.6 KB	Berdir

Comment #1

Ghent 🇧🇪🇪🇺

CreditAttribution: Wim Leers commented 24 March 2014 at 16:40

msonnabaum suggested it should just throw an exception, which I agree with.

This implies we support cache keys greater than 255 characters. Do we really want to do that? If we fix this for the database cache back-end, does that mean we also need to apply it for other cache back-ends?

Log in or register to post comments

Comment #2

Berdir

German

Switzerland

CreditAttribution: Berdir commented 24 March 2014 at 17:23

Well, the cache API consumers would then by doing exactly the same?

255 is an arbitrary limit that's specific to the database. Another backend might have a different limit? Maybe even lower and then it would need to do this anyway?

And yes, this needs to be documented either way.

The issue was created while discussing cmi file part lengths, which could also result in too long cache ID's. @alexpott and @catch both agreed with this.

Log in or register to post comments

Comment #3

danblack CreditAttribution: danblack commented 30 March 2014 at 23:14

> 255 is an arbitrary limit that's specific to the database

255 isn't the mysql limit - 65535 is (see #2224295). posgres has a limit in the 1G range and sqlite doesn't have a limit.

If you're going to hash a cid put this in a new column and index it. Then search by select ... where cid=s and cidhash=md5(s) to avoid hash collisions.

Log in or register to post comments

Comment #4

Berdir

German

Switzerland

CreditAttribution: Berdir commented 1 April 2014 at 19:02

255 is an arbitrary limit that's specific to the database *cache backend implementation*.

Theoretical limits of text fields are not relevant, the cache tables default to cid = 255, and making it longer would just make the indexes more complicated.

The memcache module in 7.x already does the hash as a cid for memcache is limited too. not sure why it has to as the database has basically the same limit though: http://drupalcode.org/project/memcache.git/blob/refs/heads/7.x-1.x:/dmem...

Log in or register to post comments

Comment #5

Berdir

German

Switzerland

CreditAttribution: Berdir commented 1 April 2014 at 21:47

Status:

Active

» Needs review

File	Size
hash-cid-2224847-5.patch	5.6 KB

Started with this but we have a problem, this breaks getMultiple() as we lose the original cid. memcache doesn't have that problem as it stores the cache object including cid.

Log in or register to post comments

Comment #6

1 April 2014 at 22:31

Status:

Needs review

» Needs work

The last submitted patch, 5: hash-cid-2224847-5.patch, failed testing.

Log in or register to post comments

Comment #7

Berdir

German

Switzerland

CreditAttribution: Berdir commented 2 April 2014 at 06:20

Status:

Needs work

» Needs review

File	Size
hash-cid-2224847-7.patch	5.92 KB

hash-cid-2224847-7-interdiff.txt	1.89 KB

Not much of a problem after all.

#2230187: Invalid cache id generation when creating comment field was fixed as a duplicate of this.

Also note that this also fixes a edge case where an array of cache ID's is passed in that has associative array keys which contain characters like a . or :, that breaks database placeholders, we had that in one of the language override issues.

Log in or register to post comments

Comment #8

catch

he/him

English

CreditAttribution: catch commented 2 April 2014 at 16:33

Memcache already does this and has for a long time:

http://drupalcontrib.org/api/drupal/contributions!memcache!dmemcache.inc...

Default limit (can be varied per-instance) is 250 there instead of 255, so throwing an exception on 255 wouldn't help in that case.

Another issue that strongly suggests doing this automatically is #2224861: Cache SystemMenuBlock and BookNavigationBlock per active trail (currently cached per URL, this breaks on very long URLs) - cache granularity/contexts mean you can bloat cid length massively, but there's no way to enforce this really.

Log in or register to post comments

Comment #9

Berdir

German

Switzerland

CreditAttribution: Berdir commented 2 April 2014 at 17:01

Yep already linked to that in #4 :)

I did wonder why it does that for a moment, but the thing is that memcache also has a per-site prefix and another prefix for the cache bin, so the actual cid length varies based on those prefixes and differs between different sites, which makes it impossible to handle it in the code that uses the cache.

There's also no question whether we should hash or not, the current render cache system with an unlimited set of cache keys and contexts will sooner or later run into this, as the existing bug reports already show. So the only question is who needs to do it, and this seems much easier...

Log in or register to post comments

Comment #10

sun

German

Karlsruhe

CreditAttribution: sun commented 2 April 2014 at 22:27

If we're using sha1, then we can as well use md5, right?

→ Faster + shorter, leaving more chars for the actual/original key?

Log in or register to post comments

Comment #11

Berdir

German

Switzerland

CreditAttribution: Berdir commented 2 April 2014 at 22:34

I used sha1 because that's what memcache uses, which says that it's a good combination of fast and few collisions, see my link, I think catch's is older. It has been made configurable there, but that seems a bit overkill?

Log in or register to post comments

Comment #12

danblack CreditAttribution: danblack commented 2 April 2014 at 23:14

File	Size
cache-do-not-test.patch	8.36 KB

i'm working on an alternate like attached - not quite there yet however. Feedback welcome.

Log in or register to post comments

Comment #13

Wim Leers

Ghent 🇧🇪🇪🇺

CreditAttribution: Wim Leers commented 3 April 2014 at 08:05

Issue tags:

+D8 cacheability

2 files were hidden/shown/deleted

File	Size
hash-cid-2224847-5.patch	5.6 KB

hash-cid-2224847-7-interdiff.txt	1.89 KB

The patch in #7 looks *great*. I'm not sure what #12 is trying to achieve?

Log in or register to post comments

Comment #14

danblack CreditAttribution: danblack commented 3 April 2014 at 08:35

File	Size
cache.patch	10.89 KB

1 file was hidden/shown/deleted

File	Size
cache-do-not-test.patch	8.36 KB

> The patch in #7 looks *great*. I'm not sure what #12 is trying to achieve?

Replaced #12 with a more updated one. (incorporates part of binary database fields patch)

What I'm achieving with this patch is a composite primary key that has the binary result at the start to facilitate more rapid determination of an existing key for those DB engines that have a B+tree index (ie. Innodb and possibly others).

What I'm also achieving is a implementation that is immune to hash collisions.

On a more philological level, databases should use columns for different data rather than concatenation into single fields.

Log in or register to post comments

Comment #15

3 April 2014 at 08:39

Status:

Needs review

» Needs work

The last submitted patch, 14: cache.patch, failed testing.

Log in or register to post comments

Comment #16

sun

German

Karlsruhe

CreditAttribution: sun commented 3 April 2014 at 11:43

Interesting. I understand @danblack's alternative proposal to essentially

Turn the current cid column into a completely peripheral identifier that exists for human purposes only.
Add a new cache key column that is always the fully hashed cid, and uses a BINARY data type where available.
A BINARY column additionally improves the index filter/sort performance, because no locale or any other character set rules are applied.

I think I like that approach more, because it makes the storage use the most optimal approach to data, whereas the human-readable cid is only retained for manual cache data table introspection purposes.

(For that matter, the cid column should be moved last in the schema, and we can turn it into VARCHAR(1000) or also TEXT, if we're still hesitant to leverage VARCHAR > 255... despite supported by all of the database engines primarily supported by core, as I recently researched in #2181549-31: Provide a StringLong field item with schema type 'text' without a 'format' column)

Log in or register to post comments

Comment #17

danblack CreditAttribution: danblack commented 3 April 2014 at 23:08

File	Size
cache-do-not-test.patch	9.69 KB

cache-with-binary-datatypes.patch	14.49 KB

2 files were hidden/shown/deleted

File	Size
hash-cid-2224847-7-interdiff.txt	1.89 KB
cache.patch	10.89 KB

Completed.

The cache-do-not-test.patch is the raw cache patch without the #710940 database binary types. cache-with-binary-datatypes.patch is so it will work.

Log in or register to post comments

Comment #18

danblack CreditAttribution: danblack commented 4 April 2014 at 00:44

Status:

Needs work

» Needs review

Log in or register to post comments

Comment #19

danblack CreditAttribution: danblack commented 4 April 2014 at 01:04

File	Size
cache.patch	9.91 KB

2 files were hidden/shown/deleted

File	Size
cache-do-not-test.patch	9.69 KB

cache-with-binary-datatypes.patch	14.49 KB

test will fail as #710940 is a dependency.

Log in or register to post comments

Comment #20

4 April 2014 at 01:06

Status:

Needs review

» Needs work

The last submitted patch, 19: cache.patch, failed testing.

Log in or register to post comments

Comment #21

4 April 2014 at 01:17

The last submitted patch, 17: cache-with-binary-datatypes.patch, failed testing.

Log in or register to post comments

Comment #22

mikeytown2 CreditAttribution: mikeytown2 commented 4 April 2014 at 21:50

In D8 do we do any wildcard cache ops (D7 has cache_clear_all())? If we do then the cid column will still need to be indexed.

Log in or register to post comments

Comment #23

danblack CreditAttribution: danblack commented 4 April 2014 at 23:45

> In D8 do we do any wildcard cache ops (D7 has cache_clear_all())? If we do then the cid column will still need to be indexed.

Not in the interface or API currently.

We'll assess it if the need comes up however I'd lean it towards a tags based purge approach (assuming I'm understanding tags correctly which may not be the case).

Log in or register to post comments

Comment #24

mikeytown2 CreditAttribution: mikeytown2 commented 7 April 2014 at 19:26

Issue tags:

+database cache

Log in or register to post comments

Comment #25

catch

he/him

English

CreditAttribution: catch commented 7 April 2014 at 21:04

There's no wildcard clears in 8.x, just tags.

Log in or register to post comments

Comment #26

pwolanin CreditAttribution: pwolanin commented 15 May 2014 at 19:36

I'm not sure about the 2 part index. Possibly, a performance hit. It's not clear why it's useful.

Patch needs work: sha1 should never be used in new Drupal code. Use sha-256 or sha-512. The latter would be a 64 byte column an really an astronomically small (or possibly 0) chance of hash collision for strings of length < 1000.

Log in or register to post comments

Comment #27

damiankloip CreditAttribution: damiankloip commented 15 May 2014 at 19:51

What is wrong with a sha1 exactly?

Log in or register to post comments

Comment #28

danblack CreditAttribution: danblack commented 15 May 2014 at 22:04

I'm not sure about the 2 part index. Possibly, a performance hit.

If you don't know measure it.

It's not clear why it's useful.

As the lookup attempt to match the entire hash and string I've indexed as much of the string as well. Whether a partial index here is useful depends on the database engine. You're are right in that its not strictly necessary.

Patch needs work: sha1 should never be used in new Drupal code. Use sha-256 or sha-512. The latter would be a 64 byte column an really an astronomically small (or possibly 0) chance of hash collision for strings of length < 1000.

Did you read comments 10 and 14? If you look closely at the patch, you'd see that when the lookup is done it is done on the cidhash and the cid. As such hash collisions are perfectly ok. They are there as an optimisation to achieve fixed size and as random as possible. As sun said, we could use md5.

You'll see that the cidhash isn't declared unique. As a primary key the composite of cidhash and cid(255) is unique however if objections to a composite primary key still occur a non-unique cidhash key would be my alternative.

Even if it was unique, the matching on cid and cidhash makes it immune to the vulnerabilities of hash collision. It would just mean you could only cache one of the collisions.

Perhaps we got #710940: Support for BINARY and VARBINARY in Database Schema committed we could actually make some progress here (if we stop jumping on hashes like this magical crypto gospel and focus on what the problem and proposed solution actually is).

I hope my clarifications alleviate your concerns.

Log in or register to post comments

Comment #29

pwolanin CreditAttribution: pwolanin commented 15 May 2014 at 22:10

https://drupal.org/writing-secure-code/hash-functions

as of Drupal 7 we removed every use of ms5 and sha1 from core. There was a very long debate, but it has been a settled decision for almost 4 years.

Log in or register to post comments

Comment #30

damiankloip CreditAttribution: damiankloip commented 15 May 2014 at 22:15

This is not that applicable when hashing a cache key. There are not security implications here.

Log in or register to post comments

Comment #31

pwolanin CreditAttribution: pwolanin commented 15 May 2014 at 22:41

@damiankliop - please read the Drupal 7 issue, but this is settled policy. We should not release any 8.x code with these deprecated functions added. It's as much about setting a good example as anything else, and frankly the average developer can't tell when something matters for security or not, so we need a policy that is secure by default.

It's also not worth debating on performance grounds. The performance difference of hash('sha512') and sha1() is not enough to matter - they are both built-ins so run in a couple microseconds.

Log in or register to post comments

Comment #32

pwolanin CreditAttribution: pwolanin commented 16 May 2014 at 01:24

to be clear: i think there is a good idea here. We shouldn't have to care about the length of the $cid passed in, but we should hash is to something consistent and use the hash as the actual primary key while storing the user cid as unindexed text for debugging.

Log in or register to post comments

Comment #33

damiankloip CreditAttribution: damiankloip commented 16 May 2014 at 09:09

Isn't that basically what the patch in #19 is trying to achieve? Not saying that is what that patch is going for but having anything store in a cache bin for 'debugging' is *not* a good idea IMO.

I am not sure whether that whole approach is a bit overkill. I think I like the initial approach berdir took in #7 better. This is the approach memcache uses, which works fine. There are also other cases this has worked; like views doing this for block IDs that are too long in D7.

I would say using sha512 is not a good idea here, as that will be what, 128 chars? That's a big chunk of the key. sha256 could be a good compromise at 64 chars. We want whatever the minimum is to tick this ridiculous need to not have md5 or sha1 usage in the code base.

Log in or register to post comments

Comment #34

catch

he/him

English

CreditAttribution: catch commented 16 May 2014 at 09:33

We can open a new issue to reverse the md5() policy. There's plenty of md5() in a grep of core given it's used in vendor libraries anyway.

I'd be fine with the approach in #7, this is the implementation of the cache backend which shouldn't be used on high performance sites anyway, and as long as there's an upgrade path it's also safe to change it post beta/release.

Log in or register to post comments

Comment #35

pwolanin CreditAttribution: pwolanin commented 16 May 2014 at 15:16

@damiankloip - depends how these things are encoded (base16, base64, or binary)

sha512 base64 encoded is like 86 chars.

That's all you need as the index. Nothing else. We can implement that without needing binary column support.

Log in or register to post comments

Comment #36

pwolanin CreditAttribution: pwolanin commented 16 May 2014 at 23:28

Status:

Needs work

» Needs review

File	Size
2224847-36.patch	9.14 KB

The last patch didn't apply.

Here's an adjusted implementation using a base64 encoded hash which can work without the binary column definition.

Log in or register to post comments

Comment #37

danblack CreditAttribution: danblack commented 16 May 2014 at 23:43

the implementation of the cache backend which shouldn't be used on high performance sites anyway

One of the factors why the db backend can't be used on high performance sites is crappy database implementation. Smaller indexes of non-character types with a more distributed key pattern will increase its usefulness.

We can implement that without needing binary column support.

The default character encoding on UTF-8 introduce significant overhead in the database when doing comparisons so I'd rather not. They also increase the size by at least 3 times.

Log in or register to post comments

Comment #38

danblack CreditAttribution: danblack commented 17 May 2014 at 00:22

Log in or register to post comments

Comment #39

17 May 2014 at 00:28

Status:

Needs review

» Needs work

The last submitted patch, 36: 2224847-36.patch, failed testing.

Log in or register to post comments

Comment #40

pwolanin CreditAttribution: pwolanin commented 17 May 2014 at 02:19

Title:	Automatically hash cid's in Cache\DatabaseBackend	» Automatically hash cid's in Cache\DatabaseBackend and PhpBackend
Status:	Needs work	» Needs review

File	Size
increment.txt	7.84 KB
2224847-38.patch	15.91 KB

Nice test fails - seems PhpBackend has a similar bug with cid length. So, let's fix that too.

We are decreasing the key size from up to 255 down to 43 in this patch and making it a better key since it will be well distributed rather than many having a common prefix. It also fixes the bug with overly long cids. So, I think this is good a step forward and it doesn't depend on a blocked patch.

We can also make a BINARY column for mysql specifically. Implementing this I also came across a bug in the schema code where it's calling drupal_strtoupper() or drupal_strtolower() which it seems may not be loaded (I had a fatal with drush) and which is pointless, since the SQL data types are only ASCII, so it can just use strtoupper()/strtolower().

Also, a utf8 VARCHAR column containing single-byte characters will only be as long in bytes as the number of characters plus the byte or 2 to hold the length. http://dev.mysql.com/doc/refman/5.5/en/storage-requirements.html

So, I don't know that the use of a binary column for mysql here will actually make much difference.

Log in or register to post comments

Comment #41

pwolanin CreditAttribution: pwolanin commented 18 May 2014 at 16:31

Discussing in IRC, Crell says he doesn't object to the use of mysql_type in the patch if there is a clear need and performance benefit.

Discussed the effect of BINARY vs VARCHAR with @pimvanderwal who is a MySQL expert, and his take was that index performance would be similar for BINARY vs VARCHAR but VARCHAR would be likely to allocate 3x as much memory if using the utf8 charset. So, given that schema API doesn't support applying latin1 or ascii charset per column, we'd need to either apply it to the whole table, or use BINARY for the indexed column. I don't think applying it to the whole table is viable since it's possible the cache tags contain non-ASCII characters, so the solution in the patch is the best option until we have BINARY column support in schema API.

Log in or register to post comments

Comment #42

pwolanin CreditAttribution: pwolanin commented 18 May 2014 at 17:18

File	Size
2224847-test-only-42.patch	1.1 KB

2224847-42.patch	16.61 KB

I also noticed that the patch fixes a bug with the DB backend that would cause it to throw an exception when asked to delete or invalidate an empty list of Cache IDs.

Here's a small test addition (should show 2 exceptions) plus incorporating that with the prior patch.

Log in or register to post comments

Comment #43

18 May 2014 at 18:20

The last submitted patch, 42: 2224847-test-only-42.patch, failed testing.

Log in or register to post comments

Comment #44

damiankloip CreditAttribution: damiankloip commented 19 May 2014 at 10:55

Weird, seems like me and catch said we liked the approach from #7.

Log in or register to post comments

Comment #45

pwolanin CreditAttribution: pwolanin commented 19 May 2014 at 12:35

@damiankliop - can you explain why that would be better? Hashing the value gives you a better key in terms of a BTREE index and a consistent length for it.

Log in or register to post comments

Comment #46

Wim Leers

Ghent 🇧🇪🇪🇺

CreditAttribution: Wim Leers commented 19 May 2014 at 14:02

Status:

Needs review

» Needs work

4 files were hidden/shown/deleted

File	Size
2224847-36.patch	9.14 KB

increment.txt	7.84 KB
2224847-38.patch	15.91 KB

2224847-test-only-42.patch	1.1 KB

I favor the simplicity of #7, but pwolanin also makes an interesting case.

I can't help but wonder what the real-world performance win is of pwolanin's patch? OTOH, if a MySQL expert says there's an advantage to using an optimized primary key (which is also used as the index), then do we have a good reason to not do that, if that will speed up cache gets, even if the benefit may be marginal for most sites? (Not sure if that is, assuming worst case here.)

Overall, I think we should just get this done (and minimize bikeshedding), no matter whether it's #7 or #42 — and I think it's up to catch to decide.

So, I'd RTBC #42 (to get catch's feedback), except that I found a bunch of nitpicks:

+++ b/core/lib/Drupal/Core/Cache/DatabaseBackend.php
@@ -160,6 +174,13 @@ public function set($cid, $data, $expire = Cache::PERMANENT, array $tags = array
   /**
+   * Calculate a consistent-length cache ID.
+   */

Incomplete docblock.

+++ b/core/lib/Drupal/Core/Cache/DatabaseBackend.php
@@ -552,13 +575,21 @@ protected function catchException(\Exception $e, $table_name = NULL) {
+          'description' => 'Hash of the human-readable cid.',

s/cid/cache ID/

+++ b/core/lib/Drupal/Core/Cache/DatabaseBackend.php
@@ -552,13 +575,21 @@ protected function catchException(\Exception $e, $table_name = NULL) {
+          'description' => 'Original (human-readable) Cache ID.',

s/Cache ID/cache ID/

+++ b/core/lib/Drupal/Core/Cache/PhpBackend.php
@@ -49,7 +49,22 @@ public function __construct($bin) {
+   *   The value after applying Crypt::hashBase64() to the original cache ID.

The hashed version of the original cache ID. (After applying Crypt::hashBase64()).

+++ b/core/modules/system/lib/Drupal/system/Tests/Cache/GenericCacheBackendUnitTestBase.php
@@ -498,6 +518,9 @@ function testInvalidate() {
+    // Calling invalidate with an empty list should not cause an error.

s/invalidate/invalidateMultiple()/

+++ b/core/lib/Drupal/Core/Cache/DatabaseBackend.php
@@ -160,6 +174,13 @@ public function set($cid, $data, $expire = Cache::PERMANENT, array $tags = array
   /**
+   * Calculate a consistent-length cache ID.
+   */

Incomplete docblock.

+++ b/core/lib/Drupal/Core/Cache/PhpBackend.php
@@ -168,9 +183,19 @@ public function deleteAll() {
+   * Invalidate one cache entry.

s/entry/item/

+++ b/core/lib/Drupal/Core/Cache/PhpBackend.php
@@ -168,9 +183,19 @@ public function deleteAll() {
+   *   The value after applying Crypt::hashBase64() to the original cache ID.

See above.

+++ b/core/modules/system/lib/Drupal/system/Tests/Cache/GenericCacheBackendUnitTestBase.php
@@ -367,6 +384,9 @@ public function testSetMultiple() {
+    // Calling delete with an empty list should not cause an error.

s/delete/deleteMultiple()/

Log in or register to post comments

Comment #47

pwolanin CreditAttribution: pwolanin commented 19 May 2014 at 14:03

Issue summary:

View changes

Log in or register to post comments

Comment #48

pwolanin CreditAttribution: pwolanin commented 19 May 2014 at 15:11

Status:

Needs work

» Needs review

File	Size
increment.txt	4.03 KB
2224847-48.patch	16.86 KB

3 files were hidden/shown/deleted

File	Size
hash-cid-2224847-7.patch	5.92 KB

hash-cid-2224847-7-interdiff.txt	1.89 KB
cache.patch	9.91 KB

Thanks Wim. This fixes comments as suggested, including a couple more similar to what you pointed out.

Log in or register to post comments

Comment #49

Wim Leers

Ghent 🇧🇪🇪🇺

CreditAttribution: Wim Leers commented 19 May 2014 at 15:14

Assigned:	Unassigned	» catch
Status:	Needs review	» Reviewed & tested by the community

From #46:

Overall, I think we should just get this done (and minimize bikeshedding), no matter whether it's #7 or #42 — and I think it's up to catch to decide

Log in or register to post comments

Comment #50

sun

German

Karlsruhe

CreditAttribution: sun commented 19 May 2014 at 16:01

Status:

Reviewed & tested by the community

» Needs work

As also discussed with @danblack in IRC, I don't understand why we're hashing anything at all here.

We can simply turn the 'cid' column into [VAR]BINARY. A $cid is a string, which is a valid binary value. Hashing the cid is unnecessary.

The one and only gain we're after is to eliminate the charset and collation from the column. Essentially following the best practice for storing UUIDs in databases that do not support a UUID type natively.

@danblack already invested quite some work to add support for *BINARY types to database drivers. In case that doesn't happen in time, a feasible alternative approach is to use a custom charset + collation for the 'cid' column (e.g., ascii_bin).

Lastly, we're using VARCHAR columns with a length of >255 in core already; we can and should simply increase the column's length. The PhpBackend may need to hash (since max length is constrained by filesystem), but that SHOULD NOT affect all other implementations in any way.

Log in or register to post comments

Comment #51

catch

he/him

English

CreditAttribution: catch commented 19 May 2014 at 16:27

Assigned:

catch

» Unassigned

Yeah I'd personally go for a longer varchar or #7. The extra column is extra complexity for no obvious benefit.

Log in or register to post comments

Comment #52

damiankloip CreditAttribution: damiankloip commented 19 May 2014 at 18:13

Status:

Needs work

» Needs review

File	Size
2224847-52.patch	11.1 KB

3 files were hidden/shown/deleted

File	Size
2224847-42.patch	16.61 KB

increment.txt	4.03 KB
2224847-48.patch	16.86 KB

Ok, here is a new patch, Me, catch and Peter spoke about this on IRC.

I will not provide an interdiff, as this is a rerolled approach from #7 with stuff from #50 too.

- Uses Crypt::hashBase64() to hash long cids
- Empty getMultiple() coverage
- Port of cid hashing code to PhpBackend
- Use foreach and array_chunk instead or do/while loops in invalidateMultiple()/deleteMultiple()

Log in or register to post comments

Comment #53

pwolanin CreditAttribution: pwolanin commented 19 May 2014 at 18:58

Title:	Automatically hash cid's in Cache\DatabaseBackend and PhpBackend	» Automatically shorten cid's in Cache\DatabaseBackend and PhpBackend
Issue summary:	View changes

File	Size
2224847-53.patch	12.36 KB

Discussed again with damiankloip in IRC - I think the version always using a hash for the file name for PHP storage is preferable, since we don't know the file name length limit on different systems. We already use this base64 hash for aggregated CSS and JS file names, so it should be safe.

Also, need to remove the array_splice() calls that are now a no-op.

So, the PhpBackend is the same as #48, otherwise just 2 lines changed like this:

-          ->condition('cid', array_splice($cids_chunk, 0, 1000), 'IN')
+          ->condition('cid', $cids_chunk, 'IN')

Log in or register to post comments

Comment #54

pwolanin CreditAttribution: pwolanin commented 19 May 2014 at 22:50

Issue summary:	View changes
Related issues:		+#2270607: Automatically hash cid's in Cache\DatabaseBackend for better lookup performance

Log in or register to post comments

Comment #55

danblack CreditAttribution: danblack commented 20 May 2014 at 00:42

1 file was hidden/shown/deleted

File	Size
cache.patch	9.91 KB

There has been a lot of claims about performance, including by myself, without any proof. Given the amount of bikeshedding so far I've attached a load test for perusal. Lets judge the suitability of patches of a cache patch on what really matters, a measurable test that replicates real load.

You'll need to run the script multiple times in parallel to generate some realistic load and combine results. Improvements to the script welcome.

Alternately if you want to measure it on a real system - recommend pulling out http://www.percona.com/doc/percona-toolkit/2.2/pt-index-usage.html and tuning the slow query log [#560228] or using a combination of the binary log (for updates) and a network capture (transformed using http://www.percona.com/doc/percona-toolkit/2.2/pt-query-digest.html) on an operational system to see what the performance is really like.

Any form of hashing/truncation of the key is susceptible to collisions. Just because its hard to generate doesn't mean its impossible or future advances in cryptography make more possible to predict so please follow the lead of #17 and match on the cid as well as its truncated form. An EXPLAIN query should show that it will use the index followed by the cid and given the cid is retrieved anyway and the low probability of collisions this shouldn't be performance impacting, but hey, don't take my word for it - TEST IT!

Log in or register to post comments

Comment #56

danblack CreditAttribution: danblack commented 20 May 2014 at 00:57

File	Size
db_cache_test.php_.test	2.95 KB

missed the test file. Opps

Log in or register to post comments

Comment #57

pwolanin CreditAttribution: pwolanin commented 20 May 2014 at 01:22

Issue tags:

+Needs backport to D7

@danblack - let's move the performance discussion to the follow-up issue? catch was clear in IRC that this needs to be a back-portable fix and any schema change made that nearly impossible due to the way cache tables are created.

Log in or register to post comments

Comment #58

danblack CreditAttribution: danblack commented 20 May 2014 at 03:17

let's move the performance discussion to the follow-up issue?
I don't think this is acceptable. It sounds like you want to short cut a design on a performance important aspect of Drupal that has some known limitations because we couldn't be bothered to run comparative tests even when they are written.

catch was clear in IRC that this needs to be a back-portable fix and any schema change made that nearly impossible due to the way cache tables are created.

Avoiding commenting on the current table creation that has numerous race conditions, a schema change in an update procedure seams achievable by looping through cache_* tables, doing a check on structure to validate is currently a cache, and conducting a rename and populate.
CREATE TABLE cache_XXX_new .....;
RENAME TABLE cache_XXX TO cache_XXX_old, cache_XXX_new TO cache_XXX
INSERT INTO cache_XXX SELECT transform(cid) as cidhash, cid, ..... FROM cache_XXX_old
DROP TABLE cache_XXX_old 
I'm happy to contribute to a back port, but on an aspect like caching where the the speed is only goal with the constraint that a back port is possible, the best implemented design should be selected according to performance criteria.

Log in or register to post comments

Comment #59

damiankloip CreditAttribution: damiankloip commented 22 May 2014 at 06:46

+++ b/core/lib/Drupal/Core/Cache/DatabaseBackend.php
@@ -284,14 +291,14 @@ public function delete($cid) {
+          ->condition('cid', $cids_chunk, 'IN')

@@ -357,15 +364,15 @@ public function invalidate($cid) {
+          ->condition('cid', $cids_chunk, 'IN')

We don't really need the 'IN' specified. That will get done for us. Depends if we want that explicitness I guess?

+++ b/core/modules/system/lib/Drupal/system/Tests/Cache/GenericCacheBackendUnitTestBase.php
@@ -400,6 +418,9 @@ public function testDeleteMultiple() {
+    $backend->deleteMultiple(array());

@@ -508,6 +529,10 @@ function testInvalidate() {
+    $backend->invalidateMultiple(array());

We should just assert that an empty value is returned too.

Sorry @danblack, we should discuss performance in another issue. This is really about fixing the cid length. This allows us to do this in an easily backportable way. Also, the Database cache backend is not what you want to use if you want performance anyway. So is it really worth trying to optimise this so much?

Log in or register to post comments

Comment #60

pwolanin CreditAttribution: pwolanin commented 22 May 2014 at 14:29

@damiankloip I don't think the explicit IN hurts? It was already there in any case.

Log in or register to post comments

Comment #61

pwolanin CreditAttribution: pwolanin commented 22 May 2014 at 14:29

53: 2224847-53.patch queued for re-testing.

Log in or register to post comments

Comment #62

damiankloip CreditAttribution: damiankloip commented 22 May 2014 at 14:34

Fine, what about the other comment?

Log in or register to post comments

Comment #63

22 May 2014 at 15:26

Status:

Needs review

» Needs work

The last submitted patch, 53: 2224847-53.patch, failed testing.

Log in or register to post comments

Comment #64

damiankloip CreditAttribution: damiankloip commented 23 May 2014 at 06:30

Status:

Needs work

» Needs review

File	Size
2224847-64.patch	12.4 KB

interdiff-2224847-64.txt	1016 bytes

3 files were hidden/shown/deleted

File	Size
2224847-52.patch	11.1 KB

2224847-53.patch	12.36 KB

db_cache_test.php_.test	2.95 KB

Log in or register to post comments

Comment #65

pwolanin CreditAttribution: pwolanin commented 26 May 2014 at 15:46

Yes, checking that the return is empty is a good idea.

Log in or register to post comments

Comment #66

pwolanin CreditAttribution: pwolanin commented 26 May 2014 at 15:52

Re-reading the patch, it feels like ensureCidLength() is not really a clear method name. Something like truncateCid() or normalizeCid() or hortenCid() or ... ?

Log in or register to post comments

Comment #67

damiankloip CreditAttribution: damiankloip commented 26 May 2014 at 19:53

Why is that not a good name? Seems ok to me. It is ensuring the cid is within our length threshold. truncateCid() is not a good name IMO, as it may not be truncated at all, and in facto wont in the majority of cases. (s)hortenCid :) is the same as the previous. normalizeCid() could be possible. I still think the initial naming of ensureCidLength() is better though...

Log in or register to post comments

Comment #68

pwolanin CreditAttribution: pwolanin commented 27 May 2014 at 13:16

I have no idea what the return value of an "ensure" would be. To me, it would be as or more likely that it would throw an exception if it's too long as shortening it. If it did alter it i'd as much expect it to be by reference as returned.

The semantics seem similar to the word "validate", so to me they don't communicate what's actually happening.

Log in or register to post comments

Comment #69

pwolanin CreditAttribution: pwolanin commented 27 May 2014 at 15:58

Base on IRC discussion with damian, maybe my comment wasn't clear. I agree we should NOT call it something like "validate", and to me "ensure" seems the same as "validate"

Seems like we are both ok with something like "normalize" as a verb.

Log in or register to post comments

Comment #70

damiankloip CreditAttribution: damiankloip commented 27 May 2014 at 16:30

Yes, I think I could live with that :)

Log in or register to post comments

Comment #71

pwolanin CreditAttribution: pwolanin commented 28 May 2014 at 17:01

File	Size
increment.txt	2.52 KB
2224847-71.patch	12.36 KB

2 files were hidden/shown/deleted

File	Size
cache.patch	9.91 KB

interdiff-2224847-64.txt	1016 bytes

re-roll for PSR-4 and method name change.

Log in or register to post comments

Comment #72

catch

he/him

English

CreditAttribution: catch commented 30 May 2014 at 16:00

Priority:

Normal

» Critical

Marked #2275905: Clicking the enable language support link for a field settings form results in an SQL Insert error as duplicate. Since that's a fatal error in core, bumping this to critical.

Log in or register to post comments

Comment #73

damiankloip CreditAttribution: damiankloip commented 17 June 2014 at 09:03

71: 2224847-71.patch queued for re-testing.

Log in or register to post comments

Comment #74

pwolanin CreditAttribution: pwolanin commented 24 June 2014 at 00:17

71: 2224847-71.patch queued for re-testing.

Log in or register to post comments

Comment #75

1 July 2014 at 12:33

pwolanin queued 71: 2224847-71.patch for re-testing.

Log in or register to post comments

Comment #76

1 July 2014 at 12:35

Status:

Needs review

» Needs work

The last submitted patch, 71: 2224847-71.patch, failed testing.

Log in or register to post comments

Comment #77

pwolanin CreditAttribution: pwolanin commented 1 July 2014 at 12:58

Status:

Needs work

» Needs review

File	Size
2224847-77.patch	12.36 KB

re-roll for trivial conflict in the test code.

Log in or register to post comments

Comment #78

kgoel CreditAttribution: kgoel commented 1 July 2014 at 14:45

Status:

Needs review

» Needs work

Patch looks good except for some very minor nitpicks.

 /**
  * @file
- * Definition of Drupal\system\Tests\Cache\GenericCacheBackendUnitTestBase.
+ * Contains \Drupal\system\Tests\Cache\GenericCacheBackendUnitTestBase..

get rid or an extra period

Log in or register to post comments

Comment #79

pwolanin CreditAttribution: pwolanin commented 1 July 2014 at 14:48

Status:

Needs work

» Needs review

File	Size
increment.txt	570 bytes
2224847-79.patch	12.36 KB

thanks, fixed.

Log in or register to post comments

Comment #80

kgoel CreditAttribution: kgoel commented 1 July 2014 at 16:48

Status:

Needs review

» Reviewed & tested by the community

Log in or register to post comments

Comment #81

catch

he/him

English

CreditAttribution: catch commented 2 July 2014 at 14:12

Status:

Reviewed & tested by the community

» Needs review

+++ b/core/lib/Drupal/Core/Cache/PhpBackend.php
@@ -49,7 +49,23 @@ public function __construct($bin) {
+    return $this->getByHash(Crypt::hashBase64($cid), $allow_invalid);

Could we add a normalizeCid() method here same as on the database backend? I see at least three separate calls to Crypt::hashBase64().

Log in or register to post comments

Comment #82

damiankloip CreditAttribution: damiankloip commented 2 July 2014 at 16:29

File	Size
2224847-82.patch	12.65 KB

interdiff-2224847-82.txt	2.66 KB

3 files were hidden/shown/deleted

File	Size
2224847-77.patch	12.36 KB

increment.txt	570 bytes
2224847-79.patch	12.36 KB

Log in or register to post comments

Comment #83

pwolanin CreditAttribution: pwolanin commented 2 July 2014 at 18:26

Status:

Needs review

» Reviewed & tested by the community

thanks @damiankloip. Looks good. I guess better to have a central method in case we ever want to tweak it.

Log in or register to post comments

Comment #84

damiankloip CreditAttribution: damiankloip commented 2 July 2014 at 19:20

Abstraction ftw :)

Log in or register to post comments

Comment #85

catch

he/him

English

CreditAttribution: catch commented 2 July 2014 at 19:42

Version:	8.x-dev	» 7.x-dev
Status:	Reviewed & tested by the community	» Patch (to be ported)

Committed/pushed to 8.x, thanks! Moving to 7.x for backport.

Log in or register to post comments

Comment #86

2 July 2014 at 19:49

catch committed 4fb1180 on

Issue #2224847 by pwolanin, damiankloip, danblack, Berdir: Fixed...

Log in or register to post comments

Comment #87

3 August 2016 at 00:04

catch committed 4fb1180 on 8.3.x

Issue #2224847 by pwolanin, damiankloip, danblack, Berdir: Fixed...

Log in or register to post comments

Comment #88

3 August 2016 at 00:05

catch committed 4fb1180 on 8.3.x

Issue #2224847 by pwolanin, damiankloip, danblack, Berdir: Fixed...

Log in or register to post comments

Comment #89

hgoto CreditAttribution: hgoto as a volunteer commented 31 August 2016 at 13:26

File	Size
2224847-89-test_only_should_fail.patch	9.38 KB
7.x: PHP 5.3 & MySQL 5.5 2,020 pass, 5 fail
2224847-89.patch	13.02 KB
7.x: PHP 5.3 & MySQL 5.5 1,977 pass, 55 fail

This issue is old but I created a patch for D7 backport.

In my understanding, D7 doesn't have a substitution of PhpBackend and we need to backport only DrupalDatabaseCache (D7 version of DatabaseBackend) and add tests for it.

I believe that the test coverage added by this patch is almost same as the D8 version's and is enough. But I'd like someone to check that point. Thank you.