Problem/Motivation
Schema is missing generic BINARY and VARBINARY support (especially mysql); this prevents leveraging those DB-level facilities for cases where efficiency (query lookup time, storage usage) is highly desired.
Hex hashes are widely used, for many kinds of situations; they are just string representations of bytes; storing a hash as a VARCHAR (even if using the binary option) does not provide the same level of speed or reduced storage usage as just leveraging DB-level binary data structures. For a MySQL-specific analysis, see this blog post.
Actual measurements by @mfb:
... storing e.g. a SHA-256 hash in a UTF-8 CHAR(64) column uses six times more bytes than a BINARY(32) column.
... storing a hash in an ASCII VARCHAR(64) column is a big improvement, but still uses more than twice as much space as a BINARY(32) column
Here are some real-world numbers, for 4000 rows of SHA-256 hashes as primary key plus an additional index*.
CREATE TABLE `binhash` ( `hash` binary(32) NOT NULL, PRIMARY KEY (`hash`), KEY `hashkey` (`hash`)) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
binary(32) size on disk: 640K
CREATE TABLE `aschash` ( `hash` varchar(64) CHARACTER SET ascii NOT NULL, PRIMARY KEY (`hash`), KEY `hashkey` (`hash`)) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
varchar(64) ascii size on disk: 9.0M
Steps to reproduce
N/A, this is for new functionality.
Proposed resolution
- Add schema support for BINARY and VARBINARY columns.
- Augment database abstraction layer so that binary data can be easily used for all types of queries, for all supported DB engines.
Remaining tasks
Add schema support(see #50, #52).- Finish test coverage for schema support.
- TBD: add support for proper conversion to/from binary for all types of queries. See #22
User interface changes
None.
API changes
hook_schema: support for type => binary/varbinary.- Proper helpers (escaping, handling of default value, etc).
Data model changes
None
Release notes snippet
TBD
Comment | File | Size | Author |
---|---|---|---|
#67 | 710940-db-binary-type-67.patch | 5.06 KB | jedihe |
Comments
Comment #1
Crell CreditAttribution: Crell commentedWe're not making any changes to the schema support in Drupal 7. We can come back to this in Drupal 8.
Comment #2
mfbSubscribe. Storing e.g. a SHA-256 hash in a UTF-8 CHAR(64) column uses six times more bytes than a BINARY(32) column.
Comment #3
ebeyrent CreditAttribution: ebeyrent commentedNow that D8 is in alpha 3 release, was this addressed at all?
Comment #4
danblack CreditAttribution: danblack commentedComment #5
mfbThis looks great, thanks!
Comment #6
danblack CreditAttribution: danblack commentedadded test cases.
Also fixed sloppy quoting of DEFAULT strings using the pdo->quote.
Comment #7
sun...and that goes into a separate issue + patch. :-)
No need to use format_string() here, because we know what $table_name contains.
"Table $table_name created."
The patch actually adjusts Postgres + SQLite accordingly, so I'm adjusting the issue title.
Comment #8
danblack CreditAttribution: danblack commentedquoting fixed in patch in issue 2232425 which is now a requirement for this to work
format_string corrected though was copying from same code base.
Comment #10
danblack CreditAttribution: danblack commentedAlso discover a PDO exception with this in postgres while testing the #2224847: Automatically shorten cid's in Cache\DatabaseBackend and PhpBackend
What postgres need to escape bindata in some way. Looks like a job for core/lib/Drupal/Core/Database/Query/Condition.php
Comment #11
danblack CreditAttribution: danblack commentedPostgres needs http://www.php.net/manual/en/function.pg-escape-bytea.php
No obvious place to insert this. https://github.com/doctrine/dbal/pull/452 doesn't seem to include it.
Comment #12
danblack CreditAttribution: danblack commentedDeferred postgresql support in Update/Select arguments to #2238253: Add bindValue to a PDO::PARAM_* type in database query arguments - seems a little tricky to tackle on its own.
Test cases updated. Postgresql seems to handle schema bits associated with default value using standard quote. Still waiting on #2232425: Database Schema field/column default value is not properly quoted via PDO::quote() to be committed.
Dependency of #2222635: Waiting for table metadata lock on cache_field table
Comment #13
Crell CreditAttribution: Crell commentedWe need to have someone verify that this works on Postgres and SQLite before we commit. I'm not opposed, though.
Comment #15
danblack CreditAttribution: danblack commentedI have verified passes the Schema test suite on postgres and sqlite.
I've even DatabaseBackendUnitTest with patch [2222635-14] (with the following change) on postgres and sqlite since postgres doesn't like escaping binary in SELECT statements.
Comment #16
danblack CreditAttribution: danblack commented12: binary.patch queued for re-testing.
Comment #17
danblack CreditAttribution: danblack commented8: binary.patch queued for re-testing.
Comment #18
danblack CreditAttribution: danblack commentedThanks to the commit in #2232425: Database Schema field/column default value is not properly quoted via PDO::quote() this now passes automated tested and is ready for review/committing as is a prerequisite for hash based db caches (#2222635: Waiting for table metadata lock on cache_field table)
Comment #19
pwolanin CreditAttribution: pwolanin commented12: binary.patch queued for re-testing.
Comment #20
pwolanin CreditAttribution: pwolanin commentedSo, I'm a little confused about the posgres docs for bytea: http://www.postgresql.org/docs/9.1/static/datatype-binary.html
It seems in #12 that's deferred? however, we could convert the binary to hex instead of using the standard escape?
Comment #21
danblack CreditAttribution: danblack commentedMaybe. However I'm still stuck with then how do we override the standard escape in the db api layer and somehow have an awareness of which field is the bytea to contain escaping?
So if cidhash is to be escaped how is the following expression written to use hex.
Directly inserting it into the query string is one unelegant way:
Comment #22
sunLooks like we need a new
escapeBinary()
method for all db drivers - essentially following the existingescapeLike()
model?Some drivers might be able to piggy-back onto
quote()
in case PDO implements native binary quoting support for them.Comment #23
pwolanin CreditAttribution: pwolanin commented@sun - so how would that work? postgres would make it hex and mysql would be a no-op?
Comment #24
danblack CreditAttribution: danblack commented@sun, @pwolanin - can we take the issue of postgresql compatibility of bytea to #2238253: Add bindValue to a PDO::PARAM_* type in database query arguments as this patch doesn't introduce the incompatibility of bytea and postgres select, it exists currently as postgres already maps blobs to bytea.
As such this is adding a feature and no regressions even though currently you'll need to be careful how you use it if binary types are used in selects with non-ascii values as parameters.
Comment #25
pwolanin CreditAttribution: pwolanin commentedHmm, well, in the cache table issue we can make binary work for mysql only, so I don't think we should show the schema API as fully supporting it unless we resolve this (small?) postgres issue.
Of course, getting the testbots to run postgres too wold be a big help.
Comment #26
danblack CreditAttribution: danblack commentedIt may be small but its really ugly to fix - progress is been made - look in the appropriate bug report.
Comment #27
BerdirNote that we support the binary => true/false flag and now also use that for the cid column in cache tables by default. If that is set, then the BINARY flag is used for MySQL.
Comment #28
mfb@Berdir these are two unrelated things. The BINARY type is for storing binary data; the BINARY attribute is for storing text with a defined character set but a binary case-sensitive collation. The BINARY attribute automatically chooses an appropriate collation for the column's character set e.g. utf8_bin.
Comment #29
gangaloo CreditAttribution: gangaloo commentedI'm using Drupal 7, no plans to go to 8 yet.
I'm writing up a module with a .install file and using hook_schema() in the .install file.
The install file includes the function for a hook_schema().
Working with encryption and I need to store as type BINARY or VARBINARY.
Looking at the Schema API, BINARY nor VARBINARY is listed.
What I've tried:
Band-aiding a solution to use base64_encode() or bin2hex(), will cost storage space and performance (I believe).
If i apply this patch (Patch #12), would it enable me to store BINARY / VARBINARY for hook_schema()?
thanks in advance,
Comment #30
Crell CreditAttribution: Crell as a volunteer commentedgangaloo: This issue is about Drupal 8. Please open a new issue for Drupal 7, as the patch won't apply anyway due to other changes in the code base.
Comment #31
bzrudi71 CreditAttribution: bzrudi71 commentedComment #32
catchFor cache tables we're now using varchar_ascii for the cache IDs.
Is there any other use case for this?
Given #1031122: postgres changeField() is unable to convert to bytea column type correctly I'd be very wary about adding even more bytea dependencies to core.
Comment #33
danblack CreditAttribution: danblack as a volunteer commentedSeems to be enough desire over the years to add this. Doctrine dbal added it years ago. As referenced in the other issues there are still some limitations around binary/varbinary types and these issues have a body of knowledge around them and aren't impossible to fix and nor are they necessary for a variety of uses.
In the mean time however consider that there is a range of functionality provided by this patch that isn't available currently.
Comment #34
catchThere's desire expressed in this issue but the issue doesn't mention varchar_ascii which can store hashes very efficiently, hence me asking what people want to store in it now.
Once we add a feature we can't control what people use it for, so we should try to fix known issues first.
Comment #35
mfbre: #2 storing a hash in an ASCII VARCHAR(64) column is a big improvement, but still uses more than twice as much space as a BINARY(32) column
Comment #36
pwolanin CreditAttribution: pwolanin as a volunteer and at Acquia commentedmoving to an appropriate version for possible features
Comment #38
twistor CreditAttribution: twistor as a volunteer commentedHow is varchar_ascii more efficient than utf8? I thought the only difference was allowed index size. The size on disk is the same.
The efficiency with varbinary comes from not having to encode binary data (hashes, encryption, files) into text via hex or base64.
Comment #39
mfbYou are correct, using ascii instead of utf8mb4 doesn't help size on disk at all (I just checked to verify this :) So we have some improvement here only for index size.
Here are some real-world numbers, for 4000 rows of SHA-256 hashes as primary key plus an additional index*.
CREATE TABLE `binhash` ( `hash` binary(32) NOT NULL, PRIMARY KEY (`hash`), KEY `hashkey` (`hash`)) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
binary(32) size on disk: 640K
CREATE TABLE `aschash` ( `hash` varchar(64) CHARACTER SET ascii NOT NULL, PRIMARY KEY (`hash`), KEY `hashkey` (`hash`)) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
varchar(64) ascii size on disk: 9.0M
* per MySQL docs:
Comment #47
jedihe CreditAttribution: jedihe commentedTrying a re-roll for 9.1.x. Let's see what the testbot says...
Re-roll was manual, so no interdiff. Summary of important differences:
- fieldSetDefault() changes: dropped due to fieldSetDefault() being deprecated in favor of changeField() (I'm assuming changeField() is already updated via createFieldSql() changes). See change record for fieldSetDefault() deprecation.
- ::_escapeDefaultValue(): already implemented in 9.1.x. (Database\Schema.php).
- Test was ported to new SchemaTest.php, I tried following the pattern in other tests for some calls. Also, commented out tests for now-non-existant fieldSetDefault() et al, with a TODO for using changeField() instead.
Comment #48
jedihe CreditAttribution: jedihe commentedSorry testbot!, I got you yelling all the way through tests due to a silly mistake. It should be fixed now.
quasi-interdiff is just:
Comment #50
jedihe CreditAttribution: jedihe commentedTrying to fix the deprecation...
Comment #51
jedihe CreditAttribution: jedihe commentedNow that we have passes for the three DBMSes, let's try to get some feedback; I'm wondering if testing changeField() is really necessary (see TODO). If not, we can just remove the TODO and continue review + refinements.
Comment #52
jedihe CreditAttribution: jedihe commentedFor anybody interested, #50 applies on 8.8.x. With it, you can define binary fields in hook_schema():
For MySQL queries, HEX()/UNHEX() + db->select() and db->query() (for INSERT) should suffice:
Comment #53
jedihe CreditAttribution: jedihe commentedDid some more research on this today, and I'm now wondering if \PDO::PARAM_LOB can help to have a single way of properly escaping binary data; PDO-style sample from https://stackoverflow.com/a/59360328:
And we should be able to use \PDO::PARAM_LOB via \Drupal\Core\Database\Connection::quote().
Comment #54
daffie CreditAttribution: daffie commentedComment #55
daffie CreditAttribution: daffie commentedComment #56
jedihe CreditAttribution: jedihe commentedUpdating IS.
Comment #57
jedihe CreditAttribution: jedihe as a volunteer commentedComment #58
jedihe CreditAttribution: jedihe as a volunteer commentedGiven #50 provides almost complete support for hook_schema() functionality, I suggest we split adding support for proper conversion to/from binary for all types of queries into a new issue. #52 shows how to leverage existing facilities to handle hex hashes. I haven't checked other uses of binary data, though.
Comment #60
daffie CreditAttribution: daffie commentedI think that it is better to do more in this issue. Lets start using this functionality in at least one place in core. To me that is the best way to see that it works as it should.
Maybe we should first do #2238253: Add bindValue to a PDO::PARAM_* type in database query arguments, before doing this issue.
Comment #64
jedihe CreditAttribution: jedihe at Council on Foreign Relations commentedJust to report we have been using #50 in a site; the patch has been working fine through various changes in the site codebase/infrastructure:
The site handles ~900k unique visits per month. During this time, the site generated ~1M records in the table having the binary field.
Comment #66
jedihe CreditAttribution: jedihe at Council on Foreign Relations commentedLet's try with a quick re-roll.
Comment #67
jedihe CreditAttribution: jedihe at Council on Foreign Relations commentedFixed cspell errors by adding the words to the dictionary.
Comment #68
daffie CreditAttribution: daffie at Finalist commented@jedihe: Thank you for working on this. Let me start by saying that I am not against this issue, only that I am not convinced that the benefits of using binary and varbinary fields outway the effort to make the change and the problems we will get in making the change. The only benefit that I see is that we save a bit of harddisk space. We do not live in 1970 any more and harddisk space is cheap. If it makes Drupal faster in a significant way, than this would something we should do.
Comment #69
mfbOn mysql, historically a binary/varbinary column gave you better performance than a blob column by avoiding slow on-disk temporary tables. Version 8.0.13 improved handling of blob columns: they can now be stored in an in-memory temporary table. I'd guess binary/varbinary might still be slightly faster than blob overall due to how the data is stored in table - could be a fun little benchmarking project..
Comment #70
jedihe CreditAttribution: jedihe as a volunteer commented@daffie: thanks for the detailed feedback. I understand that this patch touches on low-level functionality, so potential inclusion will only occur if a very compelling argument is made for it. For now, I'm happy with just having this patch in a working state for MySQL.
Comment #71
Ghost of Drupal PastConsider https://www.drupal.org/pift-ci-job/2591966 as a use case for (var)binary. I am not even sure how to do this without VARBINARY support, to be fair.
Edit: the module has been released with the appropriate drupalci.yml file to apply a minimal version of this patch.
Comment #72
mfbCan someone sum up what has been tricky about adding support for binary/varbinary, given that blob is already supported? Is there something broken about blob on postgres? (I happen to have a contrib module that stores some binary data in a cache table, so I was assuming this basically worked..)