Problem/Motivation
The Drupal MySQL driver does not currently provide full UTF-8 support. The (confusingly named) MySQL utf8
charset only provides support for the basic character plane (out of 17 in total), which constitutes 6% of possible characters. As of MySQL 5.5.3 (2010), the utf8mb4
charset provides full UTF-8 support, being completely backwards compatible, not requiring more space than utf8
for characters that are in the utf8
set, and using extra byte for characters outside of the utf8
set.
For reference, Wordpress implemented this in February 2015: https://core.trac.wordpress.org/ticket/21212
With the current partial UTF-8 support we lack the 16 other character planes. This would include:
- Emojis!
- Mathematical/scientific symbols
- Some Chinese, Japanese and Korean signs
- Musical notation
- Obscure/ancient languages
- Emoticons, astral symbols, game symbols and other pictographic sets
- (User defined) Font icons (these normally use reserved planes)
See also:
- Wikipedia
- http://mzsanford.wordpress.com/2010/12/28/mysql-and-unicode/
- http://mathiasbynens.be/notes/mysql-utf8mb4
The Drupal PostgreSQL and SQLite drivers do implement full UTF-8 support.
Proposed resolution
For D7: Allow users to use either utf8 or utf8mb4.
For D8:
Prerequisite issues that are already committed:
- #1923406: Use ASCII character set on alphanumeric fields so we can index all 255 characters
- #2473301: Raise MySQL requirement to 5.5.3
Now that we require a higher MySQL version and use ASCII encoding where possible, we can just set the utf8mb4 character set by default. Additionally, so we can support installs without "innodb_large_prefixes", we work around the InnoDB 191 character limitation on primary keys/indexes by:
- Getting rid of the database-level unique constraint on feed titles. We still have an entity constraint on the application level, as there's no way to distinguish between feeds in the UI other than by title, but it's not a huge deal to have two feeds with the same title so we can safely lose the database-level constraint.
- Getting rid of the database-level unique constraint on block descriptions. We still have uniqueness validation in the form (a patch exists to turn this into an entity constraint), as there's no way to distinguish between blocks in the UI other than by description, but it's not a huge deal to have two blocks with the same description either so we can safely lose the database-level constraint.
- We have a Drupal-level unique constraint on the File URI field instead of a database-level unique constraint, until we fix #2492171: Provide options to sanitize filenames (transliterate, lowercase, replace whitespace, etc) (at which point we can put back the database constraint).
- Setting the max size of the username field to whatever the username length is (60 characters) instead of 255.
- Setting the filename field on the D6 system table to ASCII, in both the generated script and in dump-database-d6.sh.
- Setting various fields (that only hold ASCII-data) to ASCII on test-only tables. We had skipped these in #1923406: Use ASCII character set on alphanumeric fields so we can index all 255 characters.
As to regular indexes, we limit those to 191 characters if they refer to utf8mb4-based fields.
User interface changes
N/A
API changes
We have a Drupal-level unique constraint on the File URI field instead of a database-level unique constraint, until we fix #2492171: Provide options to sanitize filenames (transliterate, lowercase, replace whitespace, etc) (at which point we can put back the database constraint).
Original Issue Summary by mdupont
From MySQL documentation, its "utf8" encoding only supports a maximum of 3 bytes per character.
As such, if we try to save valid UTF-8 content that passes drupal_validate_utf8() but contains 4 bytes characters, it would trigger a buggy behavior (see #1199736: Error on pasting in different language):
- in D6, content will be truncated to the first 4-bytes character ; any content after it will not be saved (data loss)
- in D7 and later, it will trigger an error in PDO: "PDOException: SQLSTATE[HY000]: General error: 1366 Incorrect string value:"
When using SQLite there are no issues with 4 bytes characters. I hadn't tested with PostgreSQL.
MySQL can handle 4 bytes characters since version 5.5.3 by using "utf8mb4" encoding.
How could we avoid data loss or error when using regular MySQL utf8 encoding?
Comment | File | Size | Author |
---|---|---|---|
#265 | diff-beta11-HEAD-3.txt | 743.12 KB | stefan.r |
#265 | 1314214-265.patch | 253.17 KB | stefan.r |
Comments
Comment #1
mdupontUpdated title
Comment #2
basic CreditAttribution: basic commentedThis looks like it ties in directly to the discussion on #1309278: Make PDO connection options configurable
Comment #3
sunReference manual: http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html
So...
What we want: Allow to use utf8mb4 instead of utf8.
What we don't want: Allow to use non-utf8 character sets.
Comment #4
catchTagging for backport.
Comment #5
basic CreditAttribution: basic commentedThis patch lets you set the 'names' connection option to utf8mb4, other charsets are defaulted back to utf8. This doesn't feel like the right way to approach this issue though.
In the case that we try to use utf8mb4 on an older version of MySQL we enter a scenario where Drupal cannot connect to the database, and we're brought back to the drupal install.php (which writes an updated settings.php w/o the utf8mb4 option). Not horrible, but not quite adequate either.
How should this kind of issue be handled? Should we...
Comment #6
phayes CreditAttribution: phayes commentedI think detecting wether it is compatible or not makes sense and throwing an error if it's not.
There are two ways to detect compatibility:
- Check the MySQL version (should be above 5.5.3)
- Run the query "show character set;"
The problem is that I think both of these things require a extra query :-/
Two comments about the above patch
1. "names" is a bit of a confusing variable name for the settings, how about "charset"?
2. I understand we want to limit mistakes, but is there a good reason to force only utf8 and utf8mb4 if the user is savy enough to edit their database connection settings? I think defaulting to utf8 might be good enough to limit mistakes and we should allow savy users to set the connection encoding to whatever they want if they so desire.
Finally, thank you *so* much basic for taking this on. Mucho mucho appreciation!
Comment #7
catchMoving to CNW since there's a patch here.
Comment #8
basic CreditAttribution: basic commentedFrom sun:
I am feeling that we should default to utf8mb4 where possible, and allow changing the charset back to utf8 if needed. This would cause mysql to behave the same as sqlite and postgres by default (assuming mysql >= 5.5.3), and use the legacy charset (utf8) if the mysql version doesn't support utf8mb4.
I would like to get some more feedback on this issue as well.
Comment #9
phayes CreditAttribution: phayes commentedAnother angle that this patch should address is the default table mysql character sets for tables. If we don't change the default character set for tables, setting the NAMES at connection-time will do nothing useful.
The table-level character-set is currently defined in includes/database/mysql/schema.inc line 85 and is currently hardcoded. This starts to get tricky as it seems dirty to pull this data from the mysql-*connection* configuration object. Instead, we need something like a variable that can be overridden in settings.php.
Thoughts?
Comment #10
phayes CreditAttribution: phayes commentedI've ran into another issue with using utf8mb4 that we will need to address: index-size.
Using utf8mb4 increases the index size from 3 bytes per character to 4 bytes per character. This isn't really a problem for text-fields, but absolutely plays havoc with varchar fields who's allowed size is 191 characters or larger. The reason for the is that the varchar-index has a maximum size of 765 bytes. Normally this is fine as it is exactly three-times the size of a 3-byte-per-character 255 character varchar (767 / 3 = 255). When we switch to 4-bytes-per-character, the maximum number of characters a 765 byte index can support is 191 (765 / 4 = 191).
So we have two options here
1. Only support varchars up to 191 characters in length for utf8mb4. I *don't* think this is a very good idea as it breaks too many assumptions.
2. Only support utf8mb4 characters in
text
columns. I think this is entirely reasonable.Implementing 2 will require a little more logic in the createField method, but I don't see this being problematic.
Comment #11
phayes CreditAttribution: phayes commentedHere's a snippet you can run to update all your tables and text-field to utf8mb4 for testing:
EDIT: DONT USE THIS! It will also change the charset of varchar fields created after it was run (BAD!). Use the one pasted below...
Comment #12
phayes CreditAttribution: phayes commentedHere's a patch that collects up the above observations.
I've moved the configuration from the connection config to a general $conf setting. This is because, if we are connecting to multiple databases using slaves, we want to make sure that all connections are configured the same. Additionally, using this $conf setting ensures that we only have *one* setting to change in order to enable utf8mb4 and everything "just works".
Tables now always default to utf8 no matter what, and columns only get tagged with utf8mb4 if they are text fields.
Review and thoughts please!
Comment #14
phayes CreditAttribution: phayes commentedTwo more things that should be added to the above patch:
1. Link to http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html in default.settings.php so a user can learn more
2. Check to make sure MySQL is compatible on *install*. I don't think we need to check on every request, because of:
A. We check it on install
B. If a site is migrated from one MySQL server to another that is non-compatible, the SQL import will fail when it encounters the utf8mb4 column charset
Comment #15
phayes CreditAttribution: phayes commentedHuh, I guess the MySQL testbot is not running MySQL 5.5.3+
Comment #16
Crell CreditAttribution: Crell commentedvariable_get() is not allowed inside the DB layer. All configuration must come in through the database configuration array only. That's by design so that the DB system is not dependent on the variable system, which itself has a soft dependency on the database.
The Database layer right now is about 95% Drupal-agnostic. The goal is that by Drupal 8's release t is 100% Drupal agnostic.
Comment #17
phayes CreditAttribution: phayes commentedThanks Crell,
I will re-roll using
global $conf;
Comment #18
Crell CreditAttribution: Crell commentedNo, global $conf is what we should not be using. :-) Such configuration should go into the $databases array, if anywhere.
Comment #19
phayes CreditAttribution: phayes commentedHere's another patch. Please review.
I think we are getting pretty close. The user can configure the charset in the database settings. We store the charset in the Database connection object as a read-only property so other functions can get this data later.
In particular, I would like someone to review the validateDatabaseSettings method. not sure if I'm doing this right...
Comment #20
phayes CreditAttribution: phayes commentedBlarg. Re-rolled above patch - removing danling
global $conf
Comment #22
phayes CreditAttribution: phayes commented{sigh} - re-rolled again (typo).
Comment #24
phayes CreditAttribution: phayes commentedMy apologies for the all the bad patches. Let's try this again shall we.
Comment #26
phayes CreditAttribution: phayes commentedFixing notice that testbot found...
Comment #27
phayes CreditAttribution: phayes commentedComment #28
Crell CreditAttribution: Crell commentedAlso, just to note that this is going to run smack dab into #1321540: Convert DBTNG to namespaces; separate Drupal bits and conflict. I'd rather get reviews on that sooner than later so that we can get it in, and then do other API changes.
Comment #29
Tor Arne Thune CreditAttribution: Tor Arne Thune commentedJust some comment cleanup. The interdiff is between patch in #26 and the attached patch.
Comment #30
phayes CreditAttribution: phayes commentedFor those who need this working against the latest stable in D7 (7.10 right now), here's a patch I cooked up. I hope people find it useful.
Also, once you install this patch and change your charset in your settings.php file, you can run this to upgrade all your text fields:
Comment #31
xjmThanks for the comment cleanup! Here's two other minor docs cleanups to incorporate:
These two docblocks should have one-line summaries. Also, the function's summary should start with a third-person verb. (See http://drupal.org/node/1354#classes).
Comment #32
Crell CreditAttribution: Crell commentedWoah, woah, why? The magic methods are dangerous magic in most cases. This is a hack. Don't do that. If for some reason you need to expose the charset to the outside world, make it a method.
-18 days to next Drupal core point release.
Comment #33
phayes CreditAttribution: phayes commentedHi Crell,
Here's an updated patch with the magic methods removed and replaced with a "charset" method.
Comment #34
phayes CreditAttribution: phayes commentedWhoops,
The above still had some comments pertaining to the __get method, those comments have been corrected in this one.
Comment #35
Crell CreditAttribution: Crell commentedVariable docblocks should follow the same rules as any other docblock. Vis, one line summary, empty line, then long description.
This value actually could be, arguably, multiple different values, couldn't it? Not just the two utf8s? If someone sets the charset connection option, they can set it to whatever they want.
No need to mention the method. In fact, most of this documentation should move to the method. Remember from a API perspective, the method is the only thing that exists. There is no property. That's just an implementation detail that a caller should not be able to know about.
-25 days to next Drupal core point release.
Comment #36
phayes CreditAttribution: phayes commentedCrell,
As per sun's comment above (#3), we only want to support utf8 and utf8mb4 charsets. If you want to override this decision I will make the necessary changes to the patch.
Comment #37
mdupontMarked #1741026: Some unicode code points failed on Drupal as a duplicate of this issue.
Comment #38
laurentnet CreditAttribution: laurentnet commentedDrupal 7.15, with SQlite 3.7.7.1 or PosgreSQL 9.3.1 (apache 2.2.2),
displays correctly code point U+1D538
Postgresql 9.1: one unicode character UTF_8 can be 1,2,3 ou 4 bytes
http://www.postgresql.org/docs/9.1/static/multibyte.html
Conclusion: if you want to display code point in Supplementary Multilingual Plane, use Postgresql or Sqlite
Comment #39
jurgenhaasWanted to cleanup the patch documentation (see comment #35) and realized that the structure on D8 has changed since the patch so I wonder if this still applies to D8?
Comment #40
YesCT CreditAttribution: YesCT commented#34: mysql_utf8mb4_support-1314214-34.patch queued for re-testing.
Comment #42
star-szrUpdating tags.
Comment #43
simolokid CreditAttribution: simolokid commentedWith the help of xjm i rebased it.
Comment #44
ZenDoodles CreditAttribution: ZenDoodles commentedComment #45
Crell CreditAttribution: Crell commentedSee earlier review about how this docblock belongs on the method; this should just be a one line description.
A double-ternary is way too hard to read. :-) Don't do that. Go ahead and spell this out long-form.
Actually, is there a reason to not just set a default for $connection_options['charset'], and then always use that?
See above. The property is irrelevant. The method should contain the real documentation.
0 days to next Drupal core point release.
Comment #46
star-szrThanks @simolokid! Untagging.
Comment #47
kbasarab CreditAttribution: kbasarab commentedComment #48
Crell CreditAttribution: Crell commentedkbasarab: See the first quoted block from comment #45. Some version of that should be moved to the charset() method, which then replaces the "used to fetch readonly property" line (which is meaningless).
The property is not a flag, because it's not a boolean. It should likely be something like "The character set of this connection.\n\nEither utf8 or utf8mb".
The method then has the discussion of seeing default.settings.php and further explanation and so forth.
Comment #49
ergophobe CreditAttribution: ergophobe commentedNew patch that has no code changes - this is just my best attempt to fold in Crell's last documentation requests.
The comment in lines 82 and 259 are a bit repetitive (both point to comments in default.settings.php). Not sure if that's okay.
Comment #50
YesCT CreditAttribution: YesCT commentedFor most functions (see exceptions below), summary lines must start with a third person singular present tense verb, and they must explain what the function does. Example: "Calculates the maximum weight for a list."
http://drupal.org/node/1354#functions
so this should be:
Fetches ...
Also there are a few lines with an extra space at the end: trailing whitespace.
patch coming soon.
Comment #51
YesCT CreditAttribution: YesCT commented(The white space was easy to spot using the dreditor review button http://drupal.org/project/dreditor )
Comment #52
chx CreditAttribution: chx commentedOuch. Please, please, please *no*. Supporting utf8mb4 is great but supporting any non-utf8 charset is a massive pita that I would say we must avoid at all costs. At least search will break with non-utf8 but I can imagine other things too.
Edit: to make sure this is clear: utf8 and utf8mb4 are both fine, but koi8r and all that ancient ....charsets are not.
Comment #53
chx CreditAttribution: chx commentedSolution: for the default connection throw an exception if the charset is set and does not start with utf8.
Comment #53.0
YesCT CreditAttribution: YesCT commentedadded summary of additional scope this issue needs in order to avoid a follow-up critical.
Comment #54
YesCT CreditAttribution: YesCT commentedneeds work to avoid critical follow-up issue.
updated issue summary.
updated issue title.
Comment #55
ergophobe CreditAttribution: ergophobe commentedI wrote the documentation to reflect the code, which is currently very permissive. Upon reflection, it would even accept typos would it not?
If it's the case that it absolutely must be utf8 or utf8mb4 then it should test specifically for those, not just strings that begin with 'utf8'. That way it would catch the case where someone accidentally types 'utf8mb' (e.g. as Crell did in #48).
So it seems that it needs to
- test specifically for utf8 or utf8mb4.
- if it's not one of those, throw an exception and default to utf8
Comment #56
chx CreditAttribution: chx commentedI. am fine with that -- once again, for default. The reason I emphasize default cos the any-charset-goes is an *excellent* opportunity for (careful) migrating from legacy databases.
Comment #56.0
chx CreditAttribution: chx commentedUpdated issue summary, point out need to avoid critical follow-up
Comment #57
damienwhaley CreditAttribution: damienwhaley commentedI've updated the issue summary to make it easy to figure what's left to do.
Comment #58
damienwhaley CreditAttribution: damienwhaley commentedI've started work on checking of the charset setting or MySQL and it's pretty much right I hope. The only concern I had is about throwing the exception. I've left the code commented out where I think the throw could happen, but if you let it throw there, then you end up with a white screen as the object is not returned and the error handler does not like null objects.
Perhaps the throw should be further up the chain (in the cache fetch)? Also the exception I've chosen is probably not quite right.
My thoughts after playing with this is that we should probably log it via watchdog as the connection should always work.
This probably needs a bit more thought. As I'm not 100% familiar with the new config stuff and I'm worried I have not respected @chx desire to only override the default setting.
I'd like some feedback, please.
Comment #58.0
damienwhaley CreditAttribution: damienwhaley commentedUpdated the issue summary
Comment #59
YesCT CreditAttribution: YesCT commentedI think that the interdiff was made a bit ... well, I dont think these were really removed in the patch in comment 81. wait, there is no 81! :) Probably a typo.
if a branch with the patch from 51 was called utf-51, and there was a branch with the patch in 58 called utf-58
then the interdiff could be made by: git diff utf-51 utf-58 > interdiff-51-58.txt (maybe the interdiff above reversed the arguments)
Comment #60
YesCT CreditAttribution: YesCT commentedI think any charset starting with utf8 would be ok. That would be more general than just adding two specific allowed ones.
Comment #61
chx CreditAttribution: chx commentedThanks everyone for moving this forward.
> I think any charset starting with utf8 would be ok.
See #55 for the reasoning -- checking specifically for utf8 / utf8mb4 is a good idea to avoid typos.
Re the new patch ( I attached a real interdiff for easy review): we do not need a new allowedCharset property and method as a disallowed charset should terminate immediately. strcmp() is needlessly complicated, it can be a simply ==. If the error handling is not yet set up, then just print and die, this should be superb rare, you can't hit that error message without misediting settings.php no need to be nice. However, moving the charset method up to the base Connection class is good.
As for default connection, just add
self::$databaseInfo[$key][$target]['key'] = $key
intoDatabase::openConnection
before calling the constructor and then the constructor can easily test for the default key.Comment #62
ergophobe CreditAttribution: ergophobe commentedI'll try to make a patch tomorrow morning unless someone beats me to it.
The final patch needs/should/could...
One comment on item #2 - there are only two charsets that begin with 'utf8' currently available in MySQL. There are many collations and there, if we're going to test for collations, it makes sense to take anything that begins with 'utf8', but I don't think it does with charsets. Consider that utf8mb4 requires special considerations. Let's say MySQL introduces a third utf8 charset - we can't know at this point what special considerations will be required to handle it correctly, so even if it's a valid charset, it should probably be disallowed until the implications are understood. Maybe that's wrong and it's up to the user to test it.
There is, however, an alias for the classic utf8 charset which can now be referred to as utf8mb3. So even though there are only two charsets with uft8 in the name, there are three valid handles for those two charsets. So perhaps utf8mb3 should be allowed too.
src: http://docs.oracle.com/cd/E17952_01/refman-5.5-en/charset-unicode.html
Comment #63
chx CreditAttribution: chx commentedSure, let's go with that too. And yes, if exception handling is broken then print and die -- it's just a failsafe and not an error anyone will ever see.
Comment #64
ergophobe CreditAttribution: ergophobe commentedOne more try. Since this is both my first core patch and I'm just getting started with D8, I hope this is moving things in the right direction. But I'm adding a lot of notes here to hopefully make it easier for people to decipher.
The main principles are the same
- restrictive for default connection
- permissive for all other connections
- added a $connection_options array element 'index' which is the first index of the $databases array. Sometimes this is called 'key' so perhaps that would be a better nomenclature, but I was just following advice from chx on IRC.
- probably other stuff
Exception Handling
The big thing still problematic is agreeing on an exception/error handling strategy. I understand why print and die might make the most sense, but I expect there are a lot of cases where just defaulting to UTF8 and warning the user would save some headaches.
Obviously, the best thing would be a try/throw/catch, but I could not get that to work. If I try to throw an exception (even just using \Exception, I get a fatal error:
This is true even if I use the base \Exception handler or if I try to use one of the custom ones. There's too much I don't know about exception handling in D8!
That said, for now I have created a drupal_set_message() warning with the message
I know that's not the way this should be, but hopefully it moves this patch forward.
Relation to Patch 58
Sorry - I worked off my previous patch mostly and that's what's reflected in the interdiff. Unless I've been misunderstanding all along, I think damienwhaley misconstrued what chx was driving at with respect to the default connection. The point was not to allow a change only for the default connection, it was to be very restrictive on the changes allowed on the default connection and permissive on all other connections.
I did move the charset function and property to base connection class as in #58.
I removed the MySQL-specific comments and refer the user to documentation in default.settings.php. I think it's best for the general documentation for this problem to reside in one place anyway and not be repeated all over. That way if it changes (for example the reference URL) it only needs to be updated once.
I also did not include the allowedCharset() method and deleted the commented code and comments regarding exceptions in /core/lib/Drupal/Core/Database/Database.php.
Other Notes
In my patch, the code in /mysql/Connection.php lines 78-83 is unnecessarily repetitive, but given the long discussion here ( http://drupal.org/node/935284 ) I gather that especially for core it is preferred to keep lines short at the expense of inefficient code (and it does make it more readable, but it feels "wrong" to have to short conditionals that execute the same statement).
I have also added perhaps too much documentation in default.settings.php. I'm not sure how much is too much. Comments welcome.
Comment #65
YesCT CreditAttribution: YesCT commentedchanging to needs review to let the testbot try it.
Comment #66
ergophobe CreditAttribution: ergophobe commentedOops, I forgot to change that. But I was looking over the code and realize it still needs work. As is, the patch has this code in mysql/Install/Tasks.php:
This should be generalized, because for non-default connections we're now allowing any charset.
It also raises another question: should the charset only be validated on install? If so, then the other validation code should be moved here. If not, this check should become part of the code that is validating the connection. Also, though it probably won't stay, the drupal_set_message() call does not wrap a t().
Comment #67
ergophobe CreditAttribution: ergophobe commentedI just realized this can't be used on any column over 191 characters that will be an index because InnoDB only allows 767 bytes on an index. See phayes comments above in #10.
Comment #68
ergophobe CreditAttribution: ergophobe commentedI think there's still a fair bit of work to be done here. Most importantly, any patch needs to be tested to make sure that it actually works with an interactive install and a custom charset/collation in settings.php
I'm not sure any of the patches submitted so far did, because the default collation was getting set to the custom collation and that doesn't work because you get a mismatch on VARCHAR columns. If you change it to use utf8mb4 on VARCHAR cols, then you get lots of crashes because several indexes exceed the 191 character limit (see comment above).
Also there remains the question of how to check that a given charset is in fact supported. This can be done on install, but if someone changes the settings.php file post-install, there's no test in here yet.
So this is probably the best effort I'm going to make without some guidance or someone else to step in and straighten things out. I hope I've made more fixes than problems.
Comment #69
ergophobe CreditAttribution: ergophobe commentedRemove whitespace. Queued for review so it gets tested, but really it's "needs work" - see comments in previous post.
Comment #70
YesCT CreditAttribution: YesCT commentedCaution, this is standards stuff, forgive me for distracting here. No need to worry about getting code perfect before posting. Really. I just started on some white space stuff and got carried away.
Patch attached.
extra whitespace
function one line summaries should begin with third person verbs, like: Fetches
@params have not newlines between them, but @return always has a newline before it. http://drupal.org/node/1354#order
And has a description on the second line.
missing spaces in the Default line, to align it.
missing periods at end of sentences.
I'm not sure of the format for lists inside inline comments.
it's in core in some places, but http://drupal.org/node/1354#inline says to put in-line comments on its own line.
expressions don't use space before the parens (). http://drupal.org/coding-standards#controlstruct
Also, I dont know the order of precedence between && and !=, so maybe an extra set of parens () would help make the meaning clearer. [I did not fix this one.]
use arguments to t() to make it easier on translators.
http://api.drupal.org/api/drupal/core%21includes%21bootstrap.inc/functio...
for example: $text = t("@name's blog", array('@name' => user_format_name($account)));
I'm not sure, but I think st() might be the same as t() in that should use args for variables instead of string cat. http://api.drupal.org/api/drupal/core%21includes%21install.inc/function/...
more than 80 chars.
broke up this sentence into two, and added comma to separate phrase to add clarity.
missing period.
Comment #71
ergophobe CreditAttribution: ergophobe commentedSorry about that and thanks so much for doing all that cleanup. I'll try to be more careful in the future.
>>I dont know the order of precedence between && and !=
the != takes precedence, so this works as expected, but for clarity, perhaps the extra parens would help.
Comment #72
YesCT CreditAttribution: YesCT commentedNext:
need a review from @Crell or someone to make sure previous concerns have been addressed in the right way, and/or maybe from @chx on the more recent bit added.
if the approach is ok, then a little cleaning up of the comments will be needed, there is some debug or other notes in there.
Comment #73
ergophobe CreditAttribution: ergophobe commentedMy feeling is that it needs more than cleaning up comments. The comments are long and have debug type stuff in there because I'm not sure of several things.
I stopped where I did because I had a headache and had reached the point where I felt like I was making things worse, not better. I think it needs a fair bit of massaging, but I can't really get back to it until next week.
It also needs either a very robust test or it needs to be tested by installing from scratch in a variety of scenarios (or both) - a small change here or there in this patch can easily cause a crash on install. If you get a column mismatch of any sort, MySQL will die.
Comment #74
Crell CreditAttribution: Crell commenteddrupal_set_message() inside the Connection class is a no-go. Just throw an exception and let the global error handler deal with it.
I don't fully understand the implications of this comment. Does that mean utf8mb4 + InnoDB + Varchar == kaboom? That would be... very very bad. I hope I'm misunderstanding that.
Have I mentioned how much I hate SQL databases?
Comment #75
Crell CreditAttribution: Crell commentedNote we may want to change the SET NAMES call to something that the driver supports natively:
http://www.php.net/manual/en/ref.pdo-mysql.connection.php
(Yes, it's deeply buried in the docs. I didn't even know about it until today.)
Comment #76
ergophobe CreditAttribution: ergophobe commentedIf the column has more than 191 characters, yes, it goes kaboom. See phayes comments in #10.
Yes, it would be. Maybe it's just because I don't have the chops for this, but the more I look into this, the more utf8mb4 support seems like a morass.
Comment #77
chx CreditAttribution: chx commented> because there are VARCHAR-based indexes of 255 chars.
Then we should ban that and establish sensible indexes. There's no way we need more than, say, 32 chars to index and every db and Drupal totally supports index lengths. Might be a followup.
Comment #78
ergophobe CreditAttribution: ergophobe commentedI was not reading my own comments - I was thinking the column length was the problem, but yes, it's just the index prefix length at issue and certainly 191 characters should be enough for that.
Comment #79
chx CreditAttribution: chx commentedRe #75 it was not buried rather it didn't exist: Prior to PHP 5.3.6, this element was silently ignored. Note that Drupal at this moment is 5.3.5 so using charset in the DSN is not yet possible. Plans are to go 5.3.10 in March.
Comment #80
ergophobe CreditAttribution: ergophobe commentedRoughly speaking
That's the rough scope of it, so it's not hard, though it does end up being a patch that affects a very large number of files. It will also require contrib authors to update their modules. I can try to come up with another patch, but before I go changing all those files, I just want confirmation that approach is sensible.
I don't find any indexes that are specified with a length currently, but as chx says, it's built into Drupal and is a simple matter of changing, for example, the key_value table
'primary key' => array('collection', 'name')
to
'primary key' => array(array('collection', 60), array('name', 128))
and choosing sensible values. In a default install of D8, the largest value for key_value.collection is 13 chars ("system.schema") so it seems to make sense there, for example, to leave name at 128 and then use what's left. There may be other cases where both columns need truncating. I haven't looked that carefully yet and wanted a reaction from others following the thread before going down that road.
PS I don't think just exempting the VARCHAR cols from utf8mb4 support as phayes suggested in #10 really makes sense. Some of these columns are things like
users.signature
and it seems like supporting a character set for user names is rather important.---------
*I arrived at this very roughly by writing a small script that does a SHOW TABLES and then iterates through the tables and grabs all columns that are of type VARCHAR and which have a Key (as per the SHOW COLUMNS) output (and in the latter two cases, that have a length above the given size. I also just grepped the codebase using
^[^\*\n\r]+'(primary key|indexes|unique keys)' =>.*?[;,]
and got a second count that way.Comment #81
Damien Tournoud CreditAttribution: Damien Tournoud commentedDo we have an actual use case for this? I could not see one by quickly skimming through the comments.
Until we have, it seems like the drawbacks clearly outweight the benefits here.
Comment #82
Damien Tournoud CreditAttribution: Damien Tournoud commentedAlso, a primary key cannot use length prefixing. (Or at least I hope that MySQL doesn't allow that, because it doesn't make any sense.)
It's probably worth tidying up our schema (for example: machine name-type keys should probably use a ascii character set so as to reduce the size of the index), but that belongs in another issue.
Comment #83
ergophobe CreditAttribution: ergophobe commentedActually, it can: http://dev.mysql.com/doc/refman/5.5/en/alter-table.html
I tested it and had no problem altering a table to use a 100 char PK based on a 255 char VARCHAR column.
And this is all related to #1852896: Throw an exception if a schema defines a key that would be over 1000 bytes in MySQL - it seems one way or another some sort of length checking on MySQL keys might be needed.
I'm not sure what the OPs use case was nor that I understand the internationalization issues well enough, but I believe the use case goes something like this:
- http://en.wikipedia.org/wiki/Plane_%28Unicode%29#Supplementary_Multiling...
- http://mzsanford.wordpress.com/2010/12/28/mysql-and-unicode/
- http://mathiasbynens.be/notes/mysql-utf8mb4
That's sort of what I was driving at in #76 - this seems like a lot of trouble, but as it stands now (as I understand it), Drupal is not really compatible with 4-byte character sets.
It looks like over in the Wordpress community they're fighting with the same thing.
Comment #84
phayes CreditAttribution: phayes commentedMy use case of this is that we host a lot of mathematical scientific journals, there are a lot of high order characters that are used in this content for properly displaying math. I've been running the patch in #12 for about a year now in production with no problems.
Edit: Is there a reason why we don't just say we only support utf8mb4 on text fields (not varchar) and skip the whole key-length problem. It's not a 100% ideal solution, but it's better than providing no support for high order characters.
Comment #85
Crell CreditAttribution: Crell commentedphayes: Mixed character set databases are a nightmare to manage and lead to all sorts of creative bugs unless you really know what you're doing. You may really know what you're doing, but I wager most people using MySQL don't know what a character set is, much less what happens when you try to mix them. :-)
Comment #86
Damien Tournoud CreditAttribution: Damien Tournoud commentedI don't know what it does exactly, but it's probably not desirable.
I'm going to reclassify this as a feature request. There is no data loss related to this since Drupal 7, so it is not a bug per say.
In addition, I don't think we can support this until we introduce support for an ascii or binary charset and use it to tidy up our indexes and primary keys (especially machine names and UUIDs that have poped up all over the place in Drupal 8).
Comment #87
Damien Tournoud CreditAttribution: Damien Tournoud commentedComment #88
sun#1923406: Use ASCII character set on alphanumeric fields so we can index all 255 characters
Comment #89
ergophobe CreditAttribution: ergophobe commentedBefore I answer this I want to reemphasize - much of this is over my head in terms of both Drupal 8 internals and in terms of character sets (I only actually use French and English, so these issues don't affect me, I just ended up taking this on as somewhere to try to help with D8).
That said...
Not using any Asian languages myself, I don't know how common the extensions beyond the Basic Multilingual Plane are. Mostly what you see in the SMP are special uses (Math, emoticons, alchemical symbols), dead languages (Lycian, Gothic, Cuneiform) and the unified Han characters.
But it seems to me that if you're supporting these characters in TEXT, they should be supported in VARCHAR for the cases where
Wouldn't it be confusing to a user that say that if you enter text into the body of a post, it's all good, but if you copy and paste it into the title of that post, it will truncate your data as soon as it encounters the first 4-byte character?
BTW - support for these character sets is shaky at best in the world at large. The emoji, for example, are supported in Win8 fully in IE, Firefox and Opera, but not in Chrome or Safari. Support on the Mac seems to be a little less. My Ubuntu is busy upgrading, so I can't tell.
Also, I came across one article using a 4-byte character in the title - it rendered fine in the Google search results, but the link was not clickable.
Comment #90
Pancho#75, #79:
#1800122: Bump minimum version of php required to 5.3.10 has been committed, so we can use DSN now, instead of 'SET NAMES'.
#80:
From what I got the 767 Bytes (= 191 chars) InnoDB restriction isn't about index length but about the length of a (single) prefix.
Otherwise, key length may be up to 1000 B (= 250 chars) as restricted by MyISAM, see #1852896: Throw an exception if a schema defines a key that would be over 1000 bytes in MySQL.
We might need to test this again, unless someone can clarify.
#81:
Another use case are LTR tags allowing English text be correctly embedded in an RTL environment. We're having bugs that can't be correctly solved without this, see #1165476: if t() string has no translation or fallback language, text should have lang attribute.
#86:
Agree that we should avoid length prefixing on PKs that would be opening up one more can of worms. The more we should go and shorten keys.
Plus: While It's certainly true that there is no data loss, which is the criterion for a critical bug though, not generally for a bug. Still I'm recategorizing only as task, because it's not like everybody would expect us to support 4 byte chars. But "feature" this isn't either.
Comment #91
mgifford#70: database-1314214-70.patch queued for re-testing.
Comment #92
Damien Tournoud CreditAttribution: Damien Tournoud commented@Pancho: LTR and RTL are both 3-byte UTF-8 characters. The linked issue describes the use of language tags from
U+E0000
toU+E007F
, which are totally deprecated anyway.Comment #93
ergophobe CreditAttribution: ergophobe commented@Pancho
1. key length versus prefix length.
My mistake - you are correct
Comment #94
Hanno CreditAttribution: Hanno commentedGreat work here. This patch could fix this issue.
It might also need a follow up. I think we should reconsider this as a bug as Drupal isn't fully utf-8 compliant.
This issue is not only theory, but also relevant for
- emoji, these emoticons are supported by Google and Apple on mobile phones and use the 4 bytes utf-8. Some applications even use emoji as part of the username
http://blog.manbolo.com/2011/12/12/supporting-ios-5-new-emoji-encoding
- mathematics used in scientific journals: http://drupal.stackexchange.com/questions/50868/configuring-drupal-to-us...
Some questions
- could we require MYSQL >5.5.28 for Drupal 8?
- There is no workaround if people can't or don't use utf8mb4:
- The actual behavior with normal utf8 is for user content is that your text is lost and no user feedback is given. I think we should reconsider this as a bug and we need to give user feedback.
- when importing data from external sources the import fails. Instead we could skip, escape, or replace these characters when sending to the database. This needs further research. See for example this bug when importing Tweet data: #1824506: When importing Tweets: SQLSTATE[HY000]: General error: 1366 Incorrect string value
- we could clarify in the system report which characterset is used, and if utf8 with mysql is used place a warning that it isnt fully compatible.
- could we test if this bug also exists with other databasesystems like PostgreSQL, SQLite and Oracle?
Comment #95
Damien Tournoud CreditAttribution: Damien Tournoud commentedAs I said in #86:
Increasing the size of all our indexes by 33% is not something we can afford. Let's open a separate issue for the tidying up and postpone this one on it.
This should not happen. If it does, it's a bug elsewhere. The database layer triggers an exception, that's all that it does. Catching it and processing it belongs in the upper layers. If they don't do that correctly, it's a bug there, not in the database layer.
Comment #96
Hanno CreditAttribution: Hanno commentedJust checked with a fresh install and pasting a unicode 4 bytes character in a post gives a crash. Will post the bug in a seperate issue.
EDIT: #2002100: pasting text with 4 byte UTF-8 characters leads to broken screen
Comment #97
Hanno CreditAttribution: Hanno commentedSimilar ticket and discussion in Wordpress:
http://core.trac.wordpress.org/ticket/21212 MySQL tables should use utf8mb4 character set
Is the by Pento mentioned solution in that ticket to use the option innodb_large_prefix something to consider?
Comment #98
pfrenssenRerolled, wanted to try this out. These emoji characters are not going away any time soon it seems.
Comment #99
pfrenssenComment #101
pfrenssenThis doesn't work any more after the database connections have become static in #1953800: Make the database connection serializable.
Comment #102
mgiffordSo is there another way round this?
Comment #102.0
mgiffordUpdated issue summary, took out duplicate and integrated the added problem motivation
Comment #103
morgantocker CreditAttribution: morgantocker commentedFor prior art: the way this problem was handled in mediawiki was to store strings in VARBINARY and let the application handle character-sets. This was a pre MySQL 5.5 decision, and it's important to point out that it means no collation which is the bigger deal.
With collation:
'Montréal' == 'MONTREAL'
+ a consistent sorting order is provided with multi-byte characters.
It may be an option to 'degrade' to VARBINARY for versions older than MySQL 5.5. I'll leave this one for discussion.
RE: Index size limited to 191 characters discussion
There is actually a workaround to this problem via prefix indexes. i.e.
ALTER TABLE my_table ADD INDEX text_col_index (text_col(190));
This isn't suitable for PRIMARY or UNIQUE indexes since it breaks a constraint, but it should work fine for others.
Comment #104
marco CreditAttribution: marco commentedI've updated the patch for Drupal 7 from #30:
- to Drupal 7.28
- applying all the remarks, but limited to utf8 and utf8mb4 (no custom encodings)
To apply:
1) backup the database
2) apply the patch
3) add the settings to your settings.php file
4) run this to upgrade the text fields:
Comment #105
cdeepan CreditAttribution: cdeepan commentedDear All,
Thanks for the patch for changing character set to utf8mb for columns with Text data type. I believe this would allow supplementary characters to be added to the node body. However I believe we will have issues if the user enters these supplementary characters in Menu Title, Node Title, Tags etc. Do we have any solution to take care of these fields as well?
Thanks & Regards,
Deepan Choudhary
Comment #106
chx CreditAttribution: chx commentedWe do not have a solution. See the patch:
Comment #107
Antti J. Salminen CreditAttribution: Antti J. Salminen commentedWhat about the innodb_large_prefix and associated settings available in MySQL >5.5.14 mentioned in #97? Could this just be a manually enabled option and no autodetection or anything?
Comment #108
chx CreditAttribution: chx commentedThat would be swell. MySQL 5.1 was EOL'd on December 31, 2013. Maria will EOL 5.1 on 1 Feb 2015. Are we ready to raise the requirements to 5.5?
Edit: don't forget you need the Barracuda format for that and innodb table per file. Perhaps we need to test creating a table with a long index? Put barracuda in the table options?
Comment #109
Crell CreditAttribution: Crell commentedGiven the number of other legacy versions we're dropping in Drupal 8 (PHP, IE, etc.) I would be OK with moving to MySQL 5.5 as the minimum if it bought us something. Would switching to MySQL 5.5 make this problem effectively go-away? That's what it sounds like but I want to confirm...
Comment #110
chx CreditAttribution: chx commentedWeeeeeell. If your MySQL 5.5 InnoDB is set to use innodb_file_per_table and uses the Barracuda innodb_file_format http://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_inn... then your problem goes away. Fun.
This is the default file format in MySQL 5.6 though (and also the default between MariaDB/MySQL 5.5.0 and 5.5.6 but since 5.5.7 has reverted back to Antelope).
Comment #111
morgantocker CreditAttribution: morgantocker commented@Crell: Yes, it would allow you to use utf8mb4 safely, with a couple of other issues to solve. To summarize the now quite long comment thread:
In comment #10 phayes noted that using utf8mb4 creates a secondary issue - it restricts the maximum index size of InnoDB tables to 191 characters. In comment #97 Hanno noted that there is an option innodb_large_prefix to supported larger indexes, but we should not assume all 5.5 users will be able to turn this on (InnoDB does not enable it by default for file-format backward compatibility).
There is a workaround suggested by ergophobe in comments #78 and #80, which is to use prefix indexes (index on the first 191 characters). There are little downsides to doing this: most strings will have enough selectivity in the first 191 characters. It may prevent 'covering index' optimizations, but I am not sure how many of these drupal has.
Also as chx notes in comment #108, upstream has dropped support for MySQL 5.1 (released November 2008). The latest Ubuntu release support MySQL 5.5 (2010) & 5.6 (2013). Red Hat 7.0 is MariaDB 5.5. So I would +1 a suggestion to make 5.5 the minimum.
Comment #112
ergophobe CreditAttribution: ergophobe commented@margantocker - thanks for the concise summary of a super long discussion.
There is the question you mention of whether 191 chars is enough to provide specificity (it's certainly conceivable that, say, some ridiculously long numeric series could have all the significant information at the end).
But in addition to that, I can't figure out from the documents what happens if I have a 191-char prefix and I search for a 200-char string. Does it return if the first 191 match?
http://dev.mysql.com/doc/refman/5.5/en/mysql-indexes.html
Comment #113
Damien Tournoud CreditAttribution: Damien Tournoud commentedPlease keep in mind #95. Long indexes are bad, they use more memory, storage and are less efficient. The limit in InnoDB is there for a reason.
What needs to happen is that we need to stop using Unicode columns (and indexes) where they are not necessary. This is a comprehensive change that cannot be workaround by simply bumping database version requirements.
Comment #114
morgantocker CreditAttribution: morgantocker commented@ergophobe: I have tested prefix-length on various string types on various data-sets before. If we take IMDB movie title names, the first 15 characters provides around 99% of the cardinality of an index on the full length. An edge-case where this is less-true is URLs, where uniqueness starts after the first 11 characters (http://www.). If the URLs all start with the same domain, this could be the first 50 characters... but I would say that's a less common case.
To answer your question:
In the event that MySQL can not filter rows via the index, it will filter at the row-level, so there is no risk of wrong results. It is simply a risk of following too many pointers from an index to a row, only to eliminate the need for it there (performance).
@Damien: I find mixing and matching character-sets to be a micro optimization. I have done it before, and no longer recommend it. InnoDB internally will use variable length storage for utf8 character-set, so from a storage perspective there is no added cost. In memory-temporary tables can not do variable length, and may take more memory or perhaps convert to disk more frequently.. but in the typical case I've found that drupal queries are well optimized here, and I would be surprised to see measurable difference.
The problem with shortening index prefixes, is that you need to fully examine the dataset to determine the prefix-length. And in many cases, Drupal users will have varying datasets. Long indexes are worst in the case of an index-scan (select avg(x), count(x) etc.), so more attention could be paid to optimization here. In other cases, InnoDB will usually only load the index pages into memory as required, so it largely becomes a storage problem. The worst situation I could see is a micro-optimization to shorten the prefix length, and some installations of Drupal can not use the index because it is not selective.
Comment #115
ergophobe CreditAttribution: ergophobe commentedThanks for the clarification - that's reassuring
Comment #116
mgiffordComment #117
joelpittetContrib related issue, saving tweets with Emoji as mentioned in #98. Adding to related issues.
Comment #118
joelpittetAdding another contrib related issue with emojis
Comment #119
morgantocker CreditAttribution: morgantocker commentedA little bit more MySQL version context on innodb_large_prefix:
- It was created so that MySQL downgrades would be possible (i.e. 5.5 -> 5.1)
- This purpose is now obsolete, since 5.5+ support large prefix (and 5.1 is no longer officially supported).
- We are proposing[1] to enable innodb_large_prefix by default in MySQL 5.7, since it is now a safe change.
So I realize it's a while out for many people, but the story should be simpler with MySQL 5.7.
[1] I have the call out on my blog here:
http://www.tocker.ca/2015/01/05/what-defaults-would-you-like-to-see-chan...
Comment #120
Rajab Natshah CreditAttribution: Rajab Natshah commentedHi
Now we do have a workaround module as a fix for this issue.
Strip 4-byte UTF8
https://www.drupal.org/project/strip_utf8mb4
We do have an interface for it too.
Rewarding time working on this fix :)
Comment #121
YesCT CreditAttribution: YesCT commented.
Comment #122
stefan.r CreditAttribution: stefan.r commentedMaybe relevant: Wordpress managed to fix this 2 months ago: https://core.trac.wordpress.org/ticket/21212#comment:15
Also, the next development release for MySQL 5.7 will be a Release Candidate and will likely have the folllowing changes, supporting large enough indexes on utf8mb4 out of the box:
https://dev.mysql.com/doc/relnotes/mysql/5.7/en/news-5-7-7.html
Before we can support this though, if I can quote Damien Tournoud:
So anyone correct me if I'm wrong but I think this would have to be postponed on #1923406: Use ASCII character set on alphanumeric fields so we can index all 255 characters.
Comment #123
stefan.r CreditAttribution: stefan.r commentedFurther updating issue summary
Comment #124
stefan.r CreditAttribution: stefan.r commentedComment #125
stefan.r CreditAttribution: stefan.r commentedWe now have a patch over at #1923406: Use ASCII character set on alphanumeric fields so we can index all 255 characters so if anyone could chime in it might unblock this issue :)
Comment #126
stefan.r CreditAttribution: stefan.r commentedConsidering how prevalent UTF-8 characters from the additional character planes are these days (emojis, mathematical characters etc) shouldn't this actually be considered a bug?
Now that we have a workable patch in the blocking issue, would something like this work for a patch for this issue:
- If using MySQL, detect in the installer whether MySQL supports utf8m4, innodb_large_prefix=on and engine=Barracuda. If so:
- SET NAMES = utf8mb4 in the connection object and use utf8mb4 instead of utf8 in the schema, as well as row format = DYNAMIC.
- If MySQL has other settings or an override was provided in settings.php, use utf8 instead of utf8mb4.
Comment #127
stefan.r CreditAttribution: stefan.r commentedSo I've discussed this at the Dev Days and @Damien Tournoud and @pwolanin mentioned that after we get the ASCII issue in, this would actually be easily solved by raising the MySQL requirement to 5.5.3 and using 191 characters on the few remaining UTF-8 indexes that are left. We can then just use
utf8mb4
by default in the MySQL driver, without worrying about InnoDB/row settings.Comment #128
yannickooI have created a new issue to increase the minimum required version of MySQL.
Comment #129
yannickooThis patch replaces all occurrences of
utf8
withutf8mb4
when this is related to database encoding.Comment #130
morgantocker CreditAttribution: morgantocker commentedI've clarified in #1923406 where the increase in size will be (it's not a direct 33% increase). I will comment in #2473301 on raising the minimum version - it's quite a good solution imho.
Comment #133
stefan.r CreditAttribution: stefan.r commented#129 seems close to what we want... After we fix:
Comment #134
stefan.r CreditAttribution: stefan.r commented1. This still needs a test to verify we store/display utf8mb4 characters correctly.
2. As soon as #1923406: Use ASCII character set on alphanumeric fields so we can index all 255 characters goes in it will also need to test for the standard utf8mb4 collation in SchemaTest.
3.
80 cols, s/UTF-8/utf8mb4/
Comment #135
hass CreditAttribution: hass commentedAdded note about Drupal system requirements as todo
Comment #136
stefan.r CreditAttribution: stefan.r commentedThat note can actually be removed, after #1923406: Use ASCII character set on alphanumeric fields so we can index all 255 characters goes in we can limit the remaining few UTF8 keys (if any) to 190 characters, that way we can also support installs without
innodb_large_prefix
.Comment #137
stefan.r CreditAttribution: stefan.r commentedUpdated issue summary.
Comment #138
stefan.r CreditAttribution: stefan.r commentedComment #139
stefan.r CreditAttribution: stefan.r commentedComment #140
joelpittetThis option looks like a better option for D7. #2382707: UTF8MB4 for MySQL Maybe we can re-open it to get some support there too?
Comment #141
joelpittetReason I say that is because here we are setting the default to utf8mb4, where as there it's default stays utf8 but has the option to change it if a site wants to go through with it.
Comment #142
stefan.r CreditAttribution: stefan.r commentedThis patch adds a test, addresses comments in #134 and limits the following fields with unique constraints to 191 characters:
If we have a problem with the URI field only allowing for 191 characters, we could also get rid of the database-level constraint and add it on the application level instead.
In core we don't use any non-ASCII primary keys anymore, nor do we need to explicitly define length for regular indexes in PHP, as MySQL will take care of that for us anyway; it will cut off regular indexes at 191 characters without
innodb_large_prefixes
and at 255 characters withinnodb_large_prefixes
.Comment #144
stefan.r CreditAttribution: stefan.r commentedThis should fix some failing tests as we had skipped the test-only tables in #1923406: Use ASCII character set on alphanumeric fields so we can index all 255 characters
Comment #146
stefan.r CreditAttribution: stefan.r commentedThis should further decrease test failures by marking some more fields as ASCII.
Comment #148
stefan.r CreditAttribution: stefan.r commentedComment #150
stefan.r CreditAttribution: stefan.r commentedMissed another test field.
@joelpittet as to the D7 version of this patch, @catch had proposed in #2473301: Raise MySQL requirement to 5.5.3 that we allow people to use either utf8 or utf8mb4 in D7, if @David_Rothstein is OK with that.
Comment #151
stefan.r CreditAttribution: stefan.r commentedComment #153
stefan.r CreditAttribution: stefan.r commentedThis removes the unique keys from feed title and block content info as reducing field size to 191 could break D6/D7 upgrades.
Comment #154
stefan.r CreditAttribution: stefan.r commentedComment #156
stefan.r CreditAttribution: stefan.r commentedLooks like this test run failed because #1923406: Use ASCII character set on alphanumeric fields so we can index all 255 characters had been reverted.
Whenever that gets back in I'll also remove the length cutoff from the URI field on File entities. Instead we can make it an ASCII field and urlencode it, as both removing the unique constraint and cutting off /all/ URL's at 191 characters seem like even worse options.
After we do that I think the patch is close to where it needs to be, this will just need some feedback from @benjy regarding how this will fit in with migrate.
Comment #157
stefan.r CreditAttribution: stefan.r commentedThis implements urlencoding on the storage layer for non-ASCII URLs, as it got too ugly trying to do this on the entity layer. DownloadTest has coverage for non-ASCII URLs.
This still looks a bit ugly to me. Maybe it would be better to do a query looking for file entities with the same URI in File::preSave() and we just lose the unique constraint?
Comment #158
stefan.r CreditAttribution: stefan.r commentedActually I think it may be better if we add a uri hash field and put a unique constraint on that field instead?
Comment #160
stefan.r CreditAttribution: stefan.r commentedThis adds a hash as well as a test for the hash and the uniqueness constraint on the hash. The hashing still feels like a better solution than the urlencoding as it doesn't change what is stored in the URI field, as well as allowing us to save URLs longer than 255 characters (see #193954: {file}_uri and {file}_filename length limitations)
I think the new hash field would need a beta to beta upgrade hook though.
Comment #162
stefan.r CreditAttribution: stefan.r commentedComment #163
fietserwinIf we go for URI encoding,it should be storage layer, the application layer should not have to be bothered with this kind of technical details. So #157 was an improvement.
However, as URIs can indeed become incredibly long and only differ in some query parameter values after position 191, I think that using a hash might even be better. I suppose that performance wise, computing a hash is not that expensive. So continue with #160.
- Don't bother changing generated files.
- This seems like a D6 table, so it should not be changed anyway.
- The typo should be corrected in core\scripts\migrate-dump-d6.sh,but not as part of this issue.
Comment #165
stefan.r CreditAttribution: stefan.r commented@fietserwin, thanks. The D6 table has to be changed as well, as it is being created as a utf8mb4 table with a primary key on the filename field. If we don't change it, it only allows 191 characters. This is being changed both in the generated script (so tests can pass) as in the generating script, but indeed the typo won't be picked up :)
Not sure why this test run is marked as failed, in the test log it just keeps terminating in the middle of the node tests, but it doesn't say which specific test is breaking. I'll log a ticket in the testbot queue.
Comment #166
stefan.r CreditAttribution: stefan.r commentedA few nitpicks in the mean time
Comment #169
stefan.r CreditAttribution: stefan.r commentedApparently this may be a testbot problem as opposed to a problem in the patch: #2477583: Unexpected timeouts on testbot runs
I'll just keep re-queueing this patch then
Comment #170
stefan.r CreditAttribution: stefan.r commentedAh tests are green now, #166 is not that different to the previous patches so this may have been a testbot problem after all.
Updating issue summary.
Comment #171
stefan.r CreditAttribution: stefan.r commentedUpdating issue summary to clarify why we can lose the DB-level unique constraints on block description and feed title.
The uniqueness validation is just there because there's no way to distinguish between blocks/feeds in the UI other than by description, and we already have entity validators for this, nor is it a big deal if we still get two blocks with the same description in some edge case. All it'd be is an inconvenience, and workarounds exist (such as editing the description).
Just to add to this, the patch is MySQL only, but there are two tests which may affect PostgreSQL / SQLite:
NodeViewTest
andSaveTest
. For now @amateescu has confirmed thatSaveTest
works with SQLite, the other 3 cases are yet to be confirmed.Comment #172
amateescu CreditAttribution: amateescu for Drupal Association commentedJust tested now and both tests are fine on SQLite and Postgres.
Comment #173
stefan.r CreditAttribution: stefan.r commentedThanks @amateescu! That also proves the PostgreSQL and SQLite drivers aren't affected by this issue and already support full UTF-8.
Having done another review I'm happy with this patch now. Tiny coding standards fix attached.
Just waiting for one of the database people to have a look at this and for @benjy to check this will play well with migrate.
Comment #174
stefan.r CreditAttribution: stefan.r commentedOops
Comment #176
benjy CreditAttribution: benjy commentedThis all looks fine to me from a migrate POV.
Comment #177
fietserwinAdded "font icons" to the list of lacking support and in text references to prerequisite issues.
Pass TRUE for the 3rd param (not_null)? (unique fields are not null by default, but for other fields you have to pass that explicitly)
addSharedTableFieldUniqueKey() does not accept a 3rd param (it is always not null).
idem.
warning: Variable 'exception_triggered' might not have been defined.
Comment #178
stefan.r CreditAttribution: stefan.r commentedThanks for the review and the issue summary update @fietserwin.
Actually 2 and 3 may be referring to the same line of code? Or maybe to the code we are removing, as it is currently unnecessarily adding a 3rd parameter in HEAD as well:
- $this->addSharedTableFieldUniqueKey($storage_definition, $schema, TRUE);
As to 4, I have added a variable definition so we don't trigger a notice whenever the assertion fails.
Comment #179
stefan.r CreditAttribution: stefan.r commentedComment #180
stefan.r CreditAttribution: stefan.r commentedHad left an error in #178.
This is ready for further review now :)
Comment #182
fietserwin- I would have assigned FALSE to $constraint_triggered, so that is a boolean and only a boolean, but I'm fine with this patch.
- 2 and 3 were indeed the same, my fault.
RTBC for me, but what about the 'D8 upgrade path' tag, does this patch need a hook_update?
Comment #183
stefan.r CreditAttribution: stefan.r commentedWell
assertTrue
will never considerNULL
a test pass, but indeedFALSE
would have been more readable. If this goes back to "Needs work" at some point, I'll try to remember to fix.As this adds a field on the File entity and changes a few unique keys, indeed this will need an update hook at some point. But as we don't yet support beta to beta upgrades, that's out of scope for this issue :)
Comment #184
andypostJust 2 nits
this is a default, just remove the line
why 100? sessions.sid uses 128 but hash('sha256', $data) is 256 bits so char(64)
Comment #185
stefan.r CreditAttribution: stefan.r commentedAs to 2, it's a base64'd hash which adds another ~33% to to the 64 characters, so I just rounded up to 100.
But I just noticed
hashBase64
uses raw binary data, so actually ~44 chars may have been enough in this case. Let's set it to 128 though just to be consistent with what we use forsessions.sid
and as I think this discussion is out of scope for this patch.Our max_lengths make little sense in general, usually we set it to 255 or 128 for no reason and we define our fixed-length columns as
varchar
, so maybe let's discuss that, along with limiting both columns to ~44 characters in a separate followup issue?Comment #186
stefan.r CreditAttribution: stefan.r commentedComment #188
stovak CreditAttribution: stovak commentedCreated a d7 backport issue. https://www.drupal.org/node/2488180
Comment #189
stovak CreditAttribution: stovak commentedComment #190
fietserwinWith the points raise in #177 and #184 solved or explained, this is OK now.
Comment #192
fietserwinPatch no longer applies, so needs a reroll. In that case: can you also cover my 1st point from #182?
Comment #193
anavarreComment #194
stefan.r CreditAttribution: stefan.r commentedre-rolled
Comment #195
fietserwinComment #196
stefan.r CreditAttribution: stefan.r commentedThanks @fietserwin for the review. Patch is green again.
Comment #197
fietserwinAs per #190, as there are no real changes in this patch compared to #186.
Comment #198
catchI've given this a once over and I think it's OK.
Slightly concerned about the filename base64 encoding for uniqueness, but don't have a better idea (i.e. I think that's probably better than shortening the column). Want to think about that a little bit more before committing though.
Comment #199
stefan.r CreditAttribution: stefan.r commentedThanks. What specifically is a worry about the hashing? Having a column on the database that really doesn't "mean" anything?
Just to clarify for anyone reading this, the hash is a base64-encoded sha256 hash -- not just base64-encoded.
Comment #200
catchYes exactly. We're changing the schema for something that's an InnoDB-specific limitation.
For example we avoided doing similar in #83738: LOWER(name) queries perform badly; add column name_lower and index. and eventually went with #279851: Replace LOWER() with db_select() and LIKE() where possible. An example of where the additional column could go wrong is if someone directly updated a filename in the database - then the hash would no longer match. We don't support manually updating things in the database outside of update handlers, but it's not impossible an update handler would need to update some filenames then they've additionally got to base64 hash those which is easy to forget. Typing this out is increasing my unease - so back to CNR for more opinions I think.
One option would be to not fix this particular column in this issue, and revive efforts to get transliteration for filenames into core again. Or at least to split it out into its own issue.
As well as that, we could use sqlite and postgres test runs before committing this.
Comment #201
stefan.r CreditAttribution: stefan.r commentedSQLite and PostgreSQL test runs were done by @amateescu in #172, the patch is not fundamentally different.
Maybe we could we add a trigger that hashes the field instead of doing this on the PHP level?
"Not fixing" this column would imply we limit the length to 191 characters, which would mean migrations have that as a "known issue" until we get in transliteration. I had also tried implemented URLencoding in an earlier patch (#157) but didn't really like the looks of it -- similar concerns would apply if people bypass core APIs and change the column directly.
Comment #202
morgantocker CreditAttribution: morgantocker commented@stefan.r - I don't recommend using triggers. Prior to MySQL 5.7, there is one trigger per-event per-table (event: after insert, before insert etc.). Many DBAs require triggers to be able to perform schema changes and migrations. If the application also requires triggers, they won't be able to do this.
Triggers in Amazon RDS are also a pain:
https://www.percona.com/blog/2014/07/02/using-mysql-triggers-and-views-i...
Comment #203
stefan.r CreditAttribution: stefan.r commentedThen probably our best bet is to take out the hashing, limit to 191 and reopen the filename transliteration issue so we can increase the limit back to 255.
Alternatively, we take out the unique constraint (temporarily) on the database level. We could still have a Drupal-level unique constraint for filenames.
Comment #204
fietserwin#200:
#203:
I think that, in terms of a solution, both of you are saying the same thing here ... so drop that column and only check on application level.
Comment #205
stefan.r CreditAttribution: stefan.r commentedThis takes out the database-level unique constraint and adds a Symfony constraint on the URI field instead.
The database-level constraint seems useful though, so maybe we should have another look into transliterating filenames so we can bring it back as soon as that's in?
Comment #206
fietserwin3 minors:
I wouldn't pass in the 2nd param and rely on the default (being NULL, not an empty array).
Looking a bit around regarding the naming of constraints, I found:
- UserNameUnique
- UserMailUnique
- UserNameConstraint
- LinkAccess
So, it looks that if the constraint is self explaining, the word constraint is not added to the end. This would suggest FileUriUnique as id and class name.
... this URI %value ... does not sound correct to me, but I'm not a native English speaker. I would either:
- Change 'this' to 'the', or
- (my pereference) Use something like (comparing to usernameunique and usermailunique): 'The file %value already exists. Enter a unique file URI.'.
If you change this, also change the test...
Comment #207
stefan.r CreditAttribution: stefan.r commentedComment #209
stefan.r CreditAttribution: stefan.r commentedBefore RTBCing, can we confirm no new tests should fail in PostgreSQL/SQLite, either through a new test run or a manual code review comparing with the patch that had already been tested by @amateescu (drupal-utf8mb4-1314214-165.patch)
This would also need a follow-up issue where we discuss transliteration in core so we can put back the database-level unique constraint as soon as that's back in.
Comment #210
stefan.r CreditAttribution: stefan.r commentedFollow-up issue created at #2492171: Provide options to sanitize filenames (transliterate, lowercase, replace whitespace, etc)
Comment #211
fietserwinI'm not sure about the id used for the FileUriUnique constraint (FileUri). Other constraints around in D8 seem to use the class name as id, only chopping off Constraint (bu not Unique or Access). So that would be a final correction to the patch if the postgres and sqlite tests do not uncover anything else.
Leaving to NR to as the tests may run without that final change and to invite others to give this patch a final review.
Comment #212
stefan.r CreditAttribution: stefan.r commentedLet's be consistent with the rest of core then :)
Comment #215
fietserwinThanks, perfect. The (transliteration) follow-up has also been created, so RTBC for me. @amateescu: are you also OK with this patch?
Comment #218
stefan.r CreditAttribution: stefan.r commentedThis fixes the latest test failure as the generated Migrate files now have an MD5 checksum. Just for the record, benjy had OK'ed this in #176.
Comment #219
stefan.r CreditAttribution: stefan.r commentedUpdating issue summary to reflect latest changes.
Comment #220
stefan.r CreditAttribution: stefan.r commented@bzrudi71 did test runs on PostgreSQL/SQLite this weekend:
The MigrateTableDumpTest failure was fixed in #218
Comment #221
fietserwinComment #222
stefan.r CreditAttribution: stefan.r commentedThanks.
Just for reference, here's an interdiff with the previous RTBC patch that had been reviewed by @catch.
Comment #223
alexpottAfaics @catch's concerns have been addressed. I'm going to assign it to him rather than commit myself given @catch's prior involvement.
Comment #226
stefan.r CreditAttribution: stefan.r commentedback to rtbc
Comment #227
alexpott@stefan.r we you requeue a patch because you think the fail is unrelated it is important to comment on the issue to say what failed and why it is unrelated. This helps us to not introduce new random fails.
Comment #228
stefan.r CreditAttribution: stefan.r commentedThe failure was "Repository checkout: failed to checkout from [git://git.drupal.org/project/drupal.git].", the same error as was appearing on other issues.
Comment #230
catchSo dropping the unique database constraint isn't great, but given #2492171: Provide options to sanitize filenames (transliterate, lowercase, replace whitespace, etc) is the only viable way to add it back properly I think that's fine.
Committed/pushed to 8.0.x, thanks!
I'm not sure how viable this is to backport to 7.x, but like the varchar_ascii moving it there for discussion.
Comment #231
alexpottThis patch is causing problems for me - I can't install Drupal 8 anymore.
Drupal\Core\Database\DatabaseExceptionWrapper: SQLSTATE[42000]: Syntax error or access violation: 1071 Specified key was too long; max key length is 767 bytes: CREATE TABLE {router} (
`name` VARCHAR(255) CHARACTER SET ascii COLLATE ascii_general_ci NOT NULL DEFAULT '' COMMENT 'Primary Key: Machine name of this route',
`path` VARCHAR(255) NOT NULL DEFAULT '' COMMENT 'The path for this URI',
`pattern_outline` VARCHAR(255) NOT NULL DEFAULT '' COMMENT 'The pattern',
`fit` INT NOT NULL DEFAULT 0 COMMENT 'A numeric representation of how specific the path is.',
`route` LONGBLOB DEFAULT NULL COMMENT 'A serialized Route object',
`number_parts` SMALLINT NOT NULL DEFAULT 0 COMMENT 'Number of parts in this router path.',
PRIMARY KEY (`name`),
INDEX `pattern_outline_fit` (`pattern_outline`, `fit`)
) ENGINE = InnoDB DEFAULT CHARACTER SET utf8mb4 COMMENT 'Maps paths to various callbacks (access, page and title)'; Array
(
)
in db_create_table() (line 435 of /Volumes/devdisk/dev/sites/drupal8alt.dev/core/includes/database.inc).
Comment #232
ArlaI get the same error as @alexpott, with MySQL 5.6.24 (and 5.6.21).
Probably related (from https://news.ycombinator.com/item?id=7317519):
Comment #233
catchReverted.
From irc this is affecting people who use actual Oracle MySQL. MariaDB seems fine.
Comment #235
morgantocker CreditAttribution: morgantocker commented> From irc this is affecting people who use actual Oracle MySQL. MariaDB seems fine.
@catch: This probably has to do with default SQL mode selection. I would be surprised if with sql mode strict, they behave differently.
Comment #236
stefan.r CreditAttribution: stefan.r commentedDo we know what the problem here is yet?
I don't have access to any Oracle MySQL installs right now, maybe anyone could test whether the fist statement gives an error and the second doesn't? Just in case the problem is the index.
Comment #237
alexpottINDEX `pattern_outline_fit` (`pattern_outline`, `fit`)
causes the problem
Comment #238
stefan.r CreditAttribution: stefan.r commented@morgantocker there is no command we can send to Oracle MySQL to make it cut off indexes at 191 characters, like MariaDB does? I guess we'll just have to define index length explicitly?
Comment #239
morgantocker CreditAttribution: morgantocker commented@stefan.r: yes. It is related to my earlier comment about differences in default sql mode. See this test case:
MySQL 5.6 is strict by default in new installations and strict for all in 5.7. This is probably not an issue for a non-unique index with two exceptions:
- Index-only scans are now not possible
- Strings with low selectivity in the left-most 191 characters.
It may be a problem with UNIQUE indexes because a constraint can not be imposed. I am not a fan of the 'loose' behavior.
You could achieve this by disabling strict mode, making the schema change, and then re-enabling strict mode.
Comment #240
stefan.r CreditAttribution: stefan.r commentedSo something like this... Though I'd rather have this taken care of by the database itself if possible (like MariaDB does)
Comment #241
stefan.r CreditAttribution: stefan.r commentedJust a proof of concept, will work on this further over the course of the week... or anyone feel free to refactor / further comment.
Comment #242
morgantocker CreditAttribution: morgantocker commented@stefan.r: To clarify, it's not quite true to say that this is being taken care of by the database. What the database is doing, is providing you with something other than what you asked for. In many cases this is very dangerous, which is why it is now disabled.
A more complete fix is available in 5.7, where the barracuda file format + innodb_large_prefix is enabled by default. This allows indexes up to 3072 bytes.
Comment #243
BerdirYeah, I don't think we should rely on the database here. We have to explicitly define how those indexes should be shortened. Quite often, it's not the last part that should be shortened but the first part of the index I think for example with the router table.
We've been doing that before, we just had higher limits that we could rely on.
@morgantocker: Good to know, so based on our history of which MySQL version we support, we just have to wait 10 years or so until we can require 5.7 and our problem will be solved ;)
Comment #246
stefan.r CreditAttribution: stefan.r commentedThis cuts off all indexes at 191 characters on non-ASCII fields that are MySQL string types (and thus are utf8mb4 encoded).
Probably the "non-last part" shortening should be dealt with in a separate issue as the implicit behavior of the previously commited patch (at least on MariaDB) was to shorten the last part.
Comment #247
stefan.r CreditAttribution: stefan.r commentedNitpicks
Comment #248
alexpottThe current patch makes the router table look like this - and Drupal installs for me.
So looks good. I guess this new code could do with some tests considering that #218 was green and got committed.
Comment #249
stefan.r CreditAttribution: stefan.r commentedAdded tests
Comment #250
stefan.r CreditAttribution: stefan.r commentedJust so we can get this back in, is anyone willing to review this? :)
Comment #251
stefan.r CreditAttribution: stefan.r commentedUpdating issue summary to reflect this latest change
Comment #252
fietserwinOn reviewing, I found 3 minors. The added test looks OK. I did not test the install, but it looks like @alexpott did so in #248.
Data types are required as of D8: https://www.drupal.org/coding-standards/docs#param
I would have phrased this differently (using de Morgan laws), but I guess that is also a matter of preference. So I will still RTBC if you don't change this.
idem as 1st point.
Comment #253
stefan.r CreditAttribution: stefan.r commentedComment #257
fietserwinThe failure is with a unique constraint, which is not in $spec['indexes'] but in $spec['unique keys']. But shortening a unique key index seems a no go to me as it would invalidate the purpose of the index of being a constraint.
Is the test data erroneous, as I can't find any hint to this unique constraint in the code, only in the test data file drupal\core\modules\system\tests\fixtures\update\drupal-8.beta11.bare.standard.php.gz.
Comment #258
stefan.r CreditAttribution: stefan.r commentedThat's most likely where the error comes from then... This removes the unique constraint as a workaround.
Comment #260
stefan.r CreditAttribution: stefan.r commentedJust looked at #2498625: Write tests that ensure hook_update_N is properly run and manually removing the constraints from the beta11 dump is probably not the right way to go :)
Comment #261
stefan.r CreditAttribution: stefan.r commentedDiscussed with @catch/@alexpott on IRC and this updates the dump instead.
Comment #263
stefan.r CreditAttribution: stefan.r commentedThis updates the site name in the dump, which must be "Site-Install" to pass tests.
Comment #265
stefan.r CreditAttribution: stefan.r commentedAnother try
Comment #266
daffie CreditAttribution: daffie commentedFor the testbot.
Comment #267
fietserwinDo I understand correctly that the only actual change is that the unique constraint was removed from the dump?
Comment #268
stefan.r CreditAttribution: stefan.r commentedThe change is that instead of taking the beta11 database dump, it takes the database dump of HEAD+the current patch using
core/scripts/dump-database-d8-mysql.php
.So that includes removal of the unique constraint, as well as all other changes introduced in this patch.
Comment #269
fietserwinThanks, then this is RTBC again, based on #248, #253, your discussion on IRC with @catch/@alexpott (#263), and all tests being green in #265.
Comment #270
daffie CreditAttribution: daffie commentedI have asked bzrudi71 to test the patch from this issue with a PostgreSQL backend to see if there are any problems with PostgreSQL.
Comment #271
Berdir@stefan.r: I think you figured out already, since Core doesn't support an upgrade path yet, updating the dump is fine, as the dump we want to have is the one where an official upgrade path is supported from.
However, it would be great if you could open an related issue in the head2head project and describe the necessary changes (also the actual collation change, since existing sites will want that change too), I'm sure @amateescu will be more than happy to have patches already as well ;)
Comment #272
stefan.r CreditAttribution: stefan.r commentedOK, opened an issue there.
@daffie those test runs are merely nice to haves, they shouldn't block this issue as this had already been previously tested on SQLite/PostgreSQL and compared to the previously tested patch this only limits MySQL indexes, and the test is MySQL-only.
Comment #275
stefan.r CreditAttribution: stefan.r commentedback to RTBC
Comment #277
catchOK. Committed/pushed to 8.0.x, thanks!
We have a week prior to the next beta to flush out any more issues, but bzrudi71 confirmed that a full postgres run was OK on #2157455: [Meta] Make Drupal 8 work with PostgreSQL or remove support from core before release.
Comment #278
David_Rothstein CreditAttribution: David_Rothstein as a volunteer commentedThis is marked for D7 backport (although there's also the spinoff issue at #2488180: Support full UTF-8 (emojis, Asian symbols, mathematical symbols) on MySQL and other database drivers when they are configured to allow it - I'll comment there too). Moving this one back to Drupal 7 for now.
Comment #279
mikeryanUmm, no change record? I know of at least three surprise breakages due to this patch...
Comment #280
BerdirYes, I also had to update multiple contrib modules because of this.. mostly primary keys and unique indexes that broke. Seems like a D7 backport of a similar patch would be very risky.
Comment #281
stovak CreditAttribution: stovak commented@berdir - can you list your breakages or is this in the thread somewhere?
Comment #282
mikeryanOn my side, there's
Comment #283
stefan.r CreditAttribution: stefan.r commented@Berdir D8 contrib breaking was on purpose but D7 contrib not sure we can deal with, in the other thread I had mentioned maybe on D7 we should only allow utf8mb4 if the database system support large prefixes.
As to the change record, it'll need to mention fields with primary keys or unique keys need to be either drop the constraint, limit to 191 characters or change type to varchar_ascii, @mikeryan is that missing anything?
Comment #284
BerdirDon't remember them all, but I had to fix primary keys in ultimate_cron, file_entity (file metadata) and redirect (the hash column), possibly another one as well.
Many developers just pick 255 as the length for varchar fields and that doesn't work anymore.
Comment #285
mikeryan@stefan.r: Nope, that's good - just the info needed to deal with the change.
Comment #286
stovak CreditAttribution: stovak commented@berdir - I've been working on a table_converter module that basically scans all the tables and tells you which ones you can convert without losing data. Would love feedback.
https://github.com/stovak/table_converter
included in the "supporting" folder is the core patch and the requisite my.cnf changes.
Comment #287
TR CreditAttribution: TR commentedThis patch broke one of my modules - it won't install anymore because the index is a varchar(255) (same error as described in #231).
Is it too much to ask for a change record to be made *before* a change like this is committed? That would save many people a lot of time - it seems this failure was expected in contrib, which wouldn't have been a problem if it had been announced at https://www.drupal.org/list-changes/, but when the first notice is failing tests and when a search of recent changes doesn't indicate anything, then I have to do a git bisect and read 286 comments to figure out how to fix it. And I doubt I'm the only one who is affected.
I would think twice about backporting this to D7, as there will be many contributed modules this will break and many of those aren't aggressively maintained. I think this will bring down a lot of sites.
Comment #288
stovak CreditAttribution: stovak commented@tr - not the way I did it. I simply allowed both $charset and $collation to be set in settings.php. If the settings aren't there, Drupal proceeds as normal. I made the assumption you would specifcally _WANT_ this feature and would need to make changes in order to enact. It doesn't make sense to force this on d7 users who don't specifically need it. At least it doesn't make sense to me.
Comment #289
stefan.r CreditAttribution: stefan.r commentedDraft change notice posted at https://www.drupal.org/node/2510940
Comment #290
alexpott@stefan.r can you add some examples of how to fix stuff to the CR - thanks.
Comment #291
TR CreditAttribution: TR commentedWhat do you think about adding a check in the driver Schema.php so that these varchar(255) keys are caught and handled gracefully (with an informative user message and controlled return) when trying to create the table, rather than letting the database generate a fatal error and WSOD? The patch added normalizeIndexes()/shortenIndex(), so we already are detecting the problem, we're just not handling it very well.
Also, this 191 character limitation really needs to be documented somewhere.
Comment #292
Wim LeersI did not know we committed a D8 DB dump as part of this patch. Updating that as part of other patches is very painful. If there is a way to not have that, that'd be better. Definitely follow-up material though.
EDIT: FYI: #2464427-65: Replace CacheablePluginInterface with CacheableDependencyInterface.
Comment #293
pingers CreditAttribution: pingers commentedThis has broken a bunch of contrib with long indexes... *sigh*
Dealing with repeated issues breaking the previous 333 char limit in <= D7 was bad enough, now 191 chars. Eeek.
/me attempts writing patches.
Comment #294
ansorg CreditAttribution: ansorg commentedCannot install Beta 12 due to
Failed to connect to your database server. The server reports the following message: SQLSTATE[HY000] [2019] Can't initialize character set utf8mb4 (path: /usr/share/mysql/charsets/).
My shared hosting provider does only support utf8 :-(
This is DomainFactory which is pretty big hoster in Germany, guess this will hit many users
Comment #295
stefan.r CreditAttribution: stefan.r commented#292: was not as part of this patch
#293: The D8 contrib breakage was on purpose, refer to the CR for possible workarounds. Remember it's only a problem on unique indexes and primary keys, regular indexes are automatically shortened.
#294: the MySQL requirement has been raised to 5.5.3 per #2473301: Raise MySQL requirement to 5.5.3. However instead of that error, the installer should be saying "The mysql database server must support utf8mb4 character encoding to work with Drupal. Make sure to use a database server that supports utf8mb4 character encoding, such as MySQL/MariaDB/Percona versions 5.5.3 and up."
It doesn't?
Comment #296
ansorg CreditAttribution: ansorg commented#1314214-295: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols) MySQL at DomainFactory is 5.6.19 but technical support confirms that currently only utf8 is enabled or "preinstalled" as he said.
Comment #297
stefan.r CreditAttribution: stefan.r commentedThat's not possible in D8 beta12 and up...
Doing a google search on "Can't initialize character set utf8mb4" points to several possible issues, maybe @morgantocker would know what the problem coulud be here? What's the specific command that causes the issue?
In any case this could use a separate setup check so we can give a friendlier error message.
Comment #298
morgantocker CreditAttribution: morgantocker commentedutf8mb4 is enabled by default from 5.5+, so it looks like DomainFactory has explicitly disabled it. I have honestly not heard of this being done before.. :)
It is possible to have a more specific feature detection to cater for this:
If it returns >0 rows, the feature exists.
Comment #299
stefan.r CreditAttribution: stefan.r commentedOK, I'll see about adding a patch that does that.
@ansorg can you confirm the command "
SHOW CHARACTER SET like'utf8mb4';
" gives an empty result for you? Would be good if you could keep pushing your provider to see if they can fix and report back what the actual problem was. It's unclear why they would disable that character set on purpose...Also, did you have this problem when installing Drupal or during an upgrade? Because the installer should already test for
SET NAMES utf8mb4
not giving an errors...Comment #300
ansorg CreditAttribution: ansorg commentedUnfortunately, I can run SQL only through PhpMyadmin and unfortunately there I do get a result for the query
SHOW CHARACTER SET LIKE 'utf8mb4';
The problem happens while installing a fresh download of Beta12 with an empty database.
Initially I tried to update a Beta10 using drush but I'm not sure anymore whether this did not work due to the utf8mb4 issue or something else.
Comment #301
jhedstromI opened #2522098: Shorten index length for unique and primary keys in addition to normal indexes since schemas with unique or primary keys that need shortening do not currently work, and fail.
For instance, this schema:
Fails with:
Syntax error or access violation: 1071 Specified key was too long; max key length is 767 bytes
Comment #302
stefan.r CreditAttribution: stefan.r commentedRegarding #300: wouldn't that point to a configuration issue on your hosting provider's side if the character set is listed anyway? Doing a google search for the error message you showed earlier did point to that as well.
@jhedstrom #301 was as designed, answering in the new issue
Comment #303
reevo CreditAttribution: reevo commentedI'm seeing the same issue during install as #294, using the following:
CentOS 6.4
MySQL 5.5.44 (mysql55w from Webtatic repo)
PHP 5.5.26
...and seeing this error message when I try to initialise a PDO connection using the same $dsn as D8:
Drupal is unable to connect to a MySQL server which doesn't support utf8mb4 with charset=utf8mb4 in the $dsn.
Edit: I've created an issue with patch to catch this error: #2529188: Provide better error handling for MySQL client and server utf8mb4 incompatibility
Comment #304
stefan.r CreditAttribution: stefan.r commented@reevo mysql55w should have utf8mb4 already compiled in/available, what other packages did you have that have the word mysql in them? The error may have to do with libmysqlclient, you have the latest webtatic version for that?
On this page the error is mentioned as well, where they remove all old mysql packages before instlaling libmysqlclient16, mysql55w, mysql55w-libs, mysql55w-devel and mysql55w-server: https://manageyp.github.io/ruby-on-rails/2015/03/05/aliyun-rds-change-my...
Comment #305
reevo CreditAttribution: reevo commented@stefan.r here's my list of mysql related packages:
I've tried completely uninstalling and reinstalling them all and am seeing the same result.
Comment #306
morgantocker CreditAttribution: morgantocker commentedGood call on identifying it is a client issue. That is a MySQL 5.1 client on a 5.5 install.
Comment #307
ansorg CreditAttribution: ansorg commentedre #306, not sure about the client thing?
At DomainFactory I see "MySQL client version: 5.1.61" -
On a Synology DiskStation I have utf8mb4 working with "Database client version: libmysql - mysqlnd 5.0.11-dev - 20120503 "
Comment #308
adammaloneComment #309
stefan.r CreditAttribution: stefan.r commentedApparently libmysqlclient has supported utf8mb4 since 5.5.3 (just like MySQL server). Mysqlnd uses different version numbers and has supported utf8mb4 since 5.0.9.
So we may have to do a separate mysql_get_client_info() check in addition to a server check.
Comment #310
vijaycs85We may need to test this with various OS/PHP packages before porting this to 7.x. As @stefan.r mentioned in #309, most of the php5.5.3 packages depend on old libmysqlclient version.
Comment #311
jhedstromI opened #2537928: MySQL index normalization only works on table create since the index normalization becomes problematic when adding indexes to existing tables. Thoughts on how to resolve would be greatly appreciated.
Comment #312
ansorg CreditAttribution: ansorg commentedanother big german hoster, Strato.de, seems to have a similar mysql configuration as Domainfactory and that again breaks recent drupal-8 beta12
Failed to connect to your database server. The server reports the following message: SQLSTATE[HY000] [2019] Can't initialize character set utf8mb4 (path: /opt/RZmysql5/share/mysql/charsets/).
Comment #313
stefan.r CreditAttribution: stefan.r commented@ansorg as mentioned in #2521784: cannot install at shared hoster Strato.de due to unsupported utf8mb4 Charset, this is likely due to a client version incompatibility, we are working on better error handling for this in #2529188: Provide better error handling for MySQL client and server utf8mb4 incompatibility
Comment #314
bzrudi71 CreditAttribution: bzrudi71 commentedThis commit causes me more and more headache ;) Don't get me wrong, I'm totally +1 for this, but... We see user reports with problems for at least Domainfactory.de, 1und1.de and strato,de. So we are talking about the biggest players in Germany with a marketing share (together) far above 50% (or even more)! All approaches I see so far are just about adding a better error handling (mysql_client version). TBH does anyone expect these companies with millions and millions of hosted sites to change their setup because Drupal 8 doesn't like it? ;) IMHO this smells like a big Drupal 8 showstopper for German customers to me :(
Again, I like the idea and as a PG user not even affected by this but this feels more and more like a nogo to me, sorry.
Comment #315
stefan.r CreditAttribution: stefan.r commented@bzrudi71 the minimum version requirement is 5.5.3 for both the PHP mysql driver and the server per #2473301: Raise MySQL requirement to 5.5.3 , so even if we revert this patch, if we keep the version requirement Drupal 8 would still not work on those environments.
If it's really too much a problem that some large hosters still use older client libraries on their shared hosting environments, maybe another alternative would be to work on a utf8 fallback so we can relax the client requirement, however that's opening a whole other can of worms. And I wonder if by the time D8 is widely adopted enough time will have passed for them to have upgraded their environment anyway.
For what it's worth, others have easily fixed the issue by switching mysql drivers (from libmysqlclient to mysqlnd). May be worth nudging these hosters in the same direction if upgrading libmysqlclient is problematic for them.
Comment #316
morgantocker CreditAttribution: morgantocker commented@bzrudi71, @stefan.r: my opinion is similar to stefan's here.
Please also consider that upstream (disclosure: where I am employed) no longer supports MySQL 5.1. Although not as common, there are also client vulnerabilities such as BACKRONYM, which will not be addressed. #2473301 took a look at hosting providers/distros and their minimum supported versions. 5.5 was released in 2010, and is not exactly bleeding edge by any stretch.
Comment #317
geerlingguy CreditAttribution: geerlingguy as a volunteer commentedFor Drupal 7, should we consolidate efforts with #2488180: Support full UTF-8 (emojis, Asian symbols, mathematical symbols) on MySQL and other database drivers when they are configured to allow it? It seems that the only major differences here are we're not changing the default collation, since we have to support older versions of MySQL in D7, and we're adding
charset
configuration (which should first be added to Drupal 8).Comment #318
DamienMcKennaFYI for the Twitter module we've changed to storing the tweet text in a blob (#1910376: SQL error when importing tweet with emoji), which works around the problem and is enough of a fix for our meagre needs.
Comment #319
morgantocker CreditAttribution: morgantocker commented@DamienMcKenna I think it depends on the situation if this workaround is acceptable. Blob will mean no collation (sorting order and comparison). The comparison feature is often useful: depending on the collation it will be case insensitive and accent insensitive.
Comment #320
DamienMcKenna@morgatrocker: Completely understandable.
Comment #321
eranmatis CreditAttribution: eranmatis as a volunteer commentedHi,
Do you know a host company which supports the Drupal 8 requirements ?
I just has a bad experience with MochaHost.
After a week of support ping pong, they said that their MySql doesn't support utf8mb4.
Comment #322
ansorg CreditAttribution: ansorg commented@eranmatis yes, this kind of sucks.
I have checked several well-known hosters in Germany (DomainFactory, Alfahosting, 1&1, Strato) by now and although their MySQL Server versions are fine, the mysql-client libs for PHP are outdated. None of them will allow running Drupal 8.
Comment #323
basvredelingIsn't it just a case of supporting mysql 5.6+? I asked https://www.byte.nl/ webhosting via twitter the other day. They don't anticipate any problems using utf8mb4. So I'm genuinely curious what the ping pong of #321 was all about.
Comment #324
stefan.r CreditAttribution: stefan.r commented@basvredeling no, it's about libmysqlclient - some hosters still run a an ancient version of that (5.1) which is incompatible with Drupal 8.
Upgrading libmysqlclient is not something they can do that easily, but there have been several reports here in the issue queue of people making an easy switch from libmysqlcleint to MySQLnd (the native driver that comes with PHP).
We also have a minimum version requirement for PHP, and the MySQLnd driver that comes with any version that meets the requirement does work with Drupal 8. So it would be good if people pressured these hosts to implement this.
Comment #325
stefan.r CreditAttribution: stefan.r commentedI've been in touch with DomainFactory and they will fix their setup on October 20th:
Comment #326
Rajab Natshah CreditAttribution: Rajab Natshah as a volunteer and at Vardot commentedPlease, we need to have this patch in the next Drupal 7 version.
Still using the workaround module as a fix for this issue.
Strip 4-byte UTF8
https://www.drupal.org/project/strip_utf8mb4
We do have an interface for it too.
We want to jump to have that in Drupal.
Thank you.
Comment #327
bernard.verslype CreditAttribution: bernard.verslype commentedLa réponse de OVH à mes demandes d'explications:
D'après la documentation officiel MySQL je constate que l'UTF8MB4 n'est
valable qu'à partir de base de données en version 5.5.3 hors mysql5-18.perso
et mysql51-93.perso sont en version 5.1, version que vous avez choisi lors de
leur création.
Je vous invite ici à sauvegarder vos bases, à les vider, à les supprimer puis
à les recréer en 5.5 avant de réimporter vos données ce qui devraient vous
permettre d'utiliser l'UTF8MB4.
OVH's response to my requests for explanations:
According to the official MySQL documentation I see that is UTF8MB4
valid only from database Version 5.5.3 out mysql5-18.perso
mysql51-93.perso and are in version 5.1, version you chose when
their creation.
I invite you here to save your bases, emptying, and then to remove
to recreate in 5.5 before reimport your data which you should
allow to use the UTF8MB4.
Comment #328
geerlingguy CreditAttribution: geerlingguy as a volunteer commented@RajabNatshah - To be clear, this issue is about adding support for 4-byte characters, not stripping them from content. But yes, many people are in favor of adding optional support for UTF8mb4 to core in Drupal 7 :)
Comment #329
Rajab Natshah CreditAttribution: Rajab Natshah as a volunteer and at Vardot commented@geerlingguy .. Yes .. that what we want ..
But we keep having this patch ported every time ..
It did not come with the last Drupal 7 Release.
Comment #330
darol100 CreditAttribution: darol100 as a volunteer and commentedSeem to me that this already been committed to D8. What are the things that needs to be done to complete this issue in D7 ?
I know that @geerlingguy provide a work around on how to fix this issue on D7 website (Solving the Emoji problem in Drupal 7), but this requires to patch D7 core (which I'm not going to do).
Comment #331
tim.plunkettMarking this as fixed because the D7 issue is at #2488180: Support full UTF-8 (emojis, Asian symbols, mathematical symbols) on MySQL and other database drivers when they are configured to allow it
Comment #333
Helgi Andri Jónsson CreditAttribution: Helgi Andri Jónsson commentedHello everyone,
I made a module that uses entity metadata wrappers to loop over the properties of an entity on hook_entity_presave and it converts all unicode characters to html entities for properties of type text and formatted text. I know it would be great have this solved in core and its not really a solution to strip them from input as that would be counter to the expectation of a user experience.
I hope the module solves a middle ground. I also made sure to include configurable settings to allow us to blacklist or whitelist entities to check for unicode. Therefore we can isolate such input to certain entities like comments or messages.
https://www.drupal.org/project/unicode
Best Regards,
Helgi.
Comment #334
jibranCreated #2770319: Testbot MySQL config doesn't use large prefix hiding potential errors for testbot.
Comment #335
David_Rothstein CreditAttribution: David_Rothstein as a volunteer commentedComment #336
geek-merlin