Problem/Motivation

The Drupal MySQL driver does not currently provide full UTF-8 support. The (confusingly named) MySQL utf8 charset only provides support for the basic character plane (out of 17 in total), which constitutes 6% of possible characters. As of MySQL 5.5.3 (2010), the utf8mb4 charset provides full UTF-8 support, being completely backwards compatible, not requiring more space than utf8 for characters that are in the utf8 set, and using extra byte for characters outside of the utf8 set.

For reference, Wordpress implemented this in February 2015: https://core.trac.wordpress.org/ticket/21212

With the current partial UTF-8 support we lack the 16 other character planes. This would include:

  • Emojis!
  • Mathematical/scientific symbols
  • Some Chinese, Japanese and Korean signs
  • Musical notation
  • Obscure/ancient languages
  • Emoticons, astral symbols, game symbols and other pictographic sets
  • (User defined) Font icons (these normally use reserved planes)

See also:

The Drupal PostgreSQL and SQLite drivers do implement full UTF-8 support.

Proposed resolution

For D7: Allow users to use either utf8 or utf8mb4.

For D8:
Prerequisite issues that are already committed:

Now that we require a higher MySQL version and use ASCII encoding where possible, we can just set the utf8mb4 character set by default. Additionally, so we can support installs without "innodb_large_prefixes", we work around the InnoDB 191 character limitation on primary keys/indexes by:

  • Getting rid of the database-level unique constraint on feed titles. We still have an entity constraint on the application level, as there's no way to distinguish between feeds in the UI other than by title, but it's not a huge deal to have two feeds with the same title so we can safely lose the database-level constraint.
  • Getting rid of the database-level unique constraint on block descriptions. We still have uniqueness validation in the form (a patch exists to turn this into an entity constraint), as there's no way to distinguish between blocks in the UI other than by description, but it's not a huge deal to have two blocks with the same description either so we can safely lose the database-level constraint.
  • We have a Drupal-level unique constraint on the File URI field instead of a database-level unique constraint, until we fix #2492171: Provide options to sanitize filenames (transliterate, lowercase, replace whitespace, etc) (at which point we can put back the database constraint).
  • Setting the max size of the username field to whatever the username length is (60 characters) instead of 255.
  • Setting the filename field on the D6 system table to ASCII, in both the generated script and in dump-database-d6.sh.
  • Setting various fields (that only hold ASCII-data) to ASCII on test-only tables. We had skipped these in #1923406: Use ASCII character set on alphanumeric fields so we can index all 255 characters.

As to regular indexes, we limit those to 191 characters if they refer to utf8mb4-based fields.

User interface changes

N/A

API changes

We have a Drupal-level unique constraint on the File URI field instead of a database-level unique constraint, until we fix #2492171: Provide options to sanitize filenames (transliterate, lowercase, replace whitespace, etc) (at which point we can put back the database constraint).

Original Issue Summary by mdupont

From MySQL documentation, its "utf8" encoding only supports a maximum of 3 bytes per character.

As such, if we try to save valid UTF-8 content that passes drupal_validate_utf8() but contains 4 bytes characters, it would trigger a buggy behavior (see #1199736: Error on pasting in different language):

  • in D6, content will be truncated to the first 4-bytes character ; any content after it will not be saved (data loss)
  • in D7 and later, it will trigger an error in PDO: "PDOException: SQLSTATE[HY000]: General error: 1366 Incorrect string value:"

When using SQLite there are no issues with 4 bytes characters. I hadn't tested with PostgreSQL.

MySQL can handle 4 bytes characters since version 5.5.3 by using "utf8mb4" encoding.

How could we avoid data loss or error when using regular MySQL utf8 encoding?

CommentFileSizeAuthor
#265 diff-beta11-HEAD-3.txt743.12 KBstefan.r
#265 1314214-265.patch253.17 KBstefan.r
#263 1314214-263.patch253.19 KBstefan.r
#263 diff-beta11-HEAD-2.txt743.11 KBstefan.r
#261 diff-beta11-HEAD.txt743.32 KBstefan.r
#261 1314214-261.patch253.18 KBstefan.r
#261 interdiff-258-261.txt111.49 KBstefan.r
#258 1314214-258.patch199.7 KBstefan.r
#258 interdiff-253-258.txt172.7 KBstefan.r
#253 1314214-253.patch27.27 KBstefan.r
#253 interdiff-249-253.txt1.56 KBstefan.r
#249 interdiff-247-249.txt3.37 KBstefan.r
#249 1314214-249.patch26.94 KBstefan.r
#247 1314214-247.patch23.78 KBstefan.r
#247 interdiff-246-247.txt2.28 KBstefan.r
#246 interdiff-218-246.txt3.35 KBstefan.r
#246 1314214-246.patch23.69 KBstefan.r
#222 interdiff-185-218.txt5.93 KBstefan.r
#218 interdiff-212-218.txt342 bytesstefan.r
#218 1314214-218.patch20.4 KBstefan.r
#212 1314214-212.patch20.28 KBstefan.r
#212 interdiff-208-212.txt1.01 KBstefan.r
#209 interdiff-207-208.txt539 bytesstefan.r
#209 1314214-208.patch20.27 KBstefan.r
#207 interdiff-205-207.txt2.83 KBstefan.r
#207 1314214-207.patch20.27 KBstefan.r
#205 1314214-205.patch20.54 KBstefan.r
#205 interdiff-194-205.txt5.62 KBstefan.r
#194 1314214-194.patch21.51 KBstefan.r
#194 interdiff-185-194.txt567 bytesstefan.r
#186 interdiff-180-185.txt1.32 KBstefan.r
#186 1314214-185.patch21.57 KBstefan.r
#180 interdiff-178-180.txt605 bytesstefan.r
#180 1314214-180.patch21.75 KBstefan.r
#178 interdiff-174-178.txt1.85 KBstefan.r
#178 1314214-178.patch21.75 KBstefan.r
#174 interdiff-165-174.txt610 bytesstefan.r
#174 1314214-174.patch21.56 KBstefan.r
#173 1314214-173.patch22.23 KBstefan.r
#173 interdiff-165-173.txt513 bytesstefan.r
#166 drupal-utf8mb4-1314214-165.patch21.55 KBstefan.r
#162 drupal-utf8mb4-1314214-162.patch23 KBstefan.r
#160 interdiff-157-159.txt7.17 KBstefan.r
#160 drupal-utf8mb4-1314214-159.patch22.89 KBstefan.r
#157 drupal-utf8mb4-1314214-157.patch20.72 KBstefan.r
#157 interdiff-153-157.txt3.75 KBstefan.r
#154 drupal-utf8mb4-1314214-153.patch18.77 KBstefan.r
#153 interdiff-150-153.txt5.33 KBstefan.r
#151 drupal-utf8mb4-1314214-150.patch15.72 KBstefan.r
#148 interdiff-142-148.txt4.88 KBstefan.r
#148 drupal-utf8mb4-1314214-148.patch15.44 KBstefan.r
#146 interdiff-142-146.txt4.56 KBstefan.r
#146 drupal-utf8mb4-1314214-146.patch15.12 KBstefan.r
#144 drupal-utf8mb4-1314214-144.patch12.68 KBstefan.r
#144 interdiff-142-144.txt2.45 KBstefan.r
#142 interdiff-129-142.txt6.8 KBstefan.r
#142 drupal-utf8mb4-1314214-142.patch9.66 KBstefan.r
#129 drupal-utf8mb4-1314214-129.patch6.12 KByannickoo
#104 mysql_utf8mb4_support-1314214-104-drupal7.28-backport.patch.txt6.37 KBmarco
#98 database-1314214-98.patch9.68 KBpfrenssen
#70 database-1314214-70.patch9.71 KBYesCT
#70 interdiff-69-70.txt5.92 KBYesCT
#69 database-1314214-69.patch9.62 KBergophobe
#69 interdiff-49-69.txt10.54 KBergophobe
#68 database-1314214-68.patch9.55 KBergophobe
#68 interdiff-49-68.txt10.49 KBergophobe
#64 database-1314214-64.patch7.66 KBergophobe
#64 interdiff-49-64.txt6.5 KBergophobe
#61 interdiff.txt5.79 KBchx
#58 database-1313214-58.patch8.43 KBdamienwhaley
#58 interdiff-51-81.txt6.19 KBdamienwhaley
#51 database-1313214-51.patch5.41 KBYesCT
#51 interdiff-49-51.txt1.09 KBYesCT
#49 database-1313214-49.patch5.41 KBergophobe
#49 interdiff.txt1.84 KBergophobe
#47 database-1314214-47.patch5.24 KBkbasarab
#47 interdiff.txt1.43 KBkbasarab
#43 database-1314214-43.patch5.39 KBsimolokid
#34 mysql_utf8mb4_support-1314214-34.patch5.27 KBphayes
#33 mysql_utf8mb4_support-1314214-33.patch5.32 KBphayes
#30 mysql_utf8mb4_support-1314214-30-drupal7.10-backport.patch.txt4.28 KBphayes
#29 mysql_utf8mb4_support-1314214-29.patch5.39 KBTor Arne Thune
#29 interdiff-26-29.txt4.7 KBTor Arne Thune
#26 mysql_utf8mb4_support-1314214-11.patch5.38 KBphayes
#24 mysql_utf8mb4_support-1314214-10.patch5.32 KBphayes
#22 mysql_utf8mb4_support-1314214-9.patch5.35 KBphayes
#20 mysql_utf8mb4_support-1314214-8.patch5.35 KBphayes
#19 mysql_utf8mb4_support-1314214-7.patch5.37 KBphayes
#12 mysql_utf8mb4_support-1314214-6.patch3.01 KBphayes
#5 mysql_utf8mb4_support-1314214-5.patch1.42 KBbasic
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

mdupont’s picture

Title: MySQL utf8 encoding is not really utf8 - error on node save » UTF-8: 4 bytes characters bug using MySQL backend

Updated title

basic’s picture

This looks like it ties in directly to the discussion on #1309278: Make PDO connection options configurable

sun’s picture

Issue tags: +MySQL, +charset, +collation

Reference manual: http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html

So...

What we want: Allow to use utf8mb4 instead of utf8.

What we don't want: Allow to use non-utf8 character sets.

catch’s picture

Issue tags: +Needs backport to D7

Tagging for backport.

basic’s picture

This patch lets you set the 'names' connection option to utf8mb4, other charsets are defaulted back to utf8. This doesn't feel like the right way to approach this issue though.

In the case that we try to use utf8mb4 on an older version of MySQL we enter a scenario where Drupal cannot connect to the database, and we're brought back to the drupal install.php (which writes an updated settings.php w/o the utf8mb4 option). Not horrible, but not quite adequate either.

How should this kind of issue be handled? Should we...

  • default to utf8mb4 on versions that support it
  • only allow utf8mb4 on mysql versions that support it (5.5.3+), and fall back to utf8 for all other versions
  • leave this patch as-is
  • other solutions?
phayes’s picture

I think detecting wether it is compatible or not makes sense and throwing an error if it's not.

There are two ways to detect compatibility:

- Check the MySQL version (should be above 5.5.3)
- Run the query "show character set;"

The problem is that I think both of these things require a extra query :-/

Two comments about the above patch

1. "names" is a bit of a confusing variable name for the settings, how about "charset"?

2. I understand we want to limit mistakes, but is there a good reason to force only utf8 and utf8mb4 if the user is savy enough to edit their database connection settings? I think defaulting to utf8 might be good enough to limit mistakes and we should allow savy users to set the connection encoding to whatever they want if they so desire.

Finally, thank you *so* much basic for taking this on. Mucho mucho appreciation!

catch’s picture

Status: Active » Needs work

Moving to CNW since there's a patch here.

basic’s picture

From sun:

What we want: Allow to use utf8mb4 instead of utf8.

What we don't want: Allow to use non-utf8 character sets.

  1. I like the idea of changing 'names' to 'charset' (I agree that this is confusing)
  2. I took @sun's post (#3) into consideration when creating the patch - this is why there are only two available options. I'll leave it up to the dbtng folks to decide what we want to allow. The biggest hurdle I believe is keeping people off of latin1 and on utf8*, by opening this back up we will also have to continue supporting latin1 and other unsupported charsets.

I am feeling that we should default to utf8mb4 where possible, and allow changing the charset back to utf8 if needed. This would cause mysql to behave the same as sqlite and postgres by default (assuming mysql >= 5.5.3), and use the legacy charset (utf8) if the mysql version doesn't support utf8mb4.

I would like to get some more feedback on this issue as well.

phayes’s picture

Another angle that this patch should address is the default table mysql character sets for tables. If we don't change the default character set for tables, setting the NAMES at connection-time will do nothing useful.

The table-level character-set is currently defined in includes/database/mysql/schema.inc line 85 and is currently hardcoded. This starts to get tricky as it seems dirty to pull this data from the mysql-*connection* configuration object. Instead, we need something like a variable that can be overridden in settings.php.

Thoughts?

phayes’s picture

I've ran into another issue with using utf8mb4 that we will need to address: index-size.

Using utf8mb4 increases the index size from 3 bytes per character to 4 bytes per character. This isn't really a problem for text-fields, but absolutely plays havoc with varchar fields who's allowed size is 191 characters or larger. The reason for the is that the varchar-index has a maximum size of 765 bytes. Normally this is fine as it is exactly three-times the size of a 3-byte-per-character 255 character varchar (767 / 3 = 255). When we switch to 4-bytes-per-character, the maximum number of characters a 765 byte index can support is 191 (765 / 4 = 191).

So we have two options here
1. Only support varchars up to 191 characters in length for utf8mb4. I *don't* think this is a very good idea as it breaks too many assumptions.
2. Only support utf8mb4 characters in text columns. I think this is entirely reasonable.

Implementing 2 will require a little more logic in the createField method, but I don't see this being problematic.

phayes’s picture

Status: Needs review » Needs work

Here's a snippet you can run to update all your tables and text-field to utf8mb4 for testing:

EDIT: DONT USE THIS! It will also change the charset of varchar fields created after it was run (BAD!). Use the one pasted below...

  $schema = drupal_get_schema();  
  foreach ($schema as $table_name => $table) {
    db_query('ALTER TABLE '.$table_name.' DEFAULT CHARACTER SET utf8mb4');
    foreach ($table['fields'] as $field_name => $field) {
      if ($field['type'] == 'text') {
        db_change_field($table_name, $field_name, $field_name, $field);
      }
    }
  }
phayes’s picture

Status: Needs work » Needs review
FileSize
3.01 KB

Here's a patch that collects up the above observations.

I've moved the configuration from the connection config to a general $conf setting. This is because, if we are connecting to multiple databases using slaves, we want to make sure that all connections are configured the same. Additionally, using this $conf setting ensures that we only have *one* setting to change in order to enable utf8mb4 and everything "just works".

Tables now always default to utf8 no matter what, and columns only get tagged with utf8mb4 if they are text fields.

Review and thoughts please!

phayes’s picture

Status: Needs work » Needs review

Two more things that should be added to the above patch:

1. Link to http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html in default.settings.php so a user can learn more
2. Check to make sure MySQL is compatible on *install*. I don't think we need to check on every request, because of:
A. We check it on install
B. If a site is migrated from one MySQL server to another that is non-compatible, the SQL import will fail when it encounters the utf8mb4 column charset

phayes’s picture

Status: Needs review » Needs work

Huh, I guess the MySQL testbot is not running MySQL 5.5.3+

Crell’s picture

+++ b/core/includes/database/mysql/database.inc
@@ -52,14 +52,16 @@ class DatabaseConnection_mysql extends DatabaseConnection {
+    $charset = variable_get('mysql_textfield_character_set','utf8') == 'utf8mb4' ? 'utf8mb4' : 'utf8';

variable_get() is not allowed inside the DB layer. All configuration must come in through the database configuration array only. That's by design so that the DB system is not dependent on the variable system, which itself has a soft dependency on the database.

The Database layer right now is about 95% Drupal-agnostic. The goal is that by Drupal 8's release t is 100% Drupal agnostic.

phayes’s picture

Thanks Crell,

I will re-roll using global $conf;

Crell’s picture

No, global $conf is what we should not be using. :-) Such configuration should go into the $databases array, if anywhere.

phayes’s picture

Status: Needs work » Needs review
FileSize
5.37 KB

Here's another patch. Please review.

I think we are getting pretty close. The user can configure the charset in the database settings. We store the charset in the Database connection object as a read-only property so other functions can get this data later.

In particular, I would like someone to review the validateDatabaseSettings method. not sure if I'm doing this right...

phayes’s picture

Blarg. Re-rolled above patch - removing danling global $conf

phayes’s picture

Status: Needs work » Needs review
FileSize
5.35 KB

{sigh} - re-rolled again (typo).

phayes’s picture

Status: Needs work » Needs review
FileSize
5.32 KB

My apologies for the all the bad patches. Let's try this again shall we.

phayes’s picture

Fixing notice that testbot found...

phayes’s picture

Status: Needs work » Needs review
Crell’s picture

Also, just to note that this is going to run smack dab into #1321540: Convert DBTNG to namespaces; separate Drupal bits and conflict. I'd rather get reviews on that sooner than later so that we can get it in, and then do other API changes.

Tor Arne Thune’s picture

Just some comment cleanup. The interdiff is between patch in #26 and the attached patch.

phayes’s picture

Status: Needs work » Needs review
FileSize
4.28 KB

For those who need this working against the latest stable in D7 (7.10 right now), here's a patch I cooked up. I hope people find it useful.

Also, once you install this patch and change your charset in your settings.php file, you can run this to upgrade all your text fields:

  $schema = drupal_get_schema();  
  foreach ($schema as $table_name => $table) {
    foreach ($table['fields'] as $field_name => $field) {
      if ($field['type'] == 'text') {
        db_change_field($table_name, $field_name, $field_name, $field);
      }
    }
  }
xjm’s picture

Thanks for the comment cleanup! Here's two other minor docs cleanups to incorporate:

+++ b/core/includes/database/mysql/database.incundefined
@@ -19,6 +19,28 @@ class DatabaseConnection_mysql extends DatabaseConnection {
+   * Flag to indicate our charset. It can either be utf8 or utf8mb4 (adds
+   * support for 4-byte UTF8 characters). See settings.default.php to learn more
+   * about utf8mb4. This is used in combination with the __get method below to
+   * make charset a read-only property of this object.
+   *
+   * @var string
+   */
+  protected $charset = 'utf8';
...
+  /**
+   * This is used to fetch readonly variables. You can not read the registry
+   * instance reference through here.
+   *
+   * @param string $var
+   * @return string
+   */
+  public function __get($property) {

These two docblocks should have one-line summaries. Also, the function's summary should start with a third-person verb. (See http://drupal.org/node/1354#classes).

Crell’s picture

Status: Needs review » Needs work
+++ b/includes/database/mysql/database.inc
@@ -19,6 +19,28 @@ class DatabaseConnection_mysql extends DatabaseConnection {
+  /**
+   * This is used to fetch readonly variables. You can not read the registry
+   * instance reference through here.
+   *
+   * @param string $var
+   * @return string
+   */
+  public function __get($property) {
+    // Only allow access to the charset property.
+    if ($property == 'charset') return $this->$property;
+  }

Woah, woah, why? The magic methods are dangerous magic in most cases. This is a hack. Don't do that. If for some reason you need to expose the charset to the outside world, make it a method.

-18 days to next Drupal core point release.

phayes’s picture

Status: Needs work » Needs review
FileSize
5.32 KB

Hi Crell,

Here's an updated patch with the magic methods removed and replaced with a "charset" method.

phayes’s picture

Whoops,

The above still had some comments pertaining to the __get method, those comments have been corrected in this one.

Crell’s picture

Status: Needs review » Needs work
+++ b/core/includes/database/mysql/database.inc
@@ -19,6 +19,15 @@ class DatabaseConnection_mysql extends DatabaseConnection {
+  /**
+   * Flag to indicate our charset. It can either be utf8 or utf8mb4 (adds
+   * support for 4-byte UTF8 characters). See settings.default.php to learn more
+   * about utf8mb4. The charset() method can be used to read this property.
+   *
+   * @var string
+   */
+  protected $charset = 'utf8';

Variable docblocks should follow the same rules as any other docblock. Vis, one line summary, empty line, then long description.

This value actually could be, arguably, multiple different values, couldn't it? Not just the two utf8s? If someone sets the charset connection option, they can set it to whatever they want.

No need to mention the method. In fact, most of this documentation should move to the method. Remember from a API perspective, the method is the only thing that exists. There is no property. That's just an implementation detail that a caller should not be able to know about.

-25 days to next Drupal core point release.

phayes’s picture

Crell,

As per sun's comment above (#3), we only want to support utf8 and utf8mb4 charsets. If you want to override this decision I will make the necessary changes to the patch.

mdupont’s picture

Marked #1741026: Some unicode code points failed on Drupal as a duplicate of this issue.

laurentnet’s picture

Drupal 7.15, with SQlite 3.7.7.1 or PosgreSQL 9.3.1 (apache 2.2.2),
displays correctly code point U+1D538

Postgresql 9.1: one unicode character UTF_8 can be 1,2,3 ou 4 bytes

http://www.postgresql.org/docs/9.1/static/multibyte.html

Conclusion: if you want to display code point in Supplementary Multilingual Plane, use Postgresql or Sqlite

jurgenhaas’s picture

Wanted to cleanup the patch documentation (see comment #35) and realized that the structure on D8 has changed since the patch so I wonder if this still applies to D8?

YesCT’s picture

Status: Needs work » Needs review
Issue tags: -MySQL, -charset, -collation, -Needs backport to D7

Status: Needs review » Needs work
Issue tags: +MySQL, +charset, +collation, +Needs backport to D7

The last submitted patch, mysql_utf8mb4_support-1314214-34.patch, failed testing.

star-szr’s picture

Issue tags: -MySQL, -charset, -collation +Needs reroll

Updating tags.

simolokid’s picture

FileSize
5.39 KB

With the help of xjm i rebased it.

ZenDoodles’s picture

Status: Needs work » Needs review
Crell’s picture

Status: Needs review » Needs work
+++ b/core/lib/Drupal/Core/Database/Driver/mysql/Connection.php
@@ -30,6 +30,15 @@ class Connection extends DatabaseConnection {
+  /**
+   * Flag to indicate our charset. It can either be utf8 or utf8mb4 (adds
+   * support for 4-byte UTF8 characters). See settings.default.php to learn more
+   * about utf8mb4. The charset() method can be used to read this property.
+   *
+   * @var string
+   */
+  protected $charset = 'utf8';

See earlier review about how this docblock belongs on the method; this should just be a one line description.

+++ b/core/lib/Drupal/Core/Database/Driver/mysql/Connection.php
@@ -61,14 +70,16 @@ class Connection extends DatabaseConnection {
+    $this->charset = (isset($connection_options['charset']) ? ($connection_options['charset'] == 'utf8mb4' ? 'utf8mb4' : 'utf8') : 'utf8');

A double-ternary is way too hard to read. :-) Don't do that. Go ahead and spell this out long-form.

Actually, is there a reason to not just set a default for $connection_options['charset'], and then always use that?

+++ b/core/lib/Drupal/Core/Database/Driver/mysql/Connection.php
@@ -205,6 +216,15 @@ class Connection extends DatabaseConnection {
+
+  /**
+   * This is used to fetch readonly charset property.
+   *
+   * @return string
+   */
+  public function charset() {
+    return $this->charset;
+  }

See above. The property is irrelevant. The method should contain the real documentation.

0 days to next Drupal core point release.

star-szr’s picture

Issue tags: -Needs reroll

Thanks @simolokid! Untagging.

kbasarab’s picture

Status: Needs work » Needs review
FileSize
1.43 KB
5.24 KB
+++ b/core/lib/Drupal/Core/Database/Driver/mysql/Connection.php
@@ -205,6 +216,15 @@ class Connection extends DatabaseConnection {
+
+  /**
+   * This is used to fetch readonly charset property.
+   *
+   * @return string
+   */
+  public function charset() {
+    return $this->charset;
+  }

See above. The property is irrelevant. The method should contain the real documentation.
<?blockquote>
I don't understand this portion here but I made the other corrections.

Also the patch didn't apply when I tried so I rerolled to HEAD as well.

Crell’s picture

Status: Needs review » Needs work

kbasarab: See the first quoted block from comment #45. Some version of that should be moved to the charset() method, which then replaces the "used to fetch readonly property" line (which is meaningless).

The property is not a flag, because it's not a boolean. It should likely be something like "The character set of this connection.\n\nEither utf8 or utf8mb".

The method then has the discussion of seeing default.settings.php and further explanation and so forth.

ergophobe’s picture

Status: Needs work » Needs review
FileSize
1.84 KB
5.41 KB

New patch that has no code changes - this is just my best attempt to fold in Crell's last documentation requests.

The comment in lines 82 and 259 are a bit repetitive (both point to comments in default.settings.php). Not sure if that's okay.

YesCT’s picture

Status: Needs review » Needs work
+++ b/core/lib/Drupal/Core/Database/Driver/mysql/Connection.phpundefined
@@ -252,8 +252,12 @@ protected function popCommittableTransactions() {
+   * Fetch the current character set for this connection.
+   * ¶

For most functions (see exceptions below), summary lines must start with a third person singular present tense verb, and they must explain what the function does. Example: "Calculates the maximum weight for a list."
http://drupal.org/node/1354#functions

so this should be:
Fetches ...

Also there are a few lines with an extra space at the end: trailing whitespace.

patch coming soon.

YesCT’s picture

Status: Needs work » Needs review
FileSize
1.09 KB
5.41 KB

(The white space was easy to spot using the dreditor review button http://drupal.org/project/dreditor )

chx’s picture

Ouch. Please, please, please *no*. Supporting utf8mb4 is great but supporting any non-utf8 charset is a massive pita that I would say we must avoid at all costs. At least search will break with non-utf8 but I can imagine other things too.

Edit: to make sure this is clear: utf8 and utf8mb4 are both fine, but koi8r and all that ancient ....charsets are not.

chx’s picture

Solution: for the default connection throw an exception if the charset is set and does not start with utf8.

YesCT’s picture

Issue summary: View changes

added summary of additional scope this issue needs in order to avoid a follow-up critical.

YesCT’s picture

Title: UTF-8: 4 bytes characters bug using MySQL backend » UTF-8: fix data loss in 4 bytes characters bug using MySQL backend and limit to charsets that start with utf8
Status: Needs review » Needs work

needs work to avoid critical follow-up issue.
updated issue summary.
updated issue title.

ergophobe’s picture

I wrote the documentation to reflect the code, which is currently very permissive. Upon reflection, it would even accept typos would it not?

If it's the case that it absolutely must be utf8 or utf8mb4 then it should test specifically for those, not just strings that begin with 'utf8'. That way it would catch the case where someone accidentally types 'utf8mb' (e.g. as Crell did in #48).

So it seems that it needs to
- test specifically for utf8 or utf8mb4.
- if it's not one of those, throw an exception and default to utf8

chx’s picture

I. am fine with that -- once again, for default. The reason I emphasize default cos the any-charset-goes is an *excellent* opportunity for (careful) migrating from legacy databases.

chx’s picture

Issue summary: View changes

Updated issue summary, point out need to avoid critical follow-up

damienwhaley’s picture

I've updated the issue summary to make it easy to figure what's left to do.

damienwhaley’s picture

I've started work on checking of the charset setting or MySQL and it's pretty much right I hope. The only concern I had is about throwing the exception. I've left the code commented out where I think the throw could happen, but if you let it throw there, then you end up with a white screen as the object is not returned and the error handler does not like null objects.

Perhaps the throw should be further up the chain (in the cache fetch)? Also the exception I've chosen is probably not quite right.

My thoughts after playing with this is that we should probably log it via watchdog as the connection should always work.

This probably needs a bit more thought. As I'm not 100% familiar with the new config stuff and I'm worried I have not respected @chx desire to only override the default setting.

I'd like some feedback, please.

damienwhaley’s picture

Issue summary: View changes

Updated the issue summary

YesCT’s picture

+++ b/core/lib/Drupal/Core/Database/Connection.phpundefined
@@ -135,20 +135,6 @@
-  /**
-   * The character set of this connection.
-   *
-   * @var string
-   */

I think that the interdiff was made a bit ... well, I dont think these were really removed in the patch in comment 81. wait, there is no 81! :) Probably a typo.
if a branch with the patch from 51 was called utf-51, and there was a branch with the patch in 58 called utf-58

then the interdiff could be made by: git diff utf-51 utf-58 > interdiff-51-58.txt (maybe the interdiff above reversed the arguments)

YesCT’s picture

+++ b/core/lib/Drupal/Core/Database/Driver/mysql/Connection.phpundefined
@@ -69,14 +68,29 @@ public function __construct(array $connection_options = array()) {
+      if (strcmp($connection_options['charset'], 'utf8') == 0
+        || strcmp($connection_options['charset'], 'utf8mb4') ==0) {

I think any charset starting with utf8 would be ok. That would be more general than just adding two specific allowed ones.

chx’s picture

FileSize
5.79 KB

Thanks everyone for moving this forward.

> I think any charset starting with utf8 would be ok.

See #55 for the reasoning -- checking specifically for utf8 / utf8mb4 is a good idea to avoid typos.

Re the new patch ( I attached a real interdiff for easy review): we do not need a new allowedCharset property and method as a disallowed charset should terminate immediately. strcmp() is needlessly complicated, it can be a simply ==. If the error handling is not yet set up, then just print and die, this should be superb rare, you can't hit that error message without misediting settings.php no need to be nice. However, moving the charset method up to the base Connection class is good.

As for default connection, just add self::$databaseInfo[$key][$target]['key'] = $key into Database::openConnection before calling the constructor and then the constructor can easily test for the default key.

ergophobe’s picture

I'll try to make a patch tomorrow morning unless someone beats me to it.

The final patch needs/should/could...

  1. Add the $key to the connectionOptions array. I have already done this, just need to merge it into the new patch.
  2. Limit to utf8 or utf8mb4 for the default connection and be more permissive for all other connections. So for the default connection only it needs to test for the charset, and otherwise it's up to the user to get it right (I think this is reasonable givne that anyone who is defining multiple connections is or should be an advanced user and, as chx said, it offers possibilities for connecting to legacy databases that may be non-Drupal DBs. Useful for migration, for pulling data from some other dataset, etc.
  3. Move the mysql-specific comments out of the base Connection class
  4. RE "print and die" solution - I got hung up the other day on exception handling and if that's not truly necessary, then that simplifies things.

One comment on item #2 - there are only two charsets that begin with 'utf8' currently available in MySQL. There are many collations and there, if we're going to test for collations, it makes sense to take anything that begins with 'utf8', but I don't think it does with charsets. Consider that utf8mb4 requires special considerations. Let's say MySQL introduces a third utf8 charset - we can't know at this point what special considerations will be required to handle it correctly, so even if it's a valid charset, it should probably be disallowed until the implications are understood. Maybe that's wrong and it's up to the user to test it.

There is, however, an alias for the classic utf8 charset which can now be referred to as utf8mb3. So even though there are only two charsets with uft8 in the name, there are three valid handles for those two charsets. So perhaps utf8mb3 should be allowed too.

In a future version of MySQL, it is possible that utf8 will become the 4-byte utf8, and that users who want to indicate 3-byte utf8 will have to say utf8mb3. To avoid some future problems which might occur with replication when master and slave servers have different MySQL versions, it is possible as of MySQL 5.5.3 for users to specify utf8mb3 in CHARACTER SET clauses, and utf8mb3_collation_substring in COLLATE clauses, where collation_substring is bin, czech_ci, danish_ci, esperanto_ci, estonian_ci, and so forth

src: http://docs.oracle.com/cd/E17952_01/refman-5.5-en/charset-unicode.html

chx’s picture

Sure, let's go with that too. And yes, if exception handling is broken then print and die -- it's just a failsafe and not an error anyone will ever see.

ergophobe’s picture

One more try. Since this is both my first core patch and I'm just getting started with D8, I hope this is moving things in the right direction. But I'm adding a lot of notes here to hopefully make it easier for people to decipher.

The main principles are the same
- restrictive for default connection
- permissive for all other connections
- added a $connection_options array element 'index' which is the first index of the $databases array. Sometimes this is called 'key' so perhaps that would be a better nomenclature, but I was just following advice from chx on IRC.
- probably other stuff

Exception Handling
The big thing still problematic is agreeing on an exception/error handling strategy. I understand why print and die might make the most sense, but I expect there are a lot of cases where just defaulting to UTF8 and warning the user would save some headaches.

Obviously, the best thing would be a try/throw/catch, but I could not get that to work. If I try to throw an exception (even just using \Exception, I get a fatal error:

( ! ) Fatal error: Call to a member function get() on a non-object in path/to/core/includes/theme.maintenance.inc on line 59

This is true even if I use the base \Exception handler or if I try to use one of the custom ones. There's too much I don't know about exception handling in D8!

That said, for now I have created a drupal_set_message() warning with the message

'The default connnection charset must be "utf8", "utf8mb3" or "utf8mb4". Currently defined as: ' . $connection_options['charset'] . '. Reverting to "utf8". Please check your charset definition in settings.php'

I know that's not the way this should be, but hopefully it moves this patch forward.

Relation to Patch 58

Sorry - I worked off my previous patch mostly and that's what's reflected in the interdiff. Unless I've been misunderstanding all along, I think damienwhaley misconstrued what chx was driving at with respect to the default connection. The point was not to allow a change only for the default connection, it was to be very restrictive on the changes allowed on the default connection and permissive on all other connections.

I did move the charset function and property to base connection class as in #58.

I removed the MySQL-specific comments and refer the user to documentation in default.settings.php. I think it's best for the general documentation for this problem to reside in one place anyway and not be repeated all over. That way if it changes (for example the reference URL) it only needs to be updated once.

I also did not include the allowedCharset() method and deleted the commented code and comments regarding exceptions in /core/lib/Drupal/Core/Database/Database.php.

Other Notes

In my patch, the code in /mysql/Connection.php lines 78-83 is unnecessarily repetitive, but given the long discussion here ( http://drupal.org/node/935284 ) I gather that especially for core it is preferred to keep lines short at the expense of inefficient code (and it does make it more readable, but it feels "wrong" to have to short conditionals that execute the same statement).

I have also added perhaps too much documentation in default.settings.php. I'm not sure how much is too much. Comments welcome.

YesCT’s picture

Status: Needs work » Needs review

changing to needs review to let the testbot try it.

ergophobe’s picture

Status: Needs review » Needs work

Oops, I forgot to change that. But I was looking over the code and realize it still needs work. As is, the patch has this code in mysql/Install/Tasks.php:

    // If we are using utf8mb4 charset, make sure the database supports it.
    if (isset($database['charset']) && $database['charset'] == 'utf8mb4') {
      if (!db_query("SHOW CHARACTER SET WHERE Charset = 'utf8mb4'")->rowCount()) {
        $errors['mysql_charset'] = st('Your database does not support the utf8mb4 character set');
      }
    }

This should be generalized, because for non-default connections we're now allowing any charset.

It also raises another question: should the charset only be validated on install? If so, then the other validation code should be moved here. If not, this check should become part of the code that is validating the connection. Also, though it probably won't stay, the drupal_set_message() call does not wrap a t().

ergophobe’s picture

Status: Needs review » Needs work

I just realized this can't be used on any column over 191 characters that will be an index because InnoDB only allows 767 bytes on an index. See phayes comments above in #10.

ergophobe’s picture

I think there's still a fair bit of work to be done here. Most importantly, any patch needs to be tested to make sure that it actually works with an interactive install and a custom charset/collation in settings.php

I'm not sure any of the patches submitted so far did, because the default collation was getting set to the custom collation and that doesn't work because you get a mismatch on VARCHAR columns. If you change it to use utf8mb4 on VARCHAR cols, then you get lots of crashes because several indexes exceed the 191 character limit (see comment above).

Also there remains the question of how to check that a given charset is in fact supported. This can be done on install, but if someone changes the settings.php file post-install, there's no test in here yet.

So this is probably the best effort I'm going to make without some guidance or someone else to step in and straighten things out. I hope I've made more fixes than problems.

ergophobe’s picture

Status: Needs work » Needs review
FileSize
10.54 KB
9.62 KB

Remove whitespace. Queued for review so it gets tested, but really it's "needs work" - see comments in previous post.

YesCT’s picture

Caution, this is standards stuff, forgive me for distracting here. No need to worry about getting code perfect before posting. Really. I just started on some white space stuff and got carried away.

Patch attached.

+++ b/core/lib/Drupal/Core/Database/Connection.phpundefined
@@ -135,6 +135,14 @@
+  protected $charset = 'utf8';
+
+
   function __construct($dsn, $username, $password, $driver_options = array()) {

@@ -1172,4 +1180,17 @@ public function commit() {
+  }
 }
+
+

+++ b/sites/default/default.settings.phpundefined
@@ -165,6 +165,40 @@
+ * 'utf8mb4' character set on all text columns. More information on 'utf8mb4'
+ *  can be found here:
+ * http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html

extra whitespace

+++ b/core/lib/Drupal/Core/Database/Connection.phpundefined
@@ -1172,4 +1180,17 @@ public function commit() {
+   * Fetch the current character set for this connection.

function one line summaries should begin with third person verbs, like: Fetches

+++ b/core/lib/Drupal/Core/Database/Connection.phpundefined
@@ -1172,4 +1180,17 @@ public function commit() {
+   * documentation in sites/default/default.settings.php for more information.
+   * @return string

@params have not newlines between them, but @return always has a newline before it. http://drupal.org/node/1354#order

And has a description on the second line.

+++ b/core/lib/Drupal/Core/Database/Driver/mysql/Connection.phpundefined
@@ -69,14 +69,35 @@ public function __construct(array $connection_options = array()) {
+// Default to 'utf8', but allow user to change this to
+    //  - any user-defined charset if this is not the default connection
+    //  - any allowed charset if this is the default connection
+    // See sites/default/default.settings.php for full documentation

missing spaces in the Default line, to align it.

missing periods at end of sentences.

I'm not sure of the format for lists inside inline comments.

+++ b/core/lib/Drupal/Core/Database/Driver/mysql/Connection.phpundefined
@@ -69,14 +69,35 @@ public function __construct(array $connection_options = array()) {
+    $default_connection_charsets = array('utf8', 'utf8mb4', 'utf8mb3'); //utf8mb3 is an alias for utf8

it's in core in some places, but http://drupal.org/node/1354#inline says to put in-line comments on its own line.

+++ b/core/lib/Drupal/Core/Database/Driver/mysql/Connection.phpundefined
@@ -69,14 +69,35 @@ public function __construct(array $connection_options = array()) {
+    if (isset($connection_options['charset']) && $connection_options['charset'] != 'utf8' ) {

expressions don't use space before the parens (). http://drupal.org/coding-standards#controlstruct

Also, I dont know the order of precedence between && and !=, so maybe an extra set of parens () would help make the meaning clearer. [I did not fix this one.]

+++ b/core/lib/Drupal/Core/Database/Driver/mysql/Connection.phpundefined
@@ -69,14 +69,35 @@ public function __construct(array $connection_options = array()) {
+        drupal_set_message(t('The default connnection charset must be "utf8", "utf8mb3" or "utf8mb4". Currently defined as: ' . $connection_options['charset'] . '. Reverting to "utf8". Please check your charset definition in settings.php'), 'warning');

use arguments to t() to make it easier on translators.

http://api.drupal.org/api/drupal/core%21includes%21bootstrap.inc/functio...

for example: $text = t("@name's blog", array('@name' => user_format_name($account)));

+++ b/core/lib/Drupal/Core/Database/Driver/mysql/Install/Tasks.phpundefined
@@ -82,4 +82,21 @@ protected function connect() {
+        $errors['mysql_charset'] = st('Your database does not support the '.$database['charset'].' character set');

I'm not sure, but I think st() might be the same as t() in that should use args for variables instead of string cat. http://api.drupal.org/api/drupal/core%21includes%21install.inc/function/...

+++ b/sites/default/default.settings.phpundefined
@@ -165,6 +165,40 @@
+ * Finally, it is possible to use 'utf8mb3' which is currently simply an alias of

more than 80 chars.

+++ b/sites/default/default.settings.phpundefined
@@ -165,6 +165,40 @@
+ * 'utf8', but MySQL reserves the right at a future date to make 'utf8' default
+ * to a 4-byte character set at which point 'utf8mb3' would specifically
+ * indicate the legacy 3-byte version.

broke up this sentence into two, and added comma to separate phrase to add clarity.

+++ b/sites/default/default.settings.phpundefined
@@ -165,6 +165,40 @@
+ * utf8mb3. Other charsets are allowed on other connections at the users risk

missing period.

ergophobe’s picture

Sorry about that and thanks so much for doing all that cleanup. I'll try to be more careful in the future.

>>I dont know the order of precedence between && and !=

the != takes precedence, so this works as expected, but for clarity, perhaps the extra parens would help.

YesCT’s picture

Next:

need a review from @Crell or someone to make sure previous concerns have been addressed in the right way, and/or maybe from @chx on the more recent bit added.

if the approach is ok, then a little cleaning up of the comments will be needed, there is some debug or other notes in there.

ergophobe’s picture

My feeling is that it needs more than cleaning up comments. The comments are long and have debug type stuff in there because I'm not sure of several things.

I stopped where I did because I had a headache and had reached the point where I felt like I was making things worse, not better. I think it needs a fair bit of massaging, but I can't really get back to it until next week.

It also needs either a very robust test or it needs to be tested by installing from scratch in a variety of scenarios (or both) - a small change here or there in this patch can easily cause a crash on install. If you get a column mismatch of any sort, MySQL will die.

Crell’s picture

+++ b/core/lib/Drupal/Core/Database/Driver/mysql/Connection.php
@@ -69,14 +69,36 @@ public function __construct(array $connection_options = array()) {
+        drupal_set_message(t('The default connnection charset must be "utf8", "utf8mb3" or "utf8mb4". Currently defined as: @name. Reverting to "utf8". Please check your charset definition in settings.php', array('@name' => $connection_options['charset'])), 'warning');

drupal_set_message() inside the Connection class is a no-go. Just throw an exception and let the global error handler deal with it.

+++ b/core/lib/Drupal/Core/Database/Driver/mysql/Schema.php
@@ -156,6 +162,15 @@ protected function createFieldSql($name, $spec) {
+    // If it's a text field, check to see if we should use utf8mb4 (4-byte UTF8)
+    // as the character set.
+    // InnoDB indexes have a max of 767 bytes. This means we can't use 4-byte
+    // charsets on VARCHAR because there are VARCHAR-based indexes of 255 chars.
+    if (in_array($spec['mysql_type'], array('TINYTEXT', 'MEDIUMTEXT', 'LONGTEXT', 'TEXT')) &&  Database::getConnection()->charset() == 'utf8mb4') {
+          //    isset($info['charset']) && $info['charset'] != 'utf8') {
+      $sql .= ' CHARACTER SET '.$info['charset'].' COLLATE ' . $info['collation'];

I don't fully understand the implications of this comment. Does that mean utf8mb4 + InnoDB + Varchar == kaboom? That would be... very very bad. I hope I'm misunderstanding that.

Have I mentioned how much I hate SQL databases?

Crell’s picture

Note we may want to change the SET NAMES call to something that the driver supports natively:

http://www.php.net/manual/en/ref.pdo-mysql.connection.php

(Yes, it's deeply buried in the docs. I didn't even know about it until today.)

ergophobe’s picture

Status: Needs work » Needs review

Does that mean utf8mb4 + InnoDB + Varchar == kaboom?

If the column has more than 191 characters, yes, it goes kaboom. See phayes comments in #10.

That would be... very very bad.

Yes, it would be. Maybe it's just because I don't have the chops for this, but the more I look into this, the more utf8mb4 support seems like a morass.

chx’s picture

> because there are VARCHAR-based indexes of 255 chars.

Then we should ban that and establish sensible indexes. There's no way we need more than, say, 32 chars to index and every db and Drupal totally supports index lengths. Might be a followup.

ergophobe’s picture

I was not reading my own comments - I was thinking the column length was the problem, but yes, it's just the index prefix length at issue and certainly 191 characters should be enough for that.

chx’s picture

Re #75 it was not buried rather it didn't exist: Prior to PHP 5.3.6, this element was silently ignored. Note that Drupal at this moment is 5.3.5 so using charset in the DSN is not yet possible. Plans are to go 5.3.10 in March.

ergophobe’s picture

Roughly speaking

  • 135-140 varchar cols that are indexed (see * below for how I arrived at this)
  • 82 with lengths of 128 or more (so failure in a two-col index with any col over 63)
  • 41 with lengths of 191 or more (so failure all by themselves)
  • 0 currently specify prefix lengths

That's the rough scope of it, so it's not hard, though it does end up being a patch that affects a very large number of files. It will also require contrib authors to update their modules. I can try to come up with another patch, but before I go changing all those files, I just want confirmation that approach is sensible.

I don't find any indexes that are specified with a length currently, but as chx says, it's built into Drupal and is a simple matter of changing, for example, the key_value table

'primary key' => array('collection', 'name')

to

'primary key' => array(array('collection', 60), array('name', 128))

and choosing sensible values. In a default install of D8, the largest value for key_value.collection is 13 chars ("system.schema") so it seems to make sense there, for example, to leave name at 128 and then use what's left. There may be other cases where both columns need truncating. I haven't looked that carefully yet and wanted a reaction from others following the thread before going down that road.

PS I don't think just exempting the VARCHAR cols from utf8mb4 support as phayes suggested in #10 really makes sense. Some of these columns are things like users.signature and it seems like supporting a character set for user names is rather important.
---------

*I arrived at this very roughly by writing a small script that does a SHOW TABLES and then iterates through the tables and grabs all columns that are of type VARCHAR and which have a Key (as per the SHOW COLUMNS) output (and in the latter two cases, that have a length above the given size. I also just grepped the codebase using ^[^\*\n\r]+'(primary key|indexes|unique keys)' =>.*?[;,] and got a second count that way.

Damien Tournoud’s picture

Do we have an actual use case for this? I could not see one by quickly skimming through the comments.

Until we have, it seems like the drawbacks clearly outweight the benefits here.

Damien Tournoud’s picture

Also, a primary key cannot use length prefixing. (Or at least I hope that MySQL doesn't allow that, because it doesn't make any sense.)

It's probably worth tidying up our schema (for example: machine name-type keys should probably use a ascii character set so as to reduce the size of the index), but that belongs in another issue.

ergophobe’s picture

a primary key cannot use length prefixing

Actually, it can: http://dev.mysql.com/doc/refman/5.5/en/alter-table.html
I tested it and had no problem altering a table to use a 100 char PK based on a 255 char VARCHAR column.

And this is all related to #1852896: Throw an exception if a schema defines a key that would be over 1000 bytes in MySQL - it seems one way or another some sort of length checking on MySQL keys might be needed.

Do we have an actual use case for this?

I'm not sure what the OPs use case was nor that I understand the internationalization issues well enough, but I believe the use case goes something like this:

User wants to use a characters not included in the Basic Multilingual Plane but included in the Supplementary Linguistic Plane. This would include some CJK Unified Ideographs, obscure languages (Egyptian Hieroglyphics), and some modern character sets (emoji, emoticons).

- http://en.wikipedia.org/wiki/Plane_%28Unicode%29#Supplementary_Multiling...
- http://mzsanford.wordpress.com/2010/12/28/mysql-and-unicode/
- http://mathiasbynens.be/notes/mysql-utf8mb4

it seems like the drawbacks clearly outweight the benefits here.

That's sort of what I was driving at in #76 - this seems like a lot of trouble, but as it stands now (as I understand it), Drupal is not really compatible with 4-byte character sets.

It looks like over in the Wordpress community they're fighting with the same thing.

phayes’s picture

My use case of this is that we host a lot of mathematical scientific journals, there are a lot of high order characters that are used in this content for properly displaying math. I've been running the patch in #12 for about a year now in production with no problems.

Edit: Is there a reason why we don't just say we only support utf8mb4 on text fields (not varchar) and skip the whole key-length problem. It's not a 100% ideal solution, but it's better than providing no support for high order characters.

Crell’s picture

phayes: Mixed character set databases are a nightmare to manage and lead to all sorts of creative bugs unless you really know what you're doing. You may really know what you're doing, but I wager most people using MySQL don't know what a character set is, much less what happens when you try to mix them. :-)

Damien Tournoud’s picture

Title: UTF-8: fix data loss in 4 bytes characters bug using MySQL backend and limit to charsets that start with utf8 » UTF-8: support 4 bytes characters bug in MySQL
Category: bug » feature
Priority: Major » Normal
Status: Needs review » Needs work

I tested it and had no problem altering a table to use a 100 char PK based on a 255 char VARCHAR column.

I don't know what it does exactly, but it's probably not desirable.

I'm going to reclassify this as a feature request. There is no data loss related to this since Drupal 7, so it is not a bug per say.

In addition, I don't think we can support this until we introduce support for an ascii or binary charset and use it to tidy up our indexes and primary keys (especially machine names and UUIDs that have poped up all over the place in Drupal 8).

Damien Tournoud’s picture

Title: UTF-8: support 4 bytes characters bug in MySQL » UTF-8: support 4 bytes characters in MySQL
sun’s picture

ergophobe’s picture

Before I answer this I want to reemphasize - much of this is over my head in terms of both Drupal 8 internals and in terms of character sets (I only actually use French and English, so these issues don't affect me, I just ended up taking this on as somewhere to try to help with D8).
That said...

Edit: Is there a reason why we don't just say we only support utf8mb4 on text fields (not varchar)

Not using any Asian languages myself, I don't know how common the extensions beyond the Basic Multilingual Plane are. Mostly what you see in the SMP are special uses (Math, emoticons, alchemical symbols), dead languages (Lycian, Gothic, Cuneiform) and the unified Han characters.

But it seems to me that if you're supporting these characters in TEXT, they should be supported in VARCHAR for the cases where

  • someone wants to use such a character in a node title (could there be a situation where a mathematical symbol would appear in the title too?)
  • someone wants to use one of the Asian characters in a username (e.g. it's part of their real name)

Wouldn't it be confusing to a user that say that if you enter text into the body of a post, it's all good, but if you copy and paste it into the title of that post, it will truncate your data as soon as it encounters the first 4-byte character?

BTW - support for these character sets is shaky at best in the world at large. The emoji, for example, are supported in Win8 fully in IE, Firefox and Opera, but not in Chrome or Safari. Support on the Mac seems to be a little less. My Ubuntu is busy upgrading, so I can't tell.

Also, I came across one article using a 4-byte character in the title - it rendered fine in the Google search results, but the link was not clickable.

Pancho’s picture

Category: feature » task

#75, #79:
#1800122: Bump minimum version of php required to 5.3.10 has been committed, so we can use DSN now, instead of 'SET NAMES'.

#80:
From what I got the 767 Bytes (= 191 chars) InnoDB restriction isn't about index length but about the length of a (single) prefix.
Otherwise, key length may be up to 1000 B (= 250 chars) as restricted by MyISAM, see #1852896: Throw an exception if a schema defines a key that would be over 1000 bytes in MySQL.
We might need to test this again, unless someone can clarify.

#81:
Another use case are LTR tags allowing English text be correctly embedded in an RTL environment. We're having bugs that can't be correctly solved without this, see #1165476: if t() string has no translation or fallback language, text should have lang attribute.

#86:
Agree that we should avoid length prefixing on PKs that would be opening up one more can of worms. The more we should go and shorten keys.
Plus: While It's certainly true that there is no data loss, which is the criterion for a critical bug though, not generally for a bug. Still I'm recategorizing only as task, because it's not like everybody would expect us to support 4 byte chars. But "feature" this isn't either.

mgifford’s picture

Status: Needs work » Needs review

#70: database-1314214-70.patch queued for re-testing.

Damien Tournoud’s picture

@Pancho: LTR and RTL are both 3-byte UTF-8 characters. The linked issue describes the use of language tags from U+E0000 to U+E007F, which are totally deprecated anyway.

ergophobe’s picture

@Pancho

1. key length versus prefix length.

My mistake - you are correct

  • By default, an index key for a single-column index can be up to 767 bytes. The same length limit applies to any index key prefix.
  • The InnoDB internal maximum key length is 3500 bytes, but MySQL itself restricts this to 3072 bytes. This limit applies to the length of the combined index key in a multi-column index.

-- http://dev.mysql.com/doc/refman/5.5/en/innodb-restrictions.html

Hanno’s picture

Great work here. This patch could fix this issue.

It might also need a follow up. I think we should reconsider this as a bug as Drupal isn't fully utf-8 compliant.

This issue is not only theory, but also relevant for
- emoji, these emoticons are supported by Google and Apple on mobile phones and use the 4 bytes utf-8. Some applications even use emoji as part of the username
http://blog.manbolo.com/2011/12/12/supporting-ios-5-new-emoji-encoding
- mathematics used in scientific journals: http://drupal.stackexchange.com/questions/50868/configuring-drupal-to-us...

Some questions
- could we require MYSQL >5.5.28 for Drupal 8?
- There is no workaround if people can't or don't use utf8mb4:
- The actual behavior with normal utf8 is for user content is that your text is lost and no user feedback is given. I think we should reconsider this as a bug and we need to give user feedback.
- when importing data from external sources the import fails. Instead we could skip, escape, or replace these characters when sending to the database. This needs further research. See for example this bug when importing Tweet data: #1824506: When importing Tweets: SQLSTATE[HY000]: General error: 1366 Incorrect string value
- we could clarify in the system report which characterset is used, and if utf8 with mysql is used place a warning that it isnt fully compatible.
- could we test if this bug also exists with other databasesystems like PostgreSQL, SQLite and Oracle?

Damien Tournoud’s picture

As I said in #86:

In addition, I don't think we can support this until we introduce support for an ascii or binary charset and use it to tidy up our indexes and primary keys (especially machine names and UUIDs that have poped up all over the place in Drupal 8).

Increasing the size of all our indexes by 33% is not something we can afford. Let's open a separate issue for the tidying up and postpone this one on it.

- The actual behavior with normal utf8 is for user content is that your text is lost and no user feedback is given. I think we should reconsider this as a bug and we need to give user feedback.

This should not happen. If it does, it's a bug elsewhere. The database layer triggers an exception, that's all that it does. Catching it and processing it belongs in the upper layers. If they don't do that correctly, it's a bug there, not in the database layer.

Hanno’s picture

This should not happen. If it does, it's a bug elsewhere.

Just checked with a fresh install and pasting a unicode 4 bytes character in a post gives a crash. Will post the bug in a seperate issue.
EDIT: #2002100: pasting text with 4 byte UTF-8 characters leads to broken screen

Hanno’s picture

Similar ticket and discussion in Wordpress:
http://core.trac.wordpress.org/ticket/21212 MySQL tables should use utf8mb4 character set
Is the by Pento mentioned solution in that ticket to use the option innodb_large_prefix something to consider?

The 767 byte limit won't be increased any time soon, due to ​http://bugs.mysql.com/bug.php?id=32915

It turns out this is possible, thanks to the innodb_large_prefix option, introduced in 5.5.14:
​http://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_innodb_large_prefix

So, the requirements are:

MyISAM tables:
MySQL >= 5.5.3 (Assuming we don't add any indexes larger than 250 characters.)

InnoDB tables:
MySQL >= 5.5.14
innodb_file_format=barracuda
innodb_file_per_table=true
innodb_large_prefix=true
All tables with ROW_FORMAT=(DYNAMIC|COMPRESSED)

Any other table formats:
*head explode*

We can certainly detect these settings, it's just a question of whether that kind of complex (and edge-case-y) test should be in core.

pfrenssen’s picture

Component: mysql database » database system
FileSize
9.68 KB

Rerolled, wanted to try this out. These emoji characters are not going away any time soon it seems.

pfrenssen’s picture

Component: database system » mysql db driver
pfrenssen’s picture

This doesn't work any more after the database connections have become static in #1953800: Make the database connection serializable.

mgifford’s picture

So is there another way round this?

mgifford’s picture

Issue summary: View changes

Updated issue summary, took out duplicate and integrated the added problem motivation

morgantocker’s picture

For prior art: the way this problem was handled in mediawiki was to store strings in VARBINARY and let the application handle character-sets. This was a pre MySQL 5.5 decision, and it's important to point out that it means no collation which is the bigger deal.

With collation:
'Montréal' == 'MONTREAL'
+ a consistent sorting order is provided with multi-byte characters.

It may be an option to 'degrade' to VARBINARY for versions older than MySQL 5.5. I'll leave this one for discussion.

RE: Index size limited to 191 characters discussion
There is actually a workaround to this problem via prefix indexes. i.e.
ALTER TABLE my_table ADD INDEX text_col_index (text_col(190));

This isn't suitable for PRIMARY or UNIQUE indexes since it breaks a constraint, but it should work fine for others.

marco’s picture

I've updated the patch for Drupal 7 from #30:
- to Drupal 7.28
- applying all the remarks, but limited to utf8 and utf8mb4 (no custom encodings)

To apply:
1) backup the database
2) apply the patch
3) add the settings to your settings.php file
4) run this to upgrade the text fields:

  $schema = drupal_get_schema();
  foreach ($schema as $table_name => $table) {
    $repair_needed = FALSE;
    foreach ($table['fields'] as $field_name => $field) {
      if ($field['type'] == 'text') {
        db_change_field($table_name, $field_name, $field_name, $field);
        $repair_needed = TRUE;
      }
    }
    // According to http://mathiasbynens.be/notes/mysql-utf8mb4 when upgrading
    // it's also necessary to repair and optimize the tables.
    if ($repair_needed) {
      db_query("REPAIR TABLE $table_name");
      db_query("OPTIMIZE TABLE $table_name");
    }
  }
cdeepan’s picture

Dear All,

Thanks for the patch for changing character set to utf8mb for columns with Text data type. I believe this would allow supplementary characters to be added to the node body. However I believe we will have issues if the user enters these supplementary characters in Menu Title, Node Title, Tags etc. Do we have any solution to take care of these fields as well?

Thanks & Regards,
Deepan Choudhary

chx’s picture

We do not have a solution. See the patch:

+    // InnoDB indexes have a max of 767 bytes. This means we can't use 4-byte
+    // charsets on VARCHAR because there are VARCHAR-based indexes of 255 chars.
Antti J. Salminen’s picture

What about the innodb_large_prefix and associated settings available in MySQL >5.5.14 mentioned in #97? Could this just be a manually enabled option and no autodetection or anything?

chx’s picture

That would be swell. MySQL 5.1 was EOL'd on December 31, 2013. Maria will EOL 5.1 on 1 Feb 2015. Are we ready to raise the requirements to 5.5?

Edit: don't forget you need the Barracuda format for that and innodb table per file. Perhaps we need to test creating a table with a long index? Put barracuda in the table options?

Crell’s picture

Given the number of other legacy versions we're dropping in Drupal 8 (PHP, IE, etc.) I would be OK with moving to MySQL 5.5 as the minimum if it bought us something. Would switching to MySQL 5.5 make this problem effectively go-away? That's what it sounds like but I want to confirm...

chx’s picture

Weeeeeell. If your MySQL 5.5 InnoDB is set to use innodb_file_per_table and uses the Barracuda innodb_file_format http://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_inn... then your problem goes away. Fun.

This is the default file format in MySQL 5.6 though (and also the default between MariaDB/MySQL 5.5.0 and 5.5.6 but since 5.5.7 has reverted back to Antelope).

morgantocker’s picture

@Crell: Yes, it would allow you to use utf8mb4 safely, with a couple of other issues to solve. To summarize the now quite long comment thread:

In comment #10 phayes noted that using utf8mb4 creates a secondary issue - it restricts the maximum index size of InnoDB tables to 191 characters. In comment #97 Hanno noted that there is an option innodb_large_prefix to supported larger indexes, but we should not assume all 5.5 users will be able to turn this on (InnoDB does not enable it by default for file-format backward compatibility).

There is a workaround suggested by ergophobe in comments #78 and #80, which is to use prefix indexes (index on the first 191 characters). There are little downsides to doing this: most strings will have enough selectivity in the first 191 characters. It may prevent 'covering index' optimizations, but I am not sure how many of these drupal has.

Also as chx notes in comment #108, upstream has dropped support for MySQL 5.1 (released November 2008). The latest Ubuntu release support MySQL 5.5 (2010) & 5.6 (2013). Red Hat 7.0 is MariaDB 5.5. So I would +1 a suggestion to make 5.5 the minimum.

ergophobe’s picture

@margantocker - thanks for the concise summary of a super long discussion.

There is the question you mention of whether 191 chars is enough to provide specificity (it's certainly conceivable that, say, some ridiculously long numeric series could have all the significant information at the end).

But in addition to that, I can't figure out from the documents what happens if I have a 191-char prefix and I search for a 200-char string. Does it return if the first 191 match?
http://dev.mysql.com/doc/refman/5.5/en/mysql-indexes.html

Damien Tournoud’s picture

Please keep in mind #95. Long indexes are bad, they use more memory, storage and are less efficient. The limit in InnoDB is there for a reason.

What needs to happen is that we need to stop using Unicode columns (and indexes) where they are not necessary. This is a comprehensive change that cannot be workaround by simply bumping database version requirements.

morgantocker’s picture

@ergophobe: I have tested prefix-length on various string types on various data-sets before. If we take IMDB movie title names, the first 15 characters provides around 99% of the cardinality of an index on the full length. An edge-case where this is less-true is URLs, where uniqueness starts after the first 11 characters (http://www.). If the URLs all start with the same domain, this could be the first 50 characters... but I would say that's a less common case.

To answer your question:
In the event that MySQL can not filter rows via the index, it will filter at the row-level, so there is no risk of wrong results. It is simply a risk of following too many pointers from an index to a row, only to eliminate the need for it there (performance).

@Damien: I find mixing and matching character-sets to be a micro optimization. I have done it before, and no longer recommend it. InnoDB internally will use variable length storage for utf8 character-set, so from a storage perspective there is no added cost. In memory-temporary tables can not do variable length, and may take more memory or perhaps convert to disk more frequently.. but in the typical case I've found that drupal queries are well optimized here, and I would be surprised to see measurable difference.

The problem with shortening index prefixes, is that you need to fully examine the dataset to determine the prefix-length. And in many cases, Drupal users will have varying datasets. Long indexes are worst in the case of an index-scan (select avg(x), count(x) etc.), so more attention could be paid to optimization here. In other cases, InnoDB will usually only load the index pages into memory as required, so it largely becomes a storage problem. The worst situation I could see is a micro-optimization to shorten the prefix length, and some installations of Drupal can not use the index because it is not selective.

ergophobe’s picture

Thanks for the clarification - that's reassuring

mgifford’s picture

joelpittet’s picture

Contrib related issue, saving tweets with Emoji as mentioned in #98. Adding to related issues.

joelpittet’s picture

Adding another contrib related issue with emojis

morgantocker’s picture

A little bit more MySQL version context on innodb_large_prefix:
- It was created so that MySQL downgrades would be possible (i.e. 5.5 -> 5.1)
- This purpose is now obsolete, since 5.5+ support large prefix (and 5.1 is no longer officially supported).
- We are proposing[1] to enable innodb_large_prefix by default in MySQL 5.7, since it is now a safe change.

So I realize it's a while out for many people, but the story should be simpler with MySQL 5.7.

[1] I have the call out on my blog here:
http://www.tocker.ca/2015/01/05/what-defaults-would-you-like-to-see-chan...

Rajab Natshah’s picture

Hi

Now we do have a workaround module as a fix for this issue.

Strip 4-byte UTF8

https://www.drupal.org/project/strip_utf8mb4

We do have an interface for it too.
Strip 4-byte UTF8 back-end configuration page.

Rewarding time working on this fix :)

YesCT’s picture

stefan.r’s picture

Maybe relevant: Wordpress managed to fix this 2 months ago: https://core.trac.wordpress.org/ticket/21212#comment:15

Also, the next development release for MySQL 5.7 will be a Release Candidate and will likely have the folllowing changes, supporting large enough indexes on utf8mb4 out of the box:

Important Change; InnoDB: The following changes were made to InnoDB configuration parameter default values:

  • The innodb_file_format default value was changed to Barracuda. The previous default value was Antelope. This change allows tables to use Compressed or Dynamic row formats.
  • The innodb_large_prefix default value was changed to ON. The previous default was OFF. When innodb_file_format is set to Barracuda, innodb_large_prefix=ON allows index key prefixes longer than 767 bytes (up to 3072 bytes) for tables that use a Compressed or Dynamic row format.

https://dev.mysql.com/doc/relnotes/mysql/5.7/en/news-5-7-7.html

Before we can support this though, if I can quote Damien Tournoud:

It's probably worth tidying up our schema (for example: machine name-type keys should probably use a ascii character set so as to reduce the size of the index), but that belongs in another issue.

Long indexes are bad, they use more memory, storage and are less efficient. The limit in InnoDB is there for a reason.

What needs to happen is that we need to stop using Unicode columns (and indexes) where they are not necessary. This is a comprehensive change that cannot be workaround by simply bumping database version requirements.

In addition, I don't think we can support this until we introduce support for an ascii or binary charset and use it to tidy up our indexes and primary keys (especially machine names and UUIDs that have poped up all over the place in Drupal 8).

Increasing the size of all our indexes by 33% is not something we can afford. Let's open a separate issue for the tidying up and postpone this one on it.

So anyone correct me if I'm wrong but I think this would have to be postponed on #1923406: Use ASCII character set on alphanumeric fields so we can index all 255 characters.

stefan.r’s picture

Issue summary: View changes

Further updating issue summary

stefan.r’s picture

Issue summary: View changes
stefan.r’s picture

Issue summary: View changes

We now have a patch over at #1923406: Use ASCII character set on alphanumeric fields so we can index all 255 characters so if anyone could chime in it might unblock this issue :)

stefan.r’s picture

Title: UTF-8: support 4 bytes characters in MySQL » MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols)
Category: Task » Bug report
Issue summary: View changes
Priority: Normal » Major

Considering how prevalent UTF-8 characters from the additional character planes are these days (emojis, mathematical characters etc) shouldn't this actually be considered a bug?

Now that we have a workable patch in the blocking issue, would something like this work for a patch for this issue:

- If using MySQL, detect in the installer whether MySQL supports utf8m4, innodb_large_prefix=on and engine=Barracuda. If so:
- SET NAMES = utf8mb4 in the connection object and use utf8mb4 instead of utf8 in the schema, as well as row format = DYNAMIC.
- If MySQL has other settings or an override was provided in settings.php, use utf8 instead of utf8mb4.

stefan.r’s picture

Issue tags: +drupaldevdays

So I've discussed this at the Dev Days and @Damien Tournoud and @pwolanin mentioned that after we get the ASCII issue in, this would actually be easily solved by raising the MySQL requirement to 5.5.3 and using 191 characters on the few remaining UTF-8 indexes that are left. We can then just use utf8mb4 by default in the MySQL driver, without worrying about InnoDB/row settings.

yannickoo’s picture

I have created a new issue to increase the minimum required version of MySQL.

yannickoo’s picture

Status: Postponed » Needs review
FileSize
6.12 KB

This patch replaces all occurrences of utf8 with utf8mb4 when this is related to database encoding.

morgantocker’s picture

Increasing the size of all our indexes by 33% is not something we can afford. Let's open a separate issue for the tidying up and postpone this one on it.

I've clarified in #1923406 where the increase in size will be (it's not a direct 33% increase). I will comment in #2473301 on raising the minimum version - it's quite a good solution imho.

stefan.r’s picture

#129 seems close to what we want... After we fix:

  1. #2473301: Raise MySQL requirement to 5.5.3
  2. #1923406: Use ASCII character set on alphanumeric fields so we can index all 255 characters
  3. Reducing index size to 191 characters on remaining non-binary/ascii indexed fields, or remove the index if it wasn't actually necessary.
stefan.r’s picture

Status: Needs review » Needs work
Issue tags: +Needs tests

1. This still needs a test to verify we store/display utf8mb4 characters correctly.

2. As soon as #1923406: Use ASCII character set on alphanumeric fields so we can index all 255 characters goes in it will also need to test for the standard utf8mb4 collation in SchemaTest.

3.

+++ b/core/lib/Drupal/Core/Database/Driver/mysql/Connection.php
@@ -85,27 +85,27 @@ public static function open(array &$connection_options = array()) {
 
     // Force MySQL to use the UTF-8 character set. Also set the collation, if a
-    // certain one has been set; otherwise, MySQL defaults to 'utf8_general_ci'
+    // certain one has been set; otherwise, MySQL defaults to 'utf8mb4_general_ci'
     // for UTF-8.

+++ b/core/lib/Drupal/Core/Database/Driver/mysql/Schema.php
@@ -101,22 +101,22 @@ protected function createTableSql($name, $table) {
     // By default, MySQL uses the default collation for new tables, which is

80 cols, s/UTF-8/utf8mb4/

hass’s picture

Issue summary: View changes

Added note about Drupal system requirements as todo

stefan.r’s picture

That note can actually be removed, after #1923406: Use ASCII character set on alphanumeric fields so we can index all 255 characters goes in we can limit the remaining few UTF8 keys (if any) to 190 characters, that way we can also support installs without innodb_large_prefix.

stefan.r’s picture

Issue summary: View changes

Updated issue summary.

stefan.r’s picture

Issue summary: View changes
stefan.r’s picture

Issue summary: View changes
Issue tags: +D8 upgrade path
joelpittet’s picture

This option looks like a better option for D7. #2382707: UTF8MB4 for MySQL Maybe we can re-open it to get some support there too?

joelpittet’s picture

Reason I say that is because here we are setting the default to utf8mb4, where as there it's default stays utf8 but has the option to change it if a site wants to go through with it.

stefan.r’s picture

Status: Needs work » Needs review
Issue tags: -Needs tests
FileSize
9.66 KB
6.8 KB

This patch adds a test, addresses comments in #134 and limits the following fields with unique constraints to 191 characters:

  • title/name of website providing a feed
  • description of block content field (widget is updated automatically)
  • URI of file field
  • username field (which is limited to 60 characters anyway)

If we have a problem with the URI field only allowing for 191 characters, we could also get rid of the database-level constraint and add it on the application level instead.

In core we don't use any non-ASCII primary keys anymore, nor do we need to explicitly define length for regular indexes in PHP, as MySQL will take care of that for us anyway; it will cut off regular indexes at 191 characters without innodb_large_prefixes and at 255 characters with innodb_large_prefixes.

stefan.r’s picture

Status: Needs work » Needs review
FileSize
2.45 KB
12.68 KB

This should fix some failing tests as we had skipped the test-only tables in #1923406: Use ASCII character set on alphanumeric fields so we can index all 255 characters

stefan.r’s picture

Status: Needs work » Needs review
FileSize
15.12 KB
4.56 KB

This should further decrease test failures by marking some more fields as ASCII.

stefan.r’s picture

stefan.r’s picture

Status: Needs work » Needs review
FileSize
552 bytes
4.88 KB

Missed another test field.

@joelpittet as to the D7 version of this patch, @catch had proposed in #2473301: Raise MySQL requirement to 5.5.3 that we allow people to use either utf8 or utf8mb4 in D7, if @David_Rothstein is OK with that.

stefan.r’s picture

FileSize
15.72 KB
stefan.r’s picture

Status: Needs work » Needs review
FileSize
5.33 KB

This removes the unique keys from feed title and block content info as reducing field size to 191 could break D6/D7 upgrades.

stefan.r’s picture

stefan.r’s picture

Looks like this test run failed because #1923406: Use ASCII character set on alphanumeric fields so we can index all 255 characters had been reverted.

Whenever that gets back in I'll also remove the length cutoff from the URI field on File entities. Instead we can make it an ASCII field and urlencode it, as both removing the unique constraint and cutting off /all/ URL's at 191 characters seem like even worse options.

After we do that I think the patch is close to where it needs to be, this will just need some feedback from @benjy regarding how this will fit in with migrate.

stefan.r’s picture

Status: Needs work » Needs review
FileSize
3.75 KB
20.72 KB

This implements urlencoding on the storage layer for non-ASCII URLs, as it got too ugly trying to do this on the entity layer. DownloadTest has coverage for non-ASCII URLs.

This still looks a bit ugly to me. Maybe it would be better to do a query looking for file entities with the same URI in File::preSave() and we just lose the unique constraint?

stefan.r’s picture

Actually I think it may be better if we add a uri hash field and put a unique constraint on that field instead?

stefan.r’s picture

Status: Needs work » Needs review
FileSize
22.89 KB
7.17 KB

This adds a hash as well as a test for the hash and the uniqueness constraint on the hash. The hashing still feels like a better solution than the urlencoding as it doesn't change what is stored in the URI field, as well as allowing us to save URLs longer than 255 characters (see #193954: {file}_uri and {file}_filename length limitations)

I think the new hash field would need a beta to beta upgrade hook though.

stefan.r’s picture

Status: Needs work » Needs review
FileSize
23 KB
fietserwin’s picture

If we go for URI encoding,it should be storage layer, the application layer should not have to be bothered with this kind of technical details. So #157 was an improvement.

However, as URIs can indeed become incredibly long and only differ in some query parameter values after position 191, I think that using a hash might even be better. I suppose that performance wise, computing a hash is not that expensive. So continue with #160.

+++ b/core/modules/migrate_drupal/src/Tests/Table/d6/System.php
@@ -6,7 +6,7 @@
  *
  * THIS IS A GENERATED FILE. DO NOT EDIT.
  *
- * @see cores/scripts/dump-database-d6.sh
+ * @see core/scripts/dump-database-d6.sh
  * @see https://www.drupal.org/sandbox/benjy/2405029
  */
 
@@ -26,7 +26,7 @@ public function load() {

@@ -26,7 +26,7 @@ public function load() {
       ),
       'fields' => array(
         'filename' => array(
-          'type' => 'varchar',
+          'type' => 'varchar_ascii',
           'not null' => TRUE,
           'length' => '255',
           'default' => '',

- Don't bother changing generated files.
- This seems like a D6 table, so it should not be changed anyway.
- The typo should be corrected in core\scripts\migrate-dump-d6.sh,but not as part of this issue.

stefan.r’s picture

@fietserwin, thanks. The D6 table has to be changed as well, as it is being created as a utf8mb4 table with a primary key on the filename field. If we don't change it, it only allows 191 characters. This is being changed both in the generated script (so tests can pass) as in the generating script, but indeed the typo won't be picked up :)

Not sure why this test run is marked as failed, in the test log it just keeps terminating in the middle of the node tests, but it doesn't say which specific test is breaking. I'll log a ticket in the testbot queue.

stefan.r’s picture

Status: Needs work » Needs review
FileSize
21.55 KB

A few nitpicks in the mean time

Status: Needs work » Needs review
stefan.r’s picture

Apparently this may be a testbot problem as opposed to a problem in the patch: #2477583: Unexpected timeouts on testbot runs

I'll just keep re-queueing this patch then

stefan.r’s picture

Issue summary: View changes

Ah tests are green now, #166 is not that different to the previous patches so this may have been a testbot problem after all.

Updating issue summary.

stefan.r’s picture

Issue summary: View changes

Updating issue summary to clarify why we can lose the DB-level unique constraints on block description and feed title.

The uniqueness validation is just there because there's no way to distinguish between blocks/feeds in the UI other than by description, and we already have entity validators for this, nor is it a big deal if we still get two blocks with the same description in some edge case. All it'd be is an inconvenience, and workarounds exist (such as editing the description).

Just to add to this, the patch is MySQL only, but there are two tests which may affect PostgreSQL / SQLite: NodeViewTest and SaveTest. For now @amateescu has confirmed that SaveTest works with SQLite, the other 3 cases are yet to be confirmed.

amateescu’s picture

Just tested now and both tests are fine on SQLite and Postgres.

stefan.r’s picture

FileSize
513 bytes
22.23 KB

Thanks @amateescu! That also proves the PostgreSQL and SQLite drivers aren't affected by this issue and already support full UTF-8.

Having done another review I'm happy with this patch now. Tiny coding standards fix attached.

Just waiting for one of the database people to have a look at this and for @benjy to check this will play well with migrate.

stefan.r’s picture

FileSize
21.56 KB
610 bytes

Oops

benjy’s picture

This all looks fine to me from a migrate POV.

fietserwin’s picture

Issue summary: View changes
Status: Needs review » Needs work

Added "font icons" to the list of lacking support and in text references to prerequisite issues.

  1. +++ b/core/modules/aggregator/src/FeedStorageSchema.php
    @@ -33,7 +33,7 @@ protected function getSharedTableFieldSchema(FieldStorageDefinitionInterface $st
    -          $this->addSharedTableFieldUniqueKey($storage_definition, $schema);
    +          $this->addSharedTableFieldIndex($storage_definition, $schema);
               break;
    

    Pass TRUE for the 3rd param (not_null)? (unique fields are not null by default, but for other fields you have to pass that explicitly)

  2. +++ b/core/modules/file/src/FileStorageSchema.php
    @@ -30,6 +30,9 @@ protected function getSharedTableFieldSchema(FieldStorageDefinitionInterface $st
    +          $this->addSharedTableFieldIndex($storage_definition, $schema, TRUE);
    +          break;
    +        case 'uri_hash':
               $this->addSharedTableFieldUniqueKey($storage_definition, $schema, TRUE);
               break;
    

    addSharedTableFieldUniqueKey() does not accept a 3rd param (it is always not null).

  3. +++ b/core/modules/file/src/FileStorageSchema.php
    @@ -30,6 +30,9 @@ protected function getSharedTableFieldSchema(FieldStorageDefinitionInterface $st
    +        case 'uri_hash':
               $this->addSharedTableFieldUniqueKey($storage_definition, $schema, TRUE);
               break;
           }
    

    idem.

  4. +++ b/core/modules/file/src/Tests/SaveTest.php
    @@ -67,10 +70,25 @@ function testFileSave() {
    +      $this->assertTrue($exception_triggered, 'SQL uniqueness constraint is triggered');
    +    }
    

    warning: Variable 'exception_triggered' might not have been defined.

stefan.r’s picture

FileSize
21.75 KB
1.85 KB

Thanks for the review and the issue summary update @fietserwin.

Actually 2 and 3 may be referring to the same line of code? Or maybe to the code we are removing, as it is currently unnecessarily adding a 3rd parameter in HEAD as well:

- $this->addSharedTableFieldUniqueKey($storage_definition, $schema, TRUE);

As to 4, I have added a variable definition so we don't trigger a notice whenever the assertion fails.

stefan.r’s picture

Status: Needs work » Needs review
stefan.r’s picture

FileSize
21.75 KB
605 bytes

Had left an error in #178.

This is ready for further review now :)

fietserwin’s picture

- I would have assigned FALSE to $constraint_triggered, so that is a boolean and only a boolean, but I'm fine with this patch.
- 2 and 3 were indeed the same, my fault.

RTBC for me, but what about the 'D8 upgrade path' tag, does this patch need a hook_update?

stefan.r’s picture

Well assertTrue will never consider NULL a test pass, but indeed FALSE would have been more readable. If this goes back to "Needs work" at some point, I'll try to remember to fix.

As this adds a field on the File entity and changes a few unique keys, indeed this will need an update hook at some point. But as we don't yet support beta to beta upgrades, that's out of scope for this issue :)

andypost’s picture

Just 2 nits

  1. +++ b/core/modules/block_content/src/Entity/BlockContent.php
    @@ -22,7 +22,7 @@
    - *     "storage_schema" = "Drupal\block_content\BlockContentStorageSchema",
    + *     "storage_schema" = "Drupal\Core\Entity\Sql\SqlContentEntityStorageSchema",
    

    this is a default, just remove the line

  2. +++ b/core/modules/file/src/Entity/File.php
    @@ -194,7 +195,10 @@ public static function preCreate(EntityStorageInterface $storage, array &$values
    +    // Save the hash of the URI so we can enforce uniqueness.
    +    $this->get('uri_hash')->value = Crypt::hashBase64($uri);
    
    @@ -252,6 +256,15 @@ public static function baseFieldDefinitions(EntityTypeInterface $entity_type) {
    +      // As this is a base64-encoded SHA-256 hash, we can limit the length to
    +      // 100 characters.
    +      ->setSetting('max_length', 100)
    

    why 100? sessions.sid uses 128 but hash('sha256', $data) is 256 bits so char(64)

stefan.r’s picture

As to 2, it's a base64'd hash which adds another ~33% to to the 64 characters, so I just rounded up to 100.

strlen(hash('sha256', 'hello')) == 64;
strlen(base64_encode(hash('sha256', 'hello'))) == 88;
strlen(base64_encode(hash('sha256', 'hello', TRUE))) == 44;

But I just noticed hashBase64 uses raw binary data, so actually ~44 chars may have been enough in this case. Let's set it to 128 though just to be consistent with what we use for sessions.sid and as I think this discussion is out of scope for this patch.

Our max_lengths make little sense in general, usually we set it to 255 or 128 for no reason and we define our fixed-length columns as varchar, so maybe let's discuss that, along with limiting both columns to ~44 characters in a separate followup issue?

stefan.r’s picture

FileSize
21.57 KB
1.32 KB

stefan.r queued 186: 1314214-185.patch for re-testing.

stovak’s picture

Created a d7 backport issue. https://www.drupal.org/node/2488180

fietserwin’s picture

Status: Needs review » Reviewed & tested by the community

With the points raise in #177 and #184 solved or explained, this is OK now.

fietserwin’s picture

Patch no longer applies, so needs a reroll. In that case: can you also cover my 1st point from #182?

anavarre’s picture

Issue tags: +Needs reroll
stefan.r’s picture

Issue tags: -Needs reroll
FileSize
567 bytes
21.51 KB

re-rolled

fietserwin’s picture

Status: Needs work » Needs review
stefan.r’s picture

Thanks @fietserwin for the review. Patch is green again.

fietserwin’s picture

Status: Needs review » Reviewed & tested by the community

As per #190, as there are no real changes in this patch compared to #186.

catch’s picture

I've given this a once over and I think it's OK.

Slightly concerned about the filename base64 encoding for uniqueness, but don't have a better idea (i.e. I think that's probably better than shortening the column). Want to think about that a little bit more before committing though.

stefan.r’s picture

Thanks. What specifically is a worry about the hashing? Having a column on the database that really doesn't "mean" anything?

Just to clarify for anyone reading this, the hash is a base64-encoded sha256 hash -- not just base64-encoded.

catch’s picture

Status: Reviewed & tested by the community » Needs review

Thanks. What specifically is a worry about the hashing? Having a column on the database that really doesn't "mean" anything?

Yes exactly. We're changing the schema for something that's an InnoDB-specific limitation.

For example we avoided doing similar in #83738: LOWER(name) queries perform badly; add column name_lower and index. and eventually went with #279851: Replace LOWER() with db_select() and LIKE() where possible. An example of where the additional column could go wrong is if someone directly updated a filename in the database - then the hash would no longer match. We don't support manually updating things in the database outside of update handlers, but it's not impossible an update handler would need to update some filenames then they've additionally got to base64 hash those which is easy to forget. Typing this out is increasing my unease - so back to CNR for more opinions I think.

One option would be to not fix this particular column in this issue, and revive efforts to get transliteration for filenames into core again. Or at least to split it out into its own issue.

As well as that, we could use sqlite and postgres test runs before committing this.

stefan.r’s picture

SQLite and PostgreSQL test runs were done by @amateescu in #172, the patch is not fundamentally different.

Maybe we could we add a trigger that hashes the field instead of doing this on the PHP level?

"Not fixing" this column would imply we limit the length to 191 characters, which would mean migrations have that as a "known issue" until we get in transliteration. I had also tried implemented URLencoding in an earlier patch (#157) but didn't really like the looks of it -- similar concerns would apply if people bypass core APIs and change the column directly.

morgantocker’s picture

@stefan.r - I don't recommend using triggers. Prior to MySQL 5.7, there is one trigger per-event per-table (event: after insert, before insert etc.). Many DBAs require triggers to be able to perform schema changes and migrations. If the application also requires triggers, they won't be able to do this.

Triggers in Amazon RDS are also a pain:
https://www.percona.com/blog/2014/07/02/using-mysql-triggers-and-views-i...

stefan.r’s picture

Then probably our best bet is to take out the hashing, limit to 191 and reopen the filename transliteration issue so we can increase the limit back to 255.

Alternatively, we take out the unique constraint (temporarily) on the database level. We could still have a Drupal-level unique constraint for filenames.

fietserwin’s picture

Status: Needs review » Needs work

#200:

We don't support manually updating things in the database outside of update handlers

#203:

Alternatively, we take out the unique constraint (temporarily) on the database level. We could still have a Drupal-level unique constraint for filenames.

I think that, in terms of a solution, both of you are saying the same thing here ... so drop that column and only check on application level.

stefan.r’s picture

Status: Needs work » Needs review
FileSize
5.62 KB
20.54 KB

This takes out the database-level unique constraint and adds a Symfony constraint on the URI field instead.

The database-level constraint seems useful though, so maybe we should have another look into transliterating filenames so we can bring it back as soon as that's in?

fietserwin’s picture

Status: Needs review » Needs work

3 minors:

  1. +++ b/core/modules/file/src/Entity/File.php
    @@ -252,7 +252,8 @@ public static function baseFieldDefinitions(EntityTypeInterface $entity_type) {
    +      ->addConstraint('FileUri', []);
     
    

    I wouldn't pass in the 2nd param and rely on the default (being NULL, not an empty array).

  2. +++ b/core/modules/file/src/Plugin/Validation/Constraint/FileUriConstraint.php
    @@ -0,0 +1,31 @@
    +class FileUriConstraint extends Constraint {
    +
    

    Looking a bit around regarding the naming of constraints, I found:
    - UserNameUnique
    - UserMailUnique
    - UserNameConstraint
    - LinkAccess

    So, it looks that if the constraint is self explaining, the word constraint is not added to the end. This would suggest FileUriUnique as id and class name.

  3. +++ b/core/modules/file/src/Plugin/Validation/Constraint/FileUriConstraint.php
    @@ -0,0 +1,31 @@
    +  public $message = 'A file with this URI %value already exists. Enter a unique file URI.';
    +
    

    ... this URI %value ... does not sound correct to me, but I'm not a native English speaker. I would either:
    - Change 'this' to 'the', or
    - (my pereference) Use something like (comparing to usernameunique and usermailunique): 'The file %value already exists. Enter a unique file URI.'.

    If you change this, also change the test...

stefan.r’s picture

Status: Needs work » Needs review
FileSize
20.27 KB
2.83 KB
stefan.r’s picture

Status: Needs work » Needs review
FileSize
20.27 KB
539 bytes

Before RTBCing, can we confirm no new tests should fail in PostgreSQL/SQLite, either through a new test run or a manual code review comparing with the patch that had already been tested by @amateescu (drupal-utf8mb4-1314214-165.patch)

This would also need a follow-up issue where we discuss transliteration in core so we can put back the database-level unique constraint as soon as that's back in.

stefan.r’s picture

fietserwin’s picture

I'm not sure about the id used for the FileUriUnique constraint (FileUri). Other constraints around in D8 seem to use the class name as id, only chopping off Constraint (bu not Unique or Access). So that would be a final correction to the patch if the postgres and sqlite tests do not uncover anything else.

Leaving to NR to as the tests may run without that final change and to invite others to give this patch a final review.

stefan.r’s picture

FileSize
1.01 KB
20.28 KB

Let's be consistent with the rest of core then :)

Status: Needs work » Needs review

isntall queued 212: 1314214-212.patch for re-testing.

fietserwin’s picture

Thanks, perfect. The (transliteration) follow-up has also been created, so RTBC for me. @amateescu: are you also OK with this patch?

bzrudi71 queued 212: 1314214-212.patch for re-testing.

Status: Needs review » Needs work

The last submitted patch, 212: 1314214-212.patch, failed testing.

stefan.r’s picture

Status: Needs work » Needs review
FileSize
20.4 KB
342 bytes

This fixes the latest test failure as the generated Migrate files now have an MD5 checksum. Just for the record, benjy had OK'ed this in #176.

stefan.r’s picture

Issue summary: View changes

Updating issue summary to reflect latest changes.

stefan.r’s picture

@bzrudi71 did test runs on PostgreSQL/SQLite this weekend:

Hi Stefan,

good news! No problems with the patch for PG and SQLite :) The only fail is the MigrateTableDumpTest that now also fails for current patch on MySQL…

So, RTBC :)

Cheers Rudi

The MigrateTableDumpTest failure was fixed in #218

fietserwin’s picture

Status: Needs review » Reviewed & tested by the community

So, RTBC :)

stefan.r’s picture

FileSize
5.93 KB

Thanks.

Just for reference, here's an interdiff with the previous RTBC patch that had been reviewed by @catch.

alexpott’s picture

Assigned: Unassigned » catch

Afaics @catch's concerns have been addressed. I'm going to assign it to him rather than commit myself given @catch's prior involvement.

Status: Reviewed & tested by the community » Needs work

The last submitted patch, 218: 1314214-218.patch, failed testing.

stefan.r queued 218: 1314214-218.patch for re-testing.

stefan.r’s picture

Status: Needs work » Reviewed & tested by the community

back to rtbc

alexpott’s picture

@stefan.r we you requeue a patch because you think the fail is unrelated it is important to comment on the issue to say what failed and why it is unrelated. This helps us to not introduce new random fails.

stefan.r’s picture

The failure was "Repository checkout: failed to checkout from [git://git.drupal.org/project/drupal.git].", the same error as was appearing on other issues.

  • catch committed b105158 on 8.0.x
    Issue #1314214 by stefan.r, phayes, ergophobe, YesCT, damienwhaley, Tor...
catch’s picture

Version: 8.0.x-dev » 7.x-dev
Assigned: catch » Unassigned
Status: Reviewed & tested by the community » Patch (to be ported)

So dropping the unique database constraint isn't great, but given #2492171: Provide options to sanitize filenames (transliterate, lowercase, replace whitespace, etc) is the only viable way to add it back properly I think that's fine.

Committed/pushed to 8.0.x, thanks!

I'm not sure how viable this is to backport to 7.x, but like the varchar_ascii moving it there for discussion.

alexpott’s picture

This patch is causing problems for me - I can't install Drupal 8 anymore.

Drupal\Core\Database\DatabaseExceptionWrapper: SQLSTATE[42000]: Syntax error or access violation: 1071 Specified key was too long; max key length is 767 bytes: CREATE TABLE {router} (
`name` VARCHAR(255) CHARACTER SET ascii COLLATE ascii_general_ci NOT NULL DEFAULT '' COMMENT 'Primary Key: Machine name of this route',
`path` VARCHAR(255) NOT NULL DEFAULT '' COMMENT 'The path for this URI',
`pattern_outline` VARCHAR(255) NOT NULL DEFAULT '' COMMENT 'The pattern',
`fit` INT NOT NULL DEFAULT 0 COMMENT 'A numeric representation of how specific the path is.',
`route` LONGBLOB DEFAULT NULL COMMENT 'A serialized Route object',
`number_parts` SMALLINT NOT NULL DEFAULT 0 COMMENT 'Number of parts in this router path.',
PRIMARY KEY (`name`),
INDEX `pattern_outline_fit` (`pattern_outline`, `fit`)
) ENGINE = InnoDB DEFAULT CHARACTER SET utf8mb4 COMMENT 'Maps paths to various callbacks (access, page and title)'; Array
(
)
in db_create_table() (line 435 of /Volumes/devdisk/dev/sites/drupal8alt.dev/core/includes/database.inc).

Arla’s picture

I get the same error as @alexpott, with MySQL 5.6.24 (and 5.6.21).

Probably related (from https://news.ycombinator.com/item?id=7317519):

InnoDB limits index columns to 767 bytes. Why is this suddenly an issue? Because changing the charset also changes the number of bytes needed to store a given string. With MySQL’s utf8 charset, each character could use up to 3 bytes. With utf8mb4, that goes up to 4 bytes. If you have an index on a 255 character column, that would be 765 bytes with utf8. Just under the limit. Switching to utf8mb4 increases that index column to 1020 bytes (4 * 255).

catch’s picture

Version: 7.x-dev » 8.0.x-dev
Status: Patch (to be ported) » Needs work

Reverted.

From irc this is affecting people who use actual Oracle MySQL. MariaDB seems fine.

  • catch committed 537da55 on 8.0.x
    Revert "Issue #1314214 by stefan.r, phayes, ergophobe, YesCT,...
morgantocker’s picture

> From irc this is affecting people who use actual Oracle MySQL. MariaDB seems fine.

@catch: This probably has to do with default SQL mode selection. I would be surprised if with sql mode strict, they behave differently.

stefan.r’s picture

Do we know what the problem here is yet?

I don't have access to any Oracle MySQL installs right now, maybe anyone could test whether the fist statement gives an error and the second doesn't? Just in case the problem is the index.

CREATE TABLE test (
`name` VARCHAR(255) CHARACTER SET ascii COLLATE ascii_general_ci NOT NULL DEFAULT '' COMMENT 'Primary Key: Machine name of this route',
`path` VARCHAR(255) NOT NULL DEFAULT '' COMMENT 'The path for this URI',
`pattern_outline` VARCHAR(255) NOT NULL DEFAULT '' COMMENT 'The pattern',
`fit` INT NOT NULL DEFAULT 0 COMMENT 'A numeric representation of how specific the path is.',
`route` LONGBLOB DEFAULT NULL COMMENT 'A serialized Route object',
`number_parts` SMALLINT NOT NULL DEFAULT 0 COMMENT 'Number of parts in this router path.',
PRIMARY KEY (`name`),
INDEX `pattern_outline_fit` (`pattern_outline`, `fit`)
) ENGINE = InnoDB DEFAULT CHARACTER SET utf8mb4 COMMENT 'Maps paths to various callbacks (access, page and title)';
CREATE TABLE test2 (
`name` VARCHAR(255) CHARACTER SET ascii COLLATE ascii_general_ci NOT NULL DEFAULT '' COMMENT 'Primary Key: Machine name of this route',
`path` VARCHAR(255) NOT NULL DEFAULT '' COMMENT 'The path for this URI',
`pattern_outline` VARCHAR(255) NOT NULL DEFAULT '' COMMENT 'The pattern',
`fit` INT NOT NULL DEFAULT 0 COMMENT 'A numeric representation of how specific the path is.',
`route` LONGBLOB DEFAULT NULL COMMENT 'A serialized Route object',
`number_parts` SMALLINT NOT NULL DEFAULT 0 COMMENT 'Number of parts in this router path.',
PRIMARY KEY (`name`)
) ENGINE = InnoDB DEFAULT CHARACTER SET utf8mb4 COMMENT 'Maps paths to various callbacks (access, page and title)';
alexpott’s picture

INDEX `pattern_outline_fit` (`pattern_outline`, `fit`) causes the problem

stefan.r’s picture

@morgantocker there is no command we can send to Oracle MySQL to make it cut off indexes at 191 characters, like MariaDB does? I guess we'll just have to define index length explicitly?

morgantocker’s picture

@stefan.r: yes. It is related to my earlier comment about differences in default sql mode. See this test case:

mysql56> ALTER TABLE test2 ADD INDEX `pattern_outline_fit` (`pattern_outline`, `fit`);
ERROR 1071 (42000): Specified key was too long; max key length is 767 bytes
mysql56> set sql_mode='';
Query OK, 0 rows affected (0.00 sec)

mysql56> ALTER TABLE test2 ADD INDEX `pattern_outline_fit` (`pattern_outline`, `fit`);
Query OK, 0 rows affected, 1 warning (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 1

mysql56> show warnings;
+---------+------+---------------------------------------------------------+
| Level   | Code | Message                                                 |
+---------+------+---------------------------------------------------------+
| Warning | 1071 | Specified key was too long; max key length is 767 bytes |
+---------+------+---------------------------------------------------------+
1 row in set (0.00 sec)

mysql56> show create table test2\G
*************************** 1. row ***************************
       Table: test2
Create Table: CREATE TABLE `test2` (
  `name` varchar(255) CHARACTER SET ascii NOT NULL DEFAULT '' COMMENT 'Primary Key: Machine name of this route',
  `path` varchar(255) NOT NULL DEFAULT '' COMMENT 'The path for this URI',
  `pattern_outline` varchar(255) NOT NULL DEFAULT '' COMMENT 'The pattern',
  `fit` int(11) NOT NULL DEFAULT '0' COMMENT 'A numeric representation of how specific the path is.',
  `route` longblob COMMENT 'A serialized Route object',
  `number_parts` smallint(6) NOT NULL DEFAULT '0' COMMENT 'Number of parts in this router path.',
  PRIMARY KEY (`name`),
  KEY `pattern_outline_fit` (`pattern_outline`(191),`fit`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT='Maps paths to various callbacks (access, page and title)'
1 row in set (0.00 sec)

MySQL 5.6 is strict by default in new installations and strict for all in 5.7. This is probably not an issue for a non-unique index with two exceptions:
- Index-only scans are now not possible
- Strings with low selectivity in the left-most 191 characters.

It may be a problem with UNIQUE indexes because a constraint can not be imposed. I am not a fan of the 'loose' behavior.

@morgantocker there is no command we can send to Oracle MySQL to make it cut off indexes at 191 characters, like MariaDB does? I guess we'll just have to define index length explicitly?

You could achieve this by disabling strict mode, making the schema change, and then re-enabling strict mode.

stefan.r’s picture

Status: Needs work » Needs review
FileSize
0 bytes
16.31 KB

So something like this... Though I'd rather have this taken care of by the database itself if possible (like MariaDB does)

stefan.r’s picture

FileSize
2.21 KB
22.42 KB

Just a proof of concept, will work on this further over the course of the week... or anyone feel free to refactor / further comment.

morgantocker’s picture

@stefan.r: To clarify, it's not quite true to say that this is being taken care of by the database. What the database is doing, is providing you with something other than what you asked for. In many cases this is very dangerous, which is why it is now disabled.

A more complete fix is available in 5.7, where the barracuda file format + innodb_large_prefix is enabled by default. This allows indexes up to 3072 bytes.

Berdir’s picture

Yeah, I don't think we should rely on the database here. We have to explicitly define how those indexes should be shortened. Quite often, it's not the last part that should be shortened but the first part of the index I think for example with the router table.

We've been doing that before, we just had higher limits that we could rely on.

@morgantocker: Good to know, so based on our history of which MySQL version we support, we just have to wait 10 years or so until we can require 5.7 and our problem will be solved ;)

Status: Needs review » Needs work

The last submitted patch, 241: 1314214-240.patch, failed testing.

The last submitted patch, 241: interdiff-218-240.patch, failed testing.

stefan.r’s picture

Status: Needs work » Needs review
FileSize
23.69 KB
3.35 KB

This cuts off all indexes at 191 characters on non-ASCII fields that are MySQL string types (and thus are utf8mb4 encoded).

Probably the "non-last part" shortening should be dealt with in a separate issue as the implicit behavior of the previously commited patch (at least on MariaDB) was to shorten the last part.

stefan.r’s picture

FileSize
2.28 KB
23.78 KB

Nitpicks

alexpott’s picture

Status: Needs review » Needs work

The current patch makes the router table look like this - and Drupal installs for me.

CREATE TABLE `router` (
  `name` varchar(255) CHARACTER SET ascii NOT NULL DEFAULT '' COMMENT 'Primary Key: Machine name of this route',
  `path` varchar(255) NOT NULL DEFAULT '' COMMENT 'The path for this URI',
  `pattern_outline` varchar(255) NOT NULL DEFAULT '' COMMENT 'The pattern',
  `fit` int(11) NOT NULL DEFAULT '0' COMMENT 'A numeric representation of how specific the path is.',
  `route` longblob COMMENT 'A serialized Route object',
  `number_parts` smallint(6) NOT NULL DEFAULT '0' COMMENT 'Number of parts in this router path.',
  PRIMARY KEY (`name`),
  KEY `pattern_outline_fit` (`pattern_outline`(191),`fit`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT='Maps paths to various callbacks (access, page and title)'

So looks good. I guess this new code could do with some tests considering that #218 was green and got committed.

stefan.r’s picture

Status: Needs work » Needs review
FileSize
26.94 KB
3.37 KB

Added tests

stefan.r’s picture

Just so we can get this back in, is anyone willing to review this? :)

stefan.r’s picture

Issue summary: View changes

Updating issue summary to reflect this latest change

fietserwin’s picture

Status: Needs review » Needs work

On reviewing, I found 3 minors. The added test looks OK. I did not test the install, but it looks like @alexpott did so in #248.

  1. +++ b/core/lib/Drupal/Core/Database/Driver/mysql/Schema.php
    @@ -279,6 +293,63 @@ protected function createKeysSql($spec) {
    +   * @param $spec
    +   *   The table specification.
    +   *
    

    Data types are required as of D8: https://www.drupal.org/coding-standards/docs#param

  2. +++ b/core/lib/Drupal/Core/Database/Driver/mysql/Schema.php
    @@ -279,6 +293,63 @@ protected function createKeysSql($spec) {
    +            if (!(isset($mysql_field['type']) && $mysql_field['type'] == 'varchar_ascii') && !(isset($mysql_field['length']) && $mysql_field['length'] <= 191)) {
    +              // Limit the index length to 191 characters.
    

    I would have phrased this differently (using de Morgan laws), but I guess that is also a matter of preference. So I will still RTBC if you don't change this.

  3. +++ b/core/lib/Drupal/Core/Database/Driver/mysql/Schema.php
    @@ -279,6 +293,63 @@ protected function createKeysSql($spec) {
    +   * @param $index
    +   *   The index array to be used in createKeySql.
    +   *
    

    idem as 1st point.

stefan.r’s picture

Status: Needs work » Needs review
FileSize
1.56 KB
27.27 KB

Status: Needs review » Needs work

The last submitted patch, 253: 1314214-253.patch, failed testing.

stefan.r queued 249: 1314214-249.patch for re-testing.

The last submitted patch, 249: 1314214-249.patch, failed testing.

fietserwin’s picture

The failure is with a unique constraint, which is not in $spec['indexes'] but in $spec['unique keys']. But shortening a unique key index seems a no go to me as it would invalidate the purpose of the index of being a constraint.

Is the test data erroneous, as I can't find any hint to this unique constraint in the code, only in the test data file drupal\core\modules\system\tests\fixtures\update\drupal-8.beta11.bare.standard.php.gz.

stefan.r’s picture

Status: Needs work » Needs review
FileSize
172.7 KB
199.7 KB

That's most likely where the error comes from then... This removes the unique constraint as a workaround.

stefan.r’s picture

Just looked at #2498625: Write tests that ensure hook_update_N is properly run and manually removing the constraints from the beta11 dump is probably not the right way to go :)

stefan.r’s picture

Status: Needs work » Needs review
FileSize
111.49 KB
253.18 KB
743.32 KB

Discussed with @catch/@alexpott on IRC and this updates the dump instead.

stefan.r’s picture

Status: Needs work » Needs review
FileSize
743.11 KB
253.19 KB

This updates the site name in the dump, which must be "Site-Install" to pass tests.

Status: Needs review » Needs work

The last submitted patch, 263: 1314214-263.patch, failed testing.

stefan.r’s picture

FileSize
253.17 KB
743.12 KB

Another try

daffie’s picture

Status: Needs work » Needs review

For the testbot.

fietserwin’s picture

Do I understand correctly that the only actual change is that the unique constraint was removed from the dump?

stefan.r’s picture

The change is that instead of taking the beta11 database dump, it takes the database dump of HEAD+the current patch using core/scripts/dump-database-d8-mysql.php.

So that includes removal of the unique constraint, as well as all other changes introduced in this patch.

fietserwin’s picture

Status: Needs review » Reviewed & tested by the community

Thanks, then this is RTBC again, based on #248, #253, your discussion on IRC with @catch/@alexpott (#263), and all tests being green in #265.

daffie’s picture

I have asked bzrudi71 to test the patch from this issue with a PostgreSQL backend to see if there are any problems with PostgreSQL.

Berdir’s picture

@stefan.r: I think you figured out already, since Core doesn't support an upgrade path yet, updating the dump is fine, as the dump we want to have is the one where an official upgrade path is supported from.

However, it would be great if you could open an related issue in the head2head project and describe the necessary changes (also the actual collation change, since existing sites will want that change too), I'm sure @amateescu will be more than happy to have patches already as well ;)

stefan.r’s picture

OK, opened an issue there.

@daffie those test runs are merely nice to haves, they shouldn't block this issue as this had already been previously tested on SQLite/PostgreSQL and compared to the previously tested patch this only limits MySQL indexes, and the test is MySQL-only.

Status: Reviewed & tested by the community » Needs work

The last submitted patch, 265: 1314214-265.patch, failed testing.

daffie queued 265: 1314214-265.patch for re-testing.

stefan.r’s picture

Status: Needs work » Reviewed & tested by the community

back to RTBC

  • catch committed d579f51 on 8.0.x
    Issue #1314214 by stefan.r, phayes, ergophobe, YesCT, damienwhaley,...
catch’s picture

Status: Reviewed & tested by the community » Fixed

OK. Committed/pushed to 8.0.x, thanks!

We have a week prior to the next beta to flush out any more issues, but bzrudi71 confirmed that a full postgres run was OK on #2157455: [Meta] Make Drupal 8 work with PostgreSQL or remove support from core before release.

David_Rothstein’s picture

Version: 8.0.x-dev » 7.x-dev
Status: Fixed » Patch (to be ported)

This is marked for D7 backport (although there's also the spinoff issue at #2488180: Support full UTF-8 (emojis, Asian symbols, mathematical symbols) on MySQL and other database drivers when they are configured to allow it - I'll comment there too). Moving this one back to Drupal 7 for now.

mikeryan’s picture

Issue tags: +Needs change record

Umm, no change record? I know of at least three surprise breakages due to this patch...

Berdir’s picture

Yes, I also had to update multiple contrib modules because of this.. mostly primary keys and unique indexes that broke. Seems like a D7 backport of a similar patch would be very risky.

stovak’s picture

@berdir - can you list your breakages or is this in the thread somewhere?

mikeryan’s picture

stefan.r’s picture

@Berdir D8 contrib breaking was on purpose but D7 contrib not sure we can deal with, in the other thread I had mentioned maybe on D7 we should only allow utf8mb4 if the database system support large prefixes.

As to the change record, it'll need to mention fields with primary keys or unique keys need to be either drop the constraint, limit to 191 characters or change type to varchar_ascii, @mikeryan is that missing anything?

Berdir’s picture

Don't remember them all, but I had to fix primary keys in ultimate_cron, file_entity (file metadata) and redirect (the hash column), possibly another one as well.

Many developers just pick 255 as the length for varchar fields and that doesn't work anymore.

mikeryan’s picture

@stefan.r: Nope, that's good - just the info needed to deal with the change.

stovak’s picture

@berdir - I've been working on a table_converter module that basically scans all the tables and tells you which ones you can convert without losing data. Would love feedback.

https://github.com/stovak/table_converter

included in the "supporting" folder is the core patch and the requisite my.cnf changes.

TR’s picture

This patch broke one of my modules - it won't install anymore because the index is a varchar(255) (same error as described in #231).

Is it too much to ask for a change record to be made *before* a change like this is committed? That would save many people a lot of time - it seems this failure was expected in contrib, which wouldn't have been a problem if it had been announced at https://www.drupal.org/list-changes/, but when the first notice is failing tests and when a search of recent changes doesn't indicate anything, then I have to do a git bisect and read 286 comments to figure out how to fix it. And I doubt I'm the only one who is affected.

I would think twice about backporting this to D7, as there will be many contributed modules this will break and many of those aren't aggressively maintained. I think this will bring down a lot of sites.

stovak’s picture

@tr - not the way I did it. I simply allowed both $charset and $collation to be set in settings.php. If the settings aren't there, Drupal proceeds as normal. I made the assumption you would specifcally _WANT_ this feature and would need to make changes in order to enact. It doesn't make sense to force this on d7 users who don't specifically need it. At least it doesn't make sense to me.

stefan.r’s picture

Draft change notice posted at https://www.drupal.org/node/2510940

alexpott’s picture

@stefan.r can you add some examples of how to fix stuff to the CR - thanks.

TR’s picture

What do you think about adding a check in the driver Schema.php so that these varchar(255) keys are caught and handled gracefully (with an informative user message and controlled return) when trying to create the table, rather than letting the database generate a fatal error and WSOD? The patch added normalizeIndexes()/shortenIndex(), so we already are detecting the problem, we're just not handling it very well.

Also, this 191 character limitation really needs to be documented somewhere.

Wim Leers’s picture

I did not know we committed a D8 DB dump as part of this patch. Updating that as part of other patches is very painful. If there is a way to not have that, that'd be better. Definitely follow-up material though.

EDIT: FYI: #2464427-65: Replace CacheablePluginInterface with CacheableDependencyInterface.

pingers’s picture

This has broken a bunch of contrib with long indexes... *sigh*

Dealing with repeated issues breaking the previous 333 char limit in <= D7 was bad enough, now 191 chars. Eeek.

/me attempts writing patches.

ansorg’s picture

Cannot install Beta 12 due to
Failed to connect to your database server. The server reports the following message: SQLSTATE[HY000] [2019] Can't initialize character set utf8mb4 (path: /usr/share/mysql/charsets/).

My shared hosting provider does only support utf8 :-(

This is DomainFactory which is pretty big hoster in Germany, guess this will hit many users

stefan.r’s picture

#292: was not as part of this patch
#293: The D8 contrib breakage was on purpose, refer to the CR for possible workarounds. Remember it's only a problem on unique indexes and primary keys, regular indexes are automatically shortened.
#294: the MySQL requirement has been raised to 5.5.3 per #2473301: Raise MySQL requirement to 5.5.3. However instead of that error, the installer should be saying "The mysql database server must support utf8mb4 character encoding to work with Drupal. Make sure to use a database server that supports utf8mb4 character encoding, such as MySQL/MariaDB/Percona versions 5.5.3 and up."

It doesn't?

ansorg’s picture

#1314214-295: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols) MySQL at DomainFactory is 5.6.19 but technical support confirms that currently only utf8 is enabled or "preinstalled" as he said.

UTF8MB4 ist kein Zeichensatz der von uns vorinstalliert wurde. Diese Änderung ist in der Tat problematisch - wir werden hierzu weitere Analysen durchführen. Vorab bitten wir Sie jedoch, mit UTF8 weiterzuarbeiten, damit Drupal installiert werden kann.

stefan.r’s picture

Vorab bitten wir Sie jedoch, mit UTF8 weiterzuarbeiten

That's not possible in D8 beta12 and up...

Doing a google search on "Can't initialize character set utf8mb4" points to several possible issues, maybe @morgantocker would know what the problem coulud be here? What's the specific command that causes the issue?

In any case this could use a separate setup check so we can give a friendlier error message.

morgantocker’s picture

utf8mb4 is enabled by default from 5.5+, so it looks like DomainFactory has explicitly disabled it. I have honestly not heard of this being done before.. :)

It is possible to have a more specific feature detection to cater for this:

mysql> SHOW CHARACTER SET like'utf8mb4';
+---------+---------------+--------------------+--------+
| Charset | Description   | Default collation  | Maxlen |
+---------+---------------+--------------------+--------+
| utf8mb4 | UTF-8 Unicode | utf8mb4_general_ci |      4 |
+---------+---------------+--------------------+--------+
1 row in set (0.00 sec)

If it returns >0 rows, the feature exists.

stefan.r’s picture

OK, I'll see about adding a patch that does that.

@ansorg can you confirm the command "SHOW CHARACTER SET like'utf8mb4';" gives an empty result for you? Would be good if you could keep pushing your provider to see if they can fix and report back what the actual problem was. It's unclear why they would disable that character set on purpose...

Also, did you have this problem when installing Drupal or during an upgrade? Because the installer should already test for SET NAMES utf8mb4 not giving an errors...

ansorg’s picture

Unfortunately, I can run SQL only through PhpMyadmin and unfortunately there I do get a result for the query

SHOW CHARACTER SET LIKE 'utf8mb4';

Charset 	Description 	Default collation 	Maxlen
utf8mb4		UTF-8 Unicode	utf8mb4_general_ci	4

The problem happens while installing a fresh download of Beta12 with an empty database.

Initially I tried to update a Beta10 using drush but I'm not sure anymore whether this did not work due to the utf8mb4 issue or something else.

jhedstrom’s picture

I opened #2522098: Shorten index length for unique and primary keys in addition to normal indexes since schemas with unique or primary keys that need shortening do not currently work, and fail.

For instance, this schema:

  'fields' => array(
    'cid' => array(
      'type' => 'varchar',
      'not null' => TRUE,
      'length' => '255',
      'default' => '',
      'binary' => TRUE,
    ),
    'data' => array(
      'type' => 'blob',
      'not null' => FALSE,
      'size' => 'big',
    ),
    'expire' => array(
      'type' => 'int',
      'not null' => TRUE,
      'size' => 'normal',
      'default' => '0',
    ),
    'created' => array(
      'type' => 'numeric',
      'not null' => TRUE,
      'precision' => '14',
      'scale' => '3',
      'default' => '0.000',
    ),
    'serialized' => array(
      'type' => 'int',
      'not null' => TRUE,
      'size' => 'small',
      'default' => '0',
    ),
    'tags' => array(
      'type' => 'text',
      'not null' => FALSE,
      'size' => 'big',
    ),
    'checksum' => array(
      'type' => 'varchar',
      'not null' => TRUE,
      'length' => '255',
    ),
  ),
  'primary key' => array(
    'cid',
  ),
  'indexes' => array(
    'expire' => array(
      'expire',
    ),
  ),

Fails with:
Syntax error or access violation: 1071 Specified key was too long; max key length is 767 bytes

stefan.r’s picture

Regarding #300: wouldn't that point to a configuration issue on your hosting provider's side if the character set is listed anyway? Doing a google search for the error message you showed earlier did point to that as well.

@jhedstrom #301 was as designed, answering in the new issue

reevo’s picture

I'm seeing the same issue during install as #294, using the following:

CentOS 6.4
MySQL 5.5.44 (mysql55w from Webtatic repo)
PHP 5.5.26

...and seeing this error message when I try to initialise a PDO connection using the same $dsn as D8:

Character set 'utf8mb4' is not a compiled character set and is not specified in the '/usr/share/mysql/charsets/Index.xml' file

Drupal is unable to connect to a MySQL server which doesn't support utf8mb4 with charset=utf8mb4 in the $dsn.

Edit: I've created an issue with patch to catch this error: #2529188: Provide better error handling for MySQL client and server utf8mb4 incompatibility

stefan.r’s picture

@reevo mysql55w should have utf8mb4 already compiled in/available, what other packages did you have that have the word mysql in them? The error may have to do with libmysqlclient, you have the latest webtatic version for that?

On this page the error is mentioned as well, where they remove all old mysql packages before instlaling libmysqlclient16, mysql55w, mysql55w-libs, mysql55w-devel and mysql55w-server: https://manageyp.github.io/ruby-on-rails/2015/03/05/aliyun-rds-change-my...

reevo’s picture

@stefan.r here's my list of mysql related packages:

libmysqlclient16.x86_64 5.1.69-1.w6     @webtatic                               
mysql55w.x86_64         5.5.44-1.w6     @webtatic                               
mysql55w-libs.x86_64    5.5.44-1.w6     @webtatic                               
mysql55w-server.x86_64  5.5.44-1.w6     @webtatic                               
php55w-mysql.x86_64     5.5.26-1.w6     @webtatic 

I've tried completely uninstalling and reinstalling them all and am seeing the same result.

morgantocker’s picture

Good call on identifying it is a client issue. That is a MySQL 5.1 client on a 5.5 install.

ansorg’s picture

re #306, not sure about the client thing?
At DomainFactory I see "MySQL client version: 5.1.61" -
On a Synology DiskStation I have utf8mb4 working with "Database client version: libmysql - mysqlnd 5.0.11-dev - 20120503 "

adammalone’s picture

stefan.r’s picture

Apparently libmysqlclient has supported utf8mb4 since 5.5.3 (just like MySQL server). Mysqlnd uses different version numbers and has supported utf8mb4 since 5.0.9.

So we may have to do a separate mysql_get_client_info() check in addition to a server check.

vijaycs85’s picture

We may need to test this with various OS/PHP packages before porting this to 7.x. As @stefan.r mentioned in #309, most of the php5.5.3 packages depend on old libmysqlclient version.

jhedstrom’s picture

I opened #2537928: MySQL index normalization only works on table create since the index normalization becomes problematic when adding indexes to existing tables. Thoughts on how to resolve would be greatly appreciated.

ansorg’s picture

another big german hoster, Strato.de, seems to have a similar mysql configuration as Domainfactory and that again breaks recent drupal-8 beta12
Failed to connect to your database server. The server reports the following message: SQLSTATE[HY000] [2019] Can't initialize character set utf8mb4 (path: /opt/RZmysql5/share/mysql/charsets/).

stefan.r’s picture

@ansorg as mentioned in #2521784: cannot install at shared hoster Strato.de due to unsupported utf8mb4 Charset, this is likely due to a client version incompatibility, we are working on better error handling for this in #2529188: Provide better error handling for MySQL client and server utf8mb4 incompatibility

bzrudi71’s picture

This commit causes me more and more headache ;) Don't get me wrong, I'm totally +1 for this, but... We see user reports with problems for at least Domainfactory.de, 1und1.de and strato,de. So we are talking about the biggest players in Germany with a marketing share (together) far above 50% (or even more)! All approaches I see so far are just about adding a better error handling (mysql_client version). TBH does anyone expect these companies with millions and millions of hosted sites to change their setup because Drupal 8 doesn't like it? ;) IMHO this smells like a big Drupal 8 showstopper for German customers to me :(
Again, I like the idea and as a PG user not even affected by this but this feels more and more like a nogo to me, sorry.

stefan.r’s picture

@bzrudi71 the minimum version requirement is 5.5.3 for both the PHP mysql driver and the server per #2473301: Raise MySQL requirement to 5.5.3 , so even if we revert this patch, if we keep the version requirement Drupal 8 would still not work on those environments.

If it's really too much a problem that some large hosters still use older client libraries on their shared hosting environments, maybe another alternative would be to work on a utf8 fallback so we can relax the client requirement, however that's opening a whole other can of worms. And I wonder if by the time D8 is widely adopted enough time will have passed for them to have upgraded their environment anyway.

For what it's worth, others have easily fixed the issue by switching mysql drivers (from libmysqlclient to mysqlnd). May be worth nudging these hosters in the same direction if upgrading libmysqlclient is problematic for them.

morgantocker’s picture

@bzrudi71, @stefan.r: my opinion is similar to stefan's here.

Please also consider that upstream (disclosure: where I am employed) no longer supports MySQL 5.1. Although not as common, there are also client vulnerabilities such as BACKRONYM, which will not be addressed. #2473301 took a look at hosting providers/distros and their minimum supported versions. 5.5 was released in 2010, and is not exactly bleeding edge by any stretch.

geerlingguy’s picture

For Drupal 7, should we consolidate efforts with #2488180: Support full UTF-8 (emojis, Asian symbols, mathematical symbols) on MySQL and other database drivers when they are configured to allow it? It seems that the only major differences here are we're not changing the default collation, since we have to support older versions of MySQL in D7, and we're adding charset configuration (which should first be added to Drupal 8).

DamienMcKenna’s picture

FYI for the Twitter module we've changed to storing the tweet text in a blob (#1910376: SQL error when importing tweet with emoji), which works around the problem and is enough of a fix for our meagre needs.

morgantocker’s picture

@DamienMcKenna I think it depends on the situation if this workaround is acceptable. Blob will mean no collation (sorting order and comparison). The comparison feature is often useful: depending on the collation it will be case insensitive and accent insensitive.

DamienMcKenna’s picture

@morgatrocker: Completely understandable.

eranmatis’s picture

Hi,
Do you know a host company which supports the Drupal 8 requirements ?
I just has a bad experience with MochaHost.
After a week of support ping pong, they said that their MySql doesn't support utf8mb4.

ansorg’s picture

@eranmatis yes, this kind of sucks.
I have checked several well-known hosters in Germany (DomainFactory, Alfahosting, 1&1, Strato) by now and although their MySQL Server versions are fine, the mysql-client libs for PHP are outdated. None of them will allow running Drupal 8.

basvredeling’s picture

Isn't it just a case of supporting mysql 5.6+? I asked https://www.byte.nl/ webhosting via twitter the other day. They don't anticipate any problems using utf8mb4. So I'm genuinely curious what the ping pong of #321 was all about.

stefan.r’s picture

@basvredeling no, it's about libmysqlclient - some hosters still run a an ancient version of that (5.1) which is incompatible with Drupal 8.

Upgrading libmysqlclient is not something they can do that easily, but there have been several reports here in the issue queue of people making an easy switch from libmysqlcleint to MySQLnd (the native driver that comes with PHP).

We also have a minimum version requirement for PHP, and the MySQLnd driver that comes with any version that meets the requirement does work with Drupal 8. So it would be good if people pressured these hosts to implement this.

stefan.r’s picture

I've been in touch with DomainFactory and they will fix their setup on October 20th:

From: <technik@df.eu>
Date: 2015-09-24 9:31 GMT+02:00
Subject: [domainfactory.de #5274109] Drupal 8 support

Hi Stefan,

we did our research and can confirm that the setup of drupal 8 works fine when we use pdo_mysql with mysqlnd. I think thats our prefered solution/workaround, we'll try to update our system (October 20, 2015) before your release date.
[...]
We will communicate the solution to our customers so they stop posting in your comment section ;)

Kind regards

redaccted
Senior Systemadministration
Rajab Natshah’s picture

Please, we need to have this patch in the next Drupal 7 version.
Still using the workaround module as a fix for this issue.

Strip 4-byte UTF8

https://www.drupal.org/project/strip_utf8mb4

We do have an interface for it too.
Strip 4-byte UTF8 back-end configuration page.

We want to jump to have that in Drupal.

Thank you.

bernard.verslype’s picture

La réponse de OVH à mes demandes d'explications:
D'après la documentation officiel MySQL je constate que l'UTF8MB4 n'est
valable qu'à partir de base de données en version 5.5.3 hors mysql5-18.perso
et mysql51-93.perso sont en version 5.1, version que vous avez choisi lors de
leur création.
Je vous invite ici à sauvegarder vos bases, à les vider, à les supprimer puis
à les recréer en 5.5 avant de réimporter vos données ce qui devraient vous
permettre d'utiliser l'UTF8MB4.

OVH's response to my requests for explanations:
According to the official MySQL documentation I see that is UTF8MB4
valid only from database Version 5.5.3 out mysql5-18.perso
mysql51-93.perso and are in version 5.1, version you chose when
their creation.
I invite you here to save your bases, emptying, and then to remove
to recreate in 5.5 before reimport your data which you should
allow to use the UTF8MB4.

geerlingguy’s picture

@RajabNatshah - To be clear, this issue is about adding support for 4-byte characters, not stripping them from content. But yes, many people are in favor of adding optional support for UTF8mb4 to core in Drupal 7 :)

Rajab Natshah’s picture

@geerlingguy .. Yes .. that what we want ..
But we keep having this patch ported every time ..
It did not come with the last Drupal 7 Release.

darol100’s picture

Seem to me that this already been committed to D8. What are the things that needs to be done to complete this issue in D7 ?

I know that @geerlingguy provide a work around on how to fix this issue on D7 website (Solving the Emoji problem in Drupal 7), but this requires to patch D7 core (which I'm not going to do).

tim.plunkett’s picture

Version: 7.x-dev » 8.0.x-dev
Status: Patch (to be ported) » Fixed
Issue tags: -Needs backport to D7

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Helgi Andri Jónsson’s picture

Hello everyone,

I made a module that uses entity metadata wrappers to loop over the properties of an entity on hook_entity_presave and it converts all unicode characters to html entities for properties of type text and formatted text. I know it would be great have this solved in core and its not really a solution to strip them from input as that would be counter to the expectation of a user experience.

I hope the module solves a middle ground. I also made sure to include configurable settings to allow us to blacklist or whitelist entities to check for unicode. Therefore we can isolate such input to certain entities like comments or messages.

https://www.drupal.org/project/unicode

Best Regards,
Helgi.

jibran’s picture

Version: 8.0.x-dev » 9.x-dev
David_Rothstein’s picture

Version: 9.x-dev » 8.0.0
geek-merlin’s picture