Problem/Motivation

We don't need to use UTF-8 for all our indexed machine names/UUIDs when they might as well be simple ASCII, especially not if we are looking to implement utf8mb4 in the future, which would further increase index size and could affect query buffer performance. (VAR)CHAR utf8mb4 fields also have a 191 character limitation on their indexes, while ASCII would allow for indexing all 255 characters.

Furthermore, not specifying on the schema level which fields should be UTF-8 and which ones should be simple ASCII adds to our technical debt, as developers may use UTF-8 characters in places where we haven't tested that they will actually work. Using non-ASCII characters in a module name disallows us from using hooks for instance, as PHP only accepts ASCII characters in function names.

Proposed resolution

Originally proposed by @sun & @Damien Tournoud:

#1314214-82: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols):

It's probably worth tidying up our schema (for example: machine name-type keys should probably use a ascii character set so as to reduce the size of the index)

we introduce support for an ascii or binary charset and use it to tidy up our indexes and primary keys (especially machine names and UUIDs that have poped up all over the place in Drupal 8).

What needs to happen is that we need to stop using Unicode columns (and indexes) where they are not necessary. This is a comprehensive change that cannot be workaround by simply bumping database version requirements.

Related issues

Remaining tasks

  • Review patch
  • File followup issue for ASCII support in other database engines than MySQL

User interface changes

N/A

API changes

  • By cleaning up our schema definitions in this issue, we explicitly disallow non ASCII characters for certain machine names on the database level. If we didn't disallow this already, we now do, and stick to supporting UTF-8 for content only.
  • Addition of a new is_ascii setting on the string formatter (in addition to the already existing "length" and "case_sensitive" settings).
  • Addition of a new varchar_ascii on the schema definition.

Beta phase evaluation

Reference: https://www.drupal.org/core/beta-changes
Issue category Task because this is an API cleanup / performance issue
Issue priority Normal, but affects performance, blocks #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols) and reduces fragility
Prioritized changes The main goal of this issue is API clean-up and unblocking full UTF-8 support in #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols)
Disruption Not disruptive for core because we don't use non-ASCII characters in funny places anyway
Files: 
CommentFileSizeAuthor
#122 interdiff-114-122.txt1.9 KBstefan.r
#122 1923406-122.patch39.23 KBstefan.r
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 92,250 pass(es).
[ View ]
#114 interdiff-112-114.txt627 bytesstefan.r
#114 1923406-114.patch39.14 KBstefan.r
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 91,959 pass(es).
[ View ]
#112 1923406-112.patch39.98 KBstefan.r
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] 91,719 pass(es), 97 fail(s), and 145 exception(s).
[ View ]
#112 interdiff-107-112.txt2.16 KBstefan.r
#108 1923406-109.patch39.97 KBstefan.r
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 91,831 pass(es).
[ View ]
#108 interdiff-107-109.txt453 bytesstefan.r
#107 interdiff-81-107.txt3.47 KBstefan.r
#107 1923406-107.patch39.79 KBstefan.r
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 91,825 pass(es).
[ View ]
#105 1923406-105.patch39.77 KBstefan.r
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] 91,773 pass(es), 4 fail(s), and 0 exception(s).
[ View ]
#105 interdiff-102-105.txt1.18 KBstefan.r
#102 1923406-102.patch38.44 KBstefan.r
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 91,765 pass(es).
[ View ]
#102 interdiff-95-102.txt2.57 KBstefan.r
#95 1923406-95.patch39.73 KBstefan.r
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] 27,438 pass(es), 959 fail(s), and 959 exception(s).
[ View ]
#95 interdiff-94-95.txt477 bytesstefan.r
#94 1923406-94.patch39.73 KBstefan.r
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] 27,440 pass(es), 960 fail(s), and 959 exception(s).
[ View ]
#94 interdiff-81-94.txt3.02 KBstefan.r

Comments

alexpott’s picture

Status:Active» Closed (won't fix)

The default regex on machine names limits the characters in machine names using the following regex [^a-z0-9_]+ so only ASCII characters are used.

amateescu’s picture

Status:Closed (won't fix)» Active

The improvement @sun proposes is for the schema of the table column that stores machine names :)

stefan.r’s picture

Issue summary:View changes

Does this still have a chance of going in? If I understand correctly this is blocking #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols) (emojis!).

stefan.r’s picture

Title:Consider ASCII character set for machine name schema columns» Use ASCII character set in machine name / UUID columns for more efficient indexing
Issue summary:View changes
Status:Active» Needs review
Issue tags:+Performance, +database schema
StatusFileSize
new5.95 KB
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] 90,940 pass(es), 1 fail(s), and 3 exception(s).
[ View ]

Here's a start, this would add an 'ascii' key. We'd want to use this on all UUIDs, machine names, non-numeric indexed columns and primary keys where possible, there is no need for all those to be encoded in utf8.

Status:Needs review» Needs work

The last submitted patch, 4: 1923406-ascii-4.patch, failed testing.

stefan.r’s picture

Status:Needs work» Needs review
StatusFileSize
new5.27 KB
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 90,983 pass(es).
[ View ]

Seems MySQL has a problem if we don't declare a length for a varchar field in the schema.

Crell’s picture

I dislike adding a boolean flag on things. That violates Asimov's Law. (2 is the least likely number in the universe.)

Could this be instead done with a charset key that gets set to ascii (and defaults to something reasonable so 99% of cases don't need to specify it)?

stefan.r’s picture

I sort of like the simplicity of only supporting 1 character set though (UTF-8). As an exception, ASCII is a subset of UTF-8 that we can use for identifiers, but then even latin1 for instance I believe is not. Having a "charset" key sort of gives the impression we support custom column level character sets, the question is do we want that... it may lead to issues down the line. Then there is also the question of collations.

The ascii flag is really no much different from the case sensitive flag given what we would be using it for (ie. distinguishing between identifier fields and content fields). Maybe the name is unfortunate and we need to rename it to something more sensible? Or I don't know if creating a new type (ie something other than "string") is still possible at this stage?

Crell’s picture

The idea here is to use a smaller charset for fields that we know are only going to contain machine names ( A-Za-z0-9), correct? What about an "is-machine name" or "is trivial" or something like that? If we have to use a boolean value it should be on something that is conceptually boolean, whereas "Ascii" is not. We can then map that metdata to mean "oh yeah, use ascii encoding for this field, or something else if we find an even better option later on". (That latter part is a big part of why I dislike "ascci" as a boolean.)

I agree that custom charsets per-field is a can of worms even pandora doesn't want opened.

stefan.r’s picture

Correct, that was the idea... I agree something like that makes more sense than "ascii" for a flag name :)

I'll go with is_trivial for now, as it encompasses bundle, machine name, uuid, etc.

amateescu’s picture

How about is_alphanumeric?

stefan.r’s picture

That's less confusing but not really truthful, we'd still allow underscores and possibly dots :)

Aki Tendo’s picture

Wouldn't this be better suited to assert? Or are end users expected to input schemas??

amateescu’s picture

@stefan.r, I'd go with "friendlier" over "truthful but meaningless" any time :)

@Aki Tendo, Yep, schema is user input, even though it's less an less so in Drupal 8.

stefan.r’s picture

StatusFileSize
new22.39 KB
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] 68,110 pass(es), 577 fail(s), and 69 exception(s).
[ View ]
stefan.r’s picture

StatusFileSize
new506 bytes
new22.4 KB
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] 68,073 pass(es), 575 fail(s), and 71 exception(s).
[ View ]

There may be more columns that could be alphanumeric but this should cover most of them.

Some may require further validation to make sure we don't try to store non-ASCII data in there, and some may need some further conversion. I think one field we tried to store raw a SHA256 hash for instance, that should ideally be base64'd so we can make it an "alphanumeric" field.

Status:Needs review» Needs work

The last submitted patch, 16: 1923406-16.patch, failed testing.

stefan.r’s picture

Status:Needs work» Needs review
StatusFileSize
new1.04 KB
new22.4 KB
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] 89,256 pass(es), 15 fail(s), and 18 exception(s).
[ View ]

The last submitted patch, 15: 1923406-15.patch, failed testing.

Status:Needs review» Needs work

The last submitted patch, 18: 1923406-18.patch, failed testing.

stefan.r’s picture

Status:Needs work» Needs review
StatusFileSize
new22.4 KB
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 91,229 pass(es).
[ View ]
new411 bytes
stefan.r’s picture

For easier reviewing I'll summarize the columns that have been turned into simple ASCII columns:

UUID fields
* All, through the UuidItem class

Language fields
* All, through the LanguageItem class

Content entity storage
* langcode and bundle field (The field instance bundle to which this row belongs), through SqlContentEntityStorageSchema.

Text long fields
* The format machine name, through the TextLongItem class

Text with summary fields
* The format machine name, through the TextWithSummaryItem class

cache_* bin tables
* cid (Unique cache ID)
* checksum (The tag invalidation checksum when this entry was saved.)

Cache table for tracking cache tag invalidations
* tag (Namespace-prefixed tag string)

Configuration table
* collection (Primary Key: Config object collection.)
* name (Primary Key: Config object name.)

Other key value tables
* key_value.collection
* key_value_expire.collection
* key_value.name
* key_value_expire.name

Menu tree storage
* menu_name (the menu name. All links with the same menu name (such as 'tools') are part of the same menu.)
* id (Unique machine name: the plugin ID.)
* parent (The plugin ID for the parent of this link.)
* route_name (The machine name of a defined Symfony Route this menu item represents.)
* provider (the name of the module that generated this link.)

Feeds
* hash (Calculated hash of the feed data, used for validating cache.)

Ban
* ip (IP address)

Comment
* entity_type (The entity_type of the entity to which this comment is a reply.)
* field_name (The field_name of the field that was used to add this comment.)

Dblog
* type (Type of log message, for example "user" or "page not found.)
* hostname (Hostname of the user who triggered the event.)

Entity reference
* target_id, through EntityReferenceItem class

File
* module (The name of the module that is using the file.)
* type (The name of the object type in which the file is used)
* id (The primary key of the object using the file)
* filemime (File MIME type)

Locale
* context (The context this string applies to)
* version (Version of Drupal where the string was last used (for locales optimization))
* language (Language code. References {language}.langcode)
* type (The location type (file, config, path, etc).)
* version (Version of Drupal where the location was found)
* locale_file.project (A unique short name to identify the project the file belongs to.)
* locale_file.langoode (Language code of this translation. References {language}.langcode.)

MenuLinkContent
* bundle (The content menu link bundle)
* menu_name (The menu name. All links with the same menu name (such as "tools") are part of the same menu.)

Node
* langcode (the {language}.langcode of this node.)
* node_grants.realm

Search
* langcode (The {languages}.langcode of the item variant.)
* type (type of item, e.g. node)

Shortcut
* set_name (The {shortcut_set}.set_name that will be displayed for this user.)

Simpletest
* message_group (the message group this message belongs to. For example: warning, browser, user.)
* class_name

System
* batch.token (A string token generated against the current user's session id and the batch id)
* flood.event (Name of event (e.g. contact).)
* flood.identifier(Identifier of the visitor, such as an IP address or hostname.)
* queue.name (The queue name.)
* router.name (Primary Key: Machine name of this route)
* semaphore.name (Primary Key: Unique name)
* semaphore.value (lock ID, is uniqid)
* sessions.sid (A session ID (hashed). The value is generated by Drupal's session handlers)
* sessions.hostname (The IP address that last used this session ID)
* url_alias.langcode (The language code this alias is for)

User
* users_data.name (The identifier of the data)
* users_data.module (The name of the module declaring the variable)

stefan.r’s picture

StatusFileSize
new23.1 KB
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 91,434 pass(es).
[ View ]
new1.18 KB
yannickoo’s picture

Assigned:Unassigned» yannickoo

I'm currently reviewing #23. Going to make sure that no field is missing in the patch :)

yannickoo’s picture

StatusFileSize
new40.94 KB
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] 91,429 pass(es), 1 fail(s), and 0 exception(s).
[ View ]
new1.66 KB

Patch looks really good and I have also added is_alphanumeric property to long text's format column (which stores the used text format).

yannickoo’s picture

StatusFileSize
new42.04 KB
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 91,805 pass(es).
[ View ]

The last submitted patch, 25: drupal-ascii_charset-1923406-25.patch, failed testing.

stefan.r queued 23: 1923406-23.patch for re-testing.

The last submitted patch, 25: drupal-ascii_charset-1923406-25.patch, failed testing.

Crell’s picture

  1. +++ b/core/config/schema/core.data_types.schema.yml
    @@ -438,20 +438,23 @@ field.value.*:
    +    is_alphanumeric:
    +      type: boolean
    +      label: 'US ASCII characters only'

    This is adding to the config schema, not DB schema. I don't think this is needed, is it? (DB schema and Config schema SHOULD be entirely independent.)

  2. +++ b/core/lib/Drupal/Core/Database/Driver/mysql/Schema.php
    @@ -137,20 +137,23 @@ protected function createTableSql($name, $table) {
    +      if (!empty($spec['is_alphanumeric'])) {
    +        $sql .= ' CHARACTER SET ascii COLLATE ascii_general_ci';
    +      }

    We should check for a True value specifically, not just mere presence.

  3. +++ b/core/modules/aggregator/src/FeedInterface.php
    @@ -163,21 +163,22 @@ public function setImage($image);
        * @param string $hash
    -   *   A string containing the calculated hash of the feed.
    +   *   A string containing the calculated hash of the feed. Must be base64
    +   *   encoded so it contains US ASCII characters only.

    The comment here is a little confusing. The value that setHash() returns must be base64 encoded? That seems over-reach for an interface definition. Rather, it should just state that the hash must contain only US ASCII characters. base64 encoding is only one of many ways of achieving that.

I didn't look at the specific fields that are being flagged in detail, but I'll accept yanikoo and stefan's work on that. The above issues should be easy enough to resolve so we can move forward here.

The next step would be to include some benchmarks and a beta analysis table to justify this addition in 8.0. I'm OK with it if it passes beta criteria.

stefan.r’s picture

Issue tags:+drupaldevdays

Thanks @Crell.

1. Just to clarify, those are string field storage settings which can be either max_length, case_sensitive or is_alphanumeric and are translated to DB schema settings elsewhere:

<?php
  
public static function schema(FieldStorageDefinitionInterface $field_definition) {
     return array(
      
'columns' => array(
        
'value' => array(
          
'type' => 'varchar',
          
'length' => (int) $field_definition->getSetting('max_length'),
          
'binary' => $field_definition->getSetting('case_sensitive'),
          
'is_alphanumeric' => $field_definition->getSetting('is_alphanumeric'),
         ),
       ),
     );
   }
?>

We can use this setting in the UuidItem/LanguageItem/TextWithSummaryItem/TextLongItem field types and in some of the BaseFieldDefinitions in Feed/File/MenuLinkContent.

If we do the case_sensitive/binary setting in the config schema I guess we can do this one as well.. Maybe we should rename the config setting to something else, just for consistency with the other settings?

2. The binary / unsigned settings actually also do a !empty() check in the MySQL driver, are we sure we want to deviate from those? If needed we could break the previous behavior though and do stricter testing there as well, only accepting TRUE anymore...

We were also planning to do a last manual review of every single field just to make sure there is no way we can somehow insert non-ASCII data in those. We'll also need to document the new hook_schema setting in database.api.php.

stefan.r’s picture

StatusFileSize
new10.31 KB
  • Further reviewed all fields on a complete drupal install (with all modules) and found another few that should be "trivial" fields. I will update #22 to reflect this.
  • Renamed the field storage setting to is_trivial for consistency with other field storage settings and to clarify it's different to the schema setting.
  • Added documentation for new is_alphanumeric setting

For future reference, other than the following fields, all other indexed fields should now be set to ASCII:

- aggregator.title
- aggregator.url
- file_managed.uri
- block_content_field_data.info
- taxonomy_term_field_data.name
- locales.source
- menu_link_content_data.link__uri
- menu_tree.route_param_key
- node_field_data.title
- router.pattern_outline
- search_total.word
- shortcut_field_data.link__uri
- url_alias.source
- url_alias.alias
- users_field_data.name
- users_field_data.mail

stefan.r’s picture

StatusFileSize
new29.71 KB
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] 89,478 pass(es), 17 fail(s), and 22 exception(s).
[ View ]
new10.31 KB

Status:Needs review» Needs work

The last submitted patch, 34: drupal-ascii_charset-1923406-32.patch, failed testing.

Crell’s picture

Re the config schema: Yeesh. I didn't realize it was doing that. I defer to someone from the config system team on that front for what the best way to handle it is. It's irrelevant from a DB system perspective IMO. I would prefer to still call it is_numeric, though, as is_trivial is a rather meaningless term.

Re the type of truthy check, this is where PHP is really annoying. :-) The current code would read TRUE, 1, 5, "true", and "Game of Thrones" as "this is alphanumeric" and FALSE, NULL, '', and not-even-set as "not alphanumeric". I generally prefer tighter checks, but I suppose it's reasonable to allow 1 in addition to TRUE. *sigh* PHP... I could be convinced either way here.

One thing we should realize: Technically this means that machine names may no longer have non-latin characters. IE, a user in Tokyo would still have to use ASCII letters for a machine name of a View. Are we OK with the implications of that? (I am, but I defer to Gabor on the i18n UX implications of that.)

stefan.r’s picture

Status:Needs work» Needs review
StatusFileSize
new29.29 KB
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] 91,496 pass(es), 2 fail(s), and 0 exception(s).
[ View ]
new25.85 KB

I have reviewed all occurrences with @klausi and basically we should be fine to turn these fields into ASCII fields. He recommended going with is_ascii for both the setting and the schema setting.

@Crell as you well noted, not all of it is validated or codified to be ascii, we don't protect users against doing dumb things in code such as using japanese machine names. It probably threw an error before, and if it didn't, now it does! We thought this shouldn't be supported anyway -- the UI, PHP and the OS may not be able to deal with non-ASCII machine names, we should probably stick to UTF-8 for content only. For instance, we can't use japanese characters for a module name because then we can't use hooks as PHP functions need to be ASCII :)

Log of the manual review:
https://docs.google.com/spreadsheets/d/1g42xm9ZWUJRoBcIXaw96QfBZaUqsalGh...

stefan.r’s picture

Issue summary:View changes

Updated issue summary, added beta evaluation. If this is green we can proceed to profiling this.

stefan.r’s picture

Issue summary:View changes
stefan.r’s picture

Issue summary:View changes

Also just FYI this should not affect existing installs. New Drupal installs will just have some columns be ASCII instead of UTF-8.

Status:Needs review» Needs work

The last submitted patch, 37: drupal-ascii_charset-1923406-37.patch, failed testing.

stefan.r’s picture

Status:Needs work» Needs review
StatusFileSize
new2.65 KB
new31.91 KB
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 91,547 pass(es).
[ View ]

stefan.r’s picture

StatusFileSize
new1.12 KB
new32.97 KB
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 91,553 pass(es).
[ View ]

OK, I am happy with how this looks now :)

stefan.r’s picture

How do we best benchmark this? Seems the testbot completes our test suite in 16:02 min which is about the time it usually takes. Do we have any scenarios that simulate a larger database?

yannickoo’s picture

Status:Needs review» Needs work
+++ b/core/modules/system/system.install
@@ -759,6 +759,7 @@
+ 'is_ascii' => TRUE,

@@ -766,6 +767,7 @@
+ 'is_ascii' => TRUE,

@@ -786,6 +788,7 @@
+ 'is_ascii' => TRUE,

@@ -794,6 +797,7 @@
+ 'is_ascii' => TRUE,

Tabs instead of whitespaces D:

stefan.r’s picture

Status:Needs work» Needs review
StatusFileSize
new32.99 KB
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 91,560 pass(es).
[ View ]
new1.23 KB
stefan.r’s picture

From some initial simple where/order by/like benchmarks on a table with a million rows of random content I can't really seem to get the character set to affect index size or performance (as long as the content is ASCII). Don't know about more complex operations yet but maybe any performance gains are just theoretical/negligible?

Damien Tournoud’s picture

@stefan.r was struggling to prove the performance improvements in this issue and asked me to step in.

Performance is not the point: the point is to reduce the size of the indexes so that it is practical to move to utf8mb4 by default.

Databases are really smart in how they deal with your data, so don't be surprised that even if our schema doesn't make sense (we store ASCII identifiers in a Unicode column), our indexes are still relatively efficient. As long as the size of the indexes fit in memory, it is not going to make a significant difference to the query performance anyway.

This is definitely a huge improvement anyway, +1 from me.

stefan.r’s picture

StatusFileSize
new868 bytes
new33.5 KB
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 91,394 pass(es).
[ View ]

Added two more fields that didn't really need to be UTF-8.

yannickoo’s picture

Status:Needs review» Reviewed & tested by the community

I think we can set this issue to RTBC now because this is an API clean-up and does not affect existing installations. We have done an extensive review of that together with other guys and Damien, Crell and klausi are also fine with this :)

Another good point is that the patch would unblock #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols).

stefan.r’s picture

Assigned:yannickoo» Unassigned
morgantocker’s picture

Morgan from the MySQL team here.

For the example of machine names, BINARY is better than ascii since its not really text and collations do not apply.

I would clarify that this may possibly be a micro-optimization:
- utf8mb4 indexes and columns store as variable length with InnoDB (so there should be no direct data size saving).
- Any performance loss is going to be in query buffers that are not variable length (during sort, temporary table or sending results to the client). MySQL 5.7 makes more of these buffers variable length, so in the specific case there is a heavy perf regression (I am expecting this to be rare), maybe it will be reasonable to recommend an upgrade?

So maybe to clarify the above: I don't think this feature is required.

stefan.r’s picture

@morgantocker did you really mean 5.7? Drupal 8 raising the minimum version to 5.7 is pretty unlikely.. But we've proposed raising it to 5.5 in #2473301: Raise MySQL requirement to 5.5.3

We did actually discuss using BINARY with our database maintainer @Damien Tournoud, but as mentioned in #48 performance is not the main point here -- this is about working around the maximum index size requirement (191 characters for utf8mb4 unless we use non-standard settings), and cleaning up the API to implicitly say "don't use non-ASCII characters for these machine-name fields"

morgantocker’s picture

@Stefan.r: I did not mean raising the minimum version to 5.7. I had stated that those requiring extreme performance have an acceptable solution. But I apologize that we may have been talking about different things. The physical index size will not change with ASCII, but the meta data size will (maximum theoretical length), and this is your current problem.

Maybe it is easier to clarify with examples:

mysql> CREATE TABLE t1 (a VARCHAR(255) NOT NULL, INDEX (a)) character set utf8mb4;
Query OK, 0 rows affected, 1 warning (0.03 sec)

mysql> show warnings;
+---------+------+---------------------------------------------------------+
| Level   | Code | Message                                                 |
+---------+------+---------------------------------------------------------+
| Warning | 1071 | Specified key was too long; max key length is 767 bytes |
+---------+------+---------------------------------------------------------+
1 row in set (0.00 sec)

mysql> CREATE TABLE t2 (a VARBINARY(255) NOT NULL, INDEX (a)) character set utf8mb4;
Query OK, 0 rows affected (0.02 sec)

mysql> CREATE TABLE t3 (a VARCHAR(255) CHARACTER SET latin1 NOT NULL, INDEX (a)) character set utf8mb4;
Query OK, 0 rows affected (0.03 sec)

This issue is suggesting example in t3. I was suggesting t2 may be more elegant.

stefan.r’s picture

Thanks for clarifying, that would explain why I noticed no size difference in my own tests when changing the encoding.

Indeed either BINARY or a different encoding should work in terms of solving the index size limitation for non-utf8mb4 data, although with BINARY a worry would be that we'd have to expand the scope of the current patch and deal with the encoding in Drupal itself, validating that what goes in and comes out of these fields is ASCII encoded data only. Just so as to not to have contrib/custom module developers try saving data in other encodings on those fields :)

stefan.r’s picture

Issue summary:View changes
Crell’s picture

Status:Reviewed & tested by the community» Needs review

Sounds like this is still an open question.

Thanks Morgan for jumping in with more information!

stefan.r’s picture

Well in terms of making sure that data in ASCII fields is actually ASCII and allowing for utf8mb4 encoding by default, I see 3 ways out here:

a) Reducing all indexes to 191 characters max instead of 255, sticking with utf8mb4 encoding everywhere, potentially hurting performance (although I guess smaller indexes could be better in some cases?). We'd also need to make sure on the database driver level that everything going in and out is actually ASCII encoded. This should not be a very expensive regex but it may further hurt performance. We'd also need to save somewhere which fields are ASCII or not as the DBTNG driver does not have access to the schema, which would mean further overhead and complexity. In any case it would need a more complicated patch :)

b) Going with the current patch which will make MySQL throw an error if we try to do a select with utf8 data on an ASCII field, and garble up the content to a question mark if we insert other data (which we've checked that we don't do in core), allowing us to use the full 255 characters for an index. Regardless of the index size we may also get a slight performance improvement vs option a) in terms of the query buffers, which as was well explained above, would apply to a lesser extent in MySQL 5.7.

c) The more "correct/elegant" solution of using (var)binary instead of (var)char on the fields marked with is_ascii as suggested earlier. The worry is this may be confusing in terms of DX as on the schema level we'd mark fields as both varbinary and is_ascii where we also already have a binary key that casts to binary for case sensitive comparisons. This should only be used for fields that are not text. In core we don't do this, but it could lead to further confusion for contrib/custom. Also, the same problem as in a) applies, in that we need to validate the encoding of what comes in and comes out of those fields on the PHP level. Other than being more elegant, I don't know if there's also a performance improvement vs option b), but this may be undone by the extra overhead of doing the ASCII check in the DB driver.

@Crell/Morgan any thoughts on this yourselves?

morgantocker’s picture

I find it confusing to have more than one character set in an application. This is why I like just calling it binary instead of ascii.

I doubt that reducing indexes to 191 bytes will hurt performance, since most strings I have performed analysis on are selective in the first 20-30 characters. The exception may be something like a list of string that all start with a common prefix (i..e http://www.drupal.org/node/ ..). In the pathological case can you recommend that the user explicitly enable innodb_large_prefix.

stefan.r’s picture

Thanks. It would be good if we could get Damien Tournoud to chime in on this as well.

Damien Tournoud’s picture

Conceptually, I actually think that a ASCII / UTF8 distinction makes more sense than a BINARY / UTF8 distinction, as the data that we store in those columns is just *not* binary. From a storage perspective, the end result is the same, but from a semantic perspective it is really not.

pwolanin’s picture

I tend to agree with DamZ - let's use ASCII and get the validation at the DB level.

Looking at the patch, I think the DX would be better if we defined a new schema API data type instead of a flag

i.e. instead of

@@ -75,6 +76,7 @@ public static function schema(FieldStorageDefinitionInterface $field_definition)
         'value' => array(
           'type' => 'varchar',
           'length' => 12,
+          'is_ascii' => TRUE,
         ),
       ),
     );

something like:

@@ -75,6 +76,7 @@ public static function schema(FieldStorageDefinitionInterface $field_definition)
         'value' => array(
-          'type' => 'varchar',
+          'type' => 'varascii',
           'length' => 12,
         ),
       ),
     );

I don't have a strong feeling, but varascii, varchar_ascii, etc some type name that makes it clear that the type is different and ASCII.

stefan.r’s picture

StatusFileSize
new31.33 KB
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] Unable to apply patch interdiff-48-63.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]
new36.07 KB
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] 27,572 pass(es), 948 fail(s), and 2,505 exception(s).
[ View ]

The last submitted patch, 63: interdiff-48-63.patch, failed testing.

Status:Needs review» Needs work

The last submitted patch, 63: 1923406-63.patch, failed testing.

stefan.r’s picture

That patch seems close, let's see what @Crell says before fixing those test failures

Crell’s picture

We already have some type mapping logic, so introducing a custom virtual type is, eh, I guess OK. I'm not crazy about it but can live with it. However, I think the implementation may be more challenging:

+++ b/core/lib/Drupal/Core/Database/Driver/mysql/Schema.php
+++ b/core/lib/Drupal/Core/Database/Driver/mysql/Schema.php
@@ -144,6 +144,9 @@ protected function createFieldSql($name, $spec) {

@@ -144,6 +144,9 @@ protected function createFieldSql($name, $spec) {
       if (!empty($spec['binary'])) {
         $sql .= ' BINARY';
       }
+      if (isset($spec['type']) && $spec['type'] == 'varchar_ascii') {
+        $sql .= ' CHARACTER SET ascii COLLATE ascii_general_ci';
+      }
     }
     elseif (isset($spec['precision']) && isset($spec['scale'])) {
       $sql .= '(' . $spec['precision'] . ', ' . $spec['scale'] . ')';
@@ -218,6 +221,8 @@ public function getFieldTypeMap() {

@@ -218,6 +221,8 @@ public function getFieldTypeMap() {
     // database types back into schema types.
     // $map does not use drupal_static as its value never changes.
     static $map = array(
+      'varchar_ascii:normal' => 'VARCHAR',
+
       'varchar:normal'  => 'VARCHAR',
       'char:normal'     => 'CHAR',

Would this work? We map varchar_ascii to VARCHAR, then check for varchar_ascii... which shouldn't be there anymore as it's been replaced by varchar, right?

stefan.r’s picture

OK great to hear we're close to a solution here!

It should work, but I'll see if I can document that or make it less confusing somehow. The first test is for the 'type' key, the second test is for the 'mysql_type' key.

stefan.r’s picture

Status:Needs work» Needs review
StatusFileSize
new1009 bytes
new36.18 KB
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] 91,668 pass(es), 3 fail(s), and 5 exception(s).
[ View ]
stefan.r’s picture

Title:Use ASCII character set in machine name / UUID columns for more efficient indexing» Use ASCII character set on alphanumeric fields so we can index all 255 characters
Issue summary:View changes

Updated issue summary. Also note SQLite / PostgreSQL support is outside of the scope of this issue as they don't have the utf8mb4 problem.

Status:Needs review» Needs work

The last submitted patch, 69: 1923406-69.patch, failed testing.

stefan.r’s picture

Issue summary:View changes

stefan.r’s picture

Status:Needs work» Needs review
StatusFileSize
new418 bytes
new36.69 KB
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 91,693 pass(es).
[ View ]

stefan.r’s picture

StatusFileSize
new36.87 KB
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 91,704 pass(es).
[ View ]
new1.1 KB
catch’s picture

Issue tags:+D8 upgrade path
catch’s picture

Priority:Normal» Major
stefan.r’s picture

I'm assuming any upgrade path would be outside of the scope of this issue? We don't support beta-to-beta upgrades yet do we?

amateescu’s picture

Yes, the upgrade path is not in scope of this issue. Beta-to-beta upgrades are provided by the HEAD to HEAD contrib module and the tag helps to keep track of issues that need a patch over there.

Crell’s picture

  1. +++ b/core/config/schema/core.data_types.schema.yml
    @@ -445,6 +445,9 @@ field.storage_settings.string:
    +    is_ascii:
    +      type: boolean
    +      label: 'Contains US ASCII characters only'

    I don't know the config schema well enough to say, but should we rename this since "is_ascii" is no longer parallel with the DB schema?

  2. +++ b/core/modules/comment/comment.install
    @@ -46,17 +46,18 @@ function comment_schema() {
           'field_name' => array(
    -        'type' => 'varchar',
    +        'type' => 'varchar_ascii',
             'not null' => TRUE,
             'default' => '',
             'length' => FieldStorageConfig::NAME_MAX_LENGTH,
    +        'is_ascii' => TRUE,
             'description' => 'The field_name of the field that was used to add this comment.',

    Shouldn't the is_ascii be gone now?

And... should we be using varchar_ascii or varchar:ascii? We seem to be using colons elsewhere, no? (I've not dug into the schema code in quite some time so I may be missing a context... I'm OK with whatever is most consistent for DX.)

stefan.r’s picture

Renaming isn't necessary as the string settings in cofig are independent of the database schema. I also discussed that with @pwolanin and we settled on keeping that, for instance the case_sensitive string config setting translates to "binary" on the schema level.

The colons would be confusing as they're used for size already, as far as DX is concerned an underscore is better.

Indeed the is_ascii on the hook_schema should be gone, I'll post a new patch.

stefan.r’s picture

StatusFileSize
new457 bytes
new36.72 KB
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 91,721 pass(es).
[ View ]
Crell’s picture

Status:Needs review» Reviewed & tested by the community

Makes sense to me.

joelpittet’s picture

Silly question: can or should this be backported to D7?

catch’s picture

Status:Reviewed & tested by the community» Needs review

I don't think we can rely on cache IDs being ascii-only, or if we do it'd be an API change from 7.x - have seen URLs with utf8 characters used as cache IDs in 7.x for example.

catch’s picture

Also situations like that make me very tempted to schedule this patch for a couple of days after the next beta, to have some time to flush them out. We should use ascii as much as possible for index size, but only as much as possible.

hass’s picture

+++ b/core/lib/Drupal/Core/Database/Driver/sqlite/Schema.php
@@ -207,6 +207,8 @@ public function getFieldTypeMap() {
+

The empty line in array should be removed.

hass’s picture

+++ b/core/lib/Drupal/Core/Database/Driver/mysql/Schema.php
@@ -218,6 +222,8 @@ public function getFieldTypeMap() {
+      'varchar_ascii:normal' => 'VARCHAR',
+

One more empty line

stefan.r’s picture

@catch the worry is that as we switch to the utf8mb4 charset a primary key (such as the cache ID) can only be 190 characters anymore. I think an ASCII cache ID would be an API break worth having as the point of the patch is also to avoid us using UTF8 identifiers where not desirable... As to the specific D7 example you cite, in D8 we use a hash for that (I think as of #1855260: Page caching broken by accept header-based routing).

There was already a thorough review at the dev days, as far as D8 core is concerned at least it seemed solid, but there is still some uncertainty/doubt so looks like this will still need a final review. Which should be hours, not days, as some of the previous review has already been documented, but if there's no time to do so before the next beta maybe this issue should be set to postponed?

stefan.r’s picture

@hass the empty line was actually on purpose so we don't have to touch the vertically aligned mega array underneath and as it's a "special" data type anyway. The mysql and sqlite drivers use empty lines as well there to separate similar groups, it's probably fine there for now unless anyone still thinks we need to change the code style in those places.

catch’s picture

Status:Needs review» Needs work

I think an ASCII cache ID would be an API break worth having

Well it might be worth having but it's a completely undocumented restriction at the moment. Additionally it's only a restriction if you're using MySQL for the cache backend. So it would be possible to develop some code that works fine with another cache backend which then fails if you switch back to MySQL.

Either the mysql backend should normalize to ASCII, or it needs updating in the cache documentation at a higher level.

afaik we don't use a hash for URLs, we just automatically hash if a cache ID is longer than 256 characters, but don't have the issue to hand.

+++ b/core/modules/locale/locale.install
@@ -103,7 +103,7 @@ function locale_schema() {
       'language' => array(

Double-checked and we do validate language codes in \Drupal\language\Form\LanguageFormBase::validateCommon() so this and the rest of the patch should be OK - just cache IDs to figure out.

stefan.r’s picture

Status:Needs work» Closed (fixed)

Thanks for the final review @catch. Would it help if as a first step we did transliteration in DatabaseBackend::normalizeCid() and then had the discussion about whether to allow non-US ASCII characters at all (which is less pressing) in another issue?

So right now this looks like this:

<?php
 
protected function normalizeCid($cid) {
   
// Nothing to do if the ID length is 255 characters or less.
   
if (strlen($cid) <= 255) {
      return
$cid;
    }
   
// Return a string that uses as much as possible of the original cache ID
    // with the hash appended.
   
$hash = Crypt::hashBase64($cid);
    return
substr($cid, 0, 255 - strlen($hash)) . $hash;
  }
?>

Just to be clear, I'm proposing to add in a mb_check_encoding($cid, 'ASCII') === TRUE check along with the length check, and transliterate the initial part of the cache ID.

stefan.r’s picture

Status:Closed (fixed)» Needs work
stefan.r’s picture

Status:Needs work» Needs review
StatusFileSize
new3.03 KB
new39.73 KB
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] 27,689 pass(es), 967 fail(s), and 1,520 exception(s).
[ View ]
stefan.r’s picture

StatusFileSize
new3.02 KB
new39.73 KB
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] 27,440 pass(es), 960 fail(s), and 959 exception(s).
[ View ]

Just to solve the cache ID issue in one go, would something like this work for now?

stefan.r’s picture

StatusFileSize
new477 bytes
new39.73 KB
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] 27,438 pass(es), 959 fail(s), and 959 exception(s).
[ View ]

Adding another empty line just to be consistent and address @hass feedback in #86/#87

The last submitted patch, 94: 1923406-94.patch, failed testing.

Status:Needs review» Needs work

The last submitted patch, 95: 1923406-95.patch, failed testing.

The last submitted patch, 93: 1923406-93.patch, failed testing.

pwolanin’s picture

Yeah, not sure we need to convert cache ID to ASCII. The shorter index should be ok - possibly just hash/truncate the cache ID to 190?

Crell’s picture

Having the MySQL cache backend hash at 190 instead of 255 sounds fine to me. Alternatively, I'd also be OK with omitting that field from the conversion here. have no strong preference.

stefan.r’s picture

@pwolanin not sure I follow. Part of the point of this issue was to disallow utf8mb4 where not needed, why disallow it in all machine names except cache IDs? Do we *need* special characters in there?

I think ideally we'd disallow non-ASCII characters on cache IDs altogether (by documenting this on the database cache backend interface) but there seems to be some doubt as to where we might use non-ASCII cache IDs in core, so transliterating them seemed like the most workable solution at least for now. normalizeCid() exists to work around database limitations; we already have a length limitation, which the normalization works around, so if we add an ASCII limitation here, the normalization would just transliterate the same way. At least until we confirm that core doesn't use any non-ASCII cache IDs anymore.

Also limiting utf8mb4 keys to 190 chars would probably have to be done in a followup issue as I don't know that cache ID is the only one.

@Crell: you and @pwolanin are talking about two different things here. It's not "either one or the other", it's either "both of those things" or "neither of them" :) Only utf8mb4 keys have the 190 char limitation, so if we exclude the cid field from conversion here, by definition we'll need to limit the length to 190 as well.

stefan.r’s picture

Status:Needs work» Needs review
StatusFileSize
new2.57 KB
new38.44 KB
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 91,765 pass(es).
[ View ]
pwolanin’s picture

disallowing utf8 in cache IDs is an API change per catch above - let's not do that?

stefan.r’s picture

Fair enough. Is there a concern with the current patch? (which doesn't pose an API change and still allows cache ID's to have any sort of format/length and just normalizes it to database limitations)

stefan.r’s picture

StatusFileSize
new1.18 KB
new39.77 KB
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] 91,773 pass(es), 4 fail(s), and 0 exception(s).
[ View ]

Adding a test

anavarre’s picture

+++ b/core/modules/system/src/Tests/Cache/DatabaseBackendUnitTest.php
@@ -33,4 +33,23 @@ protected function createCacheBackend($bin) {
+    // Set up a cache ID that is not ASCII and longer than 255 characters so we can test cache ID normalization.

Nit: 80 cols.

stefan.r’s picture

StatusFileSize
new39.79 KB
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 91,825 pass(es).
[ View ]
new3.47 KB

Thanks, test had a problem as well. Interdiff is with the rtbc patch in 81.

stefan.r’s picture

StatusFileSize
new453 bytes
new39.97 KB
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 91,831 pass(es).
[ View ]

Updating a comment

catch’s picture

Tested #107

Applied this change to node.routing.yml

Added one alias (/node -> /front)

Visited node/addお

diff --git a/core/modules/node/node.routing.yml b/core/modules/node/node.routing.yml
index 9f3af90..bde11ad 100644
--- a/core/modules/node/node.routing.yml
+++ b/core/modules/node/node.routing.yml
@@ -6,7 +6,7 @@ node.multiple_delete_confirm:
     _permission: 'administer nodes'

node.add_page:
-  path: '/node/add'
+  path: '/node/addお'
   defaults:
     _title: 'Add content'
     _controller: '\Drupal\node\Controller\NodeController::addPage'

Then I looked at the path alias preload cache IDs, which are by path:

| preload-paths:admin/config/search/path                                                   |
| preload-paths:node                                                                       |
| preload-paths:admin/content                                                              |
| preload-paths:node/addo2Tc-w7F3bOxdy8T5hzZrjgHxJhqBXIRsqNFJSYqJbas                       |
+------------------------------------------------------------------------------------------+

So in that case we could have just transliterated and not hashed - but the code always hashes if transliteration is needed, not a massive issue.

However I'm not sure we should necessarily transliterate here at all - it's an extra dependency for the database backend. Instead we could just do the ascii check, and hash without any prefix in that case - still ascii-safe that way and for the rare cases where there are non-ascii cache IDs I don't think we're losing much by ditching the prefix there - we only have the prefix to aid debugging/testing. Comes down to whether the transliteration dependency makes a measurable difference to the internal page cache.

Would be good to add test coverage for the database backend for whatever it ends up doing though.

The last submitted patch, 105: 1923406-105.patch, failed testing.

stefan.r’s picture

OK that works for me as well if we think it's worth losing the debugging aid for NON-ascii cache IDs.

As to the page cache ID specifically, I guess we could just urlencode that instead upon setting (instead of in the DB backend)?

The hash was to prevent transliteration collisions. I'm sure there's a way to have two different UTF8 IDs turn into the same ASCII ID if we don't use the hash.

stefan.r’s picture

StatusFileSize
new2.16 KB
new39.98 KB
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] 91,719 pass(es), 97 fail(s), and 145 exception(s).
[ View ]

The attached patch hashes any non ASCII cache IDs so that we can lose the transliteration dependency. There's still test coverage through the tests that were added in #107.

It also urlencodes the page cache ID, so at least there we'd still have the debugging aid. @catch would that be OK?

Status:Needs review» Needs work

The last submitted patch, 112: 1923406-112.patch, failed testing.

stefan.r’s picture

Status:Needs work» Needs review
StatusFileSize
new39.14 KB
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 91,959 pass(es).
[ View ]
new627 bytes

OK since that didn't work, maybe let's leave the urlencoding out as it's just going to make the scope of this issue expand, as catch pointed out it's not just page caching that uses URLs in the ID.

stefan.r’s picture

Just to summarize, relative to the previously RTBCed patch, the current patch just implements the cache ID hashing suggested by @catch which was the sole outstanding issue.

Maybe someone can validate this so we can move on to the parent issue and commit this after the beta on Wednesday? :)

catch’s picture

Status:Needs review» Reviewed & tested by the community

Moving back to RTBC, this looks good to me now.

catch’s picture

Version:8.0.x-dev» 7.x-dev
Status:Reviewed & tested by the community» Patch (to be ported)
Issue tags:+Needs change record

Committed/pushed to 8.0.x, thanks!

Opened #2482517: Upgrade path for #1923406.

Back to 7.x for backport discussion.

Also this could use a PSA-style change record for contrib schemas and the API addition.

  • catch committed ac4e35c on 8.0.x
    Issue #1923406 by stefan.r, yannickoo: Use ASCII character set on...
bzrudi71’s picture

FYI this seems to cause SchemaTest fails for PostgreSQL and (maybe) SQLIte too in regards to lastest bot logs :-(
d8pgbot.erwanderbar.de and d8sqlitebot.erwanderbar.de

Drupal\Core\Database\DatabaseExceptionWrapper: SQLSTATE[42601]: Syntax error: 7 ERROR: syntax error at or near "FULL" LINE 1: SHOW FULL COLUMNS FROM simpletest700187test_table ^: SHOW FULL COLUMNS FROM {test_table}; Array ( ) in Drupal\simpletest\TestBase->run() (line 998 of /opt/local/apache2/htdocs/drupal8/core/modules/simpletest/src/TestBase.php).

  • catch committed 1d2f9e9 on 8.0.x
    Revert "Issue #1923406 by stefan.r, yannickoo: Use ASCII character set...
catch’s picture

Version:7.x-dev» 8.0.x-dev
Status:Patch (to be ported)» Needs work

Ouch.

Reverted for now.

stefan.r’s picture

Status:Needs work» Needs review
StatusFileSize
new39.23 KB
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 92,250 pass(es).
[ View ]
new1.9 KB

Yes, sorry we had missed that. I had already noticed that and included a fix in #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols) yesterday, this ports the fix into this patch.

@amateescu will do another SchemaTest run on SQLite/PostgreSQL to confirm.

amateescu’s picture

Here are the results on SQLite:

Test summary
------------

Drupal\system\Tests\Database\SchemaTest                      743 passes                                     

Test run duration: 1 sec

And on PostreSQL:

Test summary
------------

Drupal\system\Tests\Database\SchemaTest                      748 passes                                     

Test run duration: 19 sec

It's a bit weird that we have a different number of passes but.. meh :)

stefan.r’s picture

Well as the patch is now green on MySQL as well and the reason this was reverted was because of SchemaTest failing in PostgreSQL / SQLite, and the changes in this patch are isolated to SchemaTest, I guess this can be committed now?

amateescu’s picture

Status:Needs review» Reviewed & tested by the community

Yup, +1 to recommit.

alexpott’s picture

Status:Reviewed & tested by the community» Fixed

File followup issue for ASCII support in other database engines than MySQL

Let's do this to see if the other db engines benefit from this too.

Committed d57ee5f and pushed to 8.0.x. Thanks!

  • alexpott committed d57ee5f on 8.0.x
    Issue #1923406 by stefan.r, yannickoo, catch, Crell, amateescu, pwolanin...
alexpott’s picture

Adding review commit credit.

webchick’s picture

One more.

webchick’s picture

Status:Fixed» Reviewed & tested by the community

Sorry for the noise.

webchick’s picture

Status:Reviewed & tested by the community» Fixed

Hm.

David_Rothstein’s picture

Version:8.0.x-dev» 7.x-dev
Status:Fixed» Patch (to be ported)

Based on #117 there is supposed to be a backport discussion here?

stefan.r’s picture

Actually this patch may not need a backport.

Assuming we're going to be backporting #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols) to Drupal 7, we'll need to work around the 191 character identifier length limitation somehow. This would also apply to tables defined by contrib/custom modules. So for every field, people can either:

  • shorten the field length
  • turn it into ASCII
  • turn it into single-byte utf8
  • drop the constraint
  • hash the field and put a unique constraint on the hash

...this is not a decision we can automate in a migration script, we could merely check the whole database schema for unique indexes/primary keys on varchar fields larger than 191 characters, and refuse to migrate otherwise.

With contrib and custom this could all get a bit messy, especially if they don't think this through and make fields with existing utf8 data ASCII and cause data loss during the migration of an existing install. Or if they use core fields in unexpected ways, ie. if they put utf8 characters in machine name fields.

So as far as D7 is concerned I'm leaning towards only allowing utf8mb4 on setups with innodb_large_indexes enabled, which allow us to bypass the 191 character limit on the database level instead. This is a non-standard setting but as of 5.7 it will be a MySQL default. This way we would lose the need for a backport of this patch as far as UTF-8 support is concerned.

Arla’s picture

The URI field schema is now incomplete: #2462155: Field settings schema missing for some field types (8.0.x)