Problem/Motivation

Drupal 8 includes Symfony's mbstring polyfill which means that core's Unicode library never using this the non mbstring code path. This makes the static set in Unicode::check() complete redundant and its default value wrong. This impacts unit and kernel testing as this then uses the non mbstring code path when no real site ever uses it.

Proposed resolution

  • Remove the status static from Unicode
  • Remove the non-mbstring code path from Unicode - it is never used
  • Remove the need to call Unicode::check to set anything
  • Fix bootstrap.php to normalise unit and kernel test environments
  • Fix Unicode::check() so that unicode_requirements() can actually recommend installing the mbstring extension

Remaining tasks

doit

User interface changes

none

API changes

Deprecate Unicode::setStatus() since this existed for testing purposes only and the static it sets is removed.

Data model changes

none

Original issue summary

Problem/Motivation

KernelTestBase does not provide an environment consistent with a booted DrupalKernel since the locale is not set and by default the use of mb string functions is disabled.

This has been manually worked around in some tests like \Drupal\Tests\token\Kernel\FieldTest::testTokenViewMode()

This was causing unexpected test fails for me when writing new core kernel tests that depend on Drupal correctly lowercasing unicode characters.

Proposed resolution

Copy this code from \Drupal\Core\DrupalKernel::bootEnvironment()
and add it to \Drupal\KernelTests\KernelTestBase::bootEnvironment()

    // Set sane locale settings, to ensure consistent string, dates, times and
    // numbers handling.
    setlocale(LC_ALL, 'C');

    // Detect string handling method.
    Unicode::check();
Members fund testing for the Drupal project. Drupal Association Learn more

Comments

pwolanin created an issue. See original summary.

pwolanin’s picture

Status: Active » Needs review
FileSize
1.07 KB
dawehner’s picture

Interesting. I totally agree that we should add that.
Is there any way we could actually test that? I guess for example by removing the workaround in the token test?

alexpott’s picture

We do part of this already in core/tests/bootstrap.php

// Set sane locale settings, to ensure consistent string, dates, times and
// numbers handling.
// @see \Drupal\Core\DrupalKernel::bootEnvironment()
setlocale(LC_ALL, 'C');

So that shouldn't be required. The way \Drupal\Component\Utility\Unicode::check() works is not nice. Why not make it lazy initialize? I.e if the status has not been set - set it correctly in getStatus(). /me looks at the way Symfony\Polyfill\Mbstring works and is jealous.

alexpott’s picture

Oh hilarious... we're loading vendor/symfony/polyfill-mbstring/bootstrap.php cause Symfony uses it - so our Unicode could depend on that and we could remove the static completely.

alexpott’s picture

Like so... no static and no unnecessary calls to Unicode::check() which now just does checking rather than any static initialisation.

alexpott’s picture

Also Unicode::setStatus() is horrid - why that was added for unit tests I'll never know - we shoulda just used reflection. Ughs. Probably my fault :) #1938670: Convert unicode.inc to \Drupal\Component\Utility\Unicode

alexpott’s picture

Oh and just for fun there is not D8 installation in the land where \Drupal\Component\Utility\Unicode::$status = \Drupal\Component\Utility\Unicode::STATUS_SINGLEBYTE apart from in our tests because function_exists('mb_strlen') will always return TRUE because of the polyfill :)

alexpott’s picture

+++ b/core/lib/Drupal/Component/Utility/Unicode.php
@@ -143,38 +143,33 @@ public static function setStatus($status) {
     if (!function_exists('mb_strlen')) {

Even more fun because this will never return FALSE we'll never recommend someone installs the mbstring extension!

alexpott’s picture

Issue summary: View changes
FileSize
479 bytes
14.02 KB

Let's fix #9 and update the issue summary.

alexpott’s picture

Title: Need to set locale and check for multibyte support in KernelTestBase::bootEnvironment() » Fix \Drupal\Component\Utility\Unicode() because of the Symfony mbstring polyfill
FileSize
20.44 KB
6.69 KB

Fixing up UnicodeTest since there's no single byte code path to test anymore.

alexpott’s picture

Missed a couple of things to clean up.

alexpott’s picture

7 files changed, 48 insertions, 263 deletions. - less code yay!

dawehner’s picture

Really nice work alex! I'll review it later properly.

mpdonadio’s picture

+++ b/core/composer.json
@@ -18,6 +18,7 @@
         "symfony/polyfill-iconv": "~1.0",
+        "symfony/polyfill-mbstring": "~1.0",
         "symfony/yaml": "~2.8",
+++ b/core/lib/Drupal/Component/Utility/Unicode.php
@@ -224,17 +219,7 @@ public static function encodingFromBOM($data) {
-    if (function_exists('iconv')) {
-      return @iconv($encoding, 'utf-8', $data);
-    }
-    elseif (function_exists('mb_convert_encoding')) {
-      return @mb_convert_encoding($data, 'utf-8', $encoding);
-    }
-    elseif (function_exists('recode_string')) {
-      return @recode_string($encoding . '..utf-8', $data);
-    }
-    // Cannot convert.
-    return FALSE;
+    return @iconv($encoding, 'utf-8', $data);

If the mb mb_convert_encoding() can do the same thing as iconv(), when why do we still have both libraries? Can we just use symfony/polyfill-mbstring instead of symfony/polyfill-iconv? Looks like this is the only usage on iconv() in core.

alexpott’s picture

@mpdonadio well considering we already have a polyfill for iconv core is always going to be doing that code path. Hence the change I chose to make. And the mbstring string polyfill is dependent on the iconv polyfill cause it uses iconv too...

dawehner’s picture

+++ b/core/lib/Drupal/Component/Utility/Unicode.php
@@ -322,16 +292,7 @@ public static function strtoupper($text) {
    *   The string in lowercase.
    */
   public static function strtolower($text) {
...
+    return mb_strtolower($text);
   }

@@ -399,90 +360,7 @@ public static function ucwords($text) {
    *   The shortened string.
    */
   public static function substr($text, $start, $length = NULL) {

I am wondering whether we should deprecate most of those functions now.

alexpott’s picture

@dawehner we totally could do that. On it.

alexpott’s picture

Made Unicode::mb_substr() simpler too... the NULL check here is pointless - https://secure.php.net/manual/en/function.mb-substr.php

We should file a followup to replace all the calls to the deprecated functions.

alexpott’s picture

dawehner’s picture

+++ b/core/lib/Drupal/Component/Utility/Unicode.php
@@ -358,9 +367,12 @@ public static function ucwords($text) {
   public static function substr($text, $start, $length = NULL) {
-    return $length === NULL ? mb_substr($text, $start) : mb_substr($text, $start, $length);
+    return mb_substr($text, $start, $length);
   }

I'm wondering whether this is some weird performance optimization? I doubt it thought to be honest. Do we have an explicit test to ensure this actually works?

alexpott’s picture

\Drupal\Tests\Component\Utility\UnicodeTest::testSubstr tests passing a NULL. I looked back at the original commit - #26688: Add mbstring support to Drupal no notes as to why - I just can't see it as a perf optimisation.

alexpott’s picture

Fixed a typo and improved the deprecation message.

Munavijayalakshmi’s picture

Rerolled the patch.

dawehner’s picture

Status: Needs review » Needs work
  1. +++ b/core/lib/Drupal/Component/Utility/Unicode.php
    @@ -106,8 +99,15 @@ class Unicode {
    +  public static function getStatus()
    +  {
    +    switch (static::check()) {
    

    I would guess this file is not excluded from the CS checking, right?

  2. +++ b/core/lib/Drupal/Component/Utility/Unicode.php
    @@ -279,15 +268,12 @@ public static function truncateBytes($string, $len) {
    +   * @deprecated in Drupal 8.3.0, will be removed before Drupal 9.0.0. Use
    
    @@ -298,18 +284,12 @@ public static function strlen($text) {
    +   * @deprecated in Drupal 8.3.0, will be removed before Drupal 9.0.0. Use
    

    Nitpick: We need to point to 8.4.x now ...

naveenvalecha’s picture

+++ b/core/composer.json
@@ -17,6 +17,7 @@
+        "symfony/polyfill-mbstring": "~1.0",

we are requiring a new package but not any changes in composer.lock file?

alexpott’s picture

@naveenvalecha that's because other things are already dependent on it. But also core/lib/Drupal/Component/Utility/composer.json should be update to have symfony/polyfill-mbstring too.

Version: 8.4.x-dev » 8.5.x-dev

Drupal 8.4.0-alpha1 will be released the week of July 31, 2017, which means new developments and disruptive changes should now be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.5.x-dev » 8.6.x-dev

Drupal 8.5.0-alpha1 will be released the week of January 17, 2018, which means new developments and disruptive changes should now be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

alexpott’s picture

Status: Needs work » Needs review
FileSize
228.91 KB
27.59 KB

So here's a patch with proper deprecation notices for 8.6.0. The real patch with the conversions can be scripted by doing the following:

# Apply the smaller not tested issue patch https://www.drupal.org/project/drupal/issues/2849669

# Replace the occurrances including FQDN. Check core/lib/Drupal/Core/Mail/MailFormatHelper.php.
find ./core -type f -not -path "./core/node_modules/*" -not -path "./core/assets/*" -not -path "./core/misc" -exec sed -i -r 's/\\?(Drupal\\Component\\Utility\\)?Unicode::(substr|strlen|strtoupper|strtolower|strpos)/mb_\2/g;' {} \;

# Revert unnecessary changes
git co core/lib/Drupal/Component/Utility/Unicode.php
git co core/tests/Drupal/Tests/Component/Utility/UnicodeTest.php

# Fix unused uses
composer run-script phpcbf -- --sniffs=Drupal.Classes.UnusedUseStatement,PSR2.Namespaces.NamespaceDeclaration

The last submitted patch, 30: 2849669-script-30.patch, failed testing. View results

alexpott’s picture

Not bad... just a few stragglers.

alexpott’s picture

175 files changed, 401 insertions, 724 deletions. A lot less code and once we move to Drupal 9 less surface area to support too - 6 methods from Unicode are deprecated here.

Mile23’s picture

Just mentioning because maybe someone more knowledgeable than me should close it: #2911497: Make Drupal/Component/Utility/Unicode more readable (cleanup)

alexpott’s picture

Rerolled.