Transliteration a string containing an unknown character (e.g. 0x80) is not valid [#3001997]

Problem/Motivation

Following the related topic Transliteration causes 2 capital letters at the beginning of a word this bug were discovered.

If any word contains an unknown character (e.g. 0x80), it will be cropped to the first letter.

For example expected value for 'Hel' . char(0x80) . 'o World' is "Hel?o World" but it returns "H".

Notice: after fixed Transliteration causes 2 capital letters at the beginning of a word the this will change slightly.

How to reproduce:

$transliteration = new PhpTransliteration();
$unknown = chr(0x80);

// Unknown character between two "normal" characters. Expected output "Hel?o World"
$str1 = $transliteration->transliterate('Hel' . $unknown . 'o World');

// Unknown character between one space and one "normal" character. Expected output "Hell? World"
$str2 = $transliteration->transliterate('Hell' . $unknown . ' World');

Both cases are returned "H".

Proposed resolution

In core/lib/Drupal/Component/Transliteration/PhpTransliteration.php we need improve following code:

// Split into Unicode characters and transliterate each one.
foreach (preg_split('//u', $match, 0, PREG_SPLIT_NO_EMPTY) as $character) {

Add test something like this:

['en', 'Hello ' . chr(0x80) . ' World', 'Hello ? World'],
['en', 'Hel' . chr(0x80) . 'o World', 'Hel?o World'],
['en', 'Hell' . chr(0x80) .' World', 'Hell? World'],

Remaining tasks

Add a test to check if unknown characters are correctly replaced with a non-default replacement e.g. $unknown_character = '*'.
Determine whether expected result of $string = chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80); is intended to be '?????' instead of '?'.

User interface changes

None

API changes

None

Data model changes

None

Comment	File	Size	Author
#20	3001997-20.patch	4.99 KB	alexpott
#20	8.6.x: PHP 7 & MySQL 5.5 24,460 pass PHP 7.2 & MySQL 5.5 24,461 pass
#20	16-20-interdiff.txt	3.53 KB	alexpott
#16	core_3001997-16.patch	5.11 KB	Krzysztof Domański
#16	8.6.x: PHP 7 & MySQL 5.5 24,451 pass
#14	core_3001997-14.patch	5.1 KB	Krzysztof Domański
#14	8.6.x: PHP 7 & MySQL 5.5 24,400 pass, 2 fail
#6	drupal-transliterating-unknown-char-3001997-6.patch	2.2 KB	scott_euser
#6	8.6.x: PHP 7 & MySQL 5.5 24,309 pass PHP 7.2 & MySQL 5.5 Patch Failed to Apply
#5	drupal-transliterating-unknown-char-3001997-5.patch	3.49 KB	scott_euser
#5	8.6.x: PHP 7 & MySQL 5.5 CI aborted

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

24 September 2018 at 16:55

Krzysztof Domański created an issue. See original summary.

Comment #2

Krzysztof Domański

Poland

CreditAttribution: Krzysztof Domański as a volunteer commented 24 September 2018 at 17:05

Status:

Active

» Postponed

This issue should be postponed before Transliteration causes 2 capital letters at the beginning of a word will be fixed. Improving the method transliterate() in PhpTransliteration class is required before.

Comment #3

scott_euser CreditAttribution: scott_euser as a volunteer and at Soapbox Communications Ltd commented 24 September 2018 at 19:04

Status:

Postponed

» Needs review

This attached patch does solve the issue to provide your expected outcomes but should have someone who is very confident about their UTF8 character encoding review it. Code inspiration to handle this from this comment on php.net.

Essentially the issue appears to be that preg_split('//', $string, 0, PREG_SPLIT_NO_EMPTY); fails to split the string for your examples.

Here is where I am not confident as the patch causes a change to the expected result of an existing test as you can see.

I am not sure if the expected result of $string = chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80); is intended to be '?' instead of '?????' denoting the 5 unknown characters. Not sure if that was an incorrectly set up test or the actual intension (ie, 5 invalid characters resulting in a single '?').

I have a feeling the result of the single '?' was simply because preg_split was failing to split the string so that was the outcome and the test was trying to ensure that that remained the outcome. In reality, if a user passes a string of 5 unknown characters, getting 5 '?'s back seems more logical, particularly in more normal cases like the issue description where valid characters are mixed in.

So a lot of question marks here both literally and figuratively!

Comment #4

scott_euser CreditAttribution: scott_euser as a volunteer and at Soapbox Communications Ltd commented 24 September 2018 at 18:57

To be honest, I think it does not particularly matter whether this or the related issue goes first: whichever does I am happy to update the other patch - feel free to disagree of course!

Comment #5

scott_euser CreditAttribution: scott_euser as a volunteer and at Soapbox Communications Ltd commented 24 September 2018 at 19:00

File	Size
drupal-transliterating-unknown-char-3001997-5.patch	3.49 KB
8.6.x: PHP 7 & MySQL 5.5 CI aborted

And the patch, sorry!

Comment #6

scott_euser CreditAttribution: scott_euser as a volunteer and at Soapbox Communications Ltd commented 24 September 2018 at 19:02

File	Size
drupal-transliterating-unknown-char-3001997-6.patch	2.2 KB
8.6.x: PHP 7 & MySQL 5.5 24,309 pass PHP 7.2 & MySQL 5.5 Patch Failed to Apply

Fixed patch, removed test code.

Comment #7

24 September 2018 at 19:02

The last submitted patch, 5: drupal-transliterating-unknown-char-3001997-5.patch, failed testing. View results
- codesniffer_fixes.patch Interdiff of automated coding standards fixes only.

Comment #8

Krzysztof Domański

Poland

CreditAttribution: Krzysztof Domański as a volunteer and at abventor commented 21 November 2018 at 18:50

Status:

Needs review

» Needs work

Patch #6 Failed to Apply.

Comment #9

scott_euser CreditAttribution: scott_euser as a volunteer and at Soapbox Communications Ltd commented 22 November 2018 at 08:00

Thanks - yeah, I need to update the patch now that this was committed:
https://www.drupal.org/project/drupal/issues/3000630

They affect more or less the same lines of code. Will try to get to this in the next days.

Comment #10

alexpott

he/they

English

🇪🇺🌍

CreditAttribution: alexpott commented 22 November 2018 at 11:55

Nice bug find!

+++ b/core/lib/Drupal/Component/Transliteration/PhpTransliteration.php
@@ -107,8 +107,18 @@ public function removeDiacritics($string) {
-    // Split into Unicode characters and transliterate each one.
-    foreach (preg_split('//u', $string, 0, PREG_SPLIT_NO_EMPTY) as $character) {
+    $characters = [];
+
+    // Split string into array handling unknown characters.
+    $strlen = mb_strlen($string);
+    while ($strlen) {
+      $characters[] = mb_substr($string, 0, 1, 'UTF-8');
+      $string = mb_substr($string, 1, $strlen, 'UTF-8');
+      $strlen = mb_strlen($string);
+    }
+
+    // Transliterate each character.
+    foreach ($characters as $character) {

Rather than removing the preg_split() we can enforce UTF-8 by doing

$string = mb_convert_encoding($string, 'UTF-8', 'UTF-8');

Before the preg_split()... this will replace all unknown UTF chars with the ? and the new tests will pass. Less function calls.

Comment #11

Krzysztof Domański

Poland

CreditAttribution: Krzysztof Domański at abventor commented 22 November 2018 at 13:13

Rather than removing the preg_split() we can enforce UTF-8 by doing

$string = mb_convert_encoding($string, 'UTF-8', 'UTF-8');
Before the preg_split()... this will replace all unknown UTF chars with the ? and the new tests will pass. Less function calls.

This solution works. However, we can not use it, because mb_convert_encoding always converts unknown characters to '?'.

We should be able to set $unknown_character parameter in the transliterate method.

/**
 * {@inheritdoc}
 */
public function transliterate($string, $langcode = 'en', $unknown_character = '?', $max_length = NULL) {

  (...)
  
  $word = mb_convert_encoding($word, 'UTF-8', 'UTF-8');
  // Split into Unicode characters and transliterate each one.
  foreach (preg_split('//u', $word, 0, PREG_SPLIT_NO_EMPTY) as $character) {
    $code = self::ordUTF8($character);
    if ($code == -1) {
      $to_add = $unknown_character;
    }
    else {
      $to_add = $this->replace($code, $langcode, $unknown_character);
    }

In the middle of the loop, we can not replace the character '?' to $unknown_character, because this may be the original question mark:

$str2 = $transliteration->transliterate('Hell' . $unknown . ' World ?', 'en', '*'); // expect 'Hell* World ?'

Comment #12

alexpott

he/they

English

🇪🇺🌍

CreditAttribution: alexpott commented 22 November 2018 at 14:51

@Krzysztof Domański good point. We should add test coverage of that here then.

Comment #13

Krzysztof Domański

Poland

CreditAttribution: Krzysztof Domański at abventor commented 23 November 2018 at 08:28

Issue summary:	View changes
Related issues:		+#3015684: Protect transliteration so that it does not trim whitespace

1 file was hidden/shown/deleted

File	Size
drupal-transliterating-unknown-char-3001997-5.patch	3.49 KB
8.6.x: PHP 7 & MySQL 5.5 CI aborted

Comment #14

Krzysztof Domański

Poland

CreditAttribution: Krzysztof Domański at abventor commented 26 November 2018 at 17:56

Status:

Needs work

» Needs review

File	Size
core_3001997-14.patch	5.1 KB
8.6.x: PHP 7 & MySQL 5.5 24,400 pass, 2 fail

New patch + new test testTransliterationUnknownCharacter.

This solution works. However, we can not use it, because mb_convert_encoding always converts unknown characters to '?'.

We can convert character encoding if we keep the original question marks as unique hashs before calling mb_convert_encoding().
After replace '?' with replacement and restore the original question marks.

// Because mb_convert_encoding() converts unknown characters to a question
// mark we need to distinguish the original question mark from the
// replacement.
if ($unknown_character != '?') {
  // Keep the original question marks as unique hashs.
  $hash = 'c809445b6eb4af5e0fa23c3ee7541770';
  $string = str_replace('?', $hash, $string);
}

// Because preg_split() cuts strings that contain unknown characters,
// convert character encoding. Unknown characters will be replaced by a
// question mark.
$string = mb_convert_encoding($string, 'UTF-8', 'UTF-8');

// Split into Unicode characters and transliterate each one.
foreach (preg_split('//u', $string, 0, PREG_SPLIT_NO_EMPTY) as $character) {

   (...)

}

// If we keep the original question marks as unique hash restore them.
if ($unknown_character != '?') {
  // Replace unknown character with replacement.
  $result = str_replace('?', $unknown_character, $result);
  // Restore the original question marks.
  $result = str_replace($hash, '?', $result);
}

Comment #15

26 November 2018 at 18:45

Status:

Needs review

» Needs work

The last submitted patch, 14: core_3001997-14.patch, failed testing. View results

Comment #16

Krzysztof Domański

Poland

CreditAttribution: Krzysztof Domański at abventor commented 26 November 2018 at 19:05

Status:

Needs work

» Needs review

File	Size
core_3001997-16.patch	5.11 KB
8.6.x: PHP 7 & MySQL 5.5 24,451 pass

I forgot to add the default parameters:

-  public function testTransliterationUnknownCharacter($langcode, $original, $expected, $unknown_character, $max_length) {
+  public function testTransliterationUnknownCharacter($langcode, $original, $expected, $unknown_character = '?', $max_length = NULL) {

Comment #17

scott_euser CreditAttribution: scott_euser as a volunteer and at Soapbox Communications Ltd commented 4 December 2018 at 20:31

Status:

Needs review

» Reviewed & tested by the community

Nice, thanks for moving this forward! Good call with the mb_convert_encoding instead of the loop. This looks good and works for me.

Comment #18

4 December 2018 at 21:29

Status:

Reviewed & tested by the community

» Needs work

The last submitted patch, 16: core_3001997-16.patch, failed testing. View results

Comment #19

Krzysztof Domański

Poland

CreditAttribution: Krzysztof Domański at abventor commented 5 December 2018 at 16:30

Status:

Needs work

» Reviewed & tested by the community

Unrelated test building problem https://www.drupal.org/pift-ci-job/1138297. Back to RTBC.

Comment #20

alexpott

he/they

English

🇪🇺🌍

CreditAttribution: alexpott commented 6 December 2018 at 10:32

Status:

Reviewed & tested by the community

» Needs review

File	Size
16-20-interdiff.txt	3.53 KB
3001997-20.patch	4.99 KB
8.6.x: PHP 7 & MySQL 5.5 24,460 pass PHP 7.2 & MySQL 5.5 24,461 pass

The current approach has problems when max length is used. Also I don't it makes much sense when the $unknown_character is a character that should also be transliterated. So what this means is that we should do the replacement to bring back the original question marks before transliterating.

Patch attached also improves the commentary and removes the risk of a hash clash by using a hash based on the provided string. This uses PHP's hash() function directly to keep the transliteration component dependency free.

Comment #21

Krzysztof Domański

Poland

CreditAttribution: Krzysztof Domański at abventor commented 12 December 2018 at 18:12

Status:

Needs review

» Reviewed & tested by the community

3 files were hidden/shown/deleted

File	Size
drupal-transliterating-unknown-char-3001997-6.patch	2.2 KB
8.6.x: PHP 7 & MySQL 5.5 24,309 pass PHP 7.2 & MySQL 5.5 Patch Failed to Apply
core_3001997-14.patch	5.1 KB
8.6.x: PHP 7 & MySQL 5.5 24,400 pass, 2 fail
core_3001997-16.patch	5.11 KB
8.6.x: PHP 7 & MySQL 5.5 24,451 pass

I think it is ready for use.

Comment #22

12 December 2018 at 19:10

Status:

Reviewed & tested by the community

» Needs work

The last submitted patch, 20: 3001997-20.patch, failed testing. View results

Comment #23

Krzysztof Domański

Poland

CreditAttribution: Krzysztof Domański at abventor commented 12 December 2018 at 19:11

Status:

Needs work

» Reviewed & tested by the community

Comment #24

Krzysztof Domański

Poland

CreditAttribution: Krzysztof Domański at abventor commented 12 December 2018 at 19:40

#22 // Build Successful
#2990645: "Build Successful" is treated as a test failure

Comment #25

18 December 2018 at 18:30

Status:

Reviewed & tested by the community

» Needs work

The last submitted patch, 20: 3001997-20.patch, failed testing. View results

Comment #26

alexpott

he/they

English

🇪🇺🌍

CreditAttribution: alexpott at Acro Commerce, Thunder commented 18 December 2018 at 18:35

Status:

Needs work

» Reviewed & tested by the community

More random-ness.

Comment #27

catch

he/him

English

CreditAttribution: catch at Third and Grove commented 2 January 2019 at 10:29

Status:

Reviewed & tested by the community

» Fixed

Committed and pushed f9e7921bc8 to 8.7.x and 040e6275a0 to 8.6.x. Thanks!

Comment #28

2 January 2019 at 10:29

catch committed f9e7921 on 8.7.x

Issue #3001997 by Krzysztof Domański, scott_euser, alexpott:...

Comment #29

2 January 2019 at 10:29

catch committed 040e627 on 8.6.x

Issue #3001997 by Krzysztof Domański, scott_euser, alexpott:...

Comment #30

16 January 2019 at 10:34

Status:

Fixed

» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Transliteration a string containing an unknown character (e.g. 0x80) is not valid

Problem/Motivation

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Comments

Related issues

Referenced by