Problem/Motivation

Following the related topic Transliteration causes 2 capital letters at the beginning of a word this bug were discovered.

If any word contains an unknown character (e.g. 0x80), it will be cropped to the first letter.

For example expected value for 'Hel' . char(0x80) . 'o World' is "Hel?o World" but it returns "H".

Notice: after fixed Transliteration causes 2 capital letters at the beginning of a word the this will change slightly.

How to reproduce:

$transliteration = new PhpTransliteration();
$unknown = chr(0x80);

// Unknown character between two "normal" characters. Expected output "Hel?o World"
$str1 = $transliteration->transliterate('Hel' . $unknown . 'o World');

// Unknown character between one space and one "normal" character. Expected output "Hell? World"
$str2 = $transliteration->transliterate('Hell' . $unknown . ' World');

Both cases are returned "H".

Proposed resolution

In core/lib/Drupal/Component/Transliteration/PhpTransliteration.php we need improve following code:

// Split into Unicode characters and transliterate each one.
foreach (preg_split('//u', $match, 0, PREG_SPLIT_NO_EMPTY) as $character) {

Add test something like this:

['en', 'Hello ' . chr(0x80) . ' World', 'Hello ? World'],
['en', 'Hel' . chr(0x80) . 'o World', 'Hel?o World'],
['en', 'Hell' . chr(0x80) .' World', 'Hell? World'],

Remaining tasks

  1. Add a test to check if unknown characters are correctly replaced with a non-default replacement e.g. $unknown_character = '*'.
  2. Determine whether expected result of $string = chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80); is intended to be '?????' instead of '?'.

User interface changes

None

API changes

None

Data model changes

None

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Krzysztof Domański created an issue. See original summary.

Krzysztof Domański’s picture

Status: Active » Postponed

This issue should be postponed before Transliteration causes 2 capital letters at the beginning of a word will be fixed. Improving the method transliterate() in PhpTransliteration class is required before.

scott_euser’s picture

Status: Postponed » Needs review

This attached patch does solve the issue to provide your expected outcomes but should have someone who is very confident about their UTF8 character encoding review it. Code inspiration to handle this from this comment on php.net.

Essentially the issue appears to be that preg_split('//', $string, 0, PREG_SPLIT_NO_EMPTY); fails to split the string for your examples.

Here is where I am not confident as the patch causes a change to the expected result of an existing test as you can see.

I am not sure if the expected result of $string = chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80); is intended to be '?' instead of '?????' denoting the 5 unknown characters. Not sure if that was an incorrectly set up test or the actual intension (ie, 5 invalid characters resulting in a single '?').

I have a feeling the result of the single '?' was simply because preg_split was failing to split the string so that was the outcome and the test was trying to ensure that that remained the outcome. In reality, if a user passes a string of 5 unknown characters, getting 5 '?'s back seems more logical, particularly in more normal cases like the issue description where valid characters are mixed in.

So a lot of question marks here both literally and figuratively!

scott_euser’s picture

To be honest, I think it does not particularly matter whether this or the related issue goes first: whichever does I am happy to update the other patch - feel free to disagree of course!

scott_euser’s picture

scott_euser’s picture

The last submitted patch, 5: drupal-transliterating-unknown-char-3001997-5.patch, failed testing. View results
- codesniffer_fixes.patch Interdiff of automated coding standards fixes only.

Krzysztof Domański’s picture

Status: Needs review » Needs work

Patch #6 Failed to Apply.

scott_euser’s picture

Thanks - yeah, I need to update the patch now that this was committed:
https://www.drupal.org/project/drupal/issues/3000630

They affect more or less the same lines of code. Will try to get to this in the next days.

alexpott’s picture

Nice bug find!

+++ b/core/lib/Drupal/Component/Transliteration/PhpTransliteration.php
@@ -107,8 +107,18 @@ public function removeDiacritics($string) {
-    // Split into Unicode characters and transliterate each one.
-    foreach (preg_split('//u', $string, 0, PREG_SPLIT_NO_EMPTY) as $character) {
+    $characters = [];
+
+    // Split string into array handling unknown characters.
+    $strlen = mb_strlen($string);
+    while ($strlen) {
+      $characters[] = mb_substr($string, 0, 1, 'UTF-8');
+      $string = mb_substr($string, 1, $strlen, 'UTF-8');
+      $strlen = mb_strlen($string);
+    }
+
+    // Transliterate each character.
+    foreach ($characters as $character) {

Rather than removing the preg_split() we can enforce UTF-8 by doing

$string = mb_convert_encoding($string, 'UTF-8', 'UTF-8');

Before the preg_split()... this will replace all unknown UTF chars with the ? and the new tests will pass. Less function calls.

Krzysztof Domański’s picture

Rather than removing the preg_split() we can enforce UTF-8 by doing

$string = mb_convert_encoding($string, 'UTF-8', 'UTF-8');
Before the preg_split()... this will replace all unknown UTF chars with the ? and the new tests will pass. Less function calls.

This solution works. However, we can not use it, because mb_convert_encoding always converts unknown characters to '?'.

We should be able to set $unknown_character parameter in the transliterate method.

/**
 * {@inheritdoc}
 */
public function transliterate($string, $langcode = 'en', $unknown_character = '?', $max_length = NULL) {

  (...)
  
  $word = mb_convert_encoding($word, 'UTF-8', 'UTF-8');
  // Split into Unicode characters and transliterate each one.
  foreach (preg_split('//u', $word, 0, PREG_SPLIT_NO_EMPTY) as $character) {
    $code = self::ordUTF8($character);
    if ($code == -1) {
      $to_add = $unknown_character;
    }
    else {
      $to_add = $this->replace($code, $langcode, $unknown_character);
    }

In the middle of the loop, we can not replace the character '?' to $unknown_character, because this may be the original question mark:

$str2 = $transliteration->transliterate('Hell' . $unknown . ' World ?', 'en', '*'); // expect 'Hell* World ?'
alexpott’s picture

@Krzysztof Domański good point. We should add test coverage of that here then.

Krzysztof Domański’s picture

Krzysztof Domański’s picture

Status: Needs work » Needs review
FileSize
5.1 KB

New patch + new test testTransliterationUnknownCharacter.

This solution works. However, we can not use it, because mb_convert_encoding always converts unknown characters to '?'.

We can convert character encoding if we keep the original question marks as unique hashs before calling mb_convert_encoding().
After replace '?' with replacement and restore the original question marks.

// Because mb_convert_encoding() converts unknown characters to a question
// mark we need to distinguish the original question mark from the
// replacement.
if ($unknown_character != '?') {
  // Keep the original question marks as unique hashs.
  $hash = 'c809445b6eb4af5e0fa23c3ee7541770';
  $string = str_replace('?', $hash, $string);
}

// Because preg_split() cuts strings that contain unknown characters,
// convert character encoding. Unknown characters will be replaced by a
// question mark.
$string = mb_convert_encoding($string, 'UTF-8', 'UTF-8');

// Split into Unicode characters and transliterate each one.
foreach (preg_split('//u', $string, 0, PREG_SPLIT_NO_EMPTY) as $character) {

   (...)

}

// If we keep the original question marks as unique hash restore them.
if ($unknown_character != '?') {
  // Replace unknown character with replacement.
  $result = str_replace('?', $unknown_character, $result);
  // Restore the original question marks.
  $result = str_replace($hash, '?', $result);
}

Status: Needs review » Needs work

The last submitted patch, 14: core_3001997-14.patch, failed testing. View results

Krzysztof Domański’s picture

Status: Needs work » Needs review
FileSize
5.11 KB

I forgot to add the default parameters:

-  public function testTransliterationUnknownCharacter($langcode, $original, $expected, $unknown_character, $max_length) {
+  public function testTransliterationUnknownCharacter($langcode, $original, $expected, $unknown_character = '?', $max_length = NULL) {
scott_euser’s picture

Status: Needs review » Reviewed & tested by the community

Nice, thanks for moving this forward! Good call with the mb_convert_encoding instead of the loop. This looks good and works for me.

Status: Reviewed & tested by the community » Needs work

The last submitted patch, 16: core_3001997-16.patch, failed testing. View results

Krzysztof Domański’s picture

Status: Needs work » Reviewed & tested by the community

Unrelated test building problem https://www.drupal.org/pift-ci-job/1138297. Back to RTBC.

alexpott’s picture

Status: Reviewed & tested by the community » Needs review
FileSize
3.53 KB
4.99 KB

The current approach has problems when max length is used. Also I don't it makes much sense when the $unknown_character is a character that should also be transliterated. So what this means is that we should do the replacement to bring back the original question marks before transliterating.

Patch attached also improves the commentary and removes the risk of a hash clash by using a hash based on the provided string. This uses PHP's hash() function directly to keep the transliteration component dependency free.

Krzysztof Domański’s picture

I think it is ready for use.

Status: Reviewed & tested by the community » Needs work

The last submitted patch, 20: 3001997-20.patch, failed testing. View results

Krzysztof Domański’s picture

Status: Needs work » Reviewed & tested by the community
Krzysztof Domański’s picture

Status: Reviewed & tested by the community » Needs work

The last submitted patch, 20: 3001997-20.patch, failed testing. View results

alexpott’s picture

Status: Needs work » Reviewed & tested by the community

More random-ness.

catch’s picture

Status: Reviewed & tested by the community » Fixed

Committed and pushed f9e7921bc8 to 8.7.x and 040e6275a0 to 8.6.x. Thanks!

  • catch committed f9e7921 on 8.7.x
    Issue #3001997 by Krzysztof Domański, scott_euser, alexpott:...

  • catch committed 040e627 on 8.6.x
    Issue #3001997 by Krzysztof Domański, scott_euser, alexpott:...

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.