Problem/Motivation

When one capital letter has more then one char in transliteration map it causes 2 char uppercase

'ш' = 'sh'
"Штрихкод" is becoming 'SHtrikhkod'

Proposed resolution

1. Matches any words and spaces to handle mixed case.
2. Check if the word consists of both uppercase and lowercase letters (mixed case word).
3. If this is a capitalised letter of a mixed case word, only capitalise the first letter and lowercase any subsequent letters.

Remaining tasks

Add test

Comments

APolitsin created an issue. See original summary.

andypost’s picture

Title: 2 capital letters at the beginning of a word » Transliteration causes 2 capital letters at the beginning of a word
Issue summary: View changes
Issue tags: -russian +Needs tests

Now sure how to fix that right, looks it needs special processing to "camelize" uppercased letters

scott_euser’s picture

Status: Active » Needs review

This could use some feedback - I think this is the most appropriate way, but perhaps someone has a better idea:

  • Check that the string is mixed case, otherwise leave as is (eg, all caps remain all caps, lowercase remain all lowercase)
  • If so, if the transliteration of a single character results in multiple characters AND the original character was uppercase, ucfirst it

Patch attached. Tests needed.

scott_euser’s picture

... and the patch

scott_euser’s picture

Issue uploading patch, trying again with file rename.

scott_euser’s picture

Issue tags: +DistributedSprintUK18
scott_euser’s picture

Now with unit test

vijaycs85’s picture

Over all, looks good @scott_euser. Just one minor comment:

  1. +++ b/core/lib/Drupal/Component/Transliteration/PhpTransliteration.php
    @@ -107,6 +107,10 @@ public function removeDiacritics($string) {
    +    // String is mixed case if not all uppercase and not all lowercase.
    
    @@ -126,6 +130,12 @@ public function transliterate($string, $langcode = 'en', $unknown_character = '?
    +      // If this is a capitalised letter of a mixed case word, only capitalise
    +      // the first letter and lowercase any subsequent letters.
    
    +++ b/core/tests/Drupal/Tests/Component/Transliteration/PhpTransliterationTest.php
    @@ -185,4 +185,19 @@ public function testSafeInclude() {
    +    // Test with a mixed case word where a single character results in mutliple
    +    // and the single character was originally capitalised. The result of the
    +    // below should be 'Shtrikhkod' not 'SHtrikhkod'.
    

    Great comments. Very easy to understand what's going on.

  2. +++ b/core/tests/Drupal/Tests/Component/Transliteration/PhpTransliterationTest.php
    @@ -185,4 +185,19 @@ public function testSafeInclude() {
    +    $input = 'Штрихкод';
    +    $expected_output = 'Shtrikhkod';
    

    This could use a dataProvider to have multiple dataset.

scott_euser’s picture

Thanks for the feedback! I have switched it to use a data provider matching the style of ::testRemoveDiacritics(). As a result, I have moved the comment explanation of the test above the method.

vijaycs85’s picture

Status: Needs review » Reviewed & tested by the community

Looks great!

longwave’s picture

Status: Reviewed & tested by the community » Needs review
  1. +++ b/core/lib/Drupal/Component/Transliteration/PhpTransliteration.php
    @@ -107,6 +107,10 @@ public function removeDiacritics($string) {
    +    $mixed_case = (!ctype_lower($string) && !ctype_upper($string));
    

    Do we have to consider strings that contain characters other than letters? For example $mixed_case is true for the strings "a " and "A1", but these are not strictly mixed case.

  2. +++ b/core/tests/Drupal/Tests/Component/Transliteration/PhpTransliterationTest.php
    @@ -185,4 +185,40 @@ public function testSafeInclude() {
    +  public function testTransliterationMixedCase($original, $expected) {
    

    While the comments are good I am not sure this needs its own test method, I think this can just be folded into providerTestPhpTransliteration().

scott_euser’s picture

Thanks for the thorough look! I have updated the patch to account for single character words and words with numbers mixed in and added that to the tests. Needed to convert to using mb_strtoupper and mb_strtolower comparison to handle the new cases.

I also combined the texts without losing the detailed comments - a similar long comment explanation of 3 and 4 byte characters already existed within the array so I followed that style.

If you have a chance to re-review would be greatly appreciated.

krzysztof domański’s picture

Status: Needs review » Needs work

The code below checks if the whole string is mixed case so "Щастие ЩЩЩ" return "Schastie SchSchSch". IMO this should return "Schastie SCHSCHSCH".

public function transliterate($string, $langcode = 'en', $unknown_character = '?', $max_length = NULL) {
  $result = '';
  $length = 0;

  // String is mixed case if it consists of both uppercase and lowercase
  // letters. To accurately check this, remove any numbers and check that
  // remaining characters are not all uppercase and not all lowercase.
  $alpha_string = preg_replace('/\\d/', '', $string);
  $mixed_case = (strlen($alpha_string) > 1 && mb_strtolower($alpha_string) !== $alpha_string && mb_strtoupper($alpha_string) !== $alpha_string);

  // Split into Unicode characters and transliterate each one.
  foreach (preg_split('//u', $string, 0, PREG_SPLIT_NO_EMPTY) as $character) {
    $code = self::ordUTF8($character);

We should check if single word is mixed case. The test also needs something like this:
['bg', 'Щастие ЩЩЩ', 'Schastie SCHSCHSCH'],

scott_euser’s picture

Status: Needs work » Needs review
StatusFileSize
new4.23 KB
new4.33 KB

Makes sense thank you for reviewing! I have added that in and added coverage for multiple words with different cases, potentially with punctuation.

krzysztof domański’s picture

Status: Needs review » Needs work

The patch #14 does not respect $max_length.

foreach (preg_split('//u', $word, 0, PREG_SPLIT_NO_EMPTY) as $character) {

  // (...)

  // Check if this exceeds the maximum allowed length.
  if (isset($max_length)) {
    $length += strlen($to_add);
    if ($length > $max_length) {
    // There is no more space.
    $results = array_filter($results);
      return implode(' ', $results);
    }
  }

Now the code above is in the inner loop so it checks the length of words.

For example $max_length is equal 20. Our string has 200 letters and contains only words shorter than 20 letters. It will never be trim inside this loop. After joining the words, we will get longer than 20 letters (approximately 200 letters).

scott_euser’s picture

Status: Needs work » Needs review

Thanks for rechecking. Apologies for questioning but have you actually tried that? Length is defined outside the foreach words loop + the spaces between words are also now added to length.

There is also test coverage for max length and it verifies that it is working.

Perhaps you can describe the steps you took to get it to incorrectly handle max length?

krzysztof domański’s picture

@scott_euser You're right. It respects $max_length. Apologies for incorrect comment #15.

krzysztof domański’s picture

Status: Needs review » Needs work

I tested the patch #14 checking $unknown_character.

In some cases unexpected results are returned. When an unknown character is between two spaces, everything is correct, but when an unknown character is next to the "normal" character, the data is truncated.

For example expected value for 'Hel' . $unknown_character . 'o World' is "Hel?o World" but it returns "H World". It probably always cuts to the first letter.

How to reproduce:

$transliteration = new PhpTransliteration();
$unknown_character = chr(0x80);

// Without any unknow character
// "Hello World" - expected
// "Hello World" - return
$hello_1 = 'Hello World';
$test_1 = $transliteration->transliterate($hello_1);

// Unknown character between two spaces
// "Hello ? World" - expected
// "Hello ? World" - return
$hello_2 = 'Hello ' . $unknown_character . ' World';
$test_2 = $transliteration->transliterate($hello_2);

// Unknown character between two "normal" characters
// "Hel?o World" - expected
// "H World" - return
$hello_3 = 'Hel' . $unknown_character . 'o World';
$test_3 = $transliteration->transliterate($hello_3);

// Unknown character between one space and one "normal" character
// "Hell? World" - expected
// "H World" - return
$hello_4 = 'Hell' . $unknown_character . ' World';
$test_4 = $transliteration->transliterate($hello_4);

The test also requires something similar to the following code. I think we also need other cases.

['en', 'Hello ' . $unknown_character . ' World', 'Hello ? World'],
['en', 'Hel' . $unknown_character . 'o World', 'Hel?o World'],
['en', 'Hell' . $unknown_character .' World', 'Hell? World'],
krzysztof domański’s picture

Now I noticed that I use in example the same variable like transliterate() argument.
transliterate($string, $langcode = 'en', $unknown_character = '?', $max_length = NULL)

Please rename this or use chr(0x80).

['en', 'Hello ' . chr(0x80) . ' World', 'Hello ? World'],
['en', 'Hel' . chr(0x80) . 'o World', 'Hel?o World'],
['en', 'Hell' . chr(0x80) .' World', 'Hell? World'],
krzysztof domański’s picture

Status: Needs work » Needs review
StatusFileSize
new1016 bytes

After I reset the code, I ran the test which the following cases.

// Illegal/unknown unicode.
['en', chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80), '?'],
['en', 'Hello ' . chr(0x80) . ' World', 'Hello ? World'],
['en', 'Hel' . chr(0x80) . 'o World', 'Hel?o World'],
['en', 'Hell' . chr(0x80) .' World', 'Hell? World'],

It ended with a failure so there was such a problem before. IMO Patch #14 looks good.

3) Drupal\Tests\Component\Transliteration\PhpTransliterationTest::testPhpTransliteration with data set #15 ('en', 'Hell▒ World', 'Hell? World')
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'Hell? World'
+'H'

Now I'm not sure. Do we have to create a separate issue (Unknown unicode in transliteration not work correctly) or continue in this?

scott_euser’s picture

Hi Krzysztof Domański,

Thank you for further reviewing and confirming about max length. Yes, lets keep the scope to the single character resulting in 2 characters. If there are other issues with transliteration it should be a new issue since this code isn't causing those issues.

So if you are then happy with it, can you set it back to RTBC?

Thanks!
Scott

krzysztof domański’s picture

Status: Needs review » Reviewed & tested by the community
Related issues: +#3001997: Transliteration a string containing an unknown character (e.g. 0x80) is not valid

Patch #14 looks good!

I added a new issue to a separate problem Transliteration a string containing an unknown character (e.g. 0x80) is not valid

scott_euser’s picture

Sounds good thanks! I'll see if I can take a look at it and see what's going wrong.

imyaro’s picture

Patch works good, but why did you use word "Щастие"? It confuses a little because it supposed to be a Russian word, but written with the mistake (Like "Happinez" instead of "Happiness"). Probably it will be better to change it?

We can use something like "Шина" (Shina) / "Шоссе" (Shosse) instead. And as the last phrase "Шла Саша по ШОССЕ" (Shla Sasha po SHOSSE).

scott_euser’s picture

Hi zvse,

Thanks for the feedback! Actually its the correct spelling in Bulgarian for happiness (and 'bg' is indicated in the test as the original language). Hope that helps clear things up!

Scott

Status: Reviewed & tested by the community » Needs work
scott_euser’s picture

Status: Needs work » Reviewed & tested by the community

Seems re-run of test was triggered and the test failed to complete, stopped partway through. Re-running again now and re-setting status back to where it was.

scott_euser’s picture

catch’s picture

Status: Reviewed & tested by the community » Fixed

Committed and pushed bb7fb6a3dd to 8.7.x and 35c3d18ae0 to 8.6.x. Thanks!

  • catch committed bb7fb6a on 8.7.x
    Issue #3000630 by scott_euser, Krzysztof Domański, APolitsin, vijaycs85...

  • catch committed 35c3d18 on 8.6.x
    Issue #3000630 by scott_euser, Krzysztof Domański, APolitsin, vijaycs85...
scott_euser’s picture

Thanks!

tacituseu’s picture

This might have introduced intermittent test failures (trailing spaces?):
https://www.drupal.org/pift-ci-job/1126573

1) Drupal\Tests\Core\Transliteration\PhpTransliterationTest::testPhpTransliterationWithAlter with data set #1 ('zz', '@Sqnz3\)} ', '@Sqnz3\)} ')
'@Sqnz3\)} ' transliteration to '@Sqnz3\)}' is identical to '@Sqnz3\)} ' for language 'zz' in service instance.
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-@Sqnz3\)} 
+@Sqnz3\)}

/var/www/html/core/tests/Drupal/Tests/Core/Transliteration/PhpTransliterationTest.php:56

https://www.drupal.org/pift-ci-job/1127403

1) Drupal\Tests\Component\Transliteration\PhpTransliterationTest::testPhpTransliteration with data set #0 ('en', 'XM'?Gj|P' ', 'XM'?Gj|P' ')
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'XM'?Gj|P' '
+'XM'?Gj|P''

/var/www/html/core/tests/Drupal/Tests/Component/Transliteration/PhpTransliterationTest.php:93
krzysztof domański’s picture

krzysztof domański’s picture

-- edited --

krzysztof domański’s picture

-- edited --

  • alexpott committed ce031c3 on 8.7.x
    Revert "Issue #3000630 by scott_euser, Krzysztof Domański, APolitsin,...

  • alexpott committed e8e94d4 on 8.6.x
    Revert "Issue #3000630 by scott_euser, Krzysztof Domański, APolitsin,...
alexpott’s picture

Status: Fixed » Needs work

This change definitely should not cause random test fails - #3015802: Random fail in \Drupal\Tests\Core\Transliteration\PhpTransliterationTest and also I think transliteration should not affecting spacing.

krzysztof domański’s picture

Status: Needs work » Needs review
StatusFileSize
new7.33 KB
new7.66 KB

Thanks for revert! New patch.

krzysztof domański’s picture

Unnecessary code

// String is mixed case if it consists of both uppercase and lowercase
// letters. To accurately check this, remove any numbers and check that
// remaining characters are not all uppercase and not all lowercase.
$alpha_string = preg_replace('/\\d/', '', $value);
$mixed_case = (strlen($alpha_string) > 1 && mb_strtolower($alpha_string) !== $alpha_string && mb_strtoupper($alpha_string) !== $alpha_string);

Less:

// String is mixed case if it consists of both uppercase and lowercase
// letters.
$mixed_case = (strlen($value) > 1 && mb_strtolower($value) !== $value && mb_strtoupper($value) !== $value);
krzysztof domański’s picture

krzysztof domański’s picture

1. Matches any words and spaces to handle mixed case and check if the word consists of both uppercase and lowercase letters.

// Matches any words and spaces to handle mixed case per word and keep
// multiple spaces.
preg_match_all("/[\S]+|[\s]+/", $string, $matches, PREG_PATTERN_ORDER);

foreach ($matches[0] as $str) {
  if (isset($max_length) && strlen($result) >= $max_length) {
    break;
  }
  // String is mixed case if it consists of both uppercase and lowercase
  // letters.
  $mixed_case = (strlen($str) > 1 && mb_strtolower($str) !== $str && mb_strtoupper($str) !== $str);

  // Split into Unicode characters and transliterate each one.
  foreach (preg_split('//u', $str, 0, PREG_SPLIT_NO_EMPTY) as $unicode_character) {
    if (preg_match('/\s/', $unicode_character)) {
      $to_add = $unicode_character;
    }
    else {
      $to_add = $this->transliterateSingleCharacter($unicode_character, $langcode, $unknown_character);
      // If this is a capitalised letter of a mixed case word, only
      // capitalise the first letter and lowercase any subsequent letters.
      // For example Шоссе should be transliterated into Shosse not SHosse.
      if ($mixed_case && strlen($to_add) > 1 && mb_strtoupper($to_add) === $to_add) {
        $to_add = ucfirst(strtolower($to_add));
      }
    }

    // Check if this exceeds the maximum allowed length.
    $length += strlen($to_add);
    if (isset($max_length) && $length > $max_length) {
      break;
    }

    $result .= $to_add;
  }
}
apolitsin’s picture

Do we need add some Name.Lastname mask for nice transliteration for email ?
```diff
+ ['ru', 'Борис.Шпак', 'Boris.Shpak'],
```

krzysztof domański’s picture

Version: 8.6.x-dev » 8.7.x-dev
StatusFileSize
new888 bytes
new4.81 KB

Version: 8.7.x-dev » 8.8.x-dev

Drupal 8.7.9 was released on November 6 and is the final full bugfix release for the Drupal 8.7.x series. Drupal 8.7.x will not receive any further development aside from security fixes. Sites should prepare to update to 8.8.0 on December 4, 2019. (Drupal 8.8.0-beta1 is available for testing.)

Bug reports should be targeted against the 8.8.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.9.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 8.8.x-dev » 8.9.x-dev

Drupal 8.8.7 was released on June 3, 2020 and is the final full bugfix release for the Drupal 8.8.x series. Drupal 8.8.x will not receive any further development aside from security fixes. Sites should prepare to update to Drupal 8.9.0 or Drupal 9.0.0 for ongoing support.

Bug reports should be targeted against the 8.9.x-dev branch from now on, and new development or disruptive changes should be targeted against the 9.1.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

krzysztof domański’s picture

Version: 8.9.x-dev » 9.1.x-dev
StatusFileSize
new4.81 KB

Re-rolled

Version: 9.1.x-dev » 9.2.x-dev

Drupal 9.1.0-alpha1 will be released the week of October 19, 2020, which means new developments and disruptive changes should now be targeted for the 9.2.x-dev branch. For more information see the Drupal 9 minor version schedule and the Allowed changes during the Drupal 9 release cycle.

krzysztof domański’s picture

andypost’s picture

Status: Needs review » Reviewed & tested by the community

Back to rtbc, timezones may change so separate issue

alexpott’s picture

Status: Reviewed & tested by the community » Needs work
  1. +++ b/core/lib/Drupal/Component/Transliteration/PhpTransliteration.php
    @@ -130,31 +130,71 @@ public function transliterate($string, $langcode = 'en', $unknown_character = '?
    +      if (isset($max_length) && strlen($result) >= $max_length) {
    +        break;
           }
    ...
    +        if (isset($max_length) && $length > $max_length) {
    +          break;
             }
    
    +++ b/core/tests/Drupal/Tests/Component/Transliteration/PhpTransliterationTest.php
    @@ -174,6 +174,16 @@ public function providerTestPhpTransliteration() {
    +      ['bg', 'Щастие', 'Schastie'],
    

    The first check can be removed if the second one is break 2;

  2. +++ b/core/lib/Drupal/Component/Transliteration/PhpTransliteration.php
    @@ -130,31 +130,71 @@ public function transliterate($string, $langcode = 'en', $unknown_character = '?
    +        if (preg_match('/\s/', $unicode_character)) {
    +          $to_add = $unicode_character;
    +        }
    +        else {
    

    Is this worth it? I don't think it is. Doing a regular expresssion is probably more expensive that doing...

    $code = self::ordUTF8(' ');
    if ($code < 0x80) {
      // Already lower ASCII.
      return chr($code);
    }
    

    Which is what is effectively happening in HEAD.

  3. +++ b/core/lib/Drupal/Component/Transliteration/PhpTransliteration.php
    @@ -130,31 +130,71 @@ public function transliterate($string, $langcode = 'en', $unknown_character = '?
    +          $to_add = $this->transliterateSingleCharacter($unicode_character, $langcode, $unknown_character);
    ...
    +  /**
    +   * Transliterates single character from Unicode to US-ASCII.
    +   *
    +   * @param string $character
    +   *   A single character.
    +   * @param string $langcode
    +   *   The language code of the language the character is in.
    +   * @param string $unknown_character
    +   *   The character to substitute for characters without transliterated
    +   *   equivalents.
    +   *
    +   * @return string
    +   *   Non-US-ASCII character transliterated to US-ASCII character, and unknown
    +   *   character replaced with $unknown_character.
    +   */
    +  protected function transliterateSingleCharacter($character, $langcode, $unknown_character) {
    

    I'm not convinced that turning this into a method call is correct. We can make this a method call if we need the functionality somewhere else atm we don't

alexpott’s picture

Also capitalisation rules are harder... German proper nouns for example... what is supposed to happen for шШш the code in #48 gives us shShsh whereas HEAD would give us shSHsh which I think might be closer to the intent. Not sure.

alexpott’s picture

Also re #53 the issue title claims this is about the beginning of words but as #53 shows the current code is also affecting the middle of words too.

Version: 9.2.x-dev » 9.3.x-dev

Drupal 9.2.0-alpha1 will be released the week of May 3, 2021, which means new developments and disruptive changes should now be targeted for the 9.3.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.3.x-dev » 9.4.x-dev

Drupal 9.3.0-rc1 was released on November 26, 2021, which means new developments and disruptive changes should now be targeted for the 9.4.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.4.x-dev » 9.5.x-dev

Drupal 9.4.0-alpha1 was released on May 6, 2022, which means new developments and disruptive changes should now be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.5.x-dev » 10.1.x-dev

Drupal 9.5.0-beta2 and Drupal 10.0.0-beta2 were released on September 29, 2022, which means new developments and disruptive changes should now be targeted for the 10.1.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 10.1.x-dev » 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch, which currently accepts only minor-version allowed changes. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 11.x-dev » main

Drupal core is now using the main branch as the primary development branch. New developments and disruptive changes should now be targeted to the main branch.

Read more in the announcement.