Problem/Motivation
Following the related topic Transliteration causes 2 capital letters at the beginning of a word this bug were discovered.
If any word contains an unknown character (e.g. 0x80), it will be cropped to the first letter.
For example expected value for 'Hel' . char(0x80) . 'o World'
is "Hel?o World" but it returns "H".
Notice: after fixed Transliteration causes 2 capital letters at the beginning of a word the this will change slightly.
How to reproduce:
$transliteration = new PhpTransliteration();
$unknown = chr(0x80);
// Unknown character between two "normal" characters. Expected output "Hel?o World"
$str1 = $transliteration->transliterate('Hel' . $unknown . 'o World');
// Unknown character between one space and one "normal" character. Expected output "Hell? World"
$str2 = $transliteration->transliterate('Hell' . $unknown . ' World');
Both cases are returned "H".
Proposed resolution
In core/lib/Drupal/Component/Transliteration/PhpTransliteration.php we need improve following code:
// Split into Unicode characters and transliterate each one.
foreach (preg_split('//u', $match, 0, PREG_SPLIT_NO_EMPTY) as $character) {
Add test something like this:
['en', 'Hello ' . chr(0x80) . ' World', 'Hello ? World'],
['en', 'Hel' . chr(0x80) . 'o World', 'Hel?o World'],
['en', 'Hell' . chr(0x80) .' World', 'Hell? World'],
Remaining tasks
- Add a test to check if unknown characters are correctly replaced with a non-default replacement e.g.
$unknown_character = '*'
. - Determine whether expected result of
$string = chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);
is intended to be '?????' instead of '?'.
User interface changes
None
API changes
None
Data model changes
None
Comment | File | Size | Author |
---|---|---|---|
#20 | 3001997-20.patch | 4.99 KB | alexpott |
#20 | 16-20-interdiff.txt | 3.53 KB | alexpott |
Comments
Comment #2
Krzysztof DomańskiThis issue should be postponed before Transliteration causes 2 capital letters at the beginning of a word will be fixed. Improving the method transliterate() in PhpTransliteration class is required before.
Comment #3
scott_euser CreditAttribution: scott_euser as a volunteer and at Soapbox Communications Ltd commentedThis attached patch does solve the issue to provide your expected outcomes but should have someone who is very confident about their UTF8 character encoding review it. Code inspiration to handle this from this comment on php.net.
Essentially the issue appears to be that
preg_split('//', $string, 0, PREG_SPLIT_NO_EMPTY);
fails to split the string for your examples.Here is where I am not confident as the patch causes a change to the expected result of an existing test as you can see.
I am not sure if the expected result of
$string = chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);
is intended to be'?'
instead of'?????'
denoting the 5 unknown characters. Not sure if that was an incorrectly set up test or the actual intension (ie, 5 invalid characters resulting in a single '?').I have a feeling the result of the single '?' was simply because preg_split was failing to split the string so that was the outcome and the test was trying to ensure that that remained the outcome. In reality, if a user passes a string of 5 unknown characters, getting 5 '?'s back seems more logical, particularly in more normal cases like the issue description where valid characters are mixed in.
So a lot of question marks here both literally and figuratively!
Comment #4
scott_euser CreditAttribution: scott_euser as a volunteer and at Soapbox Communications Ltd commentedTo be honest, I think it does not particularly matter whether this or the related issue goes first: whichever does I am happy to update the other patch - feel free to disagree of course!
Comment #5
scott_euser CreditAttribution: scott_euser as a volunteer and at Soapbox Communications Ltd commentedAnd the patch, sorry!
Comment #6
scott_euser CreditAttribution: scott_euser as a volunteer and at Soapbox Communications Ltd commentedFixed patch, removed test code.
Comment #8
Krzysztof DomańskiPatch #6 Failed to Apply.
Comment #9
scott_euser CreditAttribution: scott_euser as a volunteer and at Soapbox Communications Ltd commentedThanks - yeah, I need to update the patch now that this was committed:
https://www.drupal.org/project/drupal/issues/3000630
They affect more or less the same lines of code. Will try to get to this in the next days.
Comment #10
alexpottNice bug find!
Rather than removing the preg_split() we can enforce UTF-8 by doing
Before the
preg_split()
... this will replace all unknown UTF chars with the ? and the new tests will pass. Less function calls.Comment #11
Krzysztof DomańskiThis solution works. However, we can not use it, because mb_convert_encoding always converts unknown characters to '?'.
We should be able to set
$unknown_character
parameter in the transliterate method.In the middle of the loop, we can not replace the character '?' to
$unknown_character
, because this may be the original question mark:Comment #12
alexpott@Krzysztof Domański good point. We should add test coverage of that here then.
Comment #13
Krzysztof DomańskiComment #14
Krzysztof DomańskiNew patch + new test testTransliterationUnknownCharacter.
We can convert character encoding if we keep the original question marks as unique hashs before calling mb_convert_encoding().
After replace '?' with replacement and restore the original question marks.
Comment #16
Krzysztof DomańskiI forgot to add the default parameters:
Comment #17
scott_euser CreditAttribution: scott_euser as a volunteer and at Soapbox Communications Ltd commentedNice, thanks for moving this forward! Good call with the mb_convert_encoding instead of the loop. This looks good and works for me.
Comment #19
Krzysztof DomańskiUnrelated test building problem https://www.drupal.org/pift-ci-job/1138297. Back to RTBC.
Comment #20
alexpottThe current approach has problems when max length is used. Also I don't it makes much sense when the
$unknown_character
is a character that should also be transliterated. So what this means is that we should do the replacement to bring back the original question marks before transliterating.Patch attached also improves the commentary and removes the risk of a hash clash by using a hash based on the provided string. This uses PHP's
hash()
function directly to keep the transliteration component dependency free.Comment #21
Krzysztof DomańskiI think it is ready for use.
Comment #23
Krzysztof DomańskiComment #24
Krzysztof Domański#22 // Build Successful
#2990645: "Build Successful" is treated as a test failure
Comment #26
alexpottMore random-ness.
Comment #27
catchCommitted and pushed f9e7921bc8 to 8.7.x and 040e6275a0 to 8.6.x. Thanks!