Problem/Motivation
When one capital letter has more then one char in transliteration map it causes 2 char uppercase
'ш' = 'sh'
"Штрихкод" is becoming 'SHtrikhkod'
Proposed resolution
1. Matches any words and spaces to handle mixed case.
2. Check if the word consists of both uppercase and lowercase letters (mixed case word).
3. If this is a capitalised letter of a mixed case word, only capitalise the first letter and lowercase any subsequent letters.
Remaining tasks
Add test
| Comment | File | Size | Author |
|---|---|---|---|
| #48 | 3000630-48.patch | 4.81 KB | krzysztof domański |
Comments
Comment #2
andypostNow sure how to fix that right, looks it needs special processing to "camelize" uppercased letters
Comment #3
scott_euser commentedThis could use some feedback - I think this is the most appropriate way, but perhaps someone has a better idea:
Patch attached. Tests needed.
Comment #4
scott_euser commented... and the patch
Comment #5
scott_euser commentedIssue uploading patch, trying again with file rename.
Comment #6
scott_euser commentedComment #7
scott_euser commentedNow with unit test
Comment #8
vijaycs85Over all, looks good @scott_euser. Just one minor comment:
Great comments. Very easy to understand what's going on.
This could use a dataProvider to have multiple dataset.
Comment #9
scott_euser commentedThanks for the feedback! I have switched it to use a data provider matching the style of ::testRemoveDiacritics(). As a result, I have moved the comment explanation of the test above the method.
Comment #10
vijaycs85Looks great!
Comment #11
longwaveDo we have to consider strings that contain characters other than letters? For example $mixed_case is true for the strings "a " and "A1", but these are not strictly mixed case.
While the comments are good I am not sure this needs its own test method, I think this can just be folded into
providerTestPhpTransliteration().Comment #12
scott_euser commentedThanks for the thorough look! I have updated the patch to account for single character words and words with numbers mixed in and added that to the tests. Needed to convert to using mb_strtoupper and mb_strtolower comparison to handle the new cases.
I also combined the texts without losing the detailed comments - a similar long comment explanation of 3 and 4 byte characters already existed within the array so I followed that style.
If you have a chance to re-review would be greatly appreciated.
Comment #13
krzysztof domańskiThe code below checks if the whole string is mixed case so "Щастие ЩЩЩ" return "Schastie SchSchSch". IMO this should return "Schastie SCHSCHSCH".
We should check if single word is mixed case. The test also needs something like this:
['bg', 'Щастие ЩЩЩ', 'Schastie SCHSCHSCH'],Comment #14
scott_euser commentedMakes sense thank you for reviewing! I have added that in and added coverage for multiple words with different cases, potentially with punctuation.
Comment #15
krzysztof domańskiThe patch #14 does not respect
$max_length.Now the code above is in the inner loop so it checks the length of words.
For example
$max_lengthis equal 20. Our string has 200 letters and contains only words shorter than 20 letters. It will never be trim inside this loop. After joining the words, we will get longer than 20 letters (approximately 200 letters).Comment #16
scott_euser commentedThanks for rechecking. Apologies for questioning but have you actually tried that? Length is defined outside the foreach words loop + the spaces between words are also now added to length.
There is also test coverage for max length and it verifies that it is working.
Perhaps you can describe the steps you took to get it to incorrectly handle max length?
Comment #17
krzysztof domański@scott_euser You're right. It respects
$max_length. Apologies for incorrect comment #15.Comment #18
krzysztof domańskiI tested the patch #14 checking
$unknown_character.In some cases unexpected results are returned. When an unknown character is between two spaces, everything is correct, but when an unknown character is next to the "normal" character, the data is truncated.
For example expected value for
'Hel' . $unknown_character . 'o World'is "Hel?o World" but it returns "H World". It probably always cuts to the first letter.How to reproduce:
The test also requires something similar to the following code. I think we also need other cases.
Comment #19
krzysztof domańskiNow I noticed that I use in example the same variable like transliterate() argument.
transliterate($string, $langcode = 'en', $unknown_character = '?', $max_length = NULL)Please rename this or use chr(0x80).
Comment #20
krzysztof domańskiAfter I reset the code, I ran the test which the following cases.
It ended with a failure so there was such a problem before. IMO Patch #14 looks good.
Now I'm not sure. Do we have to create a separate issue (Unknown unicode in transliteration not work correctly) or continue in this?
Comment #21
scott_euser commentedHi Krzysztof Domański,
Thank you for further reviewing and confirming about max length. Yes, lets keep the scope to the single character resulting in 2 characters. If there are other issues with transliteration it should be a new issue since this code isn't causing those issues.
So if you are then happy with it, can you set it back to RTBC?
Thanks!
Scott
Comment #22
krzysztof domańskiPatch #14 looks good!
I added a new issue to a separate problem Transliteration a string containing an unknown character (e.g. 0x80) is not valid
Comment #23
scott_euser commentedSounds good thanks! I'll see if I can take a look at it and see what's going wrong.
Comment #24
imyaro commentedPatch works good, but why did you use word "Щастие"? It confuses a little because it supposed to be a Russian word, but written with the mistake (Like "Happinez" instead of "Happiness"). Probably it will be better to change it?
We can use something like "Шина" (Shina) / "Шоссе" (Shosse) instead. And as the last phrase "Шла Саша по ШОССЕ" (Shla Sasha po SHOSSE).
Comment #25
scott_euser commentedHi zvse,
Thanks for the feedback! Actually its the correct spelling in Bulgarian for happiness (and 'bg' is indicated in the test as the original language). Hope that helps clear things up!
Scott
Comment #27
scott_euser commentedSeems re-run of test was triggered and the test failed to complete, stopped partway through. Re-running again now and re-setting status back to where it was.
Comment #28
scott_euser commentedComment #29
catchCommitted and pushed bb7fb6a3dd to 8.7.x and 35c3d18ae0 to 8.6.x. Thanks!
Comment #32
scott_euser commentedThanks!
Comment #33
tacituseu commentedThis might have introduced intermittent test failures (trailing spaces?):
https://www.drupal.org/pift-ci-job/1126573
https://www.drupal.org/pift-ci-job/1127403
Comment #34
krzysztof domańskiI added a new issue #3015684: Protect transliteration so that it does not trim whitespace.
Comment #35
krzysztof domański-- edited --
Comment #36
krzysztof domański-- edited --
Comment #39
alexpottThis change definitely should not cause random test fails - #3015802: Random fail in \Drupal\Tests\Core\Transliteration\PhpTransliterationTest and also I think transliteration should not affecting spacing.
Comment #40
krzysztof domańskiThanks for revert! New patch.
Comment #41
krzysztof domańskiUnnecessary code
Less:
Comment #42
krzysztof domańskiNeeds reroll after fix #3015992: Not affecting spacing in PhpTransliterationTest.
Comment #43
krzysztof domański1. Matches any words and spaces to handle mixed case and check if the word consists of both uppercase and lowercase letters.
Comment #44
apolitsin commentedDo we need add some Name.Lastname mask for nice transliteration for email ?
```diff
+ ['ru', 'Борис.Шпак', 'Boris.Shpak'],
```
Comment #45
krzysztof domańskiComment #48
krzysztof domańskiRe-rolled
Comment #50
krzysztof domańskiRetested. https://www.drupal.org/pift-ci-job/1868691 It was a random test failure #2825845: DST-related test failures in FilterDateTimeTest.
Comment #51
andypostBack to rtbc, timezones may change so separate issue
Comment #52
alexpottThe first check can be removed if the second one is
break 2;Is this worth it? I don't think it is. Doing a regular expresssion is probably more expensive that doing...
Which is what is effectively happening in HEAD.
I'm not convinced that turning this into a method call is correct. We can make this a method call if we need the functionality somewhere else atm we don't
Comment #53
alexpottAlso capitalisation rules are harder... German proper nouns for example... what is supposed to happen for
шШшthe code in #48 gives usshShshwhereas HEAD would give usshSHshwhich I think might be closer to the intent. Not sure.Comment #54
alexpottAlso re #53 the issue title claims this is about the beginning of words but as #53 shows the current code is also affecting the middle of words too.