Problem/Motivation
Ukrainian letters are not transliterated properly. In most cases, they're using russian transliteration rules, which is not correct.
For example, "И" is transliterated as I (must be Y).
Steps to reproduce
Transliterate Ukrainian pangram via \Drupal::transliteration()->transliterate('text is here', 'uk');
Source:
На подушечці форми любої є й ґудзик щоб пірʼя геть жовте сховати.
Current result:
Na podushechci formi lyuboi e y gudzik schob pir'ya get zhovte skhovati.
See failed test in #29
Expected result:
Na podushechtsi formy lyuboyi ye y gudzyk shchob pirya het zhovte skhovaty.
Proposed resolution
Add Ukrainian transliteration data into core/lib/Drupal/Component/Transliteration/uk.php
according to Ukrainian law: https://zakon.rada.gov.ua/laws/show/55-2010-%D0%BF#Text
Character | Hex code | Current | Expected |
Г | 0x413 |
G | H |
г | 0x433 |
g | h |
Є | 0x404 |
E | Ye |
є | 0x454 |
e | ye |
И | 0x418 |
I | Y |
и | 0x438 |
i | y |
Ї | 0x407 |
I | Yi |
ї | 0x457 |
i | yi |
Ц | 0x426 |
C | Ts |
ц | 0x446 |
c | ts |
Щ | 0x429 |
SCH | Shch |
щ | 0x449 |
sch | shch |
ʼ | 0x2bc |
' | |
’ | 0x2019 |
' | |
' | 0x27 |
' |
Note: Preferably, some characters should be transliterated differently based on their position. However, we're trying to simplify things and use the most convenient version for both cases. That's why we have some differences between the official law version and our implementation.
Remaining tasks
Review, commit.
Comment | File | Size | Author |
---|---|---|---|
#45 | 3169212-8.9.x.patch | 1.93 KB | alexpott |
Issue fork drupal-3169212
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
- 3169212-improve-ukrainian-transliteration changes, plain diff MR !321
Comments
Comment #2
biblos CreditAttribution: biblos as a volunteer commentedComment #3
biblos CreditAttribution: biblos as a volunteer commentedComment #4
Krzysztof DomańskiInitial patch.
Comment #6
int_ua CreditAttribution: int_ua as a volunteer commentedAdded more of the official transliteration rules from https://zakon.rada.gov.ua/laws/show/55-2010-п#Text
Comment #7
int_ua CreditAttribution: int_ua as a volunteer commentedFixed incorrect file length in the patch file
Comment #8
int_ua CreditAttribution: int_ua as a volunteer commentedTested by applying the patch to a `8.9.13` installation and changing an entity title with Pathauto active. Not sure if I should set RTBC myself, revert it if I wasn't supposed to.
Comment #9
Krzysztof DomańskiNeeds work: coding standard. See https://www.drupal.org/pift-ci-job/1972120
See Issue Status field
Comment #10
anmolgoyal74 CreditAttribution: anmolgoyal74 at OpenSense Labs commentedComment #11
MatroskeenI was surprised to see this kind of issue and really looking forward to seeing this fixed.
Thanks for all the work so far!
Considering we have a bunch of characters, I think we should expand the test.
Also, we have some cases when characters are transliterated differently based on the character position (either beginning or any other position of the word). It's kind of tricky because it's not only about the uppercase/lowercase.
For example:
Їжакевич -> Yizhakevych
їжак -> yizhak
in those cases
ї
becomesyi
Мар'їне -> Marine
Київ -> Kyiv
in such case
ї
becomesi
Is there anything we can do about it?
Comment #12
int_ua CreditAttribution: int_ua as a volunteer commentedI didn't find any support for such behavior but I didn't spend much time looking for it. Anyway, now that I look at it again it seems that beginning-of-the-word version should take precedence and may look acceptable in all situations. Currently I've used it only for uppercase and it needs to be changed I guess. I'll take a look again tomorrow.
Comment #13
int_ua CreditAttribution: int_ua as a volunteer commentedAdded pangram as a test, changed all cases to prefer beginning-of-the-word versions.
Comment #14
int_ua CreditAttribution: int_ua as a volunteer commentedFixed apostrophe symbol
Comment #16
int_ua CreditAttribution: int_ua as a volunteer commented0x2bc => NULL,
does not work for some reason, does anyone know why?Comment #18
MatroskeenI updated the issue summary to reflect what we're doing here and to have better visibility.
I didn't include characters if it already returns the expected value. Feel free to add characters that should be ignored.
By the way, I was using this service to get hexadecimal code: http://www.mauvecloud.net/charsets/CharCodeFinder.html
So the next would be:
1) Make sure the list matches
core/lib/Drupal/Component/Transliteration/uk.php
;2) Fix failed test;
3) Upload test-only patch to verify that we're fixing the problem described in the issue summary (it should fail) + re-upload the patch with test + fix itself (it should pass).
We're getting close 🤓
Comment #19
Krzysztof Domański0x427
0x428
0x429
Uppercase / lowercase should be fixed in a separate issue.
See #3000630: Transliteration causes 2 capital letters at the beginning of a word.
Comment #20
MatroskeenFair enough. Removed such characters from the list.
Comment #21
Matroskeenre #16:
The test fail, because the
NULL
value doesn't pass thisisset
condition in :I was wondering why for a second because the array key exists. But let's take a look into
isset
definition:I think changing
0x2bc => NULL
to0x2bc => ''
should do the trick.Comment #22
int_ua CreditAttribution: int_ua as a volunteer commentedChanged
NULL
s to empty stringsComment #23
int_ua CreditAttribution: int_ua as a volunteer commentedOnly the test
Comment #26
int_ua CreditAttribution: int_ua as a volunteer commentedGuess I have to find out how to rerun tests now
Comment #27
MatroskeenYep, it seems like random test failure, perhaps related to this one #3192260: [random test failure] Random fail in media_library CKEditorIntegrationTest.
Retest can be triggered by following "Add test / retest" link under the patch and selecting "Retest PHP ..."
I just reviewed the patch and I think we can clear-up the list a little bit. Some of the specified characters already provide the expected value, so there is no need to specify this in our overrides.
I also noticed a few inconsistencies:
1) є (
0x454
) should beie
instead ofye
;2) я (
0x44f
) should beia
instead ofya
;According to the table in IS, I think we should have the following array:
We also need to specify in the IS table what characters are behind these codes:
0x42c, 0x44c, 0x2bc, 0x27, 0x2019
I'm sorry to circle it back, but I'd like to maintain the ability to RTBC 😋
Comment #28
int_ua CreditAttribution: int_ua as a volunteer commentedAs I've said before, IMHO defaulting to beginning-of-the-word version before a solution that differentiates them is implemented looks nicer. And I've added some extra symbols that were already matching just to make sure they aren't changed in the future.
They were specified, not sure when they were lost. Anyway, I'm satisfied with the patch for my project for now, please continue without me.
Comment #29
MatroskeenI've reviewed this part again, and I think this is a far point. A new version is a kind of compromise but it's much better than it was before.
Therefore, looks good to me!
It's not necessary, because we already have a test that covers all Ukrainian letters. If someone will break this, we'll see :)
Added back to the table.
I'm also re-uploading patch #23 with some clean-up and minor tweaks.
Comment #30
MatroskeenComment #33
MatroskeenOpened a merge request to simplify the review.
Comment #34
MatroskeenComment #35
init90Thanks for the great work!
The patch looks good and makes transliteration work according to official Ukraine rules ( https://zakon.rada.gov.ua/laws/show/55-2010-%D0%BF#Text).
For test was used next command:
Result without patch:
Result with patch:
The issue has a great summary with the charset comparison table which shows that with the patch we get proper Ukrainian transliteration.
Comment #36
MatroskeenThanks for the review!
We should probably keep the thread unresolved to see what other people think about it (especially core committers).
I slightly disagree about adding extra comments, but I don't mind adding if people think it's helpful 😋
Comment #37
MatroskeenComment #38
MatroskeenUpdated "Steps to reproduce" section.
Comment #39
podarok+1 to RTBC
Comment #40
xjmThanks everyone for your work on this issue, and apologies for the delay in a committer review!
I read over the character substitutions in the issue summary (disclaimer: I speak Russian, but only knew some basics about the differences with the Ukrainian alphabet). Everything looks correct and it's obvious that Russian pronunciations were being used as the basis for transliteration instead of the correct Ukranian ones. The incorrect transliterations of the hushes OTOH were also fixed for Russian in a previous issue, so it makes sense that we need to fix them here as well. (I wonder if there are other Cyrillic transliterations that have the same problem elsewhere?)
Finally, I reviewed the linked official resource from the Ukrainian Ministers' Cabinet and confirmed that (with the below exception) these are the legally correct transliterations.
I guess the only controversial part of this is:
Reviewing the above resource, I see that this is specifically for the soft vowels, which are correctly transliterated with a "Y" at the beginning when in a word-initial position (an actual separate phoneme) versus when they're in a middle or final syllable and should be prefixed with "i" instead.
To me that does sound like something we should try to fix if possible, since using "y" instead inside a word will look like a spelling error, as if the original Ukrainian word had been spelled with a consonant cluster having и or (especially weird) й before the main vowel.
However, I agree that this would be complicated to implement, and I wouldn't hold that up on fixing all the other, more egregious spelling errors. Therefore, I think we should file a followup issue to address that instead.
Finally, I do think an inline comment about the apostrophes should be added (as I commented on the MR).
Marking "Needs work" for:
I've asked other committers to keep an eye out for this issue being RTBC again so we don't have to wait another month while those changes are made.
Thanks everyone!
Comment #41
Matroskeen@xjm thanks so much for the review!
Given it's just a comment change, moving back to RTBC.
Comment #42
alexpottCredited @init90 and @xjm for reviews and @biblos for opening and documenting the issue.
Committed and pushed 9c2fa5f9e5 to 9.2.x and 2cdfb750d1 to 9.1.x. Thanks!
Kicking off a run against Drupal 8.9.x as this is a major bug and therefore eligible for backport.
Comment #45
alexpottThis is a place where patches > MR - can't get a run against 8.9.x without a patch so here's a patch that's the same as https://git.drupalcode.org/project/drupal/-/merge_requests/321.diff
Comment #47
xjmThanks @Matroskeen for the quick reply; clearly I did not read the footnotes. :) But I do think the inline comment helps regardless since there's multiple corrections for the one point about soft signs and apostrophes.
Glad to see this committed already. I went ahead and committed the backport since it passed.
Comment #48
xjmComment #49
MatroskeenThis is awesome! Thank you all so much! 💙💛