Problem/Motivation

Ukrainian letters are not transliterated properly. In most cases, they're using russian transliteration rules, which is not correct.
For example, "И" is transliterated as I (must be Y).

Steps to reproduce

Transliterate Ukrainian pangram via \Drupal::transliteration()->transliterate('text is here', 'uk');

Source:
На подушечці форми любої є й ґудзик щоб пірʼя геть жовте сховати.

Current result:
Na podushechci formi lyuboi e y gudzik schob pir'ya get zhovte skhovati.
See failed test in #29

Expected result:
Na podushechtsi formy lyuboyi ye y gudzyk shchob pirya het zhovte skhovaty.

Proposed resolution

Add Ukrainian transliteration data into core/lib/Drupal/Component/Transliteration/uk.php according to Ukrainian law: https://zakon.rada.gov.ua/laws/show/55-2010-%D0%BF#Text

Character Hex code Current Expected
Г 0x413 G H
г 0x433 g h
Є 0x404 E Ye
є 0x454 e ye
И 0x418 I Y
и 0x438 i y
Ї 0x407 I Yi
ї 0x457 i yi
Ц 0x426 C Ts
ц 0x446 c ts
Щ 0x429 SCH Shch
щ 0x449 sch shch
ʼ 0x2bc '
0x2019 '
' 0x27 '

Note: Preferably, some characters should be transliterated differently based on their position. However, we're trying to simplify things and use the most convenient version for both cases. That's why we have some differences between the official law version and our implementation.

Remaining tasks

Review, commit.

Issue fork drupal-3169212

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

biblos created an issue. See original summary.

biblos’s picture

Issue summary: View changes
biblos’s picture

Title: Not correct transliteration of Ukrainian letters » Incorrect transliteration of Ukrainian letters
Krzysztof Domański’s picture

Version: 9.0.x-dev » 9.1.x-dev
Status: Active » Needs work
FileSize
1.29 KB

Initial patch.

Version: 9.1.x-dev » 9.2.x-dev

Drupal 9.1.0-alpha1 will be released the week of October 19, 2020, which means new developments and disruptive changes should now be targeted for the 9.2.x-dev branch. For more information see the Drupal 9 minor version schedule and the Allowed changes during the Drupal 9 release cycle.

int_ua’s picture

Status: Needs work » Needs review
FileSize
1.86 KB

Added more of the official transliteration rules from https://zakon.rada.gov.ua/laws/show/55-2010-п#Text

int_ua’s picture

Fixed incorrect file length in the patch file

int_ua’s picture

Status: Needs review » Reviewed & tested by the community

Tested by applying the patch to a `8.9.13` installation and changing an entity title with Pathauto active. Not sure if I should set RTBC myself, revert it if I wasn't supposed to.

Krzysztof Domański’s picture

Status: Reviewed & tested by the community » Needs work

Needs work: coding standard. See https://www.drupal.org/pift-ci-job/1972120

One should not set one's own patch to this status.

See Issue Status field

anmolgoyal74’s picture

Status: Needs work » Needs review
FileSize
1.83 KB
750 bytes
Matroskeen’s picture

Status: Needs review » Needs work

I was surprised to see this kind of issue and really looking forward to seeing this fixed.
Thanks for all the work so far!

Considering we have a bunch of characters, I think we should expand the test.

Also, we have some cases when characters are transliterated differently based on the character position (either beginning or any other position of the word). It's kind of tricky because it's not only about the uppercase/lowercase.

For example:
Їжакевич -> Yizhakevych
їжак -> yizhak
in those cases ї becomes yi

Мар'їне -> Marine
Київ -> Kyiv
in such case ї becomes i

Is there anything we can do about it?

int_ua’s picture

Also, we have some cases when characters are transliterated differently based on the character position ... Is there anything we can do about it?

I didn't find any support for such behavior but I didn't spend much time looking for it. Anyway, now that I look at it again it seems that beginning-of-the-word version should take precedence and may look acceptable in all situations. Currently I've used it only for uppercase and it needs to be changed I guess. I'll take a look again tomorrow.

int_ua’s picture

Added pangram as a test, changed all cases to prefer beginning-of-the-word versions.

int_ua’s picture

Status: Needs review » Needs work

The last submitted patch, 14: ukrainian_transliteration_3169212_14.patch, failed testing. View results

int_ua’s picture

Status: Needs work » Needs review
FileSize
2.01 KB

0x2bc => NULL, does not work for some reason, does anyone know why?

Status: Needs review » Needs work

The last submitted patch, 16: ukrainian_transliteration_3169212_15.patch, failed testing. View results

Matroskeen’s picture

Issue summary: View changes

I updated the issue summary to reflect what we're doing here and to have better visibility.
I didn't include characters if it already returns the expected value. Feel free to add characters that should be ignored.

By the way, I was using this service to get hexadecimal code: http://www.mauvecloud.net/charsets/CharCodeFinder.html

So the next would be:
1) Make sure the list matches core/lib/Drupal/Component/Transliteration/uk.php;
2) Fix failed test;
3) Upload test-only patch to verify that we're fixing the problem described in the issue summary (it should fail) + re-upload the patch with test + fix itself (it should pass).

We're getting close 🤓

Krzysztof Domański’s picture

Character Hex code Current Expected
Ч 0x427 CH Ch
Ш 0x428 SH Sh
Щ 0x429 SCH Shch

Uppercase / lowercase should be fixed in a separate issue.
See #3000630: Transliteration causes 2 capital letters at the beginning of a word.

Matroskeen’s picture

Issue summary: View changes

Fair enough. Removed such characters from the list.

Matroskeen’s picture

re #16:

The test fail, because the NULL value doesn't pass this isset condition in :

if (isset($this->languageOverrides[$langcode][$code])) {
   return $this->languageOverrides[$langcode][$code];
 }

I was wondering why for a second because the array key exists. But let's take a look into isset definition:

Determine if a variable is considered set, this means if a variable is declared and is different than null.

I think changing 0x2bc => NULL to 0x2bc => '' should do the trick.

int_ua’s picture

Status: Needs work » Needs review
FileSize
2.01 KB

Changed NULLs to empty strings

int_ua’s picture

The last submitted patch, 22: ukrainian_transliteration_3169212_22.patch, failed testing. View results

Status: Needs review » Needs work

The last submitted patch, 23: ukrainian_transliteration_3169212_test.patch, failed testing. View results

int_ua’s picture

Status: Needs work » Needs review

Drupal\Tests\ckeditor\FunctionalJavascript\CKEditorIntegrationTest::testOffCanvasStyles
Behat\Mink\Exception\ElementNotFoundException: Element matching css ".cke_top" not found.

Guess I have to find out how to rerun tests now

Matroskeen’s picture

Status: Needs review » Needs work

Yep, it seems like random test failure, perhaps related to this one #3192260: [random test failure] Random fail in media_library CKEditorIntegrationTest.

Retest can be triggered by following "Add test / retest" link under the patch and selecting "Retest PHP ..."


I just reviewed the patch and I think we can clear-up the list a little bit. Some of the specified characters already provide the expected value, so there is no need to specify this in our overrides.

I also noticed a few inconsistencies:
1) є (0x454) should be ie instead of ye;
2) я (0x44f) should be ia instead of ya;

According to the table in IS, I think we should have the following array:

0x413 => 'H',
0x433 => 'h',
0x404 => 'Ye',
0x454 => 'ie',
0x418 => 'Y',
0x438 => 'y',
0x407 => 'Yi',
0x439 => 'i',
0x426 => 'Ts',
0x446 => 'ts', 
0x429 => 'Shch',
0x449 => 'shch',
0x44e => 'iu',
0x44f => 'ia',
0x42c => '',
0x44c => '',
0x2bc => '',
0x27 => '',
0x2019 => '',

We also need to specify in the IS table what characters are behind these codes: 0x42c, 0x44c, 0x2bc, 0x27, 0x2019

I'm sorry to circle it back, but I'd like to maintain the ability to RTBC 😋

int_ua’s picture

1) є (0x454) should be ie instead of ye;
2) я (0x44f) should be ia instead of ya;

As I've said before, IMHO defaulting to beginning-of-the-word version before a solution that differentiates them is implemented looks nicer. And I've added some extra symbols that were already matching just to make sure they aren't changed in the future.

We also need to specify in the IS table what characters are behind these codes: 0x42c, 0x44c, 0x2bc, 0x27, 0x2019

They were specified, not sure when they were lost. Anyway, I'm satisfied with the patch for my project for now, please continue without me.

Matroskeen’s picture

Issue summary: View changes
Status: Needs work » Needs review
FileSize
1.06 KB
1.75 KB
2.41 KB

As I've said before, IMHO defaulting to beginning-of-the-word version before a solution that differentiates them is implemented looks nicer.

I've reviewed this part again, and I think this is a far point. A new version is a kind of compromise but it's much better than it was before.
Therefore, looks good to me!

And I've added some extra symbols that were already matching just to make sure they aren't changed in the future.

It's not necessary, because we already have a test that covers all Ukrainian letters. If someone will break this, we'll see :)

They were specified, not sure when they were lost

Added back to the table.


I'm also re-uploading patch #23 with some clean-up and minor tweaks.

Matroskeen’s picture

Title: Incorrect transliteration of Ukrainian letters » Improve transliteration of Ukrainian letters
Issue tags: +Bug Smash Initiative

The last submitted patch, 29: 3169212-29-test_only.patch, failed testing. View results

Matroskeen’s picture

Matroskeen’s picture

init90’s picture

Status: Needs review » Reviewed & tested by the community

Thanks for the great work!

The patch looks good and makes transliteration work according to official Ukraine rules ( https://zakon.rada.gov.ua/laws/show/55-2010-%D0%BF#Text).

For test was used next command:

echo \Drupal::transliteration()->transliterate('На подушечці форми любої є й ґудзик щоб пірʼя геть жовте сховати.', 'uk');

Result without patch:

Na podushechci formi lyuboi e y gudzik schob pir'ya get zhovte skhovati.

Result with patch:

Na podushechtsi formy lyuboyi ye y gudzyk shchob pirya het zhovte skhovaty.

The issue has a great summary with the charset comparison table which shows that with the patch we get proper Ukrainian transliteration.

Matroskeen’s picture

Issue summary: View changes

Thanks for the review!
We should probably keep the thread unresolved to see what other people think about it (especially core committers).
I slightly disagree about adding extra comments, but I don't mind adding if people think it's helpful 😋

Matroskeen’s picture

Issue summary: View changes
Matroskeen’s picture

Issue summary: View changes

Updated "Steps to reproduce" section.

podarok’s picture

Priority: Normal » Major

+1 to RTBC

xjm’s picture

Status: Reviewed & tested by the community » Needs work
Issue tags: +Needs followup

Thanks everyone for your work on this issue, and apologies for the delay in a committer review!

I read over the character substitutions in the issue summary (disclaimer: I speak Russian, but only knew some basics about the differences with the Ukrainian alphabet). Everything looks correct and it's obvious that Russian pronunciations were being used as the basis for transliteration instead of the correct Ukranian ones. The incorrect transliterations of the hushes OTOH were also fixed for Russian in a previous issue, so it makes sense that we need to fix them here as well. (I wonder if there are other Cyrillic transliterations that have the same problem elsewhere?)

Finally, I reviewed the linked official resource from the Ukrainian Ministers' Cabinet and confirmed that (with the below exception) these are the legally correct transliterations.

I guess the only controversial part of this is:

Note: Preferably, some characters should be transliterated differently based on their position. However, we're trying to simplify things and use the most convenient version for both cases. That's why we have some differences between the official law version and our implementation.

Reviewing the above resource, I see that this is specifically for the soft vowels, which are correctly transliterated with a "Y" at the beginning when in a word-initial position (an actual separate phoneme) versus when they're in a middle or final syllable and should be prefixed with "i" instead.

To me that does sound like something we should try to fix if possible, since using "y" instead inside a word will look like a spelling error, as if the original Ukrainian word had been spelled with a consonant cluster having и or (especially weird) й before the main vowel.

However, I agree that this would be complicated to implement, and I wouldn't hold that up on fixing all the other, more egregious spelling errors. Therefore, I think we should file a followup issue to address that instead.

Finally, I do think an inline comment about the apostrophes should be added (as I commented on the MR).

Marking "Needs work" for:

  1. Filing a followup issue to explore different transliteration for non-word-initial soft vowels.
  2. Adding an inline comment as per the MR.

I've asked other committers to keep an eye out for this issue being RTBC again so we don't have to wait another month while those changes are made.

Thanks everyone!

Matroskeen’s picture

Status: Needs work » Reviewed & tested by the community
Issue tags: -Needs followup

@xjm thanks so much for the review!

Given it's just a comment change, moving back to RTBC.

alexpott’s picture

Version: 9.2.x-dev » 8.9.x-dev

Credited @init90 and @xjm for reviews and @biblos for opening and documenting the issue.

Committed and pushed 9c2fa5f9e5 to 9.2.x and 2cdfb750d1 to 9.1.x. Thanks!

Kicking off a run against Drupal 8.9.x as this is a major bug and therefore eligible for backport.

  • alexpott committed 9c2fa5f on 9.2.x
    Issue #3169212 by int_ua, Matroskeen, anmolgoyal74, Krzysztof Domański,...

  • alexpott committed 2cdfb75 on 9.1.x
    Issue #3169212 by int_ua, Matroskeen, anmolgoyal74, Krzysztof Domański,...
alexpott’s picture

This is a place where patches > MR - can't get a run against 8.9.x without a patch so here's a patch that's the same as https://git.drupalcode.org/project/drupal/-/merge_requests/321.diff

  • xjm committed 3bddf7f on 8.9.x
    Issue #3169212 by int_ua, Matroskeen, anmolgoyal74, Krzysztof Domański,...
xjm’s picture

Thanks @Matroskeen for the quick reply; clearly I did not read the footnotes. :) But I do think the inline comment helps regardless since there's multiple corrections for the one point about soft signs and apostrophes.

Glad to see this committed already. I went ahead and committed the backport since it passed.

xjm’s picture

Status: Reviewed & tested by the community » Fixed
Matroskeen’s picture

This is awesome! Thank you all so much! 💙💛

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.