Problem/Motivation

It's currently possible to search for Éclair recipes by typing in Eclair. This is currently only done for Latin letters and the issue is about expanding this to other scripts.

Proposed resolution

Accented characters among the Unicode characters are precomposed characters. For example, É (official name: Latin Capital Letter E with Acute) decomposes into Latin Capital Letter E and Combining Acute Accent. The proposed resolution is to apply these decomposition rules to Letters and remove the Combining Marks. This is the recommended way to remove accents in the ICU documentation.

The suggested userspace implementation have been created by a script. This script uses intl (the PHP wrapper for ICU) and attempts to remove accents from letters in the first 8192 characters and where this succeeded, the results are recorded into a simple associative array (if it didn't succeed then the letter didn't have a decomposition rule). It's 827 characters. Use this array for a new remove diacritics class. The results with ample doxygen is a less than 10K PHP file. See the ranges up to and including Greek Extended for included scripts.

Provide an alter hook in a core remove diacritics service.

Implement this alter hook in new a hidden for backwards compatibility and install it in an update hook. This is necessary because out of the 288 characters the current implementation removes diacritics from 40 characters which are no longer handled. See this Linguistics StackExchange answer why these characters have no decomposition rules. The current remove diacritics implementation uses the transliteration character database which has been compiled from various sources instead of just using the standard which is useful in the declared goal of machine name creation -- but it's less useful when creating strings users actually need to interact with. Nonetheless, the alter hook is there if a site wants to do something else and if some rules become widespread in a language community , I hope eventually there'll eventually be a contrib module which provides a config entity utilizing this alter hook and localize.drupal.org could provide these entities.

This module probably should be deprecated but adding lifecycle: deprecated to a core module breaks a lot of tests and it is not documented how to do this. So this should be a follow up pending on said documentation.

User interface changes

None.

API changes

TransliterationInterface::removeDiacritics is deprecated.

A new remove diacritics class is added to transliteration.

A new hook_remove_diacritics_map_alter is introduced to allow changing rules.

Data model changes

Release notes snippet

Original report

when removing diacritics in function search_simplify(), it not considering remove Arabic diacritics. How to test: add this text to any article: "السُّلَّامُ عَلَيْكُمْ وَرَحْمَةُ اللهِ وَبَرَكَاتُهُ" then search for "السلام". Original: https://ahmedspace.com/arabic-case-insensitive-in-database-systems-how-t...

CommentFileSizeAuthor
#89 3099668-nr-bot.txt90 bytesneeds-review-queue-bot
#59 interdiff_54-59.txt3.93 KBravi.shankar
#59 3099668-59.patch16.83 KBravi.shankar
#54 3099668-53.patch16.17 KBLendude
#54 interdiff-3099668-45-53.txt634 bytesLendude
#45 interdiff_44-45.txt1.87 KBanmolgoyal74
#45 3099668_45.patch16.17 KBanmolgoyal74
#44 interdiff_43_44.txt812 bytesanmolgoyal74
#44 3099668_44.patch16.56 KBanmolgoyal74
#43 interdiff_33_43.txt3.78 KBanmolgoyal74
#43 3099668_43.patch17.16 KBanmolgoyal74
#35 3099668-35.patch17.42 KBdaffie
#35 interdiff-3099668-33-35.txt1.87 KBdaffie
#33 3099668_33.patch17.84 KBGhost of Drupal Past
#31 3099668_30.patch15.89 KBGhost of Drupal Past
#27 3099668_27.patch26.23 KBGhost of Drupal Past
#24 default_rules.png8.26 KBGhost of Drupal Past
#22 3099668_22.patch17.49 KBGhost of Drupal Past
#22 default_rules.png12.46 KBGhost of Drupal Past
#21 2020-06-13 13_29_25-StrokesPlus.png8.66 KBGhost of Drupal Past
#20 icu_changes.txt1.18 KBGhost of Drupal Past
#18 icu_changes.txt302 bytesGhost of Drupal Past
#14 3099668_14.patch3.56 KBGhost of Drupal Past
arabic-diaritics.patch694 bytesomlx

Issue fork drupal-3099668

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

omlx created an issue. See original summary.

omlx’s picture

How long this bug have to wait to find some love from devs?

longwave’s picture

Version: 8.8.0 » 8.8.x-dev
Status: Active » Needs work

I agree that this is not the correct place to add this change; the call to removeDiacritics() above should surely do this instead.

Ghost of Drupal Past’s picture

Category: Bug report » Support request
Status: Needs work » Closed (won't fix)

Thanks for your contribution but I am afraid we are unable to fix this in core.

  1. The diacritics function explicitly mentions "This only applies to certain letters: Accented Latin characters "
  2. Our dataset is based on intl / icu. See #1823454-56: Verify transliteration data sources and their quality, and potentially eliminate maintenance. Transliteration is always controversial but icu is the best generic solution we have.
  3. If this generic solution is not working for you, please use hook_transliteration_overrides_alter. Looking at https://en.wikipedia.org/wiki/Romanization_of_Arabic#Romanization_standa... I fail to see how core could provide a transliteration which works for everyone.
omlx’s picture

Thanks for reply
We need to ignore diacritics in arabic to enhance searching
This approach is what libreoffice And firfox do
It may bette to use icu to remove diacritics Same as lifficeoffice[1]

[1] https://cgit.freedesktop.org/libreoffice/core/commit/?h=libreoffice-5-3&...

omlx’s picture

@Charlie ChX Negyesi transliteration in drupal is fine and no need to change it. The search is broken.

omlx’s picture

if we use icu for transliterationsome thing like this

$transliterator = Transliterator::createFromRules(':: NFD; :: [:M:] Remove; NFC;', Transliterator::FORWARD);

will it be acceptable?

xjm’s picture

Title: ignoring Arabic diacritics » Improve search behavior for Arabic (and other languages with optional diacritics)
Version: 8.8.x-dev » 9.1.x-dev
Category: Support request » Feature request
Status: Closed (won't fix) » Active

@Charlie ChX Negyesi, I don't think it was appropriate to wontfix this issue, especially without any maintainer feedback on it. I'm reopening it because while the actual implementation or proposed resolution might be different from the above, I don't see a reason not to try to improve this.

So, I'm setting the issue back to active. Thanks!

Ghost of Drupal Past’s picture

Title: Improve search behavior for Arabic (and other languages with optional diacritics) » Use intl for transliterate if available
Component: search.module » base system
Issue summary: View changes

I apologize for being shortsighted. In the past I have worked extensively on search-Unicode issues and the transliteration component and within that framework, this issue is unfixable. But we can, of couse, do more. Let me re-target this issue for a more appropriate solution.

If this gets in, search will need a version which uses intl for picking up codepoints instead of a regex.

Ghost of Drupal Past’s picture

Issue summary: View changes
Ghost of Drupal Past’s picture

Issue summary: View changes
Ghost of Drupal Past’s picture

Issue summary: View changes
Ghost of Drupal Past’s picture

Issue summary: View changes
Ghost of Drupal Past’s picture

Status: Active » Needs review
FileSize
3.56 KB
  1. The test expects the characters Ɋ and ɋ to be changed to Q and q but this doesn't happen yet so there's one failure.
  2. removeDiacritics is more than a bit strange because it removes more than just diacritics -- strokes, for example. But it doesn't do a full extended Latin to ASCII either. This made it necessary to lift some of the logic from PhpTransliteration.
  3. Also,I made a small change to the test to make the errors easier to read.

Status: Needs review » Needs work

The last submitted patch, 14: 3099668_14.patch, failed testing. View results
- codesniffer_fixes.patch Interdiff of automated coding standards fixes only.

Ghost of Drupal Past’s picture

Status: Needs work » Needs review

Yes, this was expected, https://stackoverflow.com/q/62343638/308851 is where we continue, the patch really does need review from interested parties.

Ghost of Drupal Past’s picture

Title: Use intl for transliterate if available » Expand the remove diacritics feature to other scripts
Issue summary: View changes
Status: Needs review » Active
Ghost of Drupal Past’s picture

Issue summary: View changes
FileSize
302 bytes
Ghost of Drupal Past’s picture

Issue summary: View changes
Ghost of Drupal Past’s picture

Issue summary: View changes
FileSize
1.18 KB
Ghost of Drupal Past’s picture

Issue summary: View changes
FileSize
8.66 KB
Ghost of Drupal Past’s picture

Issue summary: View changes
Status: Active » Needs review
FileSize
12.46 KB
17.49 KB

I don't quite expect this patch to pass yet but it should give you a very good idea of how things are shaping up. It also includes the other issue because I can't get anything to work without it -- ICU will remove those diacritics before we do anything else.

Status: Needs review » Needs work

The last submitted patch, 22: 3099668_22.patch, failed testing. View results
- codesniffer_fixes.patch Interdiff of automated coding standards fixes only.

Ghost of Drupal Past’s picture

Issue summary: View changes
Status: Needs work » Needs review
FileSize
8.26 KB

I will work on fixing those but for now -- let's validate the approach before fine tuning.

jhodgdon’s picture

I realize this patch is probably preliminary, but I took a look anyway... Some thoughts:

a) During some operations, like search cron, some of these functions will be called a lot during the same request (or at least, I think so?). I don't know if instantiating the transliterators from the intl is an expensive operation or not, but at least in the Core component it seems to have a number of steps. Would it make sense to store these things, on a per-langcode and per-type (diactrics remove or transliterate) basis, in member variables? This would apply to both the Core and Component intl transliterators.

b) API docs could be improved:
- In the api.php file, an @see link to some documentation of what ICU rules are, and to docs for the intl package, would seem to be helpful? The link to the example is OK but a link to actual docs would also be helpful.
- The alter hook's doc block should start with Alter instead of Provide.
- The Provide hook should have a @see to the Alter hook.
- The Transliteration topic probably needs some expansion to talk about the use of the intl package, with maybe a link to docs of how to install it.
- public function getRemoveDiacriticsRules() on the LanguageInterface needs more explanation. I have no idea what a "rules string for intl" ooks like and there is no link to docs.
- These rules strings seem to be referred to as "Rules string for intl for diacritics removal" and "language-specific ICU additional rules for diacritics removal" and "Characters PhpTransliteration removes diacritics from but ICU doesn't." and "Append these to remove diacritics rules for intl."... Maybe they are all the same thing, and maybe not? But it is confusing.
- Several member variables and functions in the new class are missing docs.

c) In (a) I suggested some member variables, but it looks like in the Core object there is already a member variable that should probably be a local variable:

+  protected function getTransliteration($langcode) {
+    $map = $this->map;
+    if (!isset($this->map)) {
+      $this->map = $this->moduleHandler->invokeAll('remove_diacritics_rules_build');
+    }
+    if (!isset($this->map[$langcode])) {
+      $this->map[$langcode] = self::DEFAULT_REMOVE_DIACRITICS_RULE;
+    }
+    if ($map !== $this->map) {
+      $this->moduleHandler->alter('remove_diacritics_rules', $this->map);
+      $this->transliteration = NULL;
+    }
+    if (!isset($this->transliteration)) {
+      $this->transliteration = new BaseIntlTransliteration($this->map);
+    }
+    return $this->transliteration;
+  }

As far as I can tell, each time this function is called, $this->transliteration is recalculated and it is never used elsewhere in the class. Besides, it is specific to a language... so it seems like this particular variable should be a local variable to this function and not a member variable?

alexpott’s picture

  1. +++ b/core/core.services.yml
    @@ -1372,9 +1372,12 @@ services:
    -  transliteration:
    +  transliteration.php:
    

    This will need a deprecated service so we still support the transliteration service name. Ah I see we set this in a service provider. IN that case we could leave this alone and replace the transliteration service there. No need for transliteration.php and transliteratioon.intl. If we do decide to do it this way both services should be private.

  2. +++ b/core/lib/Drupal/Core/CoreServiceProvider.php
    @@ -60,6 +60,13 @@ public function register(ContainerBuilder $container) {
    +    if (extension_loaded('intl')) {
    +      $container->setAlias('transliteration', 'transliteration.intl');
    +    }
    +    else {
    +      $container->setAlias('transliteration', 'transliteration.php');
    +    }
    

    If https://github.com/symfony/polyfill/pull/192 lands then potentially we can always use the .intl service. Also if that lands then we might not need the .php one anymore. Maybe we should try to help land the polyfill and then we can rid ourselves of quite a bit of code.

Ghost of Drupal Past’s picture

Issue summary: View changes
FileSize
26.23 KB

Amended and rewritten much of the docs, smoothed the logic in Core/IntlTransliteration, added a synthetic transliteration service to make it clearer, left comments on the intl PR, fixed at least one test -- the DefaultConfigTest checks whether nothing gets added after install but the rules do get added from the constant and I have no desire to copy out that constant into X default YAML just because of a test. The interdiff is 21k vs the patch 27k, I deemed it not useful and not uploaded.

I must admit now I found and read https://www.unicode.org/reports/tr35/tr35-general.html#Transform_Rules_S... I now understand so much more. The ICU guide at http://userguide.icu-project.org/transforms/general is really, really hard to understand. People interested in these matters should run https://stackoverflow.com/a/17015063/308851 (don't forget to redirect the output, it's a lot) and peek at the results, it basically serves as a lot of examples. I didn't link the actual HTTP link to TR35 because HTTP links have the tendency to rot but I certainly can add this one: it's been there since 2013.

alexpott’s picture

  1. +++ b/core/core.services.yml
    @@ -1342,8 +1342,15 @@ services:
       transliteration:
    +    synthetic: true
    +  transliteration.php:
         class: Drupal\Core\Transliteration\PhpTransliteration
         arguments: [null, '@module_handler']
    +    private: true
    +  transliteration.intl:
    +    class: Drupal\Core\Transliteration\IntlTransliteration
    +    arguments: ['@module_handler']
    +    private: true
    
    +++ b/core/lib/Drupal/Core/CoreServiceProvider.php
    @@ -54,6 +54,8 @@ public function register(ContainerBuilder $container) {
    +    $container->setAlias('transliteration', 'transliteration.' . (extension_loaded('intl') ? 'intl' : 'php'));
    +
    

    I don't think this dance is worthwhile. We can leave the transliteration service alone and then in the CoreServiceProvider change the definition.

  2. +++ b/core/tests/Drupal/KernelTests/AssertConfigTrait.php
    @@ -80,6 +80,13 @@ protected function assertConfigDiff(Diff $result, $config_name, array $skipped_c
    +            if (!empty($skipped_config[$config_name])) {
    +              foreach ($skipped_config[$config_name] as $line) {
    +                if (strpos($closing, $line) !== FALSE) {
    +                  continue 2;
    +                }
    +              }
    +            }
    

    How about adding the new config to the configurable language schema.

  3. +++ b/core/tests/Drupal/KernelTests/Config/DefaultConfigTest.php
    @@ -40,6 +40,9 @@ class DefaultConfigTest extends KernelTestBase {
    +    'language.entity.en' => ['remove_diacritics_rules: '],
    +    'language.entity.und' => ['remove_diacritics_rules: '],
    +    'language.entity.zxx' => ['remove_diacritics_rules: '],
    

    We need to add this to the schema - not skip it.

  4. +++ b/core/tests/Drupal/KernelTests/FileSystemModuleDiscoveryDataProviderTrait.php
    @@ -14,6 +14,7 @@ trait FileSystemModuleDiscoveryDataProviderTrait {
    +    return [['language']];
    

    Debug code.

  5. One thought is that making the rules configurable doesn't feel like it makes much sense. It feels as though correct rules are possible and something that we can build up over time in core.
alexpott’s picture

+++ b/core/modules/language/src/Entity/ConfigurableLanguage.php
@@ -90,6 +92,13 @@ class ConfigurableLanguage extends ConfigEntityBase implements ConfigurableLangu
+  /**
+   * Append these to remove diacritics rules for intl.
+   *
+   * @var string
+   */
+  protected $remove_diacritics_rules = IntlTransliteration::DEFAULT_REMOVE_DIACRITICS_RULE;

Also if we have to make this configuration that don't do this. We can use the default in language_remove_diacritics_rules_build() if the configurable language doesn't have a rule. And then we'd maintain the rule that config should not change on installation.

Status: Needs review » Needs work

The last submitted patch, 27: 3099668_27.patch, failed testing. View results
- codesniffer_fixes.patch Interdiff of automated coding standards fixes only.

Ghost of Drupal Past’s picture

Issue summary: View changes
FileSize
15.89 KB

Alright, much trimmed version: Alex asked me to drop completely the config entity parts and the build hook and just keep the alter.

Ghost of Drupal Past’s picture

Status: Needs work » Needs review
Ghost of Drupal Past’s picture

Issue summary: View changes
FileSize
17.84 KB

This is as far as I have carried this issue. Others will need to continue.

omlx’s picture

Thank you Charlie for your great work.

daffie’s picture

Updated the patch so that the PHPCS error messages should be fixed.

+++ b/core/lib/Drupal/Component/Transliteration/IntlTransliteration.php
@@ -0,0 +1,103 @@
+ * Transliteration implementation using the intl extension.

The question I have is do we have this extension enabled on the testbot. When I run the test Drupal\Tests\block\Functional\BlockTest it returns green on my local machine and it fails on the testbot.

omlx’s picture

Any update on this? Will it be merged ?

jhodgdon’s picture

This issue is at status Needs Review. The next step is for someone in the community to review it, and either set it to Reviewed and tested by the community, or to Needs work.

Once it is at Reviewed and tested by the community status, it can get committed/merged into Drupal core.

omlx’s picture

Thanks @jhodgdon for the informative reply. For the testing is it enough to test whether the original bug (mprove search behavior for Arabic) is fixed or not ? I can do that. thanks

jhodgdon’s picture

Yes, that would be a good test. In order for the issue to be marked RTBC, someone also should review the code in the patch.

But actually, before we do that... I just noticed that the test results from the automated test of the latest patch say "Build successful", which actually means that the tests failed in some way (the message is misleading -- there is an issue in the automated testing project to fix it, but that issue is still open apparently). We need a patch that results in a green message that says something like "PHP 7.4 & MySQL 5.7 26,919 pass", not a message that ends in "Build successful"

I will relaunch the test to see if this was a real problem or something intermittent not related to this patch. But if it comes back "Build successful" again, then we'll need to figure out what is wrong with the patch before we get to the step of reviewing/testing.

jhodgdon’s picture

Status: Needs review » Needs work

It is now saying Patch failed to apply on the test results.

omlx’s picture

Anybody willing to complete the patch?

Thanks

omlx’s picture

hi @xjm please have a look to this bug. currently, the search in Arabic does not yield correct results.
Thanks

anmolgoyal74’s picture

Status: Needs work » Needs review
FileSize
17.16 KB
3.78 KB

Re-rolled the patch.

anmolgoyal74’s picture

anmolgoyal74’s picture

omlx’s picture

thank @anmolgoyal74 for carry on this patch. I tried to apply the patch for 9.1.x branch but I got this error:

error: core/lib/Drupal/Component/Transliteration/IntlTransliteration.php: already exists in working directory
error: core/lib/Drupal/Core/Transliteration/IntlTransliteration.php: already exists in working directory

any clue ?

omlx’s picture

Please ignore previous reply I figure it out. The patch works good.
How I test it.
- create a post contains this text: "السُّلَّامُ عَلَيْكُمْ وَرَحْمَةُ اللهِ وَبَرَكَاتُهُ"
- reindex the post.
-search for "السلام"

could you please backport it to 8.9 branch?

Thanks

omlx’s picture

Status: Needs review » Reviewed & tested by the community

Version: 9.1.x-dev » 9.2.x-dev

Drupal 9.1.0-alpha1 will be released the week of October 19, 2020, which means new developments and disruptive changes should now be targeted for the 9.2.x-dev branch. For more information see the Drupal 9 minor version schedule and the Allowed changes during the Drupal 9 release cycle.

catch’s picture

Status: Reviewed & tested by the community » Needs work

'build successful' actually means test failures unfortunately - see the Drupal CI job output: https://dispatcher.drupalci.org/job/drupal_patches/66415/

xjm’s picture

A lot of the test failures look like this:

) Drupal\Tests\block\Functional\BlockLanguageTest::testLanguageBlockVisibility
Exception: Notice: Undefined property: Drupal\Core\Transliteration\IntlTransliteration::$transliteration
Drupal\Core\Transliteration\IntlTransliteration->getTransliteration()() (Line: 50)

So the next step here is to fix that.

xjm’s picture

Category: Feature request » Bug report

This should probably be treated as a bug, because it prevents search from returning the expected results in languages with optional vowel markers.

xjm’s picture

Issue summary: View changes

Also removing red herrings from the summary about Scandinavian, Hungarian, and "international audiences". This is a real bug for certain languages where diacritics are not a required part of the spelling (Hebrew, Arabic, etc.) Added a note to the remaining tasks for dealing with the fact that this behavior is needed for some languages but wrong in others.

Lendude’s picture

Status: Needs work » Needs review
FileSize
634 bytes
16.17 KB

This should fix the notices but not sure why the build fails, so that might still happen.

Also, this doesn't pass for me locally because: ''ext-intl' is missing in composer.json' ¯\_(ツ)_/¯

Status: Needs review » Needs work

The last submitted patch, 54: 3099668-53.patch, failed testing. View results

TR’s picture

Note that the if ($this->transliteration) check on line 40 of Drupal\Core\Transliteration\IntlTransliteration doesn't cause a "Undefined property" notice because the property IS defined on the class. So why should the if (!$this->transliteration) check on line 50 cause that notice?

Using empty() on line 50 like the patch in #54 just hides the problem, which is the unset($this->transliteration) a few lines above that. Using unset() removes the property from the object, which is not the right thing to do. Instead, this line should set the property to NULL - this empties the property value without removing the property from the object. The property probably also should be initialized to NULL, so it always has a definite value ...

quietone’s picture

I reviewed the comments for readability. I don't know this material so I have assumed the comments are correct.

  1. +++ b/core/lib/Drupal/Component/Transliteration/IntlTransliteration.php
    @@ -0,0 +1,98 @@
    +      // We need to handle this manually because Transliterator::createFromRules()
    

    Line > 80 char

  2. +++ b/core/lib/Drupal/Component/Transliteration/IntlTransliteration.php
    @@ -0,0 +1,98 @@
    +   * Return the map of rules for remove diacritics.
    

    s/remove/removing/

  3. +++ b/core/lib/Drupal/Component/Transliteration/TransliterationInterface.php
    @@ -12,18 +12,17 @@ interface TransliterationInterface {
    +   *   (optional) The language code of the language the string is in. Defaults
    

    I think this can be simplified. Maybe 'The language code of the string'.

  4. +++ b/core/lib/Drupal/Core/Language/language.api.php
    @@ -227,6 +227,28 @@ function hook_language_switch_links_alter(array &$links, $type, \Drupal\Core\Url
    + * Diacritics can be removed from a much wider set of characters if the intl
    ...
    + * syntax can be found in the Transform Rules Syntax section of UTS #35:
    + * Unicode LDML.
    

    This paragraph isn't wrapped to 80 chars.

    I don't know what the UTS #35: Unicode LDML is. Can we add an @see to link to that to save the reader some searching.

  5. +++ b/core/lib/Drupal/Core/Transliteration/IntlTransliteration.php
    @@ -0,0 +1,71 @@
    +  /**
    +   * @param $langcode
    

    Needs summary line

ravi.shankar’s picture

Assigned: Unassigned » ravi.shankar
ravi.shankar’s picture

Assigned: ravi.shankar » Unassigned
FileSize
16.83 KB
3.93 KB

Tried to address comment #57

xjm’s picture

#56 also needs to be addressed. Lots of:

1) Drupal\Tests\block\Functional\BlockTest::testBlockVisibility
Exception: Error: Call to a member function transliterate() on null
Drupal\Component\Transliteration\IntlTransliteration->transliterate()() (Line: 84)
Ghost of Drupal Past’s picture

Let me explain why I added the config entity back then -- this is not to contradict Alex, of course, this is just something I have not expanded enough on I am afraid: I do not think individuals will create their rules but I was dreaming about the various communities currently working on translations to develop their own rulesets (possibly multiple per language) and if we have a config entity, these are very easy to distribute. If that's where we are headed, it might benefit from localize.drupal.org features , even if it is just a link initially. For this reason I felt having it in core rather than contrib would be beneficial.

The comment I added in @defgroup transliteration still references to the config entity.

Re #57, The Transform Rules Syntax section of UTS #35: Unicode LDML is at http://www.unicode.org/reports/tr35/tr35-39/tr35-general.html#Transform_... I was hesitant adding such a link because I am unsure about the stability of that link. Though I guess having something that might rot is better than nothing. Despite the scary formal header, the text itself is somewhat understandable.

Let me also note the comment " If no rules are specified for a given language, default to :: NFD; :: [:M:] Remove; :: NFC;" is incorrect in the constructor of IntlTransliteration::__construct. Instead, we default to the DEFAULT_REMOVE_DIACRITICS_RULE constant. I suspect I wrote that comment before I fully understood what's going on. The comment on the constant is much better however cross referencing it to the @defgroup transliteration comment would be beneficial.

I believe the original code in testRemoveDiacritics is now superflous and could be removed as providerTestPhpTransliterationRemoveDiacritics will always return PhpTransliteration.

omlx’s picture

is there any update on this?

Version: 9.2.x-dev » 9.3.x-dev

Drupal 9.2.0-alpha1 will be released the week of May 3, 2021, which means new developments and disruptive changes should now be targeted for the 9.3.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

omlx’s picture

Hi @Charlie ChX Negyesi & @alexpott @xjm

What's the way forward about this bug?

thanks

Version: 9.3.x-dev » 9.4.x-dev

Drupal 9.3.0-rc1 was released on November 26, 2021, which means new developments and disruptive changes should now be targeted for the 9.4.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.4.x-dev » 9.5.x-dev

Drupal 9.4.0-alpha1 was released on May 6, 2022, which means new developments and disruptive changes should now be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.5.x-dev » 10.1.x-dev

Drupal 9.5.0-beta2 and Drupal 10.0.0-beta2 were released on September 29, 2022, which means new developments and disruptive changes should now be targeted for the 10.1.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

omlx’s picture

As a workaround I installed Search API module & enabled Database Search as backend, then in processors page under index page I enable Ignore characters and add \p{M} to the list.
\p{M} - Unicode property classes matches any diacritic mark.

I hope this helps till this bug fixed

Version: 10.1.x-dev » 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch, which currently accepts only minor-version allowed changes. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

andypost’s picture

It will affect coming help topics module which is using search index (core one) to search help pages and its translations

catch’s picture

This needs an issue summary update - there are at least two proposed resolutions.

Ghost of Drupal Past’s picture

Issue summary: View changes
Ghost of Drupal Past’s picture

Issue summary: View changes
Issue tags: -Needs issue summary update

I rewrote the IS completely.

Ghost of Drupal Past’s picture

Issue summary: View changes
Ghost of Drupal Past’s picture

Issue summary: View changes
Ghost of Drupal Past’s picture

I have been working on this, researching etc and I believe the best we can do is to ignore the current implementation, snapshot what intl does into a simple mapping: https://gist.github.com/chx/ce1d45398996bbadcaf0bd65a61f5902#file-remove... the gist also contains the generator script.

doesn't need intl in core at all. It's less than ten kilobytes and as fast as it gets. It handles the first 8192 (0x2000) characters https://jrgraphix.net/r/Unicode/ which includes most non-Hiragana/Katakana scripts. We still want to provide an alter hook to change the rules and it's really simple because the rules are just an array at this point.

Contrib can add intl as a service and can convert the results of the alter into intl rules to augment the basic one with ease, single character replacements are valid rules for ICU.

The rules mentioned in the current IS are still removed. If we wanted to, we could provide a module which alters in the current rules for near total backwards compatibility -- the component would have these rules removed but the core class would only add rules in this case. Read here why these do not have decomposition rules in Unicode.

Here at the 827 replacements the current proposal would do:

new rules

Ghost of Drupal Past’s picture

I pushed this solution , reading the most important new class is much more pleasant in the blob viewer the diff viewer breaks the map line and it's a bit long then.

Ghost of Drupal Past’s picture

Issue summary: View changes
Ghost of Drupal Past’s picture

Issue summary: View changes
Ghost of Drupal Past’s picture

Issue summary: View changes
Ghost of Drupal Past’s picture

Issue summary: View changes
Ghost of Drupal Past’s picture

Issue summary: View changes
Ghost of Drupal Past’s picture

Issue summary: View changes
Status: Needs work » Needs review
smustgrave’s picture

Status: Needs review » Needs work

For the CC failure.

Could the change record be expanded to include the new service remove_diacritics and how it might be used? Also the addition of a new module that's deprecated?

Ghost of Drupal Past’s picture

Issue summary: View changes
Ghost of Drupal Past’s picture

Issue summary: View changes
Status: Needs work » Needs review

Updated the change notice too and we are green. I will do a rebase.

Minor things like hook documentation, a test (I am guessing the alter hook can be tested, I can't imagine a meaningful test for a class which is literally just an strtr call) and deprecating the existing method is not written yet, but that shouldn't block a review. Most importantly: are we OK with this approach?

Ghost of Drupal Past’s picture

The fun continues in https://www.drupal.org/project/remove_diacritics it at least provides the necessary functionality.

I most certainly won't work on this issue (or any other core issue, I was a fool to do this one but I couldn't bear watching it die, I just wish I had the idea to continue in contrib sooner) and given the amount of contributions to this in the last near three years I guess this should be closed again but I do not have the permissions to do so.

needs-review-queue-bot’s picture

Status: Needs review » Needs work
FileSize
90 bytes

The Needs Review Queue Bot tested this issue. It no longer applies to Drupal core. Therefore, this issue status is now "Needs work".

This does not mean that the patch needs to be re-rolled or the MR rebased. Read the Issue Summary, the issue tags and the latest discussion here to determine what needs to be done.

Consult the Drupal Contributor Guide to find step-by-step guides for working with issues.

kostyashupenko made their first commit to this issue’s fork.

kostyashupenko’s picture

Status: Needs work » Needs review

Rebased

smustgrave’s picture

Will need test coverage for sure.

Also will also need framework manager sign off to add a new module believe.

After all that a change record will be needed.