Problem/Motivation
It's currently possible to search for Éclair recipes by typing in Eclair. This is currently only done for Latin letters and the issue is about expanding this to other scripts.
Proposed resolution
Accented characters among the Unicode characters are precomposed characters. For example, É (official name: Latin Capital Letter E with Acute) decomposes into Latin Capital Letter E and Combining Acute Accent. The proposed resolution is to apply these decomposition rules to Letters and remove the Combining Marks. This is the recommended way to remove accents in the ICU documentation.
The suggested userspace implementation have been created by a script. This script uses intl (the PHP wrapper for ICU) and attempts to remove accents from letters in the first 8192 characters and where this succeeded, the results are recorded into a simple associative array (if it didn't succeed then the letter didn't have a decomposition rule). It's 827 characters. Use this array for a new remove diacritics class. The results with ample doxygen is a less than 10K PHP file. See the ranges up to and including Greek Extended for included scripts.
Provide an alter hook in a core remove diacritics service.
Implement this alter hook in new a hidden for backwards compatibility and install it in an update hook. This is necessary because out of the 288 characters the current implementation removes diacritics from 40 characters which are no longer handled. See this Linguistics StackExchange answer why these characters have no decomposition rules. The current remove diacritics implementation uses the transliteration character database which has been compiled from various sources instead of just using the standard which is useful in the declared goal of machine name creation -- but it's less useful when creating strings users actually need to interact with. Nonetheless, the alter hook is there if a site wants to do something else and if some rules become widespread in a language community , I hope eventually there'll eventually be a contrib module which provides a config entity utilizing this alter hook and localize.drupal.org could provide these entities.
This module probably should be deprecated but adding lifecycle: deprecated
to a core module breaks a lot of tests and it is not documented how to do this. So this should be a follow up pending on said documentation.
User interface changes
None.
API changes
TransliterationInterface::removeDiacritics
is deprecated.
A new remove diacritics class is added to transliteration.
A new hook_remove_diacritics_map_alter
is introduced to allow changing rules.
Data model changes
Release notes snippet
Original report
when removing diacritics in function search_simplify(), it not considering remove Arabic diacritics. How to test: add this text to any article: "السُّلَّامُ عَلَيْكُمْ وَرَحْمَةُ اللهِ وَبَرَكَاتُهُ" then search for "السلام". Original: https://ahmedspace.com/arabic-case-insensitive-in-database-systems-how-t...
Comment | File | Size | Author |
---|---|---|---|
#89 | 3099668-nr-bot.txt | 90 bytes | needs-review-queue-bot |
#59 | interdiff_54-59.txt | 3.93 KB | ravi.shankar |
#59 | 3099668-59.patch | 16.83 KB | ravi.shankar |
#54 | 3099668-53.patch | 16.17 KB | Lendude |
#54 | interdiff-3099668-45-53.txt | 634 bytes | Lendude |
Issue fork drupal-3099668
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #2
omlx CreditAttribution: omlx as a volunteer commentedHow long this bug have to wait to find some love from devs?
Comment #3
longwaveI agree that this is not the correct place to add this change; the call to removeDiacritics() above should surely do this instead.
Comment #4
Ghost of Drupal PastThanks for your contribution but I am afraid we are unable to fix this in core.
hook_transliteration_overrides_alter
. Looking at https://en.wikipedia.org/wiki/Romanization_of_Arabic#Romanization_standa... I fail to see how core could provide a transliteration which works for everyone.Comment #5
omlx CreditAttribution: omlx commentedThanks for reply
We need to ignore diacritics in arabic to enhance searching
This approach is what libreoffice And firfox do
It may bette to use icu to remove diacritics Same as lifficeoffice[1]
[1] https://cgit.freedesktop.org/libreoffice/core/commit/?h=libreoffice-5-3&...
Comment #6
omlx CreditAttribution: omlx commented@Charlie ChX Negyesi transliteration in drupal is fine and no need to change it. The search is broken.
Comment #7
omlx CreditAttribution: omlx commentedif we use icu for transliterationsome thing like this
will it be acceptable?
Comment #8
xjm@Charlie ChX Negyesi, I don't think it was appropriate to wontfix this issue, especially without any maintainer feedback on it. I'm reopening it because while the actual implementation or proposed resolution might be different from the above, I don't see a reason not to try to improve this.
So, I'm setting the issue back to active. Thanks!
Comment #9
Ghost of Drupal PastI apologize for being shortsighted. In the past I have worked extensively on search-Unicode issues and the transliteration component and within that framework, this issue is unfixable. But we can, of couse, do more. Let me re-target this issue for a more appropriate solution.
If this gets in, search will need a version which uses intl for picking up codepoints instead of a regex.
Comment #10
Ghost of Drupal PastComment #11
Ghost of Drupal PastComment #12
Ghost of Drupal PastComment #13
Ghost of Drupal PastComment #14
Ghost of Drupal PastremoveDiacritics
is more than a bit strange because it removes more than just diacritics -- strokes, for example. But it doesn't do a full extended Latin to ASCII either. This made it necessary to lift some of the logic from PhpTransliteration.Comment #16
Ghost of Drupal PastYes, this was expected, https://stackoverflow.com/q/62343638/308851 is where we continue, the patch really does need review from interested parties.
Comment #17
Ghost of Drupal PastComment #18
Ghost of Drupal PastComment #19
Ghost of Drupal PastComment #20
Ghost of Drupal PastComment #21
Ghost of Drupal PastComment #22
Ghost of Drupal PastI don't quite expect this patch to pass yet but it should give you a very good idea of how things are shaping up. It also includes the other issue because I can't get anything to work without it -- ICU will remove those diacritics before we do anything else.
Comment #24
Ghost of Drupal PastI will work on fixing those but for now -- let's validate the approach before fine tuning.
Comment #25
jhodgdonI realize this patch is probably preliminary, but I took a look anyway... Some thoughts:
a) During some operations, like search cron, some of these functions will be called a lot during the same request (or at least, I think so?). I don't know if instantiating the transliterators from the intl is an expensive operation or not, but at least in the Core component it seems to have a number of steps. Would it make sense to store these things, on a per-langcode and per-type (diactrics remove or transliterate) basis, in member variables? This would apply to both the Core and Component intl transliterators.
b) API docs could be improved:
- In the api.php file, an @see link to some documentation of what ICU rules are, and to docs for the intl package, would seem to be helpful? The link to the example is OK but a link to actual docs would also be helpful.
- The alter hook's doc block should start with Alter instead of Provide.
- The Provide hook should have a @see to the Alter hook.
- The Transliteration topic probably needs some expansion to talk about the use of the intl package, with maybe a link to docs of how to install it.
- public function getRemoveDiacriticsRules() on the LanguageInterface needs more explanation. I have no idea what a "rules string for intl" ooks like and there is no link to docs.
- These rules strings seem to be referred to as "Rules string for intl for diacritics removal" and "language-specific ICU additional rules for diacritics removal" and "Characters PhpTransliteration removes diacritics from but ICU doesn't." and "Append these to remove diacritics rules for intl."... Maybe they are all the same thing, and maybe not? But it is confusing.
- Several member variables and functions in the new class are missing docs.
c) In (a) I suggested some member variables, but it looks like in the Core object there is already a member variable that should probably be a local variable:
As far as I can tell, each time this function is called, $this->transliteration is recalculated and it is never used elsewhere in the class. Besides, it is specific to a language... so it seems like this particular variable should be a local variable to this function and not a member variable?
Comment #26
alexpottThis will need a deprecated service so we still support the transliteration service name. Ah I see we set this in a service provider. IN that case we could leave this alone and replace the transliteration service there. No need for transliteration.php and transliteratioon.intl. If we do decide to do it this way both services should be private.
If https://github.com/symfony/polyfill/pull/192 lands then potentially we can always use the .intl service. Also if that lands then we might not need the .php one anymore. Maybe we should try to help land the polyfill and then we can rid ourselves of quite a bit of code.
Comment #27
Ghost of Drupal PastAmended and rewritten much of the docs, smoothed the logic in Core/IntlTransliteration, added a synthetic transliteration service to make it clearer, left comments on the intl PR, fixed at least one test -- the DefaultConfigTest checks whether nothing gets added after install but the rules do get added from the constant and I have no desire to copy out that constant into X default YAML just because of a test. The interdiff is 21k vs the patch 27k, I deemed it not useful and not uploaded.
I must admit now I found and read https://www.unicode.org/reports/tr35/tr35-general.html#Transform_Rules_S... I now understand so much more. The ICU guide at http://userguide.icu-project.org/transforms/general is really, really hard to understand. People interested in these matters should run https://stackoverflow.com/a/17015063/308851 (don't forget to redirect the output, it's a lot) and peek at the results, it basically serves as a lot of examples. I didn't link the actual HTTP link to TR35 because HTTP links have the tendency to rot but I certainly can add this one: it's been there since 2013.
Comment #28
alexpottI don't think this dance is worthwhile. We can leave the transliteration service alone and then in the CoreServiceProvider change the definition.
How about adding the new config to the configurable language schema.
We need to add this to the schema - not skip it.
Debug code.
Comment #29
alexpottAlso if we have to make this configuration that don't do this. We can use the default in language_remove_diacritics_rules_build() if the configurable language doesn't have a rule. And then we'd maintain the rule that config should not change on installation.
Comment #31
Ghost of Drupal PastAlright, much trimmed version: Alex asked me to drop completely the config entity parts and the build hook and just keep the alter.
Comment #32
Ghost of Drupal PastComment #33
Ghost of Drupal PastThis is as far as I have carried this issue. Others will need to continue.
Comment #34
omlx CreditAttribution: omlx commentedThank you Charlie for your great work.
Comment #35
daffie CreditAttribution: daffie commentedUpdated the patch so that the PHPCS error messages should be fixed.
The question I have is do we have this extension enabled on the testbot. When I run the test Drupal\Tests\block\Functional\BlockTest it returns green on my local machine and it fails on the testbot.
Comment #36
omlx CreditAttribution: omlx commentedAny update on this? Will it be merged ?
Comment #37
jhodgdonThis issue is at status Needs Review. The next step is for someone in the community to review it, and either set it to Reviewed and tested by the community, or to Needs work.
Once it is at Reviewed and tested by the community status, it can get committed/merged into Drupal core.
Comment #38
omlx CreditAttribution: omlx commentedThanks @jhodgdon for the informative reply. For the testing is it enough to test whether the original bug (mprove search behavior for Arabic) is fixed or not ? I can do that. thanks
Comment #39
jhodgdonYes, that would be a good test. In order for the issue to be marked RTBC, someone also should review the code in the patch.
But actually, before we do that... I just noticed that the test results from the automated test of the latest patch say "Build successful", which actually means that the tests failed in some way (the message is misleading -- there is an issue in the automated testing project to fix it, but that issue is still open apparently). We need a patch that results in a green message that says something like "PHP 7.4 & MySQL 5.7 26,919 pass", not a message that ends in "Build successful"
I will relaunch the test to see if this was a real problem or something intermittent not related to this patch. But if it comes back "Build successful" again, then we'll need to figure out what is wrong with the patch before we get to the step of reviewing/testing.
Comment #40
jhodgdonIt is now saying Patch failed to apply on the test results.
Comment #41
omlx CreditAttribution: omlx commentedAnybody willing to complete the patch?
Thanks
Comment #42
omlx CreditAttribution: omlx commentedhi @xjm please have a look to this bug. currently, the search in Arabic does not yield correct results.
Thanks
Comment #43
anmolgoyal74 CreditAttribution: anmolgoyal74 at OpenSense Labs for DrupalFit commentedRe-rolled the patch.
Comment #44
anmolgoyal74 CreditAttribution: anmolgoyal74 at OpenSense Labs for DrupalFit commentedComment #45
anmolgoyal74 CreditAttribution: anmolgoyal74 at OpenSense Labs for DrupalFit commentedFixed CS.
Comment #46
omlx CreditAttribution: omlx commentedthank @anmolgoyal74 for carry on this patch. I tried to apply the patch for 9.1.x branch but I got this error:
any clue ?
Comment #47
omlx CreditAttribution: omlx commentedPlease ignore previous reply I figure it out. The patch works good.
How I test it.
- create a post contains this text: "السُّلَّامُ عَلَيْكُمْ وَرَحْمَةُ اللهِ وَبَرَكَاتُهُ"
- reindex the post.
-search for "السلام"
could you please backport it to 8.9 branch?
Thanks
Comment #48
omlx CreditAttribution: omlx as a volunteer commentedComment #50
catch'build successful' actually means test failures unfortunately - see the Drupal CI job output: https://dispatcher.drupalci.org/job/drupal_patches/66415/
Comment #51
xjmA lot of the test failures look like this:
So the next step here is to fix that.
Comment #52
xjmThis should probably be treated as a bug, because it prevents search from returning the expected results in languages with optional vowel markers.
Comment #53
xjmAlso removing red herrings from the summary about Scandinavian, Hungarian, and "international audiences". This is a real bug for certain languages where diacritics are not a required part of the spelling (Hebrew, Arabic, etc.) Added a note to the remaining tasks for dealing with the fact that this behavior is needed for some languages but wrong in others.
Comment #54
LendudeThis should fix the notices but not sure why the build fails, so that might still happen.
Also, this doesn't pass for me locally because: ''ext-intl' is missing in composer.json' ¯\_(ツ)_/¯
Comment #56
TR CreditAttribution: TR commentedNote that the
if ($this->transliteration)
check on line 40 of Drupal\Core\Transliteration\IntlTransliteration doesn't cause a "Undefined property" notice because the property IS defined on the class. So why should theif (!$this->transliteration)
check on line 50 cause that notice?Using
empty()
on line 50 like the patch in #54 just hides the problem, which is theunset($this->transliteration)
a few lines above that. Usingunset()
removes the property from the object, which is not the right thing to do. Instead, this line should set the property to NULL - this empties the property value without removing the property from the object. The property probably also should be initialized to NULL, so it always has a definite value ...Comment #57
quietone CreditAttribution: quietone as a volunteer commentedI reviewed the comments for readability. I don't know this material so I have assumed the comments are correct.
Line > 80 char
s/remove/removing/
I think this can be simplified. Maybe 'The language code of the string'.
This paragraph isn't wrapped to 80 chars.
I don't know what the UTS #35: Unicode LDML is. Can we add an @see to link to that to save the reader some searching.
Needs summary line
Comment #58
ravi.shankar CreditAttribution: ravi.shankar at OpenSense Labs commentedComment #59
ravi.shankar CreditAttribution: ravi.shankar at OpenSense Labs commentedTried to address comment #57
Comment #60
xjm#56 also needs to be addressed. Lots of:
Comment #61
Ghost of Drupal PastLet me explain why I added the config entity back then -- this is not to contradict Alex, of course, this is just something I have not expanded enough on I am afraid: I do not think individuals will create their rules but I was dreaming about the various communities currently working on translations to develop their own rulesets (possibly multiple per language) and if we have a config entity, these are very easy to distribute. If that's where we are headed, it might benefit from localize.drupal.org features , even if it is just a link initially. For this reason I felt having it in core rather than contrib would be beneficial.
The comment I added in @defgroup transliteration still references to the config entity.
Re #57, The Transform Rules Syntax section of UTS #35: Unicode LDML is at http://www.unicode.org/reports/tr35/tr35-39/tr35-general.html#Transform_... I was hesitant adding such a link because I am unsure about the stability of that link. Though I guess having something that might rot is better than nothing. Despite the scary formal header, the text itself is somewhat understandable.
Let me also note the comment " If no rules are specified for a given language, default to :: NFD; :: [:M:] Remove; :: NFC;" is incorrect in the constructor of IntlTransliteration::__construct. Instead, we default to the DEFAULT_REMOVE_DIACRITICS_RULE constant. I suspect I wrote that comment before I fully understood what's going on. The comment on the constant is much better however cross referencing it to the @defgroup transliteration comment would be beneficial.
I believe the original code in testRemoveDiacritics is now superflous and could be removed as providerTestPhpTransliterationRemoveDiacritics will always return PhpTransliteration.
Comment #62
omlx CreditAttribution: omlx as a volunteer commentedis there any update on this?
Comment #64
omlx CreditAttribution: omlx commentedHi @Charlie ChX Negyesi & @alexpott @xjm
What's the way forward about this bug?
thanks
Comment #68
omlx CreditAttribution: omlx commentedAs a workaround I installed Search API module & enabled Database Search as backend, then in processors page under index page I enable Ignore characters and add
\p{M}
to the list.\p{M}
- Unicode property classes matches any diacritic mark.I hope this helps till this bug fixed
Comment #70
andypostIt will affect coming help topics module which is using search index (core one) to search help pages and its translations
Comment #71
catchThis needs an issue summary update - there are at least two proposed resolutions.
Comment #72
Ghost of Drupal PastComment #73
Ghost of Drupal PastI rewrote the IS completely.
Comment #74
Ghost of Drupal PastComment #75
Ghost of Drupal PastComment #76
Ghost of Drupal PastI have been working on this, researching etc and I believe the best we can do is to ignore the current implementation, snapshot what intl does into a simple mapping: https://gist.github.com/chx/ce1d45398996bbadcaf0bd65a61f5902#file-remove... the gist also contains the generator script.
doesn't need intl in core at all. It's less than ten kilobytes and as fast as it gets. It handles the first 8192 (0x2000) characters https://jrgraphix.net/r/Unicode/ which includes most non-Hiragana/Katakana scripts. We still want to provide an alter hook to change the rules and it's really simple because the rules are just an array at this point.
Contrib can add intl as a service and can convert the results of the alter into intl rules to augment the basic one with ease, single character replacements are valid rules for ICU.
The rules mentioned in the current IS are still removed. If we wanted to, we could provide a module which alters in the current rules for near total backwards compatibility -- the component would have these rules removed but the core class would only add rules in this case. Read here why these do not have decomposition rules in Unicode.
Here at the 827 replacements the current proposal would do:
Comment #78
Ghost of Drupal PastI pushed this solution , reading the most important new class is much more pleasant in the blob viewer the diff viewer breaks the map line and it's a bit long then.
Comment #79
Ghost of Drupal PastComment #80
Ghost of Drupal PastComment #81
Ghost of Drupal PastComment #82
Ghost of Drupal PastComment #83
Ghost of Drupal PastComment #84
Ghost of Drupal PastComment #85
smustgrave CreditAttribution: smustgrave at Mobomo commentedFor the CC failure.
Could the change record be expanded to include the new service remove_diacritics and how it might be used? Also the addition of a new module that's deprecated?
Comment #86
Ghost of Drupal PastComment #87
Ghost of Drupal PastUpdated the change notice too and we are green. I will do a rebase.
Minor things like hook documentation, a test (I am guessing the alter hook can be tested, I can't imagine a meaningful test for a class which is literally just an
strtr
call) and deprecating the existing method is not written yet, but that shouldn't block a review. Most importantly: are we OK with this approach?Comment #88
Ghost of Drupal PastThe fun continues in https://www.drupal.org/project/remove_diacritics it at least provides the necessary functionality.
I most certainly won't work on this issue (or any other core issue, I was a fool to do this one but I couldn't bear watching it die, I just wish I had the idea to continue in contrib sooner) and given the amount of contributions to this in the last near three years I guess this should be closed again but I do not have the permissions to do so.
Comment #89
needs-review-queue-bot CreditAttribution: needs-review-queue-bot as a volunteer commentedThe Needs Review Queue Bot tested this issue. It no longer applies to Drupal core. Therefore, this issue status is now "Needs work".
This does not mean that the patch needs to be re-rolled or the MR rebased. Read the Issue Summary, the issue tags and the latest discussion here to determine what needs to be done.
Consult the Drupal Contributor Guide to find step-by-step guides for working with issues.
Comment #91
kostyashupenkoRebased
Comment #92
smustgrave CreditAttribution: smustgrave at Mobomo commentedWill need test coverage for sure.
Also will also need framework manager sign off to add a new module believe.
After all that a change record will be needed.