Searches for words with diacritics/accents: word not highlighted in results [#731298]

Problem/Motivation

Search currently indexes words with their diacritics/accents included. Due to database collation, when you search with or without the accents present, you'll normally get the same results, because for instance in MySQL with the usual Drupal database collation, the Latvian word "meklēt" would be equal to "meklet" in a query. This is probably what most people would expect in searching, so that is good... but depending on database collation choices, it may not be true in all sites.

However, if your text has "meklēt" in it, and your search term is "meklet" without the accent on the e, you will not have "meklēt" highlighted in the excerpt in the search results (or vice versa, text with "meklet" and searching for "meklēt"). This is because the highlighting is done with PHP == matching, which knows that e is not equal to ē, while the search is done with MySQL = matching (which depends on the database collation, which normally ignores diacritics and thinks e is the same as ē).

Proposed resolution

The best way to fix this is to strip diacritics/accents before indexing, as part of search_simplify(). This would mean that:

a) The database tables for search would not include diacritics.

b) Searches would match accented/unaccented letters, regardless of database collation in use (currently it works this way with the usual database collation, but not necessarily if someone has changed their database collation).

c) Since highlighting works with search_simplify() processing, the matched terms would be properly highlighted. They aren't being highlighted consistent with search matches currently.

d) We might want to make this feature optional, but on by default. We could also give the user 3 options:
- Remove accents/diacritics (default)
- Do nothing and let the database collation determine matching.
- Full transliteration (see comment #17 for discussion)
The implications of these options would need to be explained on the Search settings page and probably also in search_help(), because most people probably wouldn't even know how to change the database collation or what to change it to. But probably we don't need the option. People do expect search to work this way, and in practice, it normally is working that way anyway due to database collation (see discussion above), just not fully working because highlighting is broken.

Remaining tasks

Make a patch, with tests.

Beta phase evaluation

Reference: https://www.drupal.org/core/beta-changes
Issue category	Bug because highlighting of search terms should work if search finds matches, and it currently doesn't. Potentially affects all site visitors who use Core Search in languages with diacritics.
Prioritized changes	This is a user-facing bug, so it is a prioritized fix.
Disruption	No APIs will change. The only possible disruption is that the contents of the search database tables will be stripped of diacriticals/accents, but since normal database collation = queries ignore diacriticals anyway, this should not cause any disruption.

User interface changes

Highlighting would work when search matching returns results.

There would possibly also be an option on the Search settings page to turn on/off how the indexing works with accents, so that someone could hypothetically use database collation that doesn't match accents if they wanted to, and turn this off... but probably this is unnecessary, and it might be confusing (see Proposed Resolution).

API changes

The search indexing process would be removing accents (possibly only when the option is set), so the contents of the search database tables would change.

Original report by @matulis

In my case, Drupal 6.15, Latvian language, diacritics widely represented (ā, š, ū, ķ, ž, ē, ģ, č, ī, ļ, ņ).
Search keyword "meklēt", search module finds all nodes with this keyword and wraps keyword "meklēt" in bold tags.
Search keyword "meklet", same nodes found but "meklēt" not in bold anymore.
So, if Drupal search module ignores diacritics in text (transliterates them "ē"->"e" and so on), both search keywords should be wrapped in bold on search results page: with and without diacritic signs.
Just my thoughts.

Comment	File	Size	Author
#51	search-731298.patch	10.62 KB	jhodgdon
#51
#45	searches_for_words_with-731298-45.patch	7.5 KB	Anonymous (not verified)
#45
#45	searches_for_words_with-731298-45.test-fail.patch	3.12 KB	Anonymous (not verified)
#45
#45	interdiff-40-45.txt	1.47 KB	Anonymous (not verified)
#40	searches_for_words_with-731298-40.patch	10.61 KB	Anonymous (not verified)
#40
#40	searches_for_words_with-731298-40.test-fail.patch	3.12 KB	Anonymous (not verified)
#40
#39	searches_for_words_with-731298-39.patch	7.49 KB	Anonymous (not verified)
#39
#39	interdiff-36-39.txt	7.8 KB	Anonymous (not verified)
#36	interdiff-34-36.txt	505 bytes	Anonymous (not verified)
#36	searches_for_words_with-731298-36.patch	7.61 KB	Anonymous (not verified)
#36
#34	searches_for_words_with-731298-34.patch	7.79 KB	Anonymous (not verified)
#34
#34	interdiff-30-34.txt	6.38 KB	Anonymous (not verified)
#30	interdiff-23-30.txt	5.73 KB	Anonymous (not verified)
#30	searches_for_words_with-731298-30.patch	4.86 KB	Anonymous (not verified)
#30
#28	interdiff-23-28.txt	5.73 KB	Anonymous (not verified)
#28	searches_for_words_with-731298-28.patch	5.73 KB	Anonymous (not verified)
#28
#23	searches_for_words_with-731298-23.patch	3.73 KB	Anonymous (not verified)
#23
#13	searches_for_words_with-731298-13.patch	1.02 KB	Anonymous (not verified)
#13

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 4 March 2010 at 18:45

Version:

6.15

» 7.x-dev

You are correct. This issue needs to be fixed in Drupal 7, and then backported to Drupal 6. There are multiple problems with the search excerpt function, and this is certainly one of them...

Comment #2

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 4 March 2010 at 19:09

An older issue that reported something similar, which I've just marked as a duplicate of this issue:
#157684: Search module does not highlight non-ASCII search terms

Comment #3

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 4 March 2010 at 19:11

Title:

Search keyword is not wrapped in bold tags on search results page if diacritics are used

» Search excerpt doesn't highlight words with diacritics

Comment #4

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 4 March 2010 at 19:21

I changed my mind. That other issue said that search didn't find words with diacritics. This issue says (and it's correct, I just tested) that search will find words with diacritics but not highlight them in the (mostly broken) excerpt function.

Anyway, I've confirmed that this is a current issue in Drupal 7.

To confirm:

1) Put the following text in a node (yes, it's probably mixing languages and no, I don't know what it means really):

Det trodde jag aldrig (eller jo, kanske ändå ;-)). Min första seriösa bloggpostning. Det vill säga en skrivelse som praktiskt taget ingen bryr sig ett dyft om. Det är ett obevisat faktum att 90% av allt som skrivs i blogga.

meklēt

2) Search for meklēt, seriösa etc. and everything works fine.

3) Search for meklet, seriosa, and search will find the node but not highlight the words in the excerpt.

Comment #5

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 4 March 2010 at 19:23

Title:

Search excerpt doesn't highlight words with diacritics

» Search excerpt doesn't highlight words with diacritics when not entered in search box

Comment #6

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 4 March 2010 at 19:33

May be related to #493270: search_excerpt() doesn't work well with stemming as well, or it may be possible to fix that other issue at the same time.

The basic problem is that the search excerpt function only looks for exact textual matches between the text of the node (or whatever is passed into the function, generally a node) and the keywords. However, search preprocesses keywords before doing searching, so there are many cases where the keywords will not exactly match what's in the node text, and search_excerpt() therefore won't highlight the keyword even though it found the node.

Comment #7

mcarbone CreditAttribution: mcarbone commented 19 September 2010 at 10:12

Title:

Search excerpt doesn't highlight words with diacritics when not entered in search box

» Words should be indexed without diacritics, rather than relying on database collation

Actually, there is no preprocessing being done to "seriösa" causing "seriosa" to match -- this is happening because of the collation of the search_index table in the database. (To see this, go ahead and inspect the db and you'll see that "seriösa" was indexed as is, yet select * from search_index where word = "seriosa" still returns that row. (See also: http://bugs.mysql.com/bug.php?id=37413)

I think the solution to this problem is to remove diacritics in search_simplify explicitly so that words are indexed without them, which means we'll get diacritic-insensitive matches regardless of the collation of the search_index table and whatever db is being used. Once that's done, the solution to #916086: search_excerpt() doesn't highlight words that are matched via search_simplify() will ensure that words with diacritics are highlighted.

Comment #8

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 15 October 2010 at 22:00

Status:

Active

» Closed (duplicate)

The issue is the search excerpt function doesn't work well. This is a duplicate of #493270: search_excerpt() doesn't work well with stemming

Comment #9

mcarbone CreditAttribution: mcarbone commented 20 October 2010 at 14:33

Status:

Closed (duplicate)

» Active

Re-opening. See: http://drupal.org/node/493270#comment-3596818

Comment #10

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 20 October 2010 at 16:53

Version:	7.x-dev	» 8.x-dev
Category:	bug	» feature
Priority:	Minor	» Normal

I personally think this is a feature request, and a rather large change in how the search module works. As such, I think we can't really do it at this stage in D7 development, and it should be pushed off to Drupal 8 for consideration. Feel free to make an argument to move it back to D7 if you think I'm wrong...

Comment #11

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 30 October 2014 at 17:09

Title:	Words should be indexed without diacritics, rather than relying on database collation	» Searches for words with diacritics/accents: word not highlighted in results
Category:	Feature request	» Bug report
Issue summary:	View changes

I just tested this issue by making a node with body meklēt

This is still a problem in Drupal 8: searching for either meklēt or meklet finds the node, but in the latter case, the word is not highlighted on the page.

I think this is a bug rather than a feature request, and we should fix it.

Comment #12

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 10 December 2014 at 20:16

Issue summary:

View changes

Updated summary

Comment #13

Anonymous (not verified) CreditAttribution: Anonymous commented 28 February 2015 at 18:44

Assigned:	Unassigned	»
Status:	Active	» Needs review
Issue tags:		+Needs tests

File	Size
searches_for_words_with-731298-13.patch	1.02 KB

We can use the transliteration service to achieve this. The attached patch demonstrates this approach.

I'll write a test later on.

Comment #14

28 February 2015 at 19:01

Status:

Needs review

» Needs work

The last submitted patch, 13: searches_for_words_with-731298-13.patch, failed testing.

Comment #15

Anonymous (not verified) CreditAttribution: Anonymous commented 28 February 2015 at 19:45

I guess this was not the right approach :) Any suggestions?

Comment #16

Anonymous (not verified) CreditAttribution: Anonymous commented 1 March 2015 at 11:56

Assigned:

» Unassigned

Comment #17

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 2 March 2015 at 15:18

Issue summary:

View changes

Hi @pjonckiere, thanks for the patch!

Transliteration is quite a bit more than just removing diacritics. For instance, if you transliterate Chinese text, you end up with something like "ni hao" instead of the actual characters, and there are some language-specific settings as well, such as in some languages u-with-umlaut is transliterated into "ue" while in others it is just "u".

So, transliteration is not quite the right approach here. What I think we would want to do instead is to give the user the option (on the search settings page) to do one of the following:
- Use database collation (i.e. do nothing, as it is now)
- Remove accents/diacritics (which should be the default)
- Complete transliteration
We'll need to explain what the implications of these three choices are, on the settings page and probably in search_help(). [I added a note about this to the issue summary, which actually already said it should maybe be optional]

And then we'd also need to figure out how to remove diacritics reliably.

One way I know of is to use the "intl" package, but this is package is not available on many web servers by default, so when we were putting Transliteration in Core, we decided not to use it. See discussion on #567832: Transliteration in core for more on that.

If you do a Google search, you'll also find suggestions to use iconv() http://se2.php.net/manual/en/function.iconv.php -- but this is not OK because it will either transliterate or silently remove characters that are non-ASCII, neither of which is what we want.

But I found what looks like a good solution on this page:
http://stackoverflow.com/questions/3371697/replacing-accented-characters...

So I found this on php.net page for preg_replace function

// replace accented chars

$string = "Zacarías Ferreíra"; // my definition for string variable
$accents = '/&([A-Za-z]{1,2})(grave|acute|circ|cedil|uml|lig);/';
$string_encoded = htmlentities($string,ENT_NOQUOTES,'UTF-8');
$string = preg_replace($accents,'$1',$string_encoded);

Something like this should work... I think we'd want to do:

if (... the setting is to remove diacritics ...) {
   $text = htmlentities($text, ENT_NOQUOTES,'UTF-8');
   $accents = '/&([A-Za-z]{1,2})(grave|acute|circ|cedil|uml|lig);/';
   $text = preg_replace($accents, '$1', $text);
   $text = entity_decode($text);
}

Want to try it out?

Comment #18

Anonymous (not verified) CreditAttribution: Anonymous commented 2 March 2015 at 15:56

Assigned:

Unassigned

Thank you for the extensive feedback.

The html entities are a clever trick here. I guess we'll want to add tilde (ã) and slash (ø) as well.

I'll give it a try.

Comment #19

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 2 March 2015 at 16:37

Sounds good!

Take a look at these pages, which lists some additional things:
http://symbolcodes.tlt.psu.edu/web/codehtml.html
http://www.danshort.com/HTMLentities/index.php?w=latin
http://www.starr.net/is/type/htmlcodes.html

Looks like we should do:
grave
acute
circ
cedil
uml
lig
tilde
slash
ring
caron

Sigh. But the case in the issue summary would still not work, because there is not a named HTML entity for it... sigh.

So. Given that... I'm wondering if we should try a different approach. We actually already have tables in the Transliteration component for transliterating various characters to ASCII. The accent removal is specifically for some lower-numbered ASCII characters. Maybe the right approach would be to make a new method on the Transiteration interface and \Drupal\Component\Transliteration class that would just do diacritical mark removal, but skip all the other stuff? It would look similar to the transliterate() method, but be limited to certain character ranges and not do any of the language-specific stuff. Thoughts?

One other note: we don't have to worry about capitalization in search simplify... so if we stick with the HTML entities approach, we do not need to do A-Z, just a-z.

Comment #20

Anonymous (not verified) CreditAttribution: Anonymous commented 2 March 2015 at 18:56

Adding a method could indeed be a valid solution, but I think it would be confusing for developers. Furthermore, doesn't that imply an api change which we cannot do due to the beta phase?

Two alternate solutions I can think of:
- Write a "lite" version of a class implementing TransliterationInterface.
- Do something like in #13, but only call transliteration if you are sure it's a low ascii character.

I seem to tend to the last one and thus further expanding on #13, since it will be the least disruptive and it wouldn't be too dirty. However, I'm not sure if it's feasible. I'll have to investigate that.

Comment #21

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 2 March 2015 at 20:29

If we need it to fix a bug, we can do API changes. I'm one of the maintainers of the Transliteration system, by the way. ;)

I guess the thing to do, to avoid any developer impact from adding a method to an interface, would be to make a second interface in the same namespace, with a removeAccents() method only. Then we can implement both of the interfaces in the base PhpTransliteration component class -- no API problems, no impact, definitely OK to do at this phase. And the method could be useful for people, outside of the context for Search. Also, it requires knowledge of the UTF-8 character set similar to what is needed by the existing Transliterate class/interface, so I think it belongs there.

I'm not sure what you mean about adding a method to the Transliteration class being confusing to developers? Transliteration is a very broad operation, as discussed above. Removing diacritics is a very narrow operation. We'll need to document the methods, but I think they can be documented in such a way that it's not confusing which one does what.

And we absolutely do not want to call the existing transliterate() method from Search by default for any characters -- it does a lot more than just removing accents, and we really only do want to remove accents, not transliterate.

Comment #22

Anonymous (not verified) CreditAttribution: Anonymous commented 2 March 2015 at 21:49

If we need it to fix a bug, we can do API changes. I'm one of the maintainers of the Transliteration system, by the way. ;)

Yes, I've noticed that in the issue you linked in #17. Also, you pushed my first D8 patch. I know I'm in good hands here :D

I guess the thing to do, to avoid any developer impact from adding a method to an interface, would be to make a second interface in the same namespace, with a removeAccents() method only. Then we can implement both of the interfaces in the base PhpTransliteration component class -- no API problems, no impact, definitely OK to do at this phase. And the method could be useful for people, outside of the context for Search. Also, it requires knowledge of the UTF-8 character set similar to what is needed by the existing Transliterate class/interface, so I think it belongs there.

As you explained, removing accents is actually a smaller part of transliteration so - if it's allowed - it makes sense to add it in the existing interface thus following your proposition in #19.

I'm not sure what you mean about adding a method to the Transliteration class being confusing to developers? Transliteration is a very broad operation, as discussed above. Removing diacritics is a very narrow operation. We'll need to document the methods, but I think they can be documented in such a way that it's not confusing which one does what.

I think that reaction might have been due to my limited understanding of transliteration (French and German urls). Getting the bigger picture was important to fully understand what you were saying. Now I can follow you here.

And we absolutely do not want to call the existing transliterate() method from Search by default for any characters -- it does a lot more than just removing accents, and we really only do want to remove accents, not transliterate.

Agreed!

I looked through the code that is involved in this, and I think I can pull this off so I'll have a go at it if I find some more time (probably Wednesday).

If this would work, do we need to alter the beta evaluation with an api addition? We probably want a change record as well?

And thank you for bearing with me!

Comment #23

Anonymous (not verified) CreditAttribution: Anonymous commented 2 March 2015 at 22:55

Status:

Needs work

» Needs review

File	Size
searches_for_words_with-731298-23.patch	3.73 KB

So I got a bit excited and made a first attempt. Let's see what breaks.

Some remarks:
- The code range I used is a rough estimate. I probably will have to further specify that.
- I find it strange to have an unknown character replacement when we can figure out exactly what characters we can handle in this method. I mean to remove it when I fix the remark above.
- I omitted the $max_length variable since it just truncates the string without breaking transliteration. I don't think that we want to add it in the new method.

Comment #24

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 3 March 2015 at 01:33

Wow, fast work! Thanks again for taking this on!

This is really close. A few thoughts:

1. Yes we will need to update the issue summary and make a change notice. Possibly two change notices... We'll need one for the Transliteration interface, saying there is this new method and noting that it's implemented on the base class and should never need to be overridden. I think we would also need to make a change notice for the Search module explaining that there's a new option for removing diacritics, enabled in the default Search module settings.

2. On the new replaceDiacritics method:

a) I think we should call it removeDiacritics, and I think the first-line summary in the docs should be:

Removes diacritics (accents) from certain letters.

(by the way, first lines of function docs need to be **one-line** summaries, not wrapping to the next line and not exceeding 80 characters) (I'm also the API docs maintainer. Lucky you! :) ).

The rest of the docs look good, except the return value. I think it should say something like:

$string, with accented letters replaced by their unaccented equivalents.

And then we should have a paragraph before the @param section that explains in detail which accented characters this works on, something like:

This only applies to certain letters:
- Accented Latin characters like a-with-acute-accent, in the UTF-8 character range of ... to ...
- ...

b) Agreed it doesn't need unknownCharacter. I think it also doesn't need a language input -- this is not a language-dependent method. It should just have one argument, $string.

c) Definitely, we need to figure out which character ranges this applies to. I've done some research on this in the past, so let me know if you want help figuring that out... See #1823454: Verify transliteration data sources and their quality, and potentially eliminate maintenance, which is where I verified and updated the data tables that the Transliteration system in Drupal is actually using. You may also find http://www.unicode.org/charts/ useful -- check the Latin section. For instance, this Latin 1 page: http://www.unicode.org/charts/PDF/U0080.pdf has a chart of the common accented Latin letters. It shows what the letters look like, and below each one is the Hex code. So, we'd need those, plus some others used in Czech and some other languages. These seem to be in the "supplement" sections.

Although... I have to say that looking at the charts, I am not sure it will be easy to decide which characters to use. It might actually be easy to look at the scripts in #1823454: Verify transliteration data sources and their quality, and potentially eliminate maintenance that use "pecl" and make a new set of data that does a "transform" that just removes diacritical marks, save those as a new data set, and use those instead of the transliteration data set that we have. I'm not sure... any thoughts? I can help with that if you think that would be better.

Also, just a thought, it is probably helpful to use 0x notation for the character codes. So the range for those Latin 1 ones goes from 0xC0 to 0xFF, for instance. May make the code more readable.

3. The call to replace diacritics should only be in search_simplify(). search_index() already calls this (indirectly, as part of search_index_split()).

4. We'll need a test for the actual bug in this issue. The issue summary has a "reproduce" that suggests having a node with meklēt in it, searching for "meklet" without the accents, and verifying that it gets highlighted in the results. Or it could get added to existing tests for excerpt highlighting. If you need help figuring out where to add the tests, let me know and I can take a look.

Comment #25

Anonymous (not verified) CreditAttribution: Anonymous commented 4 March 2015 at 18:23

Status:

Needs review

» Needs work

Thank you for your feedback. I'll come back to the other points raised in #24, but I have a question on 4: wouldn't we also want unit test coverage added to PhpTransliterationTest? And building on that, can't we just use such a test to give us the ascii codes we need?

Comment #26

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 4 March 2015 at 18:47

Yes, we would want the new transliteration method to be tested too. But we need to have a test in the Search module for the bug that was reported here also.

I'm not sure what your other question means: "And building on that, can't we just use such a test to give us the ascii codes we need?"?

Comment #27

Anonymous (not verified) CreditAttribution: Anonymous commented 4 March 2015 at 19:27

We have a unit test that can check all characters (PhpTransliterationTest), and we have a method that returns ascii codes (PhpTransliteration::ordUTF8()). I will try to use that to my advantage in finding the full list of codes we want to handle.

It was more of a (lazy developers) brain twist than a question I guess. If I'll get it to work, that is. Otherwise I'll do the actual research of course.

Comment #28

Anonymous (not verified) CreditAttribution: Anonymous commented 4 March 2015 at 22:25

Status:

Needs work

» Needs review

File	Size
searches_for_words_with-731298-28.patch	5.73 KB

interdiff-23-28.txt	5.73 KB

I worked on the unit test a bit, but was unsure on how to proceed.

E.g. the string 'àáâäæãåā' (the ones provided by Apples autocomplete when I switch to qwerty) would translate to:

if (($code > 0x00df && $code < 0x00ef) || $code == 0x0101) {

The lonely check being the 'ā' which should be covered as demonstrated by the issue summary.

A quick search learns that in a close range from that (i.e. only separated by their uppercase variants), we would miss 'ă' and the more exotic 'ą'.

Where do we put the line? Or could I just range 0x00bf to 0.017f which would include some undesired(?) characters like ×, ÷ or þ? I would vote the latter since the list would be complete and only has some characters that might be undesired. And these are edge cases really. Furthermore, in the issue you linked, ranges seem to be the way to go (as opposed to checking for single characters).

Fwiw, I'm currently using http://unicode-table.com/ as source since I find it has a good overview of all characters we need in a single and readable table.

The patch is my work in progress.

As for #24 my notes say (I'd write it out, but it's getting a bit late here):
1. todo CRs
2. a. ok, todo document range
b. ok
c. todo find range
3. ok
4. todo test coverage (both unit and web)

I'll definitely revisit that comment later on to verify that I didn't miss anything.

Comment #29

4 March 2015 at 22:27

Status:

Needs review

» Needs work

The last submitted patch, 28: searches_for_words_with-731298-28.patch, failed testing.

Comment #30

Anonymous (not verified) CreditAttribution: Anonymous commented 4 March 2015 at 22:37

Status:

Needs work

» Needs review

File	Size
searches_for_words_with-731298-30.patch	4.86 KB

interdiff-23-30.txt	5.73 KB

I made an error in the patch file in #28. Let's try again.

Comment #31

Anonymous (not verified) CreditAttribution: Anonymous commented 4 March 2015 at 22:36

2 files were hidden/shown/deleted

File	Size
searches_for_words_with-731298-28.patch	5.73 KB

interdiff-23-28.txt	5.73 KB

Comment #32

4 March 2015 at 22:53

Status:

Needs review

» Needs work

The last submitted patch, 30: searches_for_words_with-731298-30.patch, failed testing.

Comment #33

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 4 March 2015 at 23:12

Hm. A few thoughts:

a) The generic Transliteration class needs to cover both lower-case and upper-case letters. Search doesn't need this, because it lower-cases everything.

b) I kind of think we should do it right/carefully. So maybe we can use ranges with exceptions? So we'd need:
00C0 - 017E, except for:
00D0 [Eth]
0097 [multiply]
00DE [Thorn]
00F0 [eth]
00F7 [divide]
00FE [thorn]
0138 [kra]
014A [Eng]
014b [eng]

Then there is another range at 01CD - 024F except:
01DD [backwards e]
01EE, 01EF [Ezh]
01F6, 01F7
012C, 012D
0220-0223
0241-0242
0245

I also think we should exclude anything that transliterates to more than one character, like the AE ligature. And exclude anything that transliterates to 0 characters or doesn't have a transliteration.

I think that would be good enough?

Looking at the patch, I have another suggestion: Maybe rather than calling replace() and passing in a "not a language" code, it would be better to make a protected method called lookupReplacement(), or something like that, on the PhpTransliteration class that would do the part of replace() starting at:

   // See if there is a generic mapping for this character.
    $bank = $code >> 8;

And then our new replaceDiacritics() method could have logic like:

if ( ... $code is in our ranges and not one of the exceptions) {
  // passing in "xyz" for unknown character so it will fail the test below
 $newcode = $this->lookupReplacement($code, "xyz");
  if (strlen($newcode) == 1) {
    // replace it
  }
}

Comment #34

Anonymous (not verified) CreditAttribution: Anonymous commented 5 March 2015 at 23:22

Status:

Needs work

» Needs review

File	Size
interdiff-30-34.txt	6.38 KB
searches_for_words_with-731298-34.patch	7.79 KB

While evaluating the ranges, I expanded the unit test. I added the two ranges, and excluded the characters you proposed as well as the double byte characters (Æ ß æ Ĳ ĳ Œ œ ŉ Ǣ ǣ).
Fwiw, we definitely cover all Dutch, French and German single byte diacritics I can think of. But judging from the characters we handle now, they don't use the more exotic ones :)

After finishing the ranges, I refactored the methods to use a common lookupReplacement() method (DRY, I like that). I did however leave out the length check, since I don't understand why we would do that. The list of replaced characters is definite, so we should always get an answer. If not, it's a bug and we should reevaluate the range selection.

I tried writing some documentation, but that doesn't come too easy (yet). I could use some suggestions (or a similar example) for that.

Todo:
- work on any feedback that comes
- write better documentation, especially for the method definition in the interface
- expand SearchExcerptTest to test this issue
- write two CRs

Edit: and some clean-up is needed as well

Comment #35

5 March 2015 at 23:37

Status:

Needs review

» Needs work

The last submitted patch, 34: searches_for_words_with-731298-34.patch, failed testing.

Comment #36

Anonymous (not verified) CreditAttribution: Anonymous commented 6 March 2015 at 10:33

Status:

Needs work

» Needs review

File	Size
searches_for_words_with-731298-36.patch	7.61 KB

interdiff-34-36.txt	505 bytes

The test failure is caused by the use statement for Drupal\Core\Language\LanguageInterface; which we shouldn't do in the component namespace. Due to the refactoring in previous patch it's unneeded, so I'm simply removing it.

Comment #37

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 6 March 2015 at 14:54

Status:

Needs review

» Needs work

My idea on the length check was that *just in case* the data tables change, if we say "Do this whole range, but then make sure the result is exactly 1 character" then if something changes in the data tables, we will at least make sure that we are not doing something odd. And also if you do that, you don't have as many exclusions, because for instance the a-e stuck-together letter looks up to "ae", so you say "NO!" in the length check. And also to make sure that we don't get back "unknown character".

So... Anyway, great work and can I just say again THANK YOU for working on this!

This patch is obviously pretty close! Here are my suggestions for documentation, cleanup, and some minor code changes:

a) When calling the new replaceCharacters() method from removeDiacritics, pass in $unknown_character = 'xyz'.

b) Do the length check -- make sure the replacement is exactly one character; if not, don't do a replacement. If we do that, we don't have to have 00C0 (elided AE) and other characters like that in the list of exclusions, so the routine becomes more efficient.

c) Add a comment to the $exclusions lines in removeDiacritics() with a short explanation of why there are exclusions. Something like:

These two Unicode ranges include the accented US-ASCII letters, with a few characters that aren't accented letters mixed in. So define the ranges and the excluded characters.

d) Coding standards, in PhpTransliteration:

+      // unicode range 0x00bf to 0x017f

All in-code comments should be complete sentences, starting with capital letters and ending with . But... really I don't know that we even need this comment. Comments in code should explain *why*, not tell us *what* the code is doing. So if we include the comment I suggested in (c), we can probably get rid of these two "unicode range" comment lines.

e) Spelling, same class:

+  /**
+   * Lookup the replacement for a certain UTF-8 character code.

Lookup is a noun. The verb is "Look up".

f) In this line, I also think we need to say "Look up the generic replacement", just to clarify that it's not the final answer. Take out the word "certain" too.

g) Coding standards, same class:

+   * @param $code
+   *  The UTF-8 character code.
+   * @param string $unknown_character
+   *  (optional) The character to substitute for characters without transliterated
+   *  equivalents.
+   *
+   * @return string
+   *  US-ASCII replacement character. If it has a mapping, it is returned;
+   *  otherwise, $unknown_character is returned.

There should be 2 spaces of indent under @param and @return.

h) Same section... I think the docs for $unknown_character shouldn't mention "transliterated equivalents", maybe say "without entries in the replacement tables"?

And maybe on the @return we should document that it could be more than one character? The wording sounds like it's always one character. We should fix that on the replace() method too, come to think of it.

i) in the TransliterationInterface:

+   * 01CD to 024F. Replacements that would result in zero or two characters are
+   * excluded as well as characters that would need afull transliteration
+   * (e.g. Eth, Multiply, Divide etc.).

Can we change this last sentence to say:

Replacements that would result in the string changing length are excluded, as well as characters that are not accented US-ASCII letters.

j) PhpTransliterationTest:

+   * @param string $expected
+   *  The expected return from PhpTransliteration::removeDiacritics().

See (g).

k) And a few lines down:

+//    throw new \Exception($result);

Take this line out. :) Nice test by the way, on the transliteration! Very thorough!

l) A few lines down in the test:

+      // unicode range 0x00bf to 0x017f

See (d)

m) We still need a test for the Search issue that this is fixing. But you knew that. :)

n) The change in the Search module is incorrect. It needs to be in search_simplify(), and not in search_index(). See comment #24 item 3. [and also the second change notice, which I think explains the process -- see below]

o) And we need change records... I can help with that -- added:
https://www.drupal.org/node/2447327
https://www.drupal.org/node/2447357

Comment #38

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 6 March 2015 at 15:01

Issue summary:

View changes

Edited the issue summary slightly.

Comment #39

Anonymous (not verified) CreditAttribution: Anonymous commented 6 March 2015 at 18:04

Status:

Needs work

» Needs review

File	Size
interdiff-36-39.txt	7.8 KB
searches_for_words_with-731298-39.patch	7.49 KB

So... Anyway, great work and can I just say again THANK YOU for working on this!
Well, I'm learning a lot by working on this issue, so thank you for sharing your knowledge :)

a) Ok.
b) Ok. This makes a lot of sense after reading your explanation. I didn't think of that when looking at #33. And this is an incredible finetuning (mind = blown)!
c) Ok.
d) Ok.
e) Ok. TIL
f) Ok.
g) Ok.
h) Ok.
i) Ok. Better.
j) Ok. And not a single space was overlooked that day :D
k) Hah! The best thing I could find when I found out assertSame() and assertEquals() don't spit out why they failed. Shout-out to @geertvd for the suggestion!
l) Ok. I made them full sentences since I think it might come in handy for people looking at this test.
m) Todo
n) Ok. I misread #24.3 completely and didn't test this properly yet. That's what m) is for.
o) Thank you. Thank you. Thank you!

Next up, let's make a test fail! And then, let's make it green again.

Comment #40

Anonymous (not verified) CreditAttribution: Anonymous commented 6 March 2015 at 20:22

File	Size
searches_for_words_with-731298-40.test-fail.patch	3.12 KB

searches_for_words_with-731298-40.patch	10.61 KB

This could do. I looked at the SearchExcerptTest, but thought it was not the best place to put it. Since I couldn't find a better place, I created a new one.

The test-fail is an interdiff 39-40.

Comment #41

6 March 2015 at 20:39

The last submitted patch, 40: searches_for_words_with-731298-40.test-fail.patch, failed testing.

Comment #42

Anonymous (not verified) CreditAttribution: Anonymous commented 8 March 2015 at 21:14

Do we still want to give the users a choice (cf. #17)? I think that would be more of a feature than part of fixing the bug?

Comment #43

Sutharsan CreditAttribution: Sutharsan commented 8 March 2015 at 21:28

Issue tags:

-Needs tests

Removing tag as the test is available and working.

Comment #44

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 9 March 2015 at 15:50

Hi pjonckiere -- this is looking great!

I took a look at the test. It is excellent! It verifies both the desired behavior and that certain characters, like þ, are not transliterated. Very nice work!

The other changes look good too. There are only a couple of very small cosmetic (coding standards) things to fix:

a) In PhpTransliteration::removeDiacritics() - comment lines, including in-code // comments and /** */ docs comments, should never exceed 80 characters in the line. The added comment about the Unicode range is loner than 80 characters. [Code lines are allowed to be longer.]

b) In the docs header for the lookupReplacement() method:

+   *   (optional) The character to substitute for characters without entries in
+   *   the replacement tables

Last line needs to end in .

That is it! Let's just fix these two things and it is ready to go!

Comment #45

Anonymous (not verified) CreditAttribution: Anonymous commented 9 March 2015 at 17:45

File	Size
interdiff-40-45.txt	1.47 KB
searches_for_words_with-731298-45.test-fail.patch	3.12 KB

searches_for_words_with-731298-45.patch	7.5 KB

Feedback from #44.

Comment #46

9 March 2015 at 18:02

The last submitted patch, 45: searches_for_words_with-731298-45.test-fail.patch, failed testing.

Comment #47

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 9 March 2015 at 18:03

Status:

Needs review

» Reviewed & tested by the community

Excellent!

As a co-maintainer of both Search module and Transliteration component, I heartily endorse this patch. Change records and beta evaluation have been done, and the issue summary is current. Let's fix this bug.

The only possible question is whether it should be an option (on by default) or whether it should just be an always-on feature. See issue summary for discussion. This patch makes it an always-on feature, which I think is the right thing to do -- people expect this from search, and I think it would just lead to confusion to provide the option to turn it off. It's also not an expensive operation. But if the branch maintainers disagree, we can definitely make it into a Search setting.

Comment #48

Gábor Hojtsy

he/him

Hungarian

Hungary

CreditAttribution: Gábor Hojtsy commented 11 March 2015 at 13:55

Issue tags:

+D8MI, +language-content

Tagging retroactively.

Comment #49

Gábor Hojtsy

he/him

Hungarian

Hungary

CreditAttribution: Gábor Hojtsy commented 11 March 2015 at 13:58

Issue tags:

+sprint

Comment #50

alexpott

he/they

English

🇪🇺🌍

CreditAttribution: alexpott at Chapter Three commented 14 March 2015 at 13:37

Status:	Reviewed & tested by the community	» Needs review
Issue tags:		+Needs tests

As far as I can see we don't have test coverage for the fact that when you search for "meklet" nodes that contain "meklēt" are found and "meklēt" is highlighted. Shouldn't we be adding this?

Comment #51

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 15 March 2015 at 14:08

Status:

Needs review

» Reviewed & tested by the community

File	Size
search-731298.patch	10.62 KB

Yeah, sorry. I didn't notice. These tests were in the patch in #40 and several things got dropped when the patch was remade in #45.

So here is a new patch. This is the patch in #40, with the interdiff 40-45 applied properly. All of this has been previously reviewed, so should be RTBC for real this time.

Comment #52

Anonymous (not verified) CreditAttribution: Anonymous commented 17 March 2015 at 19:23

Hmm, no idea how this happened. Thanks for catching it @alexpott!

And thanks for the merged patch @jhodgdon!

Comment #53

Gábor Hojtsy

he/him

Hungarian

Hungary

CreditAttribution: Gábor Hojtsy at Acquia commented 18 March 2015 at 17:34

Issue tags:

-Needs tests

Comment #54

alexpott

he/they

English

🇪🇺🌍

CreditAttribution: alexpott at Chapter Three commented 20 March 2015 at 14:45

Status:

Reviewed & tested by the community

» Fixed

This issue is a normal bug fix, and doesn't include any disruptive changes beyond the api addition to the transliteration service and I don't think that overriding this is common at this point, so it is allowed per https://www.drupal.org/core/beta-changes. Committed ba500d4 and pushed to 8.0.x. Thanks!

Comment #55

20 March 2015 at 14:46

alexpott committed ba500d4 on 8.0.x

Issue #731298 by pjonckiere, jhodgdon: Searches for words with...

Comment #56

Gábor Hojtsy

he/him

Hungarian

Hungary

CreditAttribution: Gábor Hojtsy at Acquia commented 20 March 2015 at 15:22

Issue tags:

-sprint

Yay, thanks!

Comment #57

3 April 2015 at 15:24

Status:

Fixed

» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Comment #58

ataimist CreditAttribution: ataimist commented 7 March 2017 at 09:56

Sorry to comment on an old issue, but regarding #47 and an option to turn this off - I think this at least is definitely needed. I created a new issue, https://www.drupal.org/node/2858337, to explain why, but basically removing diacritical marks completely changes word meanings in several languages. Some examples: Finnish säde 'ray' -> sade 'rain', Swedish vår 'spring', 'our' -> var 'each', Italian purè 'mashed potatoes' -> pure 'also', Estonian olu 'existence' -> ölu 'beer'. Likely to be an issue in other Scandinavian languages as well (e.g. Danish, Norwegian), possibly also in German.

Comment #59

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 7 March 2017 at 15:16

Understood. The other question is whether people would be sloppy: would they always type in the word with the accent, or would they occasionally search from a keyboard without all the accents, and not type the accent when searching? Also, what does Google do when searching in Swedish or Finnish -- they seem to set the standard for what is expected in searches.

In any case, this issue was about highlighting in results, not search matching -- the core Search module has always found matches irrespective of diacriticals. This issue was just about making the highlighted excerpt consistent with the search algorithm.

If you would like to get either the matching or the highlighting turned into optional, what you should do at this point is to file a new issue. Thanks!

Comment #60

ataimist CreditAttribution: ataimist commented 7 March 2017 at 21:31

Ah, you're correct; search does indeed return results with and without diacritics, diacritics are just removed before hook_search_preprocess sees the search input. I'm working on stemmers, so this is problematic, but it's a separate issue.

Regarding your question, there are a lot of cases in a lot of languages (not just Scandinavian ones, but e.g. German and French too) where you simply can't drop the accent even if you are feeling lazy, because then you'd be searching for something else entirely. For Finnish, changing Ä into A is no more correct than changing it into I (funnily enough: säde 'ray', sade 'rain', side 'bandage'). Getting results for säde and sade - two completely unrelated words - is unexpected. Google doesn't mix the two, either.

Anyway, sorry for the confusion and thanks for the pointers, I'll amend my other issue and probably file another regarding matching in general.