Strip diacritics from indexed content and query strings. [#2083625]

The following searches do not work with the default schema.xml:
- Bełżec Museum: search for "Belzec"
- Congress Hall in Rožnov pod Radhoštěm: search for "Roznov" and "Radhostem"
- House on the Červený Kopec Hill: search for "Cerveny" (Kopec is fine)

However, the addition of

to the text fieldType on both index and query allows these searches to match content containing diacritics.

Comment	File	Size	Author
#7	2083625-7--improved_folding.patch	2.42 KB	drunken monkey
#6	2083625-6--improved_folding.patch	2.5 KB	drunken monkey
#5	2083625-5.patch	2.34 KB	Nick_vh

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

cpliakas CreditAttribution: cpliakas commented 8 September 2013 at 12:07

To me this makes sense to include in the default schema. Since the default language is English, it is very unlikely that people will add ~~accented characters~~ diacritics but will want to match words like the ones above.

Comment #2

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 6 September 2013 at 22:03

This is already working perfectly fine for umlauts and accented characters. I don't know why it's different for diacritics, but they should of course be treated the same (I'd say, without being in any way familiar with their use in the different languages). So, I'm all in favor for adding this.

Comment #3

Nick_vh

he/him

Ghent

CreditAttribution: Nick_vh commented 16 October 2013 at 16:54

With Solr 4 there's a new file added called mapping-FoldToASCII.txt and I'm not sure where they are used but it seems like we can use that file instead of mapping-ISOLatin1Accent.txt to fix this issue also. Did anyone look into this or perhaps this file is required to enable the ASCIIFoldingFilterFactory?

Comment #4

Nick_vh

he/him

Ghent

CreditAttribution: Nick_vh commented 29 October 2013 at 18:37

solr.ISOLatin1AccentFilterFactory

Creates org.apache.lucene.analysis.ISOLatin1AccentFilter.

Replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by their unaccented equivalent. In Solr 3.x, this filter is deprecated. This filter does not exist at all in 4.x versions. Use ASCIIFoldingFilterFactory instead.

solr.ASCIIFoldingFilterFactory

Creates org.apache.lucene.analysis.ASCIIFoldingFilter.

Converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.

So let's replace the ISOLatin1 filter with the ASCIIFolding filter for Solr 3 and 4, as for solr 4 it is does not even exist.

Comment #5

Nick_vh

he/him

Ghent

CreditAttribution: Nick_vh commented 29 October 2013 at 18:50

File	Size
2083625-5.patch	2.34 KB

I got an interesting conversation in IRC about this :

nick_vh: Question : What is the main difference of the solr.ASCIIFoldingFilterFactory and solr.MappingCharFilterFactory with a provided file such as mapping-ISOLatin1Accent.txt? I know the ASCIIFoldingFactory encodes many more chars, but can I then omit the mappingcharfactory?
[2:43pm] elyograg: the ascii folding filter's mapping can only get updated when Solr does. The other one has a config file that you can change at any time. The best option is the ICUFoldingfilterFactory, which does character folding and lowercasing in one high-performance step, and it is aware of all of Unicode, not just ascii.
[2:44pm] nick_vh: elyograg: and that is available in Solr 3 and 4?
[2:44pm] nick_vh: let me check
[2:44pm] elyograg: it is. it requires adding jars to the classpath, but those jars are included in the solr download.
[2:45pm] nick_vh: elyograg: so, to confirm - this normalizes any character with accents to a "regular" letter right?
[2:45pm] elyograg: icu4j-49.1.jar
[2:45pm] elyograg: lucene-analyzers-icu-4.4.0.jar
[2:45pm] nick_vh: right, that's not too hard to do
[2:45pm] nick_vh: interesting
[2:46pm] elyograg: yes. it's the smartest folding/lowercasing filter available that I know of. There might be commercial solutions available that are better, but the ICU classes are made by IBM, who knows unicode.
[2:48pm] elyograg: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFold...

So if we want to be really smart we need to tell people they need to copy and paste something in their SOLR_HOME directory. But I'm afraid that will be causing tons of headaches...

Comment #6

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 30 October 2013 at 06:53

Status:

Active

» Needs review

File	Size
2083625-6--improved_folding.patch	2.5 KB

Why SOLR_HOME? We just have to include it in the classpath – which we could either do in the command line, or just in solrconfig.xml. The attached patch works perfectly fine for me in Solr 4.x.
We just have to make sure that this is really what we want to use, that folding of characters with diacritics now works properly and that it's worth the change. (After all, due to the changed query preprocessing some people might need to re-index to keep getting the right results.)
Also we'd need to make sure it works for 3.x, too, of course.

Comment #7

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 30 October 2013 at 07:15

File	Size
2083625-7--improved_folding.patch	2.42 KB

It seems you accidentally already commited your patch #5 in 46889db. To apply my patch, either revert or use the one attached here.

Comment #8

Nick_vh

he/him

Ghent

CreditAttribution: Nick_vh commented 30 October 2013 at 15:38

No, the ICU is part of contrib and if you deploy the war file and not use the jetty it will not work. Believe me, using this will cause issues. I'll take a look if this is also true for Solr 3.x. Perhaps that solution does work for Solr 4.

Perhaps even extraction does not work for some folks out of the box due to the lack of some of these folders? I'm not even sure if we should add the extraction include in the default solrconfig.xml by default?

Comment #9

Nick_vh

he/him

Ghent

CreditAttribution: Nick_vh commented 30 October 2013 at 15:40

And I'll revert the commit. That was an accident indeed

Comment #10

Nick_vh

he/him

Ghent

CreditAttribution: Nick_vh commented 30 October 2013 at 17:54

commit reverted. Let's continue discussing :)

Comment #10.0

Nick_vh

he/him

Ghent

CreditAttribution: Nick_vh commented 30 October 2013 at 17:54

Issue summary:

View changes

Updated issue summary.

Comment #11

mkalkbrenner

German

🇩🇪

CreditAttribution: mkalkbrenner commented 1 November 2013 at 14:44

Let's go back to the beginning of this discussion!

There's no generic solution for that problem that fits all.

First a German example:
No matter which solr filter you use, converting "Küchen" to "Kuchen" is completely wrong because you convert "kitchens" to "cakes".

In my solr multilingual trainings and sessions I recommend to not convert accents or diacritics that occur in the targeted language but to convert any of those which don't belong to the targeted language. Therefor a filter is required that is configurable.

The currently used solr.MappingCharFilterFactory is suitable for that but needs to be configured per language. Apache Solr Multilingual does that job.

Comment #12

mkalkbrenner

German

🇩🇪

CreditAttribution: mkalkbrenner commented 1 November 2013 at 14:52

By using ICUFoldingFilterFactory or ASCIIFoldingFilterFactory or ISOLatin1AccentFilterFactory you increase the number of search results. I think this is the intention of this issue.
But the downside is that you also increase the number of "false positives".
See slide 4 of http://drupalcamp-essen.de/13/dateien/solrmultilingual_dcessen2013.pdf for a completely wrong search result.

Comment #13

mkalkbrenner

German

🇩🇪

CreditAttribution: mkalkbrenner commented 1 November 2013 at 15:50

cspitzlay just mentioned offline that ICUFoldingFilterFactory or ASCIIFoldingFilterFactory are probably good default settings for an English index.
But for all other languages that contain accents or diacritics we need a something different.

So maybe a solution like this does the job:

  <charFilter class="solr.MappingCharFilterFactory" mapping="protect-some-accents.current-language.txt"/>
  <filter class="solr.ICUFoldingFilterFactory"/>
  <filter class="solr.PatternReplaceFilterFactory" pattern="MATCH_PROTECTED_ACCENTS" replacement="ORIGINAL_ACCENT"/>

Comment #14

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 1 November 2013 at 20:00

I don't think that would be usable for non-technical users. I'm also not sure how those patterns would look – the "protected accent" string could not contain the original accent, so how would you re-create the original with a regex? Hacky byte manipulation, maybe, but since Solr/Java use proper Unicode strings all the way through, I don't think even that would work.

Even though I was sceptic myself at first, I think we should continue to create the config files for the standard English use case, and just document how other languages could be supported, or the default language changed. It wouldn't be easy in any case, so at least we shouldn't inconvenience English users more than necessary. And with English stemming enabled by default, your search terms will be mutilated in some cases in other languages anyways if you don't change that.

Also, while it of course also leads to some stupid results, folding umlauts to ASCII also helps in some portion of cases – especially when the umlaut is only in the plural form. "Erdapfel" should find "Erdäpfel", "Arzt" "Ärzte", etc.

Comment #15

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 17 December 2013 at 10:17

commit reverted. Let's continue discussing :)

You missed the confFiles directive in the solrconfig.xml files, this change is now in the 4.2 config version. However, you forgot to make the same changes in solrcore.properties anyways, so as long as that file is present the change should not have any effect.
Still, you should probably remove that bit, too.