The following searches do not work with the default schema.xml:
- Bełżec Museum: search for "Belzec"
- Congress Hall in Rožnov pod Radhoštěm: search for "Roznov" and "Radhostem"
- House on the Červený Kopec Hill: search for "Cerveny" (Kopec is fine)

However, the addition of

<filter class="solr.ASCIIFoldingFilterFactory"/>

to the text fieldType on both index and query allows these searches to match content containing diacritics.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

cpliakas’s picture

To me this makes sense to include in the default schema. Since the default language is English, it is very unlikely that people will add accented characters diacritics but will want to match words like the ones above.

drunken monkey’s picture

This is already working perfectly fine for umlauts and accented characters. I don't know why it's different for diacritics, but they should of course be treated the same (I'd say, without being in any way familiar with their use in the different languages). So, I'm all in favor for adding this.

Nick_vh’s picture

With Solr 4 there's a new file added called mapping-FoldToASCII.txt and I'm not sure where they are used but it seems like we can use that file instead of mapping-ISOLatin1Accent.txt to fix this issue also. Did anyone look into this or perhaps this file is required to enable the ASCIIFoldingFilterFactory?

Nick_vh’s picture

solr.ISOLatin1AccentFilterFactory

Creates org.apache.lucene.analysis.ISOLatin1AccentFilter.

Replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by their unaccented equivalent. In Solr 3.x, this filter is deprecated. This filter does not exist at all in 4.x versions. Use ASCIIFoldingFilterFactory instead.

solr.ASCIIFoldingFilterFactory

Creates org.apache.lucene.analysis.ASCIIFoldingFilter.

Converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.

So let's replace the ISOLatin1 filter with the ASCIIFolding filter for Solr 3 and 4, as for solr 4 it is does not even exist.

Nick_vh’s picture

FileSize
2.34 KB

I got an interesting conversation in IRC about this :

nick_vh: Question : What is the main difference of the solr.ASCIIFoldingFilterFactory and solr.MappingCharFilterFactory with a provided file such as mapping-ISOLatin1Accent.txt? I know the ASCIIFoldingFactory encodes many more chars, but can I then omit the mappingcharfactory?
[2:43pm] elyograg: the ascii folding filter's mapping can only get updated when Solr does. The other one has a config file that you can change at any time. The best option is the ICUFoldingfilterFactory, which does character folding and lowercasing in one high-performance step, and it is aware of all of Unicode, not just ascii.
[2:44pm] nick_vh: elyograg: and that is available in Solr 3 and 4?
[2:44pm] nick_vh: let me check
[2:44pm] elyograg: it is. it requires adding jars to the classpath, but those jars are included in the solr download.
[2:45pm] nick_vh: elyograg: so, to confirm - this normalizes any character with accents to a "regular" letter right?
[2:45pm] elyograg: icu4j-49.1.jar
[2:45pm] elyograg: lucene-analyzers-icu-4.4.0.jar
[2:45pm] nick_vh: right, that's not too hard to do
[2:45pm] nick_vh: interesting
[2:46pm] elyograg: yes. it's the smartest folding/lowercasing filter available that I know of. There might be commercial solutions available that are better, but the ICU classes are made by IBM, who knows unicode.
[2:48pm] elyograg: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFold...

So if we want to be really smart we need to tell people they need to copy and paste something in their SOLR_HOME directory. But I'm afraid that will be causing tons of headaches...

drunken monkey’s picture

Status: Active » Needs review
FileSize
2.5 KB

Why SOLR_HOME? We just have to include it in the classpath – which we could either do in the command line, or just in solrconfig.xml. The attached patch works perfectly fine for me in Solr 4.x.
We just have to make sure that this is really what we want to use, that folding of characters with diacritics now works properly and that it's worth the change. (After all, due to the changed query preprocessing some people might need to re-index to keep getting the right results.)
Also we'd need to make sure it works for 3.x, too, of course.

drunken monkey’s picture

It seems you accidentally already commited your patch #5 in 46889db. To apply my patch, either revert or use the one attached here.

Nick_vh’s picture

No, the ICU is part of contrib and if you deploy the war file and not use the jetty it will not work. Believe me, using this will cause issues. I'll take a look if this is also true for Solr 3.x. Perhaps that solution does work for Solr 4.

Perhaps even extraction does not work for some folks out of the box due to the lack of some of these folders? I'm not even sure if we should add the extraction include in the default solrconfig.xml by default?

Nick_vh’s picture

And I'll revert the commit. That was an accident indeed

Nick_vh’s picture

commit reverted. Let's continue discussing :)

Nick_vh’s picture

Issue summary: View changes

Updated issue summary.

mkalkbrenner’s picture

Let's go back to the beginning of this discussion!

There's no generic solution for that problem that fits all.

First a German example:
No matter which solr filter you use, converting "Küchen" to "Kuchen" is completely wrong because you convert "kitchens" to "cakes".

In my solr multilingual trainings and sessions I recommend to not convert accents or diacritics that occur in the targeted language but to convert any of those which don't belong to the targeted language. Therefor a filter is required that is configurable.

The currently used solr.MappingCharFilterFactory is suitable for that but needs to be configured per language. Apache Solr Multilingual does that job.

mkalkbrenner’s picture

By using ICUFoldingFilterFactory or ASCIIFoldingFilterFactory or ISOLatin1AccentFilterFactory you increase the number of search results. I think this is the intention of this issue.
But the downside is that you also increase the number of "false positives".
See slide 4 of http://drupalcamp-essen.de/13/dateien/solrmultilingual_dcessen2013.pdf for a completely wrong search result.

mkalkbrenner’s picture

cspitzlay just mentioned offline that ICUFoldingFilterFactory or ASCIIFoldingFilterFactory are probably good default settings for an English index.
But for all other languages that contain accents or diacritics we need a something different.

So maybe a solution like this does the job:

  <charFilter class="solr.MappingCharFilterFactory" mapping="protect-some-accents.current-language.txt"/>
  <filter class="solr.ICUFoldingFilterFactory"/>
  <filter class="solr.PatternReplaceFilterFactory" pattern="MATCH_PROTECTED_ACCENTS" replacement="ORIGINAL_ACCENT"/>
drunken monkey’s picture

I don't think that would be usable for non-technical users. I'm also not sure how those patterns would look – the "protected accent" string could not contain the original accent, so how would you re-create the original with a regex? Hacky byte manipulation, maybe, but since Solr/Java use proper Unicode strings all the way through, I don't think even that would work.

Even though I was sceptic myself at first, I think we should continue to create the config files for the standard English use case, and just document how other languages could be supported, or the default language changed. It wouldn't be easy in any case, so at least we shouldn't inconvenience English users more than necessary. And with English stemming enabled by default, your search terms will be mutilated in some cases in other languages anyways if you don't change that.

Also, while it of course also leads to some stupid results, folding umlauts to ASCII also helps in some portion of cases – especially when the umlaut is only in the plural form. "Erdapfel" should find "Erdäpfel", "Arzt" "Ärzte", etc.

drunken monkey’s picture

commit reverted. Let's continue discussing :)

You missed the confFiles directive in the solrconfig.xml files, this change is now in the 4.2 config version. However, you forgot to make the same changes in solrcore.properties anyways, so as long as that file is present the change should not have any effect.
Still, you should probably remove that bit, too.