Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
The following searches do not work with the default schema.xml:
- Bełżec Museum: search for "Belzec"
- Congress Hall in Rožnov pod Radhoštěm: search for "Roznov" and "Radhostem"
- House on the Červený Kopec Hill: search for "Cerveny" (Kopec is fine)
However, the addition of
<filter class="solr.ASCIIFoldingFilterFactory"/>
to the text fieldType on both index and query allows these searches to match content containing diacritics.
Comment | File | Size | Author |
---|---|---|---|
#7 | 2083625-7--improved_folding.patch | 2.42 KB | drunken monkey |
#6 | 2083625-6--improved_folding.patch | 2.5 KB | drunken monkey |
#5 | 2083625-5.patch | 2.34 KB | Nick_vh |
Comments
Comment #1
cpliakas CreditAttribution: cpliakas commentedTo me this makes sense to include in the default schema. Since the default language is English, it is very unlikely that people will add
accented charactersdiacritics but will want to match words like the ones above.Comment #2
drunken monkeyThis is already working perfectly fine for umlauts and accented characters. I don't know why it's different for diacritics, but they should of course be treated the same (I'd say, without being in any way familiar with their use in the different languages). So, I'm all in favor for adding this.
Comment #3
Nick_vhWith Solr 4 there's a new file added called mapping-FoldToASCII.txt and I'm not sure where they are used but it seems like we can use that file instead of mapping-ISOLatin1Accent.txt to fix this issue also. Did anyone look into this or perhaps this file is required to enable the ASCIIFoldingFilterFactory?
Comment #4
Nick_vhSo let's replace the ISOLatin1 filter with the ASCIIFolding filter for Solr 3 and 4, as for solr 4 it is does not even exist.
Comment #5
Nick_vhI got an interesting conversation in IRC about this :
So if we want to be really smart we need to tell people they need to copy and paste something in their SOLR_HOME directory. But I'm afraid that will be causing tons of headaches...
Comment #6
drunken monkeyWhy
SOLR_HOME
? We just have to include it in the classpath – which we could either do in the command line, or just insolrconfig.xml
. The attached patch works perfectly fine for me in Solr 4.x.We just have to make sure that this is really what we want to use, that folding of characters with diacritics now works properly and that it's worth the change. (After all, due to the changed query preprocessing some people might need to re-index to keep getting the right results.)
Also we'd need to make sure it works for 3.x, too, of course.
Comment #7
drunken monkeyIt seems you accidentally already commited your patch #5 in 46889db. To apply my patch, either revert or use the one attached here.
Comment #8
Nick_vhNo, the ICU is part of contrib and if you deploy the war file and not use the jetty it will not work. Believe me, using this will cause issues. I'll take a look if this is also true for Solr 3.x. Perhaps that solution does work for Solr 4.
Perhaps even extraction does not work for some folks out of the box due to the lack of some of these folders? I'm not even sure if we should add the extraction include in the default solrconfig.xml by default?
Comment #9
Nick_vhAnd I'll revert the commit. That was an accident indeed
Comment #10
Nick_vhcommit reverted. Let's continue discussing :)
Comment #10.0
Nick_vhUpdated issue summary.
Comment #11
mkalkbrennerLet's go back to the beginning of this discussion!
There's no generic solution for that problem that fits all.
First a German example:
No matter which solr filter you use, converting "Küchen" to "Kuchen" is completely wrong because you convert "kitchens" to "cakes".
In my solr multilingual trainings and sessions I recommend to not convert accents or diacritics that occur in the targeted language but to convert any of those which don't belong to the targeted language. Therefor a filter is required that is configurable.
The currently used solr.MappingCharFilterFactory is suitable for that but needs to be configured per language. Apache Solr Multilingual does that job.
Comment #12
mkalkbrennerBy using ICUFoldingFilterFactory or ASCIIFoldingFilterFactory or ISOLatin1AccentFilterFactory you increase the number of search results. I think this is the intention of this issue.
But the downside is that you also increase the number of "false positives".
See slide 4 of http://drupalcamp-essen.de/13/dateien/solrmultilingual_dcessen2013.pdf for a completely wrong search result.
Comment #13
mkalkbrennercspitzlay just mentioned offline that ICUFoldingFilterFactory or ASCIIFoldingFilterFactory are probably good default settings for an English index.
But for all other languages that contain accents or diacritics we need a something different.
So maybe a solution like this does the job:
Comment #14
drunken monkeyI don't think that would be usable for non-technical users. I'm also not sure how those patterns would look – the "protected accent" string could not contain the original accent, so how would you re-create the original with a regex? Hacky byte manipulation, maybe, but since Solr/Java use proper Unicode strings all the way through, I don't think even that would work.
Even though I was sceptic myself at first, I think we should continue to create the config files for the standard English use case, and just document how other languages could be supported, or the default language changed. It wouldn't be easy in any case, so at least we shouldn't inconvenience English users more than necessary. And with English stemming enabled by default, your search terms will be mutilated in some cases in other languages anyways if you don't change that.
Also, while it of course also leads to some stupid results, folding umlauts to ASCII also helps in some portion of cases – especially when the umlaut is only in the plural form. "Erdapfel" should find "Erdäpfel", "Arzt" "Ärzte", etc.
Comment #15
drunken monkeyYou missed the
confFiles
directive in thesolrconfig.xml
files, this change is now in the 4.2 config version. However, you forgot to make the same changes insolrcore.properties
anyways, so as long as that file is present the change should not have any effect.Still, you should probably remove that bit, too.