Does this support Arabic? [#1991430]

Comment #1

German

🇩🇪

CreditAttribution: mkalkbrenner commented 9 May 2013 at 21:50

Version:	7.x-1.0-alpha2	» 7.x-1.x-dev
Category:	support	» feature

According to http://wiki.apache.org/solr/LanguageAnalysis#Arabic it should work from a solr perspective.
But the current implementation of Apache Solr Multilingual does not support the exchange of a stemmer. I turn that issue into a feature request ...

Log in or register to post comments

Comment #2

cspitzlay

German

🇩🇪🇪🇺

CreditAttribution: cspitzlay commented 10 May 2013 at 11:23

That wiki page is marked as "mostly obsolete" (although I expect language support to improve and not to become worse over time).

They suggest looking at the example config file at
http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_6/solr... instead.

Does not change the fact that the stemmer would need replacing, though.

Log in or register to post comments

Comment #3

memoday CreditAttribution: memoday commented 12 May 2013 at 17:45

Thanks for your replies. When I compare the default solr schema.xlm against the one that comes with the Apache Solr integration module, I see that there is a section for most languages in the default schema file including Arabic. I can search for Arabic words, but if there are diacritics in any words, it doesn't yield any result. Is there an easy way to force Apache Solr to ignore diacritics in Arabic?

Log in or register to post comments

Comment #4

memoday CreditAttribution: memoday commented 12 May 2013 at 17:56

Thanks cspitzlay. As you can see in the example schema file you provided, there is an Arabic section

    <!-- Arabic -->
    <fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
      <analyzer> 
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <!-- for any non-arabic -->
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ar.txt" enablePositionIncrements="true"/>
        <!-- normalizes ï»¯ to ï»±, etc -->
        <filter class="solr.ArabicNormalizationFilterFactory"/>
        <filter class="solr.ArabicStemFilterFactory"/>
      </analyzer>
    </fieldType>

My question is: if I added this to the schema file that comes with the Apache Integration module, would the Arabic search work fine? I am not sure why languages were removed from the Apache integration schema file.

All what I need for now is the ability to ignore diacritics. Do you think adding this section above would resolve the issue?

Log in or register to post comments

Comment #5

cspitzlay

German

🇩🇪🇪🇺

CreditAttribution: cspitzlay commented 13 May 2013 at 20:26

The file I linked to is an example configuration from the solr project. If I understand correctly it's not a suggested default configuration.

It's not like the Apache Solr Multilingual project removed any languages from a standard config.

Apache Solr Multilingual works the other way around.
It extends the schema provided by the Apachesolr Search Integration project which connects Drupal and Apache Solr but which supports only English well.

Apache Solr Multilingual makes it possible to have multiple languages at once and to tune things like stop words on a per-language basis.

So Apache Solr Multilingual actually adds multilingual features to a monolingual and somewhat hard-coded integration.
Your request made it clear that there are still some missing features, though. That's why mkalkbrenner accepted your issue as feature request.

What might work for you:
Configure arabic via the Apache Solr Multilingual interface, for example add stop words you need, then *change* the generated schema to have the above filter configuration.