Hi there,

I tried to find out if this supports Arabic, but I couldn't. Can you please let me know if this can work just fine with Arabic as well?

Thanks

Comments

mkalkbrenner’s picture

Version: 7.x-1.0-alpha2 » 7.x-1.x-dev
Category: support » feature

According to http://wiki.apache.org/solr/LanguageAnalysis#Arabic it should work from a solr perspective.
But the current implementation of Apache Solr Multilingual does not support the exchange of a stemmer. I turn that issue into a feature request ...

cspitzlay’s picture

That wiki page is marked as "mostly obsolete" (although I expect language support to improve and not to become worse over time).

They suggest looking at the example config file at
http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_6/solr... instead.

Does not change the fact that the stemmer would need replacing, though.

memoday’s picture

Thanks for your replies. When I compare the default solr schema.xlm against the one that comes with the Apache Solr integration module, I see that there is a section for most languages in the default schema file including Arabic. I can search for Arabic words, but if there are diacritics in any words, it doesn't yield any result. Is there an easy way to force Apache Solr to ignore diacritics in Arabic?

memoday’s picture

Thanks cspitzlay. As you can see in the example schema file you provided, there is an Arabic section

    <!-- Arabic -->
    <fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
      <analyzer> 
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <!-- for any non-arabic -->
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ar.txt" enablePositionIncrements="true"/>
        <!-- normalizes ﻯ to ﻱ, etc -->
        <filter class="solr.ArabicNormalizationFilterFactory"/>
        <filter class="solr.ArabicStemFilterFactory"/>
      </analyzer>
    </fieldType>

My question is: if I added this to the schema file that comes with the Apache Integration module, would the Arabic search work fine? I am not sure why languages were removed from the Apache integration schema file.

All what I need for now is the ability to ignore diacritics. Do you think adding this section above would resolve the issue?

cspitzlay’s picture

The file I linked to is an example configuration from the solr project. If I understand correctly it's not a suggested default configuration.

It's not like the Apache Solr Multilingual project removed any languages from a standard config.

Apache Solr Multilingual works the other way around.
It extends the schema provided by the Apachesolr Search Integration project which connects Drupal and Apache Solr but which supports only English well.

Apache Solr Multilingual makes it possible to have multiple languages at once and to tune things like stop words on a per-language basis.

So Apache Solr Multilingual actually adds multilingual features to a monolingual and somewhat hard-coded integration.
Your request made it clear that there are still some missing features, though. That's why mkalkbrenner accepted your issue as feature request.

What might work for you:
Configure arabic via the Apache Solr Multilingual interface, for example add stop words you need, then *change* the generated schema to have the above filter configuration.

memoday’s picture

Thanks Christian for your reply and sorry for my belated one.

I will be available for testing the Arabic integration once it is available.

Thanks for your help!

mkalkbrenner’s picture

Issue summary: View changes
Status: Active » Closed (duplicate)
Related issues: +#2361393: Stemmers supported in Solr 3.x and updated stopwords