Hello

I am trying to make solr work with Greek and when i search something without the accent it does not find anything but when i search with accent it works.

Can you please let me know what i need to change so i can make this work?

CommentFileSizeAuthor
#21 solr.zip7.39 KBarekanderu
#7 select.zip7.08 KBarekanderu
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

arekanderu’s picture

Category: bug » support

I was able to make this thing work a bit better by using the GreekLowerCaseFilterFactory and GreekLowerCaseFilterFactory directly in my schema.xml under text_el field type. Apparently stemming for this language is not supported through SnowballPorterFilterFactory yet, unfortunately.

Even though i can see in my solr admin -> schema browser that the greek terms are now lower cased and without accent (thats what i wanted) and if i use in solr admin -> query a greek word results are returned, if i use the same word on my drupal site and language set to greek apachesolr returns 0 results.

Index and solr installation is the same one for both cases so basically i do not see similar results. Any thoughts?

arekanderu’s picture

Actually, I am not getting proper results in Solr Admin either. I checked the logs to see what kind of query does Apache Multilingual module produces and I only took the basic parts of it, the filter query (which is language:el) and the query I am actually submitting, which is one word in greek.

With the use of Solr Admin -> query -> advanced interface if i type my query (just one word) inside the statement and add language:el in filter query I get 0 results.

The word i am using in query exists in Schema Browser however as a term under title_el.

If i use an english word though with filter query language:el some expected results are returned just fine.

arekanderu’s picture

Some furtner info: I made the search with my greek word but WITH the accent (which i can see it with an accent as a term under the body field in schema browser) and language:el for filter query and it returned proper results.

In addition, if i just make a query by using body_el:MY_GREEK_WORD and without filter, i get 0 results again.

It's like body_el is completely ignored....any thoughts why?

I am clarifying again that i am doing all this test through the tomcat solr admin interface but I've indexed my data through drupal admin interface

arekanderu’s picture

Title: Wrong query with non-English language and more » How to ignore accents
Priority: Major » Normal

Here is partial from the query explain:

<str name="parsedquery_toString">
+(tags_h1:ξενοδοχει^5.0 | body:ξενοδοχει^40.0 | title:ξενοδοχει^5.0 | tags_h4_h5_h6:ξενοδοχει^2.0 | tags_inline:ξενοδοχει | name:ξενοδοχει^3.0 | taxonomy_names:ξενοδοχει^2.0 | tags_h2_h3:ξενοδοχει^3.0)~0.01 (body:ξενοδοχει^2.0)~0.01
</str>

As you can see, body is used instead of body_el and title instead of title_el.

I am pretty sure that the query submitted by your module produces (a more complex version which includes facets) something like the above query. Please clarify the situation.

mkalkbrenner’s picture

From README.txt included in the module:

B) Multiple Languages and Apache Solr Multilingual Texfile
==========================================================

1. Ensure that all the languages you want to cover with
   multilingual search are available and enabled at
   admin/settings/language

2. Enable all the languages you want to cover with
   multilingual search at admin/settings/apachesolr/multilingual
   and "Save configuration"

3. Adjust all solr text files to your needs at
   admin/settings/apachesolr/multilingual

4. Download apachesolr_multilingual_config.zip at
   admin/settings/apachesolr/schema_generator

5. Extract apachesolr_multilingual_config.zip to your solr
   conf directory and restart solr

6. "Re-index all content" at settings/apachesolr/index.
   It's important that you already have content in every langauge
   at this point. Otherwise the checkboxes in the next step won't
   exist until you indexed some content in a specific language

7. Go to admin/settings/apachesolr/query-fields and set "Body" and
   "Title" to "Omit". Enable all language specific bodies and titles
   like body_en or title_de by selecting any value you like but not
   "Omit". And don't forget to "Save configuration".

8. Optional: Like described in 7 omit
     "Body text inside links (A tags)",
     "Body text inside H1 tags",
     "Body text inside H2 or H3 tags",
     "Body text inside H4, H5, or H6 tags",
     "Body text in inline tags like EM or STRONG"
   and turn on the language specific fields like
     "tags_a_de",
     "tags_h1_de",
     "tags_h2_h3_de",
     "tags_h4_h5_h6_de",
     "tags_inline_de".

9. Optional: If you insatalled the module "Taxonomy translation" and
   turned on "Index taxonomy term translations" at
   /admin/settings/apachesolr/multilingual you should omit
   "All taxonomy term names" and enable the language specific equivalent
   like "taxonomy_names_de" instead like described in 7.

I guess you missed at least point 7.
But ensure that you re-index first. Otherwise the options will not be available (see 6).

Please tell me if that helps.

arekanderu’s picture

Thanks for your reply mkalkbrenner. Unfortunately I already had done everything the readme file says so that didn't help much. Search results nothing whatever term I try to use, even with accents.

I also reverted the changes i did to schema.xml to yours and re-indexed and now with the use of solr administrator if i just type in the query browser a greek word with accents I get some results returned, but if i type the same word on my site i get 0 results returned. Both use the same solr instance. So it's basically not a matter of stemming at the moment because this does not work even if i type the word as is in the index.

I still haven't understood how do you make solr look in body_el when my site is switched to Greek with the use of language:el as filter query.

arekanderu’s picture

FileSize
7.08 KB

I have attached you the XML result from solr admin with the greek word i am using with and without accent. As you can see with accent I get results returned (because it looks inside body field) but without accent i get 0 results even though the same word exists in body_el field.

If you look more carefully you will see that even though i specify the filter query language:el (as your module does) the fields used are body, title etc and not body_el , title_el so i am not really understanding where and how do you select the language specific field.

The other thing that i do not understand is why when i make a query body_el:GREEK_WORD solr decides to search in body field instead of body_el field.

arekanderu’s picture

Title: How to ignore accents » Non-English language issues

After some long debugging i finally saw the language field added to documents by the module and i also saw it in solr admin schema browser.

From what I understand the search is always done within the body field but the problem is that body field is based on "text" field type which has some generic english based stemming and it's not always right for non english languages (like greek). Search should look within body_LANGUAGE, title_LANGUAGE, whateverfield_LANGUAGE based on the sites current language selection.

Another thing i noticed is duplicated information. My Greek documents had two fields, body and body_el which they basically had the same information. Same goes for english ones. This does bring a question to my mind though, whats the purpose of this language specific fields if they are never used?

I hope someone will actually look at this issue because basically nothing works on my site at the moment regarding search.

arekanderu’s picture

Title: How to ignore accents » Non-English language issues and more
Priority: Normal » Major
arekanderu’s picture

Title: Non-English language issues and more » Wrong query with non-English language and more

Post #8 is the most appropriate description of the problem i think

arekanderu’s picture

Title: Non-English language issues » No results with non-English language and more
Priority: Normal » Major

I updated the mapping-ISOLatin1Accent.txt file and added the mappings for the non-accented greek letters in order to have case-insensitive and accented-insensitive terms in the "body" field and re-indexed everything. If i search on my solr admin query browser now I am able to see proper results. If i search on my drupal however, with the same word i used on the solr admin then i get no results...

I find the whole situation a mess...is anyone willing to help me troubleshoot at least?

arekanderu’s picture

I have even more findings now. I've set up a clean installation of drupal 6.22 with latest apache-solr 6.x-dev and i18 module for translatiing content. and added two stories. One in english (which is the source) and the greek translation. I've run the indexer and then made a search and it worked fine.

I didn't change anything in the schema.xml which the apache-solr module provided and it removed the accents and lower cased from the Greek story.

Then, i installed apachesolr-multilingual and used the configuration files which the module provides, restarted tomcat and re-indexed. Now search couldn't find anything and i see in the schema browser that the Greek terms have accents (but they are lower cased).

The spelling suggestion after the search did suggested me to search with the accented word however.

arekanderu’s picture

Category: bug » support

On my normal drupal installation the query logged in catalina.out looks like this (i removed some part of it so it will be cleaner):

INFO: [] webapp=/solr path=/select params={spellcheck=true&f.changed.facet.date.start=2010-12-06T18:19:19Z/MONTH&facet=true&f.sm_facetbuilder_search_price.facet.limit=20&f.sm_facetbuilder_price_hotel.facet.limit=20&spellcheck.q=ξενοδοχειο&facet.limit=20&json.nl=map&wt=json&f.changed.facet.date.end=2011-09-23T15:15:10Z%2B1MONTH/MONTH&rows=12&f.im_vid_5.facet.limit=20&f.sm_facetbuilder_rest_type_facet.facet.limit=20&facet.sort=true&start=0&q=ξενοδοχειο

And on my fresh drupal installation:

INFO: [] webapp=/solr path=/select params={spellcheck=true&f.changed.facet.date.start=2011-09-23T17:05:56Z/HOUR&facet=true&facet.mincount=1&spellcheck.q=ξεναδοχείο&facet.limit=20&facet.date=created&facet.date=changed&json.nl=map&hl.fl=body_el&f.changed.facet.date.end=2011-09-23T17:26:02Z%2B1HOUR/HOUR&wt=json&rows=10&f.created.facet.date.gap=%2B1HOUR&fl=id,nid,title,comment_count,type,created,changed,score,path,url,uid,name&f.created.facet.date.start=2011-09-23T17:04:54Z/HOUR&facet.sort=true&start=0&q=ξεναδοχείο&spellcheck.dictionary=spellchecker_el&bf=recip(rord(created),4,2,2)^200.0&f.created.facet.date.end=2011-09-23T17:06:31Z%2B1HOUR/HOUR&f.changed.facet.date.gap=%2B1HOUR&facet.field=uid&facet.field=type&facet.field=language&facet.field=language&fq=language:el} hits=1 status=0 QTime=7 

Do you notice the difference in the q variable? On my fresh one the q variable has the query i did in greek and its readable in log, on my normal drupal installation it's unreadable.

It seems that the encoding of what i am querying gets blown up somewhere and thats why i get 0 results on my normal installation.

Note that both drupal installations are on the same system with the same solr instance (I re-index whenever i make a new test)

If we find the reason that the query string gets screwed up I can solve the accent problem myself and everything will work fine.

arekanderu’s picture

Category: support » bug
arekanderu’s picture

Category: support » bug

Well, after some long long searching i found what is causing it, but not why yet. I have a content type named hotel and if i remove that content type then the query works just fine (the greek word is readable in catalina.out). If i import again the content type then it gets blown up....this is beyond insane.

Also, if i remove 3 (they can be random) cck fields (out of the 33 in total) from the content type, again it works....

This is completely insane....ANY THOUGHTS?

arekanderu’s picture

Status: Active » Closed (won't fix)

The bug is within the apachesolr module so...i am marking this as closed. You can even delete this issue since its non apachesolr_multilingual related

dropbydrop’s picture

+1

dropbydrop’s picture

@arekanderu Could you please the changes you made to your solr configuration files to check, since I m working with the same problem I think?

Thanks

arekanderu’s picture

dropbydrop, my problem (and solution) was this: http://drupal.org/node/1289400

I hope it helps.

dropbydrop’s picture

@arekanderu, I did not manage making it work well with Greek language. I need it to do greek stemming and convert greek accented to unascented letters as well as latin to greek characters. I will open a support request and you may answer there if you wish.

arekanderu’s picture

FileSize
7.39 KB

You need to use solr 3.2.x (or latest) and then edit your schema.xml. I have attached you mine after I added the Greek Stemming to text_el field type. Nothing else needs to be changed. I have also attached you my Greek stopwords file, you should use that one as well.

Do not forget to re-index.

mkalkbrenner’s picture

Title: No results with non-English language and more » Support for Solr 3.x and it's special filters for Greek
Version: 6.x-2.0-beta1 » 6.x-2.x-dev
Category: bug » feature
Status: Closed (won't fix) » Active
mkalkbrenner’s picture

Version: 6.x-2.x-dev » 7.x-1.x-dev
klonos’s picture

How can I help with Greek here?

mkalkbrenner’s picture

Whenever I start to implement the support for Greek, heavy testing will be required. And we have to define useful defaults. That's something I can't do, because I can't even read Greak ;-)

klonos’s picture

Issue tags: +greek

...k. Then just let me know when something is available for testing and when you need help with setting these defaults.

...also adding the "greek" tag in order to be able to track these issues easily.

mkalkbrenner’s picture

Category: feature » task

The remaining task is to offer these filters for Solr 3.1 and above:
http://wiki.apache.org/solr/LanguageAnalysis#Greek

BTW we should provide more language specific stuff as described here:
http://wiki.apache.org/solr/LanguageAnalysis

mkalkbrenner’s picture

Title: Support for Solr 3.x and it's special filters for Greek » Support for Solr 3.x language specific filters (Greek, ...)
mkalkbrenner’s picture

mkalkbrenner’s picture

Status: Active » Closed (duplicate)

#2361393: Stemmers supported in Solr 3.x and updated stopwords contains a patch that covers the tasks mentioned in #27