How to get Search API Solr work with i18n content translation [#2146077]

Hi guys,

I've created a website using i18n content translation, which connects multiple nodes in different languages to each other. All nodes are indexed and can be queried, but ie. English stemming and English stopwords are applied.

In the schema_extra_fields.xml and schema_extra_types.xml files I found an example of adding a German specific field type. The comments in these files make it look so simple by just uncommenting the example settings. So I used this example to create a Dutch (NL) specific field type, but after re-indexing the used Solr fields do not have the Dutch prefix, they are ie. tm_title instead of tm_nl_title.

My question...

Am I missing something? Or are these language dependent custom field types only used when using Entity Translation and Search API Entity Translation?

These days I have read quite a few websites, README.txt and INSTALL.txt files, Search API- and Search API Solr handbooks (the latter also states one only has to add custom field types to the extra Solr configuration files) and on the creation of the Search API Entity Translation contrib (#1323168: Add support for translated fields, #1393058: Decide on strategy for language aware search). I now have got the feeling Search API Entity Translation is the only way to go without having to use multiple Solr cores and Search API indexes. Is this correct?

Any help is much appreciated, thanks a lot in advance!

Comment	File	Size	Author
#8	search_api_solr_custom.zip	2.45 KB	lmeurs
#5	solr-extra-configuration-files.zip	7.94 KB	lmeurs
#5	search_api_solr_custom.zip	2.08 KB	lmeurs

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 29 November 2013 at 09:01

I'm very sorry, this is still really complicated right now. When the different languages are stored in different nodes, I don't think you need to use the Search API Entity Translation module, or that it would bring any benefit. That's (as far as I know) only useful when you have field-level translation on a single node, and want to split that node into separate search items for each language (which of course makes sense).

Except for the stemming (and other language-specific configuration, like stopwords), multilingual content with content translation should work pretty well out of the box – as it apparently does, judging from your comment.

The easiest way to deal with the stemming, etc., is to just remove it from the Solr configuratin altogether – not really ideal, but at least content won't be stemmed in the wrong language (which is worse than no stemming, in most cases). Editing the schema.xml is the easiest way, but can of course lead to trouble when updating so is generally discouraged. (Since there probably won't be many changes there from now on, though, it could still be the best way.) Otherwise, there is already the text_und type defined for that, and you'd have to make your fulltext fields use that type. For that, you can either change the prefix key of Search API's text type (see search_api_solr_hook_search_api_data_type_info()), or add a new type with the same method, or use the method described in this module's handbook for customizing field types.
(The comments in the schema_extra_types.xml file only pertain to adding the type for Solr – getting the Search API Solr Search module to actually use it is a lot harder.)

However, if you want to do it completely right and stem the content of each node using the correct language, this gets a lot harder still. In that case, you have to change the indexed Solr documents (using hook_search_api_solr_documents_alter()) and move all fulltext content to use a prefix for a dynamic field you have defined for that language. Then, at query time, you have to use hook_search_api_solr_query_alter() to query all language-specific versions of a field instead of (just) the default one. You can, of course, also restrict it to a certain subset of languages (or a single one), depending on your use case. (If the user has set the site to Spanish, they probably don't want to find Greek content.)

Note that Solr 3.5+ also has a feature called Language Detection. I haven't personally use it yet but this might help somewhat with your objective.

There might soon be a Search API version of Apachesolr Multilingual available, which might make all of that a lot easier, but that's yet to come so won't help you right now.

Comment #2

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 30 November 2013 at 08:35

Note: I have added a slight rewrite of this post to the handbook. Hopefully it will help others as well. If you find out anything more, please add it there, too (or post it here for me to add, at least).

Also, Solr's Language Detection feature probably won't help in the case of this module – we already know the language of each item, so there's really no use in using Solr to determine it for us. (It could help with routing the properties to the right Solr fields, with the correct prefix and type, but that's probably not really simpler than doing it yourself in custom code (which you'd still need anyways) and certainly more error-prone.) If you still give it a try, though, be sure to report back your findings. As said, I haven't really used it myself yet.

Comment #3

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 30 November 2013 at 19:48

Just found out there's a Search API ET Solr module, too, which claims to do exactly what you want. I don't know what state it is in, but it probably deserves a look. If you try it out, please report back your findings so I can update the handbook page.

Comment #4

lmeurs CreditAttribution: lmeurs commented 3 December 2013 at 14:09

Hi dronken monkey,

Thanks so much for your extensive answers, I really appreciate it! Since I was (am) new to the Search API and Search API Solr modules and Solr itself, I really had a hard time getting to know Solr, it's XML schema, configuration options, filters, etc., but also defining which functionality is provided by Solr or by the Search API (Solr) module.

Since your 1st comment I tried many things and eventually got i18n content translation (opposed to field translation) working with correct stemming and spellcheck in English (EN) and Dutch (NL) using 1 Solr core, but 2 Search API indexes.

I will describe how I managed to do this in detail, hopefully others can benefit from this. Also see the attached custom module (which can be used as a base for one's own implementation) and the Solr schema and config files.

Important: see last paragraph on assumptions, caveats and code modifications!

Stemming and field name mapping

When querying for "service", stemming makes it possible to also find pages containing "services" and "servicing". In order to get stemming to work for different languages, we have to define separate Solr field types and Solr fields with language prefixes.

The language in the default setup is English and works out of the box, but to make it easier to adjust and maintain settings we define new field types for both NL and EN, and also copy stopwords.txt, synonyms.txt and protwords.txt and paste them as lang/stopwords_en.txt etc. If the Dutch files are not present in the lang folder, copy them from the English versions.

Field name mapping

The Search API Solr module uses dynamic fields to handle certain Solr fields in the same way, based on their name's prefix. Example: the Drupal's body field has a value, safe value, etc. Search API generates a string identifier like body:value, Search API Solr maps this to tm_body$value. The prefix tm_ and the dynamic field settings make Solar treat this field as a full text (t) multi valued (m) field.

The hook hook_search_api_solr_field_mapping_alter() allows us to alter the field name mapping, so we can localize tm_body$value into ie. tm_nl_body$value to apply Dutch stemming. See attached module.

Separate Search API indexes per language

Search API Solr maps field names per Search API index and not per node. This means all indexed nodes in a Search API index use the exact same mapping, which makes it impossible to index nodes from different languages in a single Search API index. So we have to create two separate Search API indexes.

To make hook_search_api_solr_field_mapping_alter() aware of which language prefix to apply, we use the 'Language control' data alteration option at the workflow / filters tab of the concurring Search API index. Enable this option and adjust the callback settings below. We use 'Language field: Language' and select only one language (and optionally Language undefined). Our implementation of the hook_search_api_solr_field_mapping_alter() hook requires that only a single language (and optionally Language undefined) is selected for localized mapping!

schema_extra_types.xml

From schema.xml we copy the text and textSpell field types and paste it in schema_extra_types.xml. Once for EN and once for NL by adding _en and _nl suffixes to their names. Further we localize settings by changing the paths to imported text files (from stopwords.txt to lang/stopwords_en.txt) and changing the language attribute of the SnowballPorterFilterFactory for stemming.

schema_extra_fields.xml

Here we copy + paste the commented German example code for both EN and NL and replace all _de occurrences with their localized counterparts.

Before doing this, modify the example code in the following way:

1: The type of field spell_de was text_de, but should have been textSpell_de;

2: Add copyField parameters for dynamic fields:

<copyField source="ts_de_*" dest="spell_de"/>
<copyField source="tm_de_*" dest="spell_de"/>

so all localized full text field values will be automatically added to the right localized spell field.

Solr spellcheck

When querying for "internt" (mind the missing -e), Solr spellcheck offers a "Did you mean 'internet'" link. The Search API Spellcheck module extends the Search API module for this purpose.

The Search API Solr module's default configuration uses a field (opposed to ie. a file) for it's spellcheck functionality. When querying a word, spellcheck searches the indexed values of this field for similar words. The default configuration uses the field spell which values are a concatenation of all full text field values of a node.

We created localized spell fields in our schema so we can use separate dictionaries for each language.

solrconfig_extra.xml

To build these dictionaries, we have to define them in solrconfig_extra.xml. All we have to do is copy + paste the default dictionary twice (EN + NL) and change the names of the new dictionaries from default to spellchecker_en / spellchecker_nl and append localized suffixes _en and _nl to the other string values.

NB: In case of multiple dictionaries, Solr requires a dictionary named default (or with no name defined at all), otherwise Solr will not start.

We are almost there!

We defined new Solr field types, fields and dictionaries. From within Drupal we created two Search API indexes and altered the field name mapping. After reloading Solr and indexing the content from both indexes, all data is (hopefully) successfully stored using our custom field mapping.

All that there is left to do is some localization on query time using hooks (see attached module):

hook_search_api_solr_query_alter() alters 1) the names of fields that have highlighted search keys and 2) the name of the spellcheck dictionary;
hook_search_api_solr_search_results_alter() creates custom excerpts bases on localized spell fields.

Important:

* We create 2 completely separated localized search pages using 1 Solr core, 2 Search API indexes and 2 Views;

* From the Search API index' workflow / filters tab we select 1 language and optionally Language undefined. If you select Language undefined for both languages, all Language undefined content gets indexed twice (no harm done, but gives a little extra overhead);

* All Solr fields of type textSpell need to be stemmed as well if you use exerpts / (highlighted) snippets, otherwise a query for "service" does find a node containing "services", but Solr does not return any snippets;

* In schema.xml I commented out the following copyField parameters for dynamic fields to avoid both spell and spell_nl fields from being created (we only want the localized field);

<copyField source="ts_*" dest="spell"/>
<copyField source="tm_*" dest="spell"/>

* Is the use of dynamic Solr fields like tm_ and tm_nl_ a smart thing? The first applies field type text, the last applies field type text_nl. Does the latter overwrite the first or are they both applied? If they are both applied, are both stemming factories being used?

* Always reload Solr after editing Solr schema and configuration files. This can be done from the Tomcat control panel at localhost:8080/manager/ (assuming you use Tomcat and at port 8080) or the Tomcat GUI ( TOMCAT_FOLDER/bin/Tomcat7w.exe on my Windows 7 workstation);

* In solrconfig_extra.xml we set that dictionaries are being built when the index is being optimized. This is automatically done by cron once a day. When testing make sure you optimize the Solr core, otherwise spellcheck does not offer any query suggestions. See the control panel at localhost:8080/solr/NAME_OF_CORE/ or by visiting localhost:8080/solr/NAME_OF_CORE/update?optimize=true.

Comment #5

lmeurs CreditAttribution: lmeurs commented 2 December 2013 at 11:33

File	Size
search_api_solr_custom.zip	2.08 KB
solr-extra-configuration-files.zip	7.94 KB

Comment #6

lmeurs CreditAttribution: lmeurs commented 2 December 2013 at 12:02

Title:	Does Search API SOLR work on a multilingual website using i18n content translation and a single SOLR core?	» How to get Search API Solr work with i18n content translation
Issue summary:	View changes

Comment #7

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 3 December 2013 at 08:31

Status:

Active

» Fixed

Wow, thanks a lot for the great write-up! I've added a link to it on the handbook page, that's sure to help others facing a similar problem.

To answer two of your questions:

* Is the use of dynamic Solr fields like tm_ and tm_nl_ a smart thing? The first applies field type text, the last applies field type text_nl. Does the latter overwrite the first or are they both applied? If they are both applied, are both stemming factories being used?

A field can always have at most one dynamic field base, so it will never happen that both definitions are applied. In your case, only tm_nl_* will be used (for fields matching it) as it is the longer pattern. See the Solr wiki for an explanation.
So, to summarize: no problem there.

* In solrconfig_extra.xml we set that dictionaries are being built when the index is being optimized. This is automatically done by cron once a day. When testing make sure you optimize the Solr core, otherwise spellcheck does not offer any query suggestions. See the control panel at localhost:8080/solr/NAME_OF_CORE/ or by visiting localhost:8080/solr/NAME_OF_CORE/update?optimize=true.

Good that you're mentioning it, since I'm currently working on #2099559: Don't issue optimize commands. So you'll have to be careful there in the future. The current plan for that issue includes a variable to switch back to the old behavior, though, so it will still be possible to support multiple dictionaries.

Anyways, since you seem to have solved your problem (except for a few details), I'm setting this to "Fixed" here. Thanks again for your write-up, though! I'd wish all users with support requests were as helpful as you!

Comment #8

lmeurs CreditAttribution: lmeurs commented 3 December 2013 at 14:31

File	Size
search_api_solr_custom.zip	2.45 KB

1 file was hidden/shown/deleted

File	Size
search_api_solr_custom.zip	2.08 KB

Thank you for the clarification. Based on your answers above and at #2147957: Hard coded field key ("spell") in SearchApiSolrService::getExcerpt() I updated comment #4 and the provided custom module.

Comment #9

17 December 2013 at 14:30

Status:

Fixed

» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Comment #10

koschos CreditAttribution: koschos as a volunteer commented 8 February 2016 at 09:00

Thank you a lot!

But I have to mention one huge improvement, which I found.
You use the language's filter as a trigger to map fields to the correct language. Unfortunately, this approach has a performance issue.
When you use filters to filter your nodes for particular index, you have to keep in mind that these rules will be executed ONLY on index time.
That means, that you instruct your index, which is assigned to one particular language, to add ALL the nodes as indexable into the table {search_api_item}.
For instance, you have 200k nodes which are divided into 20 languages. You add 20 indexes, and choose a concrete language in the language's filter for each particular index. In this case you will get 200k * 20 = 4 million rows in the {search_api_item} table.
It occures because the index's datasource DOES NOT USE the rules from the filters.
The only way I found to avoid such problems is to write your custom datasource. Extend it from SearchApiEntityDataSourceController
and override methods: startTracking, trackItemInsert. These methods DO instruct your datasource which nodes it must mark as indexable.

Comment #11

lazzyvn CreditAttribution: lazzyvn commented 4 October 2017 at 10:30

Hello lmeurs,
Can you send me your folder solr/conf please I did everything from your post but my solr doesnt work
i have error message

SolrCore Initialization Failures
drupal: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load conf for core drupal: Can't load schema /opt/solr/server/solr/drupal/conf/schema.xml: Plugin init failure for [schema.xml] fieldType "text_en": Plugin init failure for [schema.xml] analyzer/filter: Error instantiating class: 'org.apache.lucene.analysis.core.StopFilterFactory'