Hi guys,
I've created a website using i18n content translation, which connects multiple nodes in different languages to each other. All nodes are indexed and can be queried, but ie. English stemming and English stopwords are applied.
In the schema_extra_fields.xml
and schema_extra_types.xml
files I found an example of adding a German specific field type. The comments in these files make it look so simple by just uncommenting the example settings. So I used this example to create a Dutch (NL) specific field type, but after re-indexing the used Solr fields do not have the Dutch prefix, they are ie. tm_title
instead of tm_nl_title
.
My question...
Am I missing something? Or are these language dependent custom field types only used when using Entity Translation and Search API Entity Translation?
These days I have read quite a few websites, README.txt and INSTALL.txt files, Search API- and Search API Solr handbooks (the latter also states one only has to add custom field types to the extra Solr configuration files) and on the creation of the Search API Entity Translation contrib (#1323168: Add support for translated fields, #1393058: Decide on strategy for language aware search). I now have got the feeling Search API Entity Translation is the only way to go without having to use multiple Solr cores and Search API indexes. Is this correct?
Any help is much appreciated, thanks a lot in advance!
Comment | File | Size | Author |
---|---|---|---|
#8 | search_api_solr_custom.zip | 2.45 KB | lmeurs |
#5 | solr-extra-configuration-files.zip | 7.94 KB | lmeurs |
Comments
Comment #1
drunken monkeyI'm very sorry, this is still really complicated right now. When the different languages are stored in different nodes, I don't think you need to use the Search API Entity Translation module, or that it would bring any benefit. That's (as far as I know) only useful when you have field-level translation on a single node, and want to split that node into separate search items for each language (which of course makes sense).
Except for the stemming (and other language-specific configuration, like stopwords), multilingual content with content translation should work pretty well out of the box – as it apparently does, judging from your comment.
The easiest way to deal with the stemming, etc., is to just remove it from the Solr configuratin altogether – not really ideal, but at least content won't be stemmed in the wrong language (which is worse than no stemming, in most cases). Editing the
schema.xml
is the easiest way, but can of course lead to trouble when updating so is generally discouraged. (Since there probably won't be many changes there from now on, though, it could still be the best way.) Otherwise, there is already thetext_und
type defined for that, and you'd have to make your fulltext fields use that type. For that, you can either change theprefix
key of Search API'stext
type (seesearch_api_solr_hook_search_api_data_type_info()
), or add a new type with the same method, or use the method described in this module's handbook for customizing field types.(The comments in the
schema_extra_types.xml
file only pertain to adding the type for Solr – getting the Search API Solr Search module to actually use it is a lot harder.)However, if you want to do it completely right and stem the content of each node using the correct language, this gets a lot harder still. In that case, you have to change the indexed Solr documents (using
hook_search_api_solr_documents_alter()
) and move all fulltext content to use a prefix for a dynamic field you have defined for that language. Then, at query time, you have to usehook_search_api_solr_query_alter()
to query all language-specific versions of a field instead of (just) the default one. You can, of course, also restrict it to a certain subset of languages (or a single one), depending on your use case. (If the user has set the site to Spanish, they probably don't want to find Greek content.)Note that Solr 3.5+ also has a feature called Language Detection. I haven't personally use it yet but this might help somewhat with your objective.
There might soon be a Search API version of Apachesolr Multilingual available, which might make all of that a lot easier, but that's yet to come so won't help you right now.
Comment #2
drunken monkeyNote: I have added a slight rewrite of this post to the handbook. Hopefully it will help others as well. If you find out anything more, please add it there, too (or post it here for me to add, at least).
Also, Solr's Language Detection feature probably won't help in the case of this module – we already know the language of each item, so there's really no use in using Solr to determine it for us. (It could help with routing the properties to the right Solr fields, with the correct prefix and type, but that's probably not really simpler than doing it yourself in custom code (which you'd still need anyways) and certainly more error-prone.) If you still give it a try, though, be sure to report back your findings. As said, I haven't really used it myself yet.
Comment #3
drunken monkeyJust found out there's a Search API ET Solr module, too, which claims to do exactly what you want. I don't know what state it is in, but it probably deserves a look. If you try it out, please report back your findings so I can update the handbook page.
Comment #4
lmeurs CreditAttribution: lmeurs commentedHi dronken monkey,
Thanks so much for your extensive answers, I really appreciate it! Since I was (am) new to the Search API and Search API Solr modules and Solr itself, I really had a hard time getting to know Solr, it's XML schema, configuration options, filters, etc., but also defining which functionality is provided by Solr or by the Search API (Solr) module.
Since your 1st comment I tried many things and eventually got i18n content translation (opposed to field translation) working with correct stemming and spellcheck in English (EN) and Dutch (NL) using 1 Solr core, but 2 Search API indexes.
I will describe how I managed to do this in detail, hopefully others can benefit from this. Also see the attached custom module (which can be used as a base for one's own implementation) and the Solr schema and config files.
Important: see last paragraph on assumptions, caveats and code modifications!
Stemming and field name mapping
When querying for "service", stemming makes it possible to also find pages containing "services" and "servicing". In order to get stemming to work for different languages, we have to define separate Solr field types and Solr fields with language prefixes.
The language in the default setup is English and works out of the box, but to make it easier to adjust and maintain settings we define new field types for both NL and EN, and also copy
stopwords.txt
,synonyms.txt
andprotwords.txt
and paste them aslang/stopwords_en.txt
etc. If the Dutch files are not present in thelang
folder, copy them from the English versions.Field name mapping
The Search API Solr module uses dynamic fields to handle certain Solr fields in the same way, based on their name's prefix. Example: the Drupal's body field has a value, safe value, etc. Search API generates a string identifier like
body:value
, Search API Solr maps this totm_body$value
. The prefixtm_
and the dynamic field settings make Solar treat this field as a full text (t) multi valued (m) field.The hook
hook_search_api_solr_field_mapping_alter()
allows us to alter the field name mapping, so we can localizetm_body$value
into ie.tm_nl_body$value
to apply Dutch stemming. See attached module.Separate Search API indexes per language
Search API Solr maps field names per Search API index and not per node. This means all indexed nodes in a Search API index use the exact same mapping, which makes it impossible to index nodes from different languages in a single Search API index. So we have to create two separate Search API indexes.
To make
hook_search_api_solr_field_mapping_alter()
aware of which language prefix to apply, we use the 'Language control' data alteration option at the workflow / filters tab of the concurring Search API index. Enable this option and adjust the callback settings below. We use 'Language field: Language' and select only one language (and optionally Language undefined). Our implementation of thehook_search_api_solr_field_mapping_alter()
hook requires that only a single language (and optionally Language undefined) is selected for localized mapping!schema_extra_types.xml
From
schema.xml
we copy thetext
andtextSpell
field types and paste it inschema_extra_types.xml
. Once for EN and once for NL by adding_en
and_nl
suffixes to their names. Further we localize settings by changing the paths to imported text files (fromstopwords.txt
tolang/stopwords_en.txt
) and changing the language attribute of theSnowballPorterFilterFactory
for stemming.schema_extra_fields.xml
Here we copy + paste the commented German example code for both EN and NL and replace all
_de
occurrences with their localized counterparts.Before doing this, modify the example code in the following way:
1: The type of field
spell_de
wastext_de
, but should have beentextSpell_de
;2: Add
copyField
parameters for dynamic fields:so all localized full text field values will be automatically added to the right localized
spell
field.Solr spellcheck
When querying for "internt" (mind the missing -e), Solr spellcheck offers a "Did you mean 'internet'" link. The Search API Spellcheck module extends the Search API module for this purpose.
The Search API Solr module's default configuration uses a field (opposed to ie. a file) for it's spellcheck functionality. When querying a word, spellcheck searches the indexed values of this field for similar words. The default configuration uses the field
spell
which values are a concatenation of all full text field values of a node.We created localized
spell
fields in our schema so we can use separate dictionaries for each language.solrconfig_extra.xml
To build these dictionaries, we have to define them in
solrconfig_extra.xml
. All we have to do is copy + paste the default dictionary twice (EN + NL) and change the names of the new dictionaries fromdefault
tospellchecker_en
/spellchecker_nl
and append localized suffixes_en
and_nl
to the other string values.NB: In case of multiple dictionaries, Solr requires a dictionary named
default
(or with no name defined at all), otherwise Solr will not start.We are almost there!
We defined new Solr field types, fields and dictionaries. From within Drupal we created two Search API indexes and altered the field name mapping. After reloading Solr and indexing the content from both indexes, all data is (hopefully) successfully stored using our custom field mapping.
All that there is left to do is some localization on query time using hooks (see attached module):
hook_search_api_solr_query_alter()
alters 1) the names of fields that have highlighted search keys and 2) the name of the spellcheck dictionary;hook_search_api_solr_search_results_alter()
creates custom excerpts bases on localizedspell
fields.Important:
* We create 2 completely separated localized search pages using 1 Solr core, 2 Search API indexes and 2 Views;
* From the Search API index' workflow / filters tab we select 1 language and optionally Language undefined. If you select Language undefined for both languages, all Language undefined content gets indexed twice (no harm done, but gives a little extra overhead);
* All Solr fields of type
textSpell
need to be stemmed as well if you use exerpts / (highlighted) snippets, otherwise a query for "service" does find a node containing "services", but Solr does not return any snippets;* In
schema.xml
I commented out the followingcopyField
parameters for dynamic fields to avoid bothspell
andspell_nl
fields from being created (we only want the localized field);* Is the use of dynamic Solr fields like
tm_
andtm_nl_
a smart thing? The first applies field typetext
, the last applies field typetext_nl
. Does the latter overwrite the first or are they both applied? If they are both applied, are both stemming factories being used?* Always reload Solr after editing Solr schema and configuration files. This can be done from the Tomcat control panel at localhost:8080/manager/ (assuming you use Tomcat and at port 8080) or the Tomcat GUI (
TOMCAT_FOLDER/bin/Tomcat7w.exe
on my Windows 7 workstation);* In
solrconfig_extra.xml
we set that dictionaries are being built when the index is being optimized. This is automatically done by cron once a day. When testing make sure you optimize the Solr core, otherwise spellcheck does not offer any query suggestions. See the control panel at localhost:8080/solr/NAME_OF_CORE/ or by visiting localhost:8080/solr/NAME_OF_CORE/update?optimize=true.Comment #5
lmeurs CreditAttribution: lmeurs commentedComment #6
lmeurs CreditAttribution: lmeurs commentedComment #7
drunken monkeyWow, thanks a lot for the great write-up! I've added a link to it on the handbook page, that's sure to help others facing a similar problem.
To answer two of your questions:
A field can always have at most one dynamic field base, so it will never happen that both definitions are applied. In your case, only
tm_nl_*
will be used (for fields matching it) as it is the longer pattern. See the Solr wiki for an explanation.So, to summarize: no problem there.
Good that you're mentioning it, since I'm currently working on #2099559: Don't issue optimize commands. So you'll have to be careful there in the future. The current plan for that issue includes a variable to switch back to the old behavior, though, so it will still be possible to support multiple dictionaries.
Anyways, since you seem to have solved your problem (except for a few details), I'm setting this to "Fixed" here. Thanks again for your write-up, though! I'd wish all users with support requests were as helpful as you!
Comment #8
lmeurs CreditAttribution: lmeurs commentedThank you for the clarification. Based on your answers above and at #2147957: Hard coded field key ("spell") in SearchApiSolrService::getExcerpt() I updated comment #4 and the provided custom module.
Comment #10
koschos CreditAttribution: koschos as a volunteer commentedThank you a lot!
But I have to mention one huge improvement, which I found.
You use the language's filter as a trigger to map fields to the correct language. Unfortunately, this approach has a performance issue.
When you use filters to filter your nodes for particular index, you have to keep in mind that these rules will be executed ONLY on index time.
That means, that you instruct your index, which is assigned to one particular language, to add ALL the nodes as indexable into the table {search_api_item}.
For instance, you have 200k nodes which are divided into 20 languages. You add 20 indexes, and choose a concrete language in the language's filter for each particular index. In this case you will get 200k * 20 = 4 million rows in the {search_api_item} table.
It occures because the index's datasource DOES NOT USE the rules from the filters.
The only way I found to avoid such problems is to write your custom datasource. Extend it from SearchApiEntityDataSourceController
and override methods: startTracking, trackItemInsert. These methods DO instruct your datasource which nodes it must mark as indexable.
Comment #11
lazzyvn CreditAttribution: lazzyvn commentedHello lmeurs,
Can you send me your folder solr/conf please I did everything from your post but my solr doesnt work
i have error message
SolrCore Initialization Failures
drupal: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load conf for core drupal: Can't load schema /opt/solr/server/solr/drupal/conf/schema.xml: Plugin init failure for [schema.xml] fieldType "text_en": Plugin init failure for [schema.xml] analyzer/filter: Error instantiating class: 'org.apache.lucene.analysis.core.StopFilterFactory'
Comment #12
lmeurs CreditAttribution: lmeurs commented@lazzyvn: See to this issue attached files, best of luck!