Add language-specific spell fields [#2735625]

Comment	File	Size	Author
#11	2735625_final.patch	173.67 KB	mkalkbrenner
#10	2735625_WIP.patch	66.23 KB	mkalkbrenner
#5	search_api_solr_multilingual--add-spellcheck-per-language-support--2735625--5.patch	1.04 KB	damontgomery
#3	search_api_solr_multilingual--add-spellcheck-per-language-support--2735625--2.patch	1.01 KB	damontgomery

Comments

Comment #1

28 May 2016 at 08:05

mkalkbrenner created an issue. See original summary.

Comment #2

mkalkbrenner

German

🇩🇪

commented 28 May 2016 at 08:06

Title:

Add langauge-specific spell fields

» Add language-specific spell fields

Comment #3

damontgomery commented 8 June 2017 at 17:29

Status	File	Size
new	search_api_solr_multilingual--add-spellcheck-per-language-support--2735625--2.patch	1.01 KB

I believe I have a start for a solution here. It's definitely not completed and it can't be merged without an upgrade path for people, but I wanted to get this moving in case someone else can jump in.

Please see patch for change. The patch updates the Search API backend to add a 'spellcheck.dictionary' parameter to the Solr query. This is necessary to tell the system which Solr spellcheck system to use. I used the configuration already present to tell if the site wants to search per language. Another one might make sense, but this seems to work. I also am not clear how the query languages are intended to work. If multiple are present, we need to pick one for spellcheck. I select the first. I'm not sure if there are any cases when there wouldn't be a language, so that might need to be addressed as well.

With the patch, you also need to update a few configuration files.

Make / edit a solrconfig_extra.xml

This defines which spellcheckers should exist.

<!-- Spell Check

    The spell check component can return a list of alternative spelling
    suggestions.

    http://wiki.apache.org/solr/SpellCheckComponent
 -->
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">

  <str name="queryAnalyzerFieldType">textSpell</str>

  <!-- Multiple "Spell Checkers" can be declared and used by this
       component
    -->

  <!-- a spellchecker built from a field of the main index, and
       written to disk
    -->
  <lst name="spellchecker">
    <str name="name">default</str>
    <str name="field">spell</str>
    <str name="spellcheckIndexDir">spellchecker</str>
    <str name="buildOnOptimize">true</str>
    <str name="buildOnCommit">true</str>
    <!-- uncomment this to require terms to occur in 1% of the documents in order to be included in the dictionary
      <float name="thresholdTokenFrequency">.01</float>
    -->
  </lst>

  <!-- Arabic -->
  <lst name="spellchecker">
    <str name="name">spellchecker_ar</str>
    <str name="field">spell_ar</str>
    <str name="spellcheckIndexDir">./spellchecker_ar</str>
    <str name="buildOnOptimize">true</str>
    <str name="buildOnCommit">true</str>
  </lst>

  <!-- English -->
  <lst name="spellchecker">
    <str name="name">spellchecker_en</str>
    <str name="field">spell_en</str>
    <str name="spellcheckIndexDir">./spellchecker_en</str>
    <str name="buildOnOptimize">true</str>
    <str name="buildOnCommit">true</str>
  </lst>

</searchComponent>

Next, you must add an extra field type per language in schema_extra_types.xml. The reasoning behind this is that if you use the same field types as the normal fields, you will get suggestions that are stemmed. For example, "maternity" is stemmed to "matern". We don't want to suggest someone enter the search term "matern" if they spell "maternity" wrong like "meternite".

Below is my full file for reference.

<?xml version="1.0" encoding="UTF-8" ?>

<types>

  <!--
    Arabic Text Field
    4.5.0
  -->
  <fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="1" words="stopwords_ar.txt" enablePositionIncrements="1"/>
      <filter class="solr.ArabicNormalizationFilterFactory"/>
      <filter class="solr.ArabicStemFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="1" words="stopwords_ar.txt" enablePositionIncrements="1"/>
      <filter class="solr.ArabicNormalizationFilterFactory"/>
      <filter class="solr.SynonymFilterFactory" synonyms="synonyms_ar.txt" expand="1" ignoreCase="1"/>
      <filter class="solr.ArabicStemFilterFactory"/>
    </analyzer>
  </fieldType>

  <!-- Field that is not stemmed for spelling. -->
  <fieldType name="text_spell_ar" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="1" words="stopwords_ar.txt" enablePositionIncrements="1"/>
      <filter class="solr.ArabicNormalizationFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="1" words="stopwords_ar.txt" enablePositionIncrements="1"/>
      <filter class="solr.ArabicNormalizationFilterFactory"/>
      <filter class="solr.SynonymFilterFactory" synonyms="synonyms_ar.txt" expand="1" ignoreCase="1"/>
    </analyzer>
  </fieldType>

  <!--
    English Text Field
    4.5.0
  -->
  <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <charFilter class="solr.MappingCharFilterFactory" mapping="accents_en.txt"/>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.LengthFilterFactory" min="2" max="100"/>
      <filter class="solr.StopFilterFactory" ignoreCase="1" words="stopwords_en.txt"/>
      <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" protected="protwords_en.txt" splitOnCaseChange="0" generateWordParts="1" preserveOriginal="1" catenateAll="0" catenateWords="1"/>
      <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords_en.txt"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <charFilter class="solr.MappingCharFilterFactory" mapping="accents_en.txt"/>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.LengthFilterFactory" min="2" max="100"/>
      <filter class="solr.StopFilterFactory" ignoreCase="1" words="stopwords_en.txt"/>
      <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="0" generateNumberParts="1" protected="protwords_en.txt" splitOnCaseChange="0" generateWordParts="1" preserveOriginal="1" catenateAll="0" catenateWords="0"/>
      <filter class="solr.SynonymFilterFactory" synonyms="synonyms_en.txt" expand="1" ignoreCase="1"/>
      <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords_en.txt"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
  </fieldType>

  <!-- Field that is not stemmed for spelling. -->
  <fieldType name="text_spell_en" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <charFilter class="solr.MappingCharFilterFactory" mapping="accents_en.txt"/>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.LengthFilterFactory" min="2" max="100"/>
      <filter class="solr.StopFilterFactory" ignoreCase="1" words="stopwords_en.txt"/>
      <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" protected="protwords_en.txt" splitOnCaseChange="0" generateWordParts="1" preserveOriginal="1" catenateAll="0" catenateWords="1"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <charFilter class="solr.MappingCharFilterFactory" mapping="accents_en.txt"/>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.LengthFilterFactory" min="2" max="100"/>
      <filter class="solr.StopFilterFactory" ignoreCase="1" words="stopwords_en.txt"/>
      <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="0" generateNumberParts="1" protected="protwords_en.txt" splitOnCaseChange="0" generateWordParts="1" preserveOriginal="1" catenateAll="0" catenateWords="0"/>
      <filter class="solr.SynonymFilterFactory" synonyms="synonyms_en.txt" expand="1" ignoreCase="1"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
  </fieldType>

</types>

Next, we need to dynamically fill these fields. You need to update schema_extra_fields.xml. The full file is here for reference.

<?xml version="1.0" encoding="UTF-8" ?>

<fields>
  <!-- Define dynamic fields per language. -->
  <dynamicField name="ts_X3b_ar_*" type="text_ar" stored="true" indexed="true" multiValued="false" termVectors="true"  />
  <dynamicField name="tm_X3b_ar_*" type="text_ar" stored="true" indexed="true" multiValued="true" termVectors="true"  />
  <dynamicField name="ts_X3b_en_*" type="text_en" stored="true" indexed="true" multiValued="false" termVectors="true"  />
  <dynamicField name="tm_X3b_en_*" type="text_en" stored="true" indexed="true" multiValued="true" termVectors="true"  />
  <dynamicField name="ts_X3b_und_*" type="text_und" stored="true" indexed="true" multiValued="false" termVectors="true"  />
  <dynamicField name="tm_X3b_und_*" type="text_und" stored="true" indexed="true" multiValued="true" termVectors="true"  />

  <!-- Define spellcheck fields per language. These types don't stem the words. -->
  <field name="spell_ar" type="text_spell_ar" indexed="true" stored="true" multiValued="true" />
  <field name="spell_en" type="text_spell_en" indexed="true" stored="true" multiValued="true" />

  <!-- Copy the contents of the multilingual fields into the spellcheck fields by language. -->
  <copyField source="ts_X3b_ar_*" dest="spell_ar"/>
  <copyField source="tm_X3b_ar_*" dest="spell_ar"/>
  <copyField source="ts_X3b_en_*" dest="spell_en"/>
  <copyField source="tm_X3b_en_*" dest="spell_en"/>
</fields>

To test this, you'll need to update your schema_extra_fields.xml, schema_extra_types.xml and solrconfig_extra.xml files, reload the Solr configuration for the core, delete all the data on the core, and reindex the content.

Comment #4

damontgomery commented 8 June 2017 at 17:30

Category:	Bug report	» Feature request
Status:	Active	» Needs work

Changing status to Feature request and needs work.

Comment #5

damontgomery commented 8 June 2017 at 20:16

Status	File	Size
new	search_api_solr_multilingual--add-spellcheck-per-language-support--2735625--5.patch	1.04 KB

1 file was hidden/shown/deleted

Status	File	Size
hidden	search_api_solr_multilingual--add-spellcheck-per-language-support--2735625--2.patch	1.01 KB

My previous patch produced a PHP error about references. This one doesn't. :)

Comment #6

mkalkbrenner

German

🇩🇪

commented 9 June 2017 at 09:30

You're moving into the right direction :-)

But Search API Multilingual Solr Search has a concept to dynamically generate the extra xml files and to dynamically modify schema.xml and solrconfig.xml according to the languages active in your drupal instance.

All this magic happens in Drupal\search_api_solr_multilingual\Controller\SolrFieldTypeListBuilder.

So the missing part is to extend the patch to generate the modifications to the xml files you listed in #3.

Comment #7

mkalkbrenner

German

🇩🇪

commented 11 February 2018 at 09:50

Project:

Search API Multilingual Solr Search

» Search API Solr

The implementation should now be like the new Suggester.

Comment #8

mkalkbrenner

German

🇩🇪

commented 11 February 2018 at 09:50

Version:	8.x-1.x-dev	» 8.x-2.x-dev
Component:	Code	» Multilingual Backend

Comment #9

mkalkbrenner

German

🇩🇪

commented 1 January 2019 at 14:09

Version:	8.x-2.x-dev	» 8.x-3.x-dev
Component:	Multilingual Backend	» Code

I'm working on it:

Configure a spell field as part of a SolrFieldType
Extend the config generator
Route the spellchecker accordingly
Use cron to build the catalogs

Comment #10

mkalkbrenner

German

🇩🇪

commented 2 January 2019 at 17:13

Status	File	Size
new	2735625_WIP.patch	66.23 KB

Still WIP, but already a big patch ;-)

Comment #11

mkalkbrenner

German

🇩🇪

commented 3 January 2019 at 10:33

Status:

Needs work

» Needs review

Status	File	Size
new	2735625_final.patch	173.67 KB

2 files were hidden/shown/deleted

Status	File	Size
hidden	search_api_solr_multilingual--add-spellcheck-per-language-support--2735625--5.patch	1.04 KB
hidden	2735625_WIP.patch	66.23 KB

Here's the final patch.

To fill a spellcheck directory you need to add "Spellcheck" fields to your index, just like the Suggesters in 8.x-2.x.

Comment #12

3 January 2019 at 12:17

mkalkbrenner committed 19bd3e2 on 8.x-3.x

Issue #2735625 by damontgomery, mkalkbrenner: Add language-specific...

mkalkbrenner committed 2918f6d on 8.x-3.x

Issue #2735625 by damontgomery, mkalkbrenner: Add language-specific...

mkalkbrenner committed 42a6c8e on 8.x-3.x

Issue #2735625 by damontgomery, mkalkbrenner: Add language-specific...

mkalkbrenner committed accb2e6 on 8.x-3.x

Issue #2735625 by damontgomery, mkalkbrenner: Add language-specific...

mkalkbrenner committed fc22940 on 8.x-3.x

Issue #2735625 by damontgomery, mkalkbrenner: Add language-specific...

Comment #13

3 January 2019 at 15:47

mkalkbrenner committed 1a03ec0 on 8.x-3.x

Issue #2735625 by damontgomery, mkalkbrenner: Add language-specific...

Comment #14

mpp commented 8 January 2019 at 13:44

+++ b/config/install/search_api_solr.solr_field_type.text_und_6_0_0.yml
@@ -81,8 +81,70 @@ field_type:
+        ignoreCase: true
...
+        class: solr.RemoveDuplicatesTokenFilterFactory

+++ b/search_api_solr.install
@@ -649,3 +649,17 @@ function search_api_solr_update_8301() {
+ * Re-install language-specific filed types to enable the new spellcheckers.

+ * Re-install language-specific field types to enable the new spellcheckers.

Comment #15

mkalkbrenner

German

🇩🇪

commented 17 January 2019 at 20:36

Status:

Needs review

» Fixed

typo fixed in my repro

Comment #16

31 January 2019 at 20:39

Status:

Fixed

» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Add language-specific spell fields

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Child issues

Referenced by

News items

Our community

Documentation

Drupal code base

Governance of community