Comments

mkalkbrenner created an issue. See original summary.

mkalkbrenner’s picture

Title: Add langauge-specific spell fields » Add language-specific spell fields
damontgomery’s picture

I believe I have a start for a solution here. It's definitely not completed and it can't be merged without an upgrade path for people, but I wanted to get this moving in case someone else can jump in.

Please see patch for change. The patch updates the Search API backend to add a 'spellcheck.dictionary' parameter to the Solr query. This is necessary to tell the system which Solr spellcheck system to use. I used the configuration already present to tell if the site wants to search per language. Another one might make sense, but this seems to work. I also am not clear how the query languages are intended to work. If multiple are present, we need to pick one for spellcheck. I select the first. I'm not sure if there are any cases when there wouldn't be a language, so that might need to be addressed as well.

With the patch, you also need to update a few configuration files.

Make / edit a solrconfig_extra.xml

This defines which spellcheckers should exist.

<!-- Spell Check

    The spell check component can return a list of alternative spelling
    suggestions.

    http://wiki.apache.org/solr/SpellCheckComponent
 -->
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">

  <str name="queryAnalyzerFieldType">textSpell</str>

  <!-- Multiple "Spell Checkers" can be declared and used by this
       component
    -->

  <!-- a spellchecker built from a field of the main index, and
       written to disk
    -->
  <lst name="spellchecker">
    <str name="name">default</str>
    <str name="field">spell</str>
    <str name="spellcheckIndexDir">spellchecker</str>
    <str name="buildOnOptimize">true</str>
    <str name="buildOnCommit">true</str>
    <!-- uncomment this to require terms to occur in 1% of the documents in order to be included in the dictionary
      <float name="thresholdTokenFrequency">.01</float>
    -->
  </lst>

  <!-- Arabic -->
  <lst name="spellchecker">
    <str name="name">spellchecker_ar</str>
    <str name="field">spell_ar</str>
    <str name="spellcheckIndexDir">./spellchecker_ar</str>
    <str name="buildOnOptimize">true</str>
    <str name="buildOnCommit">true</str>
  </lst>

  <!-- English -->
  <lst name="spellchecker">
    <str name="name">spellchecker_en</str>
    <str name="field">spell_en</str>
    <str name="spellcheckIndexDir">./spellchecker_en</str>
    <str name="buildOnOptimize">true</str>
    <str name="buildOnCommit">true</str>
  </lst>

</searchComponent>

Next, you must add an extra field type per language in schema_extra_types.xml. The reasoning behind this is that if you use the same field types as the normal fields, you will get suggestions that are stemmed. For example, "maternity" is stemmed to "matern". We don't want to suggest someone enter the search term "matern" if they spell "maternity" wrong like "meternite".

Below is my full file for reference.

<?xml version="1.0" encoding="UTF-8" ?>

<types>

  <!--
    Arabic Text Field
    4.5.0
  -->
  <fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="1" words="stopwords_ar.txt" enablePositionIncrements="1"/>
      <filter class="solr.ArabicNormalizationFilterFactory"/>
      <filter class="solr.ArabicStemFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="1" words="stopwords_ar.txt" enablePositionIncrements="1"/>
      <filter class="solr.ArabicNormalizationFilterFactory"/>
      <filter class="solr.SynonymFilterFactory" synonyms="synonyms_ar.txt" expand="1" ignoreCase="1"/>
      <filter class="solr.ArabicStemFilterFactory"/>
    </analyzer>
  </fieldType>

  <!-- Field that is not stemmed for spelling. -->
  <fieldType name="text_spell_ar" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="1" words="stopwords_ar.txt" enablePositionIncrements="1"/>
      <filter class="solr.ArabicNormalizationFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="1" words="stopwords_ar.txt" enablePositionIncrements="1"/>
      <filter class="solr.ArabicNormalizationFilterFactory"/>
      <filter class="solr.SynonymFilterFactory" synonyms="synonyms_ar.txt" expand="1" ignoreCase="1"/>
    </analyzer>
  </fieldType>

  <!--
    English Text Field
    4.5.0
  -->
  <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <charFilter class="solr.MappingCharFilterFactory" mapping="accents_en.txt"/>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.LengthFilterFactory" min="2" max="100"/>
      <filter class="solr.StopFilterFactory" ignoreCase="1" words="stopwords_en.txt"/>
      <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" protected="protwords_en.txt" splitOnCaseChange="0" generateWordParts="1" preserveOriginal="1" catenateAll="0" catenateWords="1"/>
      <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords_en.txt"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <charFilter class="solr.MappingCharFilterFactory" mapping="accents_en.txt"/>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.LengthFilterFactory" min="2" max="100"/>
      <filter class="solr.StopFilterFactory" ignoreCase="1" words="stopwords_en.txt"/>
      <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="0" generateNumberParts="1" protected="protwords_en.txt" splitOnCaseChange="0" generateWordParts="1" preserveOriginal="1" catenateAll="0" catenateWords="0"/>
      <filter class="solr.SynonymFilterFactory" synonyms="synonyms_en.txt" expand="1" ignoreCase="1"/>
      <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords_en.txt"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
  </fieldType>

  <!-- Field that is not stemmed for spelling. -->
  <fieldType name="text_spell_en" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <charFilter class="solr.MappingCharFilterFactory" mapping="accents_en.txt"/>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.LengthFilterFactory" min="2" max="100"/>
      <filter class="solr.StopFilterFactory" ignoreCase="1" words="stopwords_en.txt"/>
      <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" protected="protwords_en.txt" splitOnCaseChange="0" generateWordParts="1" preserveOriginal="1" catenateAll="0" catenateWords="1"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <charFilter class="solr.MappingCharFilterFactory" mapping="accents_en.txt"/>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.LengthFilterFactory" min="2" max="100"/>
      <filter class="solr.StopFilterFactory" ignoreCase="1" words="stopwords_en.txt"/>
      <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="0" generateNumberParts="1" protected="protwords_en.txt" splitOnCaseChange="0" generateWordParts="1" preserveOriginal="1" catenateAll="0" catenateWords="0"/>
      <filter class="solr.SynonymFilterFactory" synonyms="synonyms_en.txt" expand="1" ignoreCase="1"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
  </fieldType>

</types>

Next, we need to dynamically fill these fields. You need to update schema_extra_fields.xml. The full file is here for reference.

<?xml version="1.0" encoding="UTF-8" ?>

<fields>
  <!-- Define dynamic fields per language. -->
  <dynamicField name="ts_X3b_ar_*" type="text_ar" stored="true" indexed="true" multiValued="false" termVectors="true"  />
  <dynamicField name="tm_X3b_ar_*" type="text_ar" stored="true" indexed="true" multiValued="true" termVectors="true"  />
  <dynamicField name="ts_X3b_en_*" type="text_en" stored="true" indexed="true" multiValued="false" termVectors="true"  />
  <dynamicField name="tm_X3b_en_*" type="text_en" stored="true" indexed="true" multiValued="true" termVectors="true"  />
  <dynamicField name="ts_X3b_und_*" type="text_und" stored="true" indexed="true" multiValued="false" termVectors="true"  />
  <dynamicField name="tm_X3b_und_*" type="text_und" stored="true" indexed="true" multiValued="true" termVectors="true"  />

  <!-- Define spellcheck fields per language. These types don't stem the words. -->
  <field name="spell_ar" type="text_spell_ar" indexed="true" stored="true" multiValued="true" />
  <field name="spell_en" type="text_spell_en" indexed="true" stored="true" multiValued="true" />

  <!-- Copy the contents of the multilingual fields into the spellcheck fields by language. -->
  <copyField source="ts_X3b_ar_*" dest="spell_ar"/>
  <copyField source="tm_X3b_ar_*" dest="spell_ar"/>
  <copyField source="ts_X3b_en_*" dest="spell_en"/>
  <copyField source="tm_X3b_en_*" dest="spell_en"/>
</fields>

To test this, you'll need to update your schema_extra_fields.xml, schema_extra_types.xml and solrconfig_extra.xml files, reload the Solr configuration for the core, delete all the data on the core, and reindex the content.

damontgomery’s picture

Category: Bug report » Feature request
Status: Active » Needs work

Changing status to Feature request and needs work.

damontgomery’s picture

My previous patch produced a PHP error about references. This one doesn't. :)

mkalkbrenner’s picture

You're moving into the right direction :-)

But Search API Multilingual Solr Search has a concept to dynamically generate the extra xml files and to dynamically modify schema.xml and solrconfig.xml according to the languages active in your drupal instance.

All this magic happens in Drupal\search_api_solr_multilingual\Controller\SolrFieldTypeListBuilder.

So the missing part is to extend the patch to generate the modifications to the xml files you listed in #3.

mkalkbrenner’s picture

Project: Search API Multilingual Solr Search » Search API Solr

The implementation should now be like the new Suggester.

mkalkbrenner’s picture

Version: 8.x-1.x-dev » 8.x-2.x-dev
Component: Code » Multilingual Backend
mkalkbrenner’s picture

Version: 8.x-2.x-dev » 8.x-3.x-dev
Component: Multilingual Backend » Code

I'm working on it:

  • Configure a spell field as part of a SolrFieldType
  • Extend the config generator
  • Route the spellchecker accordingly
  • Use cron to build the catalogs
mkalkbrenner’s picture

StatusFileSize
new66.23 KB

Still WIP, but already a big patch ;-)

mkalkbrenner’s picture

Status: Needs work » Needs review
StatusFileSize
new173.67 KB

Here's the final patch.

To fill a spellcheck directory you need to add "Spellcheck" fields to your index, just like the Suggesters in 8.x-2.x.

  • mkalkbrenner committed 19bd3e2 on 8.x-3.x
    Issue #2735625 by damontgomery, mkalkbrenner: Add language-specific...
  • mkalkbrenner committed 2918f6d on 8.x-3.x
    Issue #2735625 by damontgomery, mkalkbrenner: Add language-specific...
  • mkalkbrenner committed 42a6c8e on 8.x-3.x
    Issue #2735625 by damontgomery, mkalkbrenner: Add language-specific...
  • mkalkbrenner committed accb2e6 on 8.x-3.x
    Issue #2735625 by damontgomery, mkalkbrenner: Add language-specific...
  • mkalkbrenner committed fc22940 on 8.x-3.x
    Issue #2735625 by damontgomery, mkalkbrenner: Add language-specific...

  • mkalkbrenner committed 1a03ec0 on 8.x-3.x
    Issue #2735625 by damontgomery, mkalkbrenner: Add language-specific...
mpp’s picture

+++ b/config/install/search_api_solr.solr_field_type.text_und_6_0_0.yml
@@ -81,8 +81,70 @@ field_type:
+        ignoreCase: true
...
+        class: solr.RemoveDuplicatesTokenFilterFactory

+++ b/search_api_solr.install
@@ -649,3 +649,17 @@ function search_api_solr_update_8301() {
+ * Re-install language-specific filed types to enable the new spellcheckers.

+ * Re-install language-specific field types to enable the new spellcheckers.

mkalkbrenner’s picture

Status: Needs review » Fixed

typo fixed in my repro

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.