From BOF session at drupalcon

Comments

janusman’s picture

Version: 5.x-1.0-beta1 » 5.x-1.0-alpha3
Status: Active » Needs review

Did some research and came up with the basic Solr configuration (no Drupal code yet!)

1) Install this patch to your solr installation's example/solr/conf/schema.xml

--- apachesolr-5-head/schema.xml        2008-07-19 14:07:55.000000000 -0500
+++ schema_spellcheck.xml       2008-08-01 14:47:59.490671000 -0500
@@ -1,5 +1,5 @@
 <?xml version="1.0" encoding="UTF-8" ?>
-<!-- $Id: schema.xml,v 1.1.2.3 2008/07/19 19:07:55 robertDouglass Exp $ -->
+<!-- $Id: schema.xml,v 1.1.2.2 2008/06/14 18:44:29 robertDouglass Exp $ -->
 <!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
@@ -216,6 +216,22 @@
       </analyzer>
     </fieldType>
 
+  <fieldType name="spellText" class="solr.TextField" positionIncrementGap="100">
+   <analyzer type="index">
+     <tokenizer class="solr.StandardTokenizerFactory"/>
+     <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
+     <filter class="solr.StandardFilterFactory"/>
+     <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
+   </analyzer>
+   <analyzer type="query">
+     <tokenizer class="solr.StandardTokenizerFactory"/>
+     <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
+     <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
+     <filter class="solr.StandardFilterFactory"/>
+     <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
+   </analyzer>
+  </fieldType>
+
     <!-- since fields of this type are by default not stored or indexed, any data added to
          them will be ignored outright
      -->
@@ -258,6 +274,7 @@
    <field name="taxonomy_name" type="text" indexed="true" stored="true" multiValued="true"/>
    <field name="text" type="text" indexed="true" stored="false"/>
    <field name="language" type="string" indexed="true" stored="true"/>
+   <field name="word" type="spellText" indexed="true" stored="true" />
 
    <!-- Here, default is used to create a "timestamp" field indicating
         When each document was indexed.
@@ -297,6 +314,9 @@
    <dynamicField name="*" type="ignored" />
 
  </fields>
+  
+ <copyField source="text" dest="word" />
+
 
  <!-- Field to use to determine and enforce document uniqueness.
       Unless this field is marked with required="false", it will be a required field

The field name is "word" because that's the way it's configured in solrconfig.xml--so DON'T CHANGE IT =)

2) Reindex... (remove example/solr/data, tell Drupal to reindex, restart solr, run cron.php a lot, etc.) =)

3) After some time, you need to rebuild the spelling dictionary. You do this calling Solr with a special URL (replace example.com with your hostname):

http://example.com:8983/solr/select?qt=spellchecker&cmd=rebuild

4) Now, you get suggestions for a query term like this:

http://example.com:8983/solr/select?qt=spellchecker&q=[word]

Sorry, it only accepts one word, so q=word1+word2 will not return any suggestions.

You can read a thread on this here: http://markmail.org/message/gluyo6mhuxqltzdh

Now... where should the actual PHP code for using suggestions go? The apachesolr.module... or SolrPhpClient?

janusman’s picture

Ok, got this working in an actual site (perhaps not very elegantly, or maybe not even correctly).

I patched apachesolr_search.module to return a "fake" $results[] element that includes the suggestion. To try it test out some queries on our test site:

Again, it only works with single words--no phrases. After patching the schema.xml, you need to copy it over to the Solr instance, reindex, etc etc as per the instructions in the previous comment.

Patch is for D5-1-alpha3.

janusman’s picture

Another small snippet to put at the end of apachesolr_search.module ... this one rebuilds the spellchecker automatically. Since it takes a while to do this (20-120 seconds) I set it up so it only does it every 24 hours.

Sorry, not yet pinging Solr beforehand...

/*
 * Implementation of hook_cron()
 */
function apachesolr_search_cron() {
  // Rebuild spelling dictionary every 24 hours
  $last = variable_get('apachesolr_search_spellrebuildtime', 0);
  if (time() < $last+3600*24) {
    return;
  }
  $host = variable_get('apachesolr_host', 'localhost');
  $port = variable_get('apachesolr_port', 8983);
  $path = variable_get('apachesolr_path', '/solr');
  
  // http://example.com:8983/solr/select?qt=spellchecker&cmd=rebuild
  $request = "http://{$host}:{$port}{$path}/solr/select?qt=spellchecker&cmd=rebuild";
  $result = drupal_http_request($request);
  
  // Mark time of last spellchecker rebuild
  variable_set('apachesolr_search_spellrebuildtime', time());
}

Sorry for not issuing a complete patch... will work on it soon. =)

By the way I just went live with this on our production site... try it:
http://biblioteca.mty.itesm.mx/pasteur/en/search/apachesolr_search/testt

janusman’s picture

Just a small FYI...

Spellchecker rebuild time benchmark for a Solaris 8 box.

2000 nodes: 20 seconds.

30,000 nodes: 65 seconds.

robertDouglass’s picture

Great stuff. It's on my todo list.

robertDouglass’s picture

Priority: Normal » Critical

Are these issues duplicates? @janusman, can you please combine them and close one? http://drupal.org/node/303937

Plus I'm boosting this as critical = should be included in 1.0 release.

janusman’s picture

I have a patch for the DRUPAL-5 branch. Will work on the D6 port.

This only works for single-word queries for now.

robertDouglass’s picture

FileSize
4.92 KB

Testing the patch, corrected a small error (need to use " instead of ' if variables are to evaluate). Now I'm getting the following:

http://localhost:8983/solr/select?qt=spellchecker&q=causi

HTTP ERROR: 400

unknown handler: spellchecker
RequestURI=/solr/select

I'm using the 1.3 nightly.

The word field is showing up in my index:
Apache Solr search index
Field name Field index type
changed integer
word spellText

In the attached patch I fixed the " bug, tweaked the message text, and replaced time() with $_SERVER['REQUEST_TIME'] (a suggestion from Rasmus Lerdorf in Szeged).

janusman’s picture

Sorry about the quotes.

I think the problem is that Solr1.3 differs from 1.2. Are we aiming to switch to Solr1.3 before the 1.0 release?

I will double check to see if the patch is correct (at least my code *is* spellchecking, and my Solr instance (1.2) is spellchecking). =)

janusman’s picture

Yup, 1.3 is different regarding spellcheck. See http://markmail.org/message/7tqnlgw6zums45p7 for a thread discussing this.

robertDouglass’s picture

Ok. So we need to make a decision on this, before 1.0. I'm going to start another thread, but I think an OO approach to this is in order, complete with a factory that finds out which Solr we're running on somewhere along the line.

pwolanin’s picture

Status: Needs review » Closed (duplicate)

We have a basic version working in 6.x with 1.4 (and 1.3 before that).