The problem I experienced

I deployed my site to a staging environment and, after reindexing the content, all nodes were being returned in a search. It seemed that the indexing wasn't working correctly.

My observations

I noticed that the fulltext fields were not indexing after I checked the database. I am using "fuzzysearch" as well, so I checked the fuzzysearch_default_fuzzysearch_index_* tables in the database and realised that fuzzysearch_default_fuzzysearch_index_search_api_language had data in it, but fuzzysearch_default_fuzzysearch_index_search_api_viewed did not. No matter how many times I added/removed/changed the fields in the search index, the fulltext fields were not working.

I started to debug the issue by using "var_dump" around the place and realised that in includes/index_entity.inc (function preprocessIndexItems) one of the processors was returning "p" as a value every single time, regardless of the input (see example output below).

["title"]=>
    array(3) {
      ["type"]=>
      string(6) "tokens"
      ["value"]=>
      array(1) {
        [0]=>
        array(2) {
          ["value"]=>
          string(1) "p"
          ["score"]=>
          int(1)
        }
      }
      ["original_type"]=>
      string(4) "text"
    }

The solution

After reading the below resources I realised that PCRE might *not* be compiled with UTF-8, so I changed the Tokenizer "whitespace" setting (/admin/config/search/search_api/index/default_fuzzysearch_index/workflow) to [^[:alnum:]]

...everything worked as normal after that.

So I'm proposing that the default setting [^\p{L}\p{N}] is changed to [^[:alnum:]] to ensure for maximum compatibility across systems.

Resources:

http://framework.zend.com/issues/browse/ZF-1641
http://regexkit.sourceforge.net/Documentation/pcre/pcre.html

TLDR; if your fulltext indexing isn't working, try: [^[:alnum:]] in the Tokenizer processor's "whitespace" setting.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

drunken monkey’s picture

Status: Active » Needs review
FileSize
1 KB

Thanks for reporting this! Such specifics are always hard to find. However, it seems your solution produces the exact same results, and if it is more portable than that's of course great!

I also wondered a bit about the additional characters I included as non-spaces (^') or ignorables (-), they don't make much sense to me. I therefore also changed those. Please review the attached patch and see if you find the changes reasonable!

Status: Needs review » Needs work

The last submitted patch, 1606122--tokenizer_regexp_defaults-1.patch, failed testing.

mbelos’s picture

Seems good to me, thanks for that!

drunken monkey’s picture

Status: Needs work » Fixed

OK, thanks for testing!
Committed.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.