[SOLVED] Fulltext fields were not being indexed at all [#1580780]

The problem I experienced

I deployed my site to a staging environment and, after reindexing the content, all nodes were being returned in a search. It seemed that the indexing wasn't working correctly.

My observations

I noticed that the fulltext fields were not indexing after I checked the database. I am using "fuzzysearch" as well, so I checked the fuzzysearch_default_fuzzysearch_index_* tables in the database and realised that fuzzysearch_default_fuzzysearch_index_search_api_language had data in it, but fuzzysearch_default_fuzzysearch_index_search_api_viewed did not. No matter how many times I added/removed/changed the fields in the search index, the fulltext fields were not working.

I started to debug the issue by using "var_dump" around the place and realised that in includes/index_entity.inc (function preprocessIndexItems) one of the processors was returning "p" as a value every single time, regardless of the input (see example output below).

["title"]=>
    array(3) {
      ["type"]=>
      string(6) "tokens"
      ["value"]=>
      array(1) {
        [0]=>
        array(2) {
          ["value"]=>
          string(1) "p"
          ["score"]=>
          int(1)
        }
      }
      ["original_type"]=>
      string(4) "text"
    }

The solution

After reading the below resources I realised that PCRE might *not* be compiled with UTF-8, so I changed the Tokenizer "whitespace" setting (/admin/config/search/search_api/index/default_fuzzysearch_index/workflow) to [^[:alnum:]]

...everything worked as normal after that.

So I'm proposing that the default setting [^\p{L}\p{N}] is changed to [^[:alnum:]] to ensure for maximum compatibility across systems.

Resources:

http://framework.zend.com/issues/browse/ZF-1641
http://regexkit.sourceforge.net/Documentation/pcre/pcre.html

TLDR; if your fulltext indexing isn't working, try: [^[:alnum:]] in the Tokenizer processor's "whitespace" setting.

Comment	File	Size	Author
#1	1606122--tokenizer_regexp_defaults-1.patch	1 KB	drunken monkey
#1

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 29 May 2012 at 16:03

Status:

Active

» Needs review

File	Size
1606122--tokenizer_regexp_defaults-1.patch	1 KB

Thanks for reporting this! Such specifics are always hard to find. However, it seems your solution produces the exact same results, and if it is more portable than that's of course great!

I also wondered a bit about the additional characters I included as non-spaces (^') or ignorables (-), they don't make much sense to me. I therefore also changed those. Please review the attached patch and see if you find the changes reasonable!