Panopoly Search seems to be indexing the source code of pages, not just the visible content.

To reproduce:

  1. On a fresh install of the latest Panopoly with Panopoly Demo, go to the home page.
  2. Enter "public" in the Search field and press the Search button.
  3. Note the search results include all 4 demo nodes.

This seems to be happening because "public" appears in the source code as part of the URLs of the featured images.

Another way to reproduce:

  1. On a fresh install of the latest Panopoly with Panopoly Demo, go to the home page.
  2. Enter "taxonomy" in the Search field and press the Search button.
  3. Note there are no search results.
  4. Log in as an administrator.
  5. Add a term "bacon" in the Categories vocabulary.
  6. Edit the Lovely Vegetables page.
  7. Check the box to add the term "bacon."
  8. Save the page.
  9. Go to the home page.
  10. Enter "taxonomy" in the Search field and press the Search button.
  11. Note there is now 1 result: Lovely Vegetables.

Comments

dsnopek’s picture

I think this is from #2416505: Allow indexing content from "Full page override" with Search API, which is putting the full HTML in the index. It should probably strip the HTML, replacing any tags with a single space ' '.

cboyden’s picture

Search API provides an HTML filter that should work for this purpose. On the database index settings for the latest Panopoly, here are the options of fields where it can be applied:

  • Node ID
  • Content type
  • Title
  • URL
  • Status
  • Date created
  • Author
  • Item language
  • URI
  • Panelizer "Full page override" HTML output
  • Panelizer "Full page override" page title
  • Entity HTML output
  • Node access information
  • The main body text » Text
  • The main body text » Summary

As of now, only the Title field is filtered. In order to fix this, the feature also has to filter Entity HTML output and Panelizer "Full page override" HTML output.

While we're at it, we should probably also turn on the Ignore Case filter for the same fields.

Are there other fields that should be added to the filter list?

cboyden’s picture

Status: Active » Needs review
StatusFileSize
new1.06 KB

Here is a patch that strips HTML and ignores case on the Panelizer "Full page override" HTML output and the Entity HTML output. It works in my testing.

cboyden’s picture

This will need an update hook. It doesn't appear to need a feature revert, just a cache clear and queuing things to be re-indexed. I've been clearing all caches, but is there a subset that can be cleared instead?

dsnopek’s picture

Issue tags: +sprint

Adding this issue to the sprint board in case someone wants to hack on it at the sprint today. :-)

The hook_update_N() should definitely target the necessary search_api caches directly - clearing all caches in an update function can be dangerous if another module is adding a new cache table in their update. However, I'm not sure exactly what those caches are and the functions to clear them. We'll need to dig into the search_api code!

Lowell’s picture

Status: Needs review » Needs work

Verified patch #3 works as expected, but only after clearing all caches and then re-indexing the Search API - Database Node Index

Lowell’s picture

Status: Needs work » Needs review
StatusFileSize
new1.87 KB
new833 bytes

Hoping I got this right
added hook_update_N() to register search indexes for re-indexing

cboyden’s picture

StatusFileSize
new1.87 KB
new936 bytes

There was an error in the update function - it was missing a closing curly brace. I added that. I was also getting these watchdog errors during indexing:
An overlong word (more than 50 characters) was encountered while indexing, due to bad tokenizing. Please check your settings for the "Tokenizer" preprocessor to ensure that data is tokenized correctly.
So I added the Tokenizer filter to the two new fields as well.

cboyden’s picture

StatusFileSize
new2.33 KB
dsnopek’s picture

Status: Needs review » Needs work

I haven't tested the upgrade path (someone at the sprint is working on that!) but looking at the patch, this updates the database index, but doesn't update the SOLR index. So, the same configuration changes will need to be done on the SOLR index too.

jfrederick’s picture

The upgrade path for #9 works great. Tested using Drush 7, Drush 8, and update.php.

cboyden’s picture

The Solr index is already running all three filters (ignore case, HTML filter, tokenizer ) on "The main body text » Text" field. This is different from the Database index, which was only running them on the Title field. Is this intentional? Should one of them be changed to match the other? I don't have Solr set up, so I can't test the effects.

dsnopek’s picture

Hrm. I don't think that's intentional! I think it makes sense to run those filters on both title and body.

@jfrederick at the NYCCamp sprint is going to test with SOLR on Pantheon.

cboyden’s picture

OK, if so we should also filter the summary in case it's different from the default. I'll have an updated patch later today.

cboyden’s picture

StatusFileSize
new4.45 KB

OK, here's a patch which adds all required fields to the filters on both DB and Solr indexes. It's updated against recent changes from the sprint.

cboyden’s picture

Status: Needs work » Needs review
jfrederick’s picture

After deploying #9 to Pantheon with Solr enabled, the Solr configuration included in the patch is marked as overridden in the panopoly_search module. This patch is an un-tested attempt to prevent that. It includes the updates from #15.

jfrederick’s picture

jfrederick’s picture

StatusFileSize
new5.21 KB

Here is an updated patch.

It incorporates the Features updates in #15.

In addition, if a search index configuration is stored in the database, it also updates the appropriate configuration in the database. That way, the database and the Feature match, keeping the Feature in its default state.

dsnopek’s picture

Status: Needs review » Fixed

Works great in my testing! Thanks @cboyden, @Lowell, and @jfrederick! :-) Committed.

  • dsnopek committed 934d003 on 7.x-1.x
    Update Panopoly Search for Issue #2530866 by cboyden, jfrederick, Lowell...

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.