Panopoly Search seems to be indexing the source code of pages, not just the visible content.
To reproduce:
- On a fresh install of the latest Panopoly with Panopoly Demo, go to the home page.
- Enter "public" in the Search field and press the Search button.
- Note the search results include all 4 demo nodes.
This seems to be happening because "public" appears in the source code as part of the URLs of the featured images.
Another way to reproduce:
- On a fresh install of the latest Panopoly with Panopoly Demo, go to the home page.
- Enter "taxonomy" in the Search field and press the Search button.
- Note there are no search results.
- Log in as an administrator.
- Add a term "bacon" in the Categories vocabulary.
- Edit the Lovely Vegetables page.
- Check the box to add the term "bacon."
- Save the page.
- Go to the home page.
- Enter "taxonomy" in the Search field and press the Search button.
- Note there is now 1 result: Lovely Vegetables.
| Comment | File | Size | Author |
|---|---|---|---|
| #19 | search_indexes_source-2530866-19.patch | 5.21 KB | jfrederick |
| #15 | panopoly-search-filter-html-2530866-15.patch | 4.45 KB | cboyden |
Comments
Comment #1
dsnopekI think this is from #2416505: Allow indexing content from "Full page override" with Search API, which is putting the full HTML in the index. It should probably strip the HTML, replacing any tags with a single space ' '.
Comment #2
cboyden commentedSearch API provides an HTML filter that should work for this purpose. On the database index settings for the latest Panopoly, here are the options of fields where it can be applied:
As of now, only the Title field is filtered. In order to fix this, the feature also has to filter Entity HTML output and Panelizer "Full page override" HTML output.
While we're at it, we should probably also turn on the Ignore Case filter for the same fields.
Are there other fields that should be added to the filter list?
Comment #3
cboyden commentedHere is a patch that strips HTML and ignores case on the Panelizer "Full page override" HTML output and the Entity HTML output. It works in my testing.
Comment #4
cboyden commentedThis will need an update hook. It doesn't appear to need a feature revert, just a cache clear and queuing things to be re-indexed. I've been clearing all caches, but is there a subset that can be cleared instead?
Comment #5
dsnopekAdding this issue to the sprint board in case someone wants to hack on it at the sprint today. :-)
The
hook_update_N()should definitely target the necessary search_api caches directly - clearing all caches in an update function can be dangerous if another module is adding a new cache table in their update. However, I'm not sure exactly what those caches are and the functions to clear them. We'll need to dig into the search_api code!Comment #6
Lowell commentedVerified patch #3 works as expected, but only after clearing all caches and then re-indexing the Search API - Database Node Index
Comment #7
Lowell commentedHoping I got this right
added hook_update_N() to register search indexes for re-indexing
Comment #8
cboyden commentedThere was an error in the update function - it was missing a closing curly brace. I added that. I was also getting these watchdog errors during indexing:
An overlong word (more than 50 characters) was encountered while indexing, due to bad tokenizing. Please check your settings for the "Tokenizer" preprocessor to ensure that data is tokenized correctly.So I added the Tokenizer filter to the two new fields as well.
Comment #9
cboyden commentedComment #10
dsnopekI haven't tested the upgrade path (someone at the sprint is working on that!) but looking at the patch, this updates the database index, but doesn't update the SOLR index. So, the same configuration changes will need to be done on the SOLR index too.
Comment #11
jfrederick commentedThe upgrade path for #9 works great. Tested using Drush 7, Drush 8, and update.php.
Comment #12
cboyden commentedThe Solr index is already running all three filters (ignore case, HTML filter, tokenizer ) on "The main body text » Text" field. This is different from the Database index, which was only running them on the Title field. Is this intentional? Should one of them be changed to match the other? I don't have Solr set up, so I can't test the effects.
Comment #13
dsnopekHrm. I don't think that's intentional! I think it makes sense to run those filters on both title and body.
@jfrederick at the NYCCamp sprint is going to test with SOLR on Pantheon.
Comment #14
cboyden commentedOK, if so we should also filter the summary in case it's different from the default. I'll have an updated patch later today.
Comment #15
cboyden commentedOK, here's a patch which adds all required fields to the filters on both DB and Solr indexes. It's updated against recent changes from the sprint.
Comment #16
cboyden commentedComment #17
jfrederick commentedAfter deploying #9 to Pantheon with Solr enabled, the Solr configuration included in the patch is marked as overridden in the panopoly_search module. This patch is an un-tested attempt to prevent that. It includes the updates from #15.
Comment #18
jfrederick commentedComment #19
jfrederick commentedHere is an updated patch.
It incorporates the Features updates in #15.
In addition, if a search index configuration is stored in the database, it also updates the appropriate configuration in the database. That way, the database and the Feature match, keeping the Feature in its default state.
Comment #20
dsnopekWorks great in my testing! Thanks @cboyden, @Lowell, and @jfrederick! :-) Committed.