Search indexes source code [#2530866]

Panopoly Search seems to be indexing the source code of pages, not just the visible content.

To reproduce:

On a fresh install of the latest Panopoly with Panopoly Demo, go to the home page.
Enter "public" in the Search field and press the Search button.
Note the search results include all 4 demo nodes.

This seems to be happening because "public" appears in the source code as part of the URLs of the featured images.

Another way to reproduce:

On a fresh install of the latest Panopoly with Panopoly Demo, go to the home page.
Enter "taxonomy" in the Search field and press the Search button.
Note there are no search results.
Log in as an administrator.
Add a term "bacon" in the Categories vocabulary.
Edit the Lovely Vegetables page.
Check the box to add the term "bacon."
Save the page.
Go to the home page.
Enter "taxonomy" in the Search field and press the Search button.
Note there is now 1 result: Lovely Vegetables.

Comment	File	Size	Author
#19	search_indexes_source-2530866-19.patch	5.21 KB	jfrederick
#17	panopoly-search-filter-html-2530866-16.patch	4.94 KB	jfrederick
#15	panopoly-search-filter-html-2530866-15.patch	4.45 KB	cboyden
#9	panopoly-search-filter-html-2530866-8.patch	2.33 KB	cboyden
#8	interdiff-7-8.txt	936 bytes	cboyden
#8	panopoly_search-search-reindex-2530866-7.patch	1.87 KB	cboyden
#7	interdiff-2530866-3-7.txt	833 bytes	Lowell
#7	panopoly_search-search-reindex-2530866-7.patch	1.87 KB	Lowell
#3	panopoly-search-filter-html-2530866-3.patch	1.06 KB	cboyden

Comments

Comment #1

dsnopek

he/him

English

USA

commented 10 July 2015 at 17:51

I think this is from #2416505: Allow indexing content from "Full page override" with Search API, which is putting the full HTML in the index. It should probably strip the HTML, replacing any tags with a single space ' '.

Comment #2

cboyden commented 10 July 2015 at 22:33

Search API provides an HTML filter that should work for this purpose. On the database index settings for the latest Panopoly, here are the options of fields where it can be applied:

Node ID
Content type
Title
URL
Status
Date created
Author
Item language
URI
Panelizer "Full page override" HTML output
Panelizer "Full page override" page title
Entity HTML output
Node access information
The main body text » Text
The main body text » Summary

As of now, only the Title field is filtered. In order to fix this, the feature also has to filter Entity HTML output and Panelizer "Full page override" HTML output.

While we're at it, we should probably also turn on the Ignore Case filter for the same fields.

Are there other fields that should be added to the filter list?

Comment #3

cboyden commented 10 July 2015 at 22:48

Status:

Active

» Needs review

Status	File	Size
new	panopoly-search-filter-html-2530866-3.patch	1.06 KB

Here is a patch that strips HTML and ignores case on the Panelizer "Full page override" HTML output and the Entity HTML output. It works in my testing.

Comment #4

cboyden commented 10 July 2015 at 23:16

This will need an update hook. It doesn't appear to need a feature revert, just a cache clear and queuing things to be re-indexed. I've been clearing all caches, but is there a subset that can be cleared instead?

Comment #5

dsnopek

he/him

English

USA

commented 11 July 2015 at 14:37

Issue tags:

+sprint

Adding this issue to the sprint board in case someone wants to hack on it at the sprint today. :-)

The hook_update_N() should definitely target the necessary search_api caches directly - clearing all caches in an update function can be dangerous if another module is adding a new cache table in their update. However, I'm not sure exactly what those caches are and the functions to clear them. We'll need to dig into the search_api code!

Comment #6

Lowell commented 11 July 2015 at 20:42

Status:

Needs review

» Needs work

Verified patch #3 works as expected, but only after clearing all caches and then re-indexing the Search API - Database Node Index

Comment #7

Lowell commented 14 July 2015 at 14:57

Status:

Needs work

» Needs review

Status	File	Size
new	panopoly_search-search-reindex-2530866-7.patch	1.87 KB
new	interdiff-2530866-3-7.txt	833 bytes

Hoping I got this right
added hook_update_N() to register search indexes for re-indexing

Comment #8

cboyden commented 16 July 2015 at 04:08

Status	File	Size
new	panopoly_search-search-reindex-2530866-7.patch	1.87 KB
new	interdiff-7-8.txt	936 bytes

3 files were hidden/shown/deleted

Status	File	Size
hidden	panopoly-search-filter-html-2530866-3.patch	1.06 KB
hidden	panopoly_search-search-reindex-2530866-7.patch	1.87 KB
hidden	interdiff-2530866-3-7.txt	833 bytes

There was an error in the update function - it was missing a closing curly brace. I added that. I was also getting these watchdog errors during indexing:
An overlong word (more than 50 characters) was encountered while indexing, due to bad tokenizing. Please check your settings for the "Tokenizer" preprocessor to ensure that data is tokenized correctly.
So I added the Tokenizer filter to the two new fields as well.

Comment #9

cboyden commented 16 July 2015 at 04:08

Status	File	Size
new	panopoly-search-filter-html-2530866-8.patch	2.33 KB

1 file was hidden/shown/deleted

Status	File	Size
hidden	panopoly_search-search-reindex-2530866-7.patch	1.87 KB

Comment #10

dsnopek

he/him

English

USA

commented 16 July 2015 at 14:26

Status:

Needs review

» Needs work

I haven't tested the upgrade path (someone at the sprint is working on that!) but looking at the patch, this updates the database index, but doesn't update the SOLR index. So, the same configuration changes will need to be done on the SOLR index too.

Comment #11

jfrederick commented 16 July 2015 at 15:32

The upgrade path for #9 works great. Tested using Drush 7, Drush 8, and update.php.

Comment #12

cboyden commented 16 July 2015 at 16:45

The Solr index is already running all three filters (ignore case, HTML filter, tokenizer ) on "The main body text » Text" field. This is different from the Database index, which was only running them on the Title field. Is this intentional? Should one of them be changed to match the other? I don't have Solr set up, so I can't test the effects.

Comment #13

dsnopek

he/him

English

USA

commented 16 July 2015 at 16:49

Hrm. I don't think that's intentional! I think it makes sense to run those filters on both title and body.

@jfrederick at the NYCCamp sprint is going to test with SOLR on Pantheon.

Comment #14

cboyden commented 16 July 2015 at 17:36

OK, if so we should also filter the summary in case it's different from the default. I'll have an updated patch later today.

Comment #15

cboyden commented 16 July 2015 at 21:18

Status	File	Size
new	panopoly-search-filter-html-2530866-15.patch	4.45 KB

2 files were hidden/shown/deleted

Status	File	Size
hidden	interdiff-7-8.txt	936 bytes
hidden	panopoly-search-filter-html-2530866-8.patch	2.33 KB

OK, here's a patch which adds all required fields to the filters on both DB and Solr indexes. It's updated against recent changes from the sprint.

Comment #16

cboyden commented 16 July 2015 at 21:18

Status:

Needs work

» Needs review

Comment #17

jfrederick commented 16 July 2015 at 21:45

Status	File	Size
new	panopoly-search-filter-html-2530866-16.patch	4.94 KB

After deploying #9 to Pantheon with Solr enabled, the Solr configuration included in the patch is marked as overridden in the panopoly_search module. This patch is an un-tested attempt to prevent that. It includes the updates from #15.

Comment #18

jfrederick commented 17 July 2015 at 02:30

1 file was hidden/shown/deleted

Status	File	Size
hidden	panopoly-search-filter-html-2530866-16.patch	4.94 KB

Comment #19

jfrederick commented 17 July 2015 at 03:24

Status	File	Size
new	search_indexes_source-2530866-19.patch	5.21 KB

Here is an updated patch.

It incorporates the Features updates in #15.

In addition, if a search index configuration is stored in the database, it also updates the appropriate configuration in the database. That way, the database and the Feature match, keeping the Feature in its default state.

Comment #20

dsnopek

he/him

English

USA

commented 17 July 2015 at 16:28

Status:

Needs review

» Fixed

Works great in my testing! Thanks @cboyden, @Lowell, and @jfrederick! :-) Committed.

Comment #21

17 July 2015 at 16:28

dsnopek committed 934d003 on 7.x-1.x

Update Panopoly Search for Issue #2530866 by cboyden, jfrederick, Lowell...

Comment #22

31 July 2015 at 16:34

Status:

Fixed

» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Search indexes source code

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Comment #19

Comment #20

Comment #21

Comment #22

News items

Our community

Documentation

Drupal code base

Governance of community