Excerpt and field highlighting are broken [#2719573]

Comment	File	Size	Author
#17	excerpt_highlight-2719573-17.patch	945 bytes	garnett2125
#12	2719573_excerpt_highlight.patch	18.5 KB	mkalkbrenner
#7	rendered_html_output-2719573-7.patch	761 bytes	fran seva
#2	Screen Shot 2016-05-09 at 13.03.16.png	36.44 KB	fran seva

Comment #1

5 May 2016 at 15:44

arknoll created an issue. See original summary.

Log in or register to post comments

Comment #2

fran seva commented 9 May 2016 at 18:10

Issue summary:

View changes

Status	File	Size
new	Screen Shot 2016-05-09 at 13.03.16.png	36.44 KB

Hi @arknoll -- I've been working in this issue and I'm not sure what is the expected result.
I tried to reproduce the error following this steps:

Create a Solr server: I tried with and without "Return an excerpt for all results"option
Create an index
Add body field to be indexed
Add a preprocessor to the index (with default configuration)
Create a view to display indexed content with a fullText search exposed filter and displaying the Excerpt field.
Create dummy content and alter the body with some links and setting the format to full html (then check the HTML was indexed)
Go to the page and search by a word that is part of the link text

What I got was:
2719573-excerpt

My question is, should the excerpt be displayed as full html?

Log in or register to post comments

Comment #3

fran seva commented 9 May 2016 at 18:13

Issue summary:

View changes

Log in or register to post comments

Comment #4

arknoll commented 11 May 2016 at 12:30

@fran you have to change one step in your config to reproduce:

1. Create a Solr server: I tried with and without "Return an excerpt for all results"option
2. Create an index
3. ~~Add body field to be indexed~~
3. Select to index the Rendered HTML output
4. Add a preprocessor to the index (with default configuration)
5. Create a view to display indexed content with a fullText search exposed filter and displaying the Excerpt field.
6. Create dummy content and alter the body with some links and setting the format to full html (then check the HTML was indexed)
7. Go to the page and search by a word that is part of the link text

This functionality is key for pages that are controlled by panelizer. The full rendered HTML is really the only way to index content for those pages (although, the problem is reproducible with a basic content type as well)

Log in or register to post comments

Comment #5

fran seva commented 11 May 2016 at 16:21

Thanks @arknoll I'm able to reproduce the error.

Log in or register to post comments

Comment #6

fran seva commented 11 May 2016 at 21:20

Assigned:

Unassigned

» fran seva

Log in or register to post comments

Comment #7

fran seva commented 12 May 2016 at 04:20

Assigned:	fran seva	» Unassigned
Status:	Active	» Needs review

Status	File	Size
new	rendered_html_output-2719573-7.patch	761 bytes

Hi -- After review the code we found (@plopesc and me) that the code was trying to access to an object instead an array:
$response['highlighting'][$solr_id]

To make excerpt works we have to:

Apply the patch
Activate in Solr server the highlight option and excerpt all the results
Re-index the content

Log in or register to post comments

Comment #8

mkalkbrenner

German

🇩🇪

commented 12 May 2016 at 07:22

Status:	Needs review	» Needs work
Issue tags:		+Needs tests

Looks good! Thank you.

But I think we should add a test for it.

And I'm not sure if "rendered HTML" == "excerpt" ;-)

Log in or register to post comments

Comment #9

mkalkbrenner

German

🇩🇪

commented 20 May 2016 at 17:37

OK, leveraging the "spell" field here is just a temporary workaround because the content of "spell" is filtered by stop words, length filters, removal of duplicates and coverted into lower case:

    <fieldType name="textSpell" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.LengthFilterFactory" min="4" max="20" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
      </analyzer>
    </fieldType>

As proposed in #2195465: Returning snippets for partial matches doesn't work we should use something like "rendered_item" but without HTML tags.
From my point of view we have to port the concept of the apachesolr 7.x module to store a plain text / stripped tags version of the rendered item in a dedicated field (previously "content") and use that one to generate highlighted snippets.

But before we start implementing it we need to decide how to fix #2718209: RenderedItem uses wrong DataType, leads to various issues with Solr backends.

Log in or register to post comments

Comment #10

mkalkbrenner

German

🇩🇪

commented 25 May 2016 at 08:38

OK, I was confused. For sure, the filters don't apply if the stored value is returned.

Nevertheless, the spell field is not that suitable. Due to the fact that all full text fields are copied to spell, it might return snippets coming from hidden fields.
I'm still convinced, that we must use a field that only contains the content of the rendered entity in other words we must not show snippets that the user doesn't see anymore if he jumps to that content.

Log in or register to post comments

Comment #11

mkalkbrenner

German

🇩🇪

commented 27 May 2016 at 16:23

Title:	Rendered HTML output not showing up in excerpt	» Excerpt and field highlighting are broken
Assigned:	Unassigned	» mkalkbrenner
Priority:	Normal	» Major

Log in or register to post comments

Comment #12

mkalkbrenner

German

🇩🇪

commented 27 May 2016 at 16:40

Status:

Needs work

» Needs review

Status	File	Size
new	2719573_excerpt_highlight.patch	18.5 KB

OMG what a mess. After discussing with drunken_monkey what "excerpt" and "highlight" means in Search API, it turned out that both features are broken.

"Excerpt" is the corresponding feature to apachesolr 7.x highlighted search result snippets. This excerpt consts of multiple snippets of a given size.
"Highlight" means replacing single field values by corresponding highlighted values (without any snippets).

The broken implementation tries to provide both features with the one and only Solr highlighter. I think I fixed it and also wrote tests for it. I also added a simple configuration for most of the parameters of the Solr standard highlighter.

I first wanted to fix the current issue. Now we need follow-up issues:

the spell field is not the optimal one
the single vs. multi value issue causes duplicate excerpt snippets
create a UI for advanced highlighter parameters

Log in or register to post comments

Comment #13

mkalkbrenner

German

🇩🇪

commented 27 May 2016 at 21:47

Tests pass:
https://travis-ci.org/mkalkbrenner/search_api_solr/builds/133428852

It would be good if someone could verify that the upgrade path works.

Log in or register to post comments

Comment #14

28 May 2016 at 22:32

mkalkbrenner committed 4952e63 on 8.x-1.x

Issue #2719573 by fran seva, mkalkbrenner: Excerpt and field...

Log in or register to post comments

Comment #15

mkalkbrenner

German

🇩🇪

commented 28 May 2016 at 22:39

Status:

Needs review

» Fixed

Follow-ups:

Log in or register to post comments

Comment #16

11 June 2016 at 22:44

Status:

Fixed

» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Log in or register to post comments

Comment #17

garnett2125 commented 28 April 2017 at 13:48

Status	File	Size
new	excerpt_highlight-2719573-17.patch	945 bytes

The code does check if highlight_data is set to TRUE but doesn't change the $output at all.

I had solr highlighting my keyword but it wasn't returned from the getExcerpt function.

This patch fixed the issue for me.

Log in or register to post comments

Comment #18

mkalkbrenner

German

Excerpt and field highlighting are broken

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Referenced by

News items

Our community

Documentation

Drupal code base

Governance of community