1. Select to index the Rendered HTML output
2. Enable solr to produce an excerpt
3. Have a view which displays the excerpt
4. Search for a term that appears in the full html

Expected result:

Result returned and excerpt of full html page displayed

Actual result:

Result returned but no excerpt of full html page displayed

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

arknoll created an issue. See original summary.

fran seva’s picture

Issue summary: View changes
FileSize
36.44 KB

Hi @arknoll -- I've been working in this issue and I'm not sure what is the expected result.
I tried to reproduce the error following this steps:

  1. Create a Solr server: I tried with and without "Return an excerpt for all results"option
  2. Create an index
  3. Add body field to be indexed
  4. Add a preprocessor to the index (with default configuration)
  5. Create a view to display indexed content with a fullText search exposed filter and displaying the Excerpt field.
  6. Create dummy content and alter the body with some links and setting the format to full html (then check the HTML was indexed)
  7. Go to the page and search by a word that is part of the link text

What I got was:
2719573-excerpt

My question is, should the excerpt be displayed as full html?

fran seva’s picture

Issue summary: View changes
arknoll’s picture

@fran you have to change one step in your config to reproduce:

1. Create a Solr server: I tried with and without "Return an excerpt for all results"option
2. Create an index
3. Add body field to be indexed
3. Select to index the Rendered HTML output
4. Add a preprocessor to the index (with default configuration)
5. Create a view to display indexed content with a fullText search exposed filter and displaying the Excerpt field.
6. Create dummy content and alter the body with some links and setting the format to full html (then check the HTML was indexed)
7. Go to the page and search by a word that is part of the link text

This functionality is key for pages that are controlled by panelizer. The full rendered HTML is really the only way to index content for those pages (although, the problem is reproducible with a basic content type as well)

fran seva’s picture

Thanks @arknoll I'm able to reproduce the error.

fran seva’s picture

Assigned: Unassigned » fran seva
fran seva’s picture

Assigned: fran seva » Unassigned
Status: Active » Needs review
FileSize
761 bytes

Hi -- After review the code we found (@plopesc and me) that the code was trying to access to an object instead an array:
$response['highlighting'][$solr_id]

To make excerpt works we have to:

  1. Apply the patch
  2. Activate in Solr server the highlight option and excerpt all the results
  3. Re-index the content
mkalkbrenner’s picture

Status: Needs review » Needs work
Issue tags: +Needs tests

Looks good! Thank you.

But I think we should add a test for it.

And I'm not sure if "rendered HTML" == "excerpt" ;-)

mkalkbrenner’s picture

OK, leveraging the "spell" field here is just a temporary workaround because the content of "spell" is filtered by stop words, length filters, removal of duplicates and coverted into lower case:

    <fieldType name="textSpell" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.LengthFilterFactory" min="4" max="20" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
      </analyzer>
    </fieldType>

As proposed in #2195465: Returning snippets for partial matches doesn't work we should use something like "rendered_item" but without HTML tags.
From my point of view we have to port the concept of the apachesolr 7.x module to store a plain text / stripped tags version of the rendered item in a dedicated field (previously "content") and use that one to generate highlighted snippets.

But before we start implementing it we need to decide how to fix #2718209: RenderedItem uses wrong DataType, leads to various issues with Solr backends.

mkalkbrenner’s picture

OK, I was confused. For sure, the filters don't apply if the stored value is returned.

Nevertheless, the spell field is not that suitable. Due to the fact that all full text fields are copied to spell, it might return snippets coming from hidden fields.
I'm still convinced, that we must use a field that only contains the content of the rendered entity in other words we must not show snippets that the user doesn't see anymore if he jumps to that content.

mkalkbrenner’s picture

Title: Rendered HTML output not showing up in excerpt » Excerpt and field highlighting are broken
Assigned: Unassigned » mkalkbrenner
Priority: Normal » Major
mkalkbrenner’s picture

Status: Needs work » Needs review
FileSize
18.5 KB

OMG what a mess. After discussing with drunken_monkey what "excerpt" and "highlight" means in Search API, it turned out that both features are broken.

"Excerpt" is the corresponding feature to apachesolr 7.x highlighted search result snippets. This excerpt consts of multiple snippets of a given size.
"Highlight" means replacing single field values by corresponding highlighted values (without any snippets).

The broken implementation tries to provide both features with the one and only Solr highlighter. I think I fixed it and also wrote tests for it. I also added a simple configuration for most of the parameters of the Solr standard highlighter.

I first wanted to fix the current issue. Now we need follow-up issues:

  • the spell field is not the optimal one
  • the single vs. multi value issue causes duplicate excerpt snippets
  • create a UI for advanced highlighter parameters
mkalkbrenner’s picture

Tests pass:
https://travis-ci.org/mkalkbrenner/search_api_solr/builds/133428852

It would be good if someone could verify that the upgrade path works.

  • mkalkbrenner committed 4952e63 on 8.x-1.x
    Issue #2719573 by fran seva, mkalkbrenner: Excerpt and field...

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

garnett2125’s picture

The code does check if highlight_data is set to TRUE but doesn't change the $output at all.

I had solr highlighting my keyword but it wasn't returned from the getExcerpt function.

This patch fixed the issue for me.

mkalkbrenner’s picture

Please don't comment on closed issues. Open a new one instead.

BTW Your patch doesn't seem to be related to your comment.