Observed so far with German and Danish content - text with certain encodings or characters causes Solr to die when trying to highlight the search result:

"500" Status: Internal Server Error. String index out of range: 1854 java.lang.StringIndexOutOfBoundsException: String index out of range: 1854 at java.lang.String.substring(String.java:1935) at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:274) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:313) at 

Comments

pwolanin’s picture

Status: Active » Needs review
StatusFileSize
new1.1 KB

going to test this approach, to see if it helps.

pwolanin’s picture

Status: Needs review » Needs work

While this avoid the error 500, the characters look wrong.

pwolanin’s picture

StatusFileSize
new1.98 KB

Here's a patch for pedantic utf-8 matching - does not help, however.

mikl’s picture

As per the thread on the Acquia Search beta forums, I'm affected by this bug as well. Please let me know if I can help – my Java-skills are somewhere around non-existing, but if you need help testing, feel free to hit me up on IRC – mikl on freenode.net

pwolanin’s picture

StatusFileSize
new937 bytes

This should be a work-around, ugly though it is.

pwolanin’s picture

Likely source for fixes at the Lucene level: https://issues.apache.org/jira/browse/LUCENE-1500

pwolanin’s picture

Status: Needs work » Needs review
pwolanin’s picture

My patch on the lucene issue prevents the exception - though obviously it's a bit of a PITA to build the lucene .jar and then put it in Solr and then build the solr.war.

JacobSingh’s picture

I looked at this a little more today till my head exploded.

I couldn't find a better solution and I don't think we will get one pre-DC.

The patch Peter put together does work to mitigate the issue

pwolanin’s picture

Status: Needs review » Postponed

Since there is a lucene patch that mitigates the bug, seems liek we shoul not commit this hack at the moment. Anyhow not able to rebuild lucene and solr may want to try it, however.

pwolanin’s picture

StatusFileSize
new1.37 KB

http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

"As you can see, if you use CharFilter, Token offsets could be incorrect
because CharFilters may convert 1 char to 2 chars or the other way
around."

https://issues.apache.org/jira/browse/SOLR-822

pwolanin’s picture

StatusFileSize
new1.79 KB

should indicate the schema change too.

pwolanin’s picture

Status: Postponed » Reviewed & tested by the community

This change to the schema seems to correct the highlighting.

pwolanin’s picture

Status: Reviewed & tested by the community » Active

commited to 6.x, though it's possible we only need it for the query-time part. Needs further investigation.

jody lynn’s picture

I have everything indexed under the latest 6.x but still have the highlighting offsets caused by UTF-8 characters (&#x characters). I tried the other patches above as well also without success. (My head also exploded)

pwolanin’s picture

@Jody - what Solr server are you using for this? you must make sure the schema and solrconfig are in sync for the isomapping functionality to work and not give offsets.

jody lynn’s picture

It's the Acquia hosted Solr.

jody lynn’s picture

Decimal numerical character references work fine by the way, it's just hexadecimal character references that offset highlighting.

pwolanin’s picture

Ah, this relates perhaps to this issue: #420290: document preparation runs words together the idea being that we need to run a decode entitites to handle these sorts of characters.

jody lynn’s picture

StatusFileSize
new7.93 KB

Attached is some of our text that gets the highlighter offset bug. I think it's mostly the single and double quotes that wreak havoc.

pwolanin’s picture

Hopefully this will be fixed when we push the most current schema to all our hosted indexes.

jody lynn’s picture

I tried to test this but now I'm not getting any results for anything that's a book page (which is all the content that had encoding issues). I only get results for other content types. Very strange.

jody lynn’s picture

Status: Active » Fixed

Ok, never mind, I guess we just had to do a whole lot of reindexing to start to see the book pages somehow. The highlighting issue is resolved for me.

genernic’s picture

I'm having the exact same issue you do, without Solr, only using Lucene in combination with Tika, I get a lot of errors and in addition to this, the documents which do work, sometimes highlight the wrong words, what kind of workaround have you found?

pwolanin’s picture

Status: Fixed » Closed (fixed)