Closed (fixed)
Project:
Apache Solr Search
Version:
6.x-1.x-dev
Component:
Code
Priority:
Normal
Category:
Bug report
Assigned:
Unassigned
Reporter:
Created:
24 Feb 2009 at 15:10 UTC
Updated:
12 May 2009 at 18:33 UTC
Jump to comment: Most recent file
Comments
Comment #1
pwolanin commentedgoing to test this approach, to see if it helps.
Comment #2
pwolanin commentedWhile this avoid the error 500, the characters look wrong.
Comment #3
pwolanin commentedHere's a patch for pedantic utf-8 matching - does not help, however.
Comment #4
miklAs per the thread on the Acquia Search beta forums, I'm affected by this bug as well. Please let me know if I can help – my Java-skills are somewhere around non-existing, but if you need help testing, feel free to hit me up on IRC – mikl on freenode.net
Comment #5
pwolanin commentedThis should be a work-around, ugly though it is.
Comment #6
pwolanin commentedLikely source for fixes at the Lucene level: https://issues.apache.org/jira/browse/LUCENE-1500
Comment #7
pwolanin commentedComment #8
pwolanin commentedMy patch on the lucene issue prevents the exception - though obviously it's a bit of a PITA to build the lucene .jar and then put it in Solr and then build the solr.war.
Comment #9
JacobSingh commentedI looked at this a little more today till my head exploded.
I couldn't find a better solution and I don't think we will get one pre-DC.
The patch Peter put together does work to mitigate the issue
Comment #10
pwolanin commentedSince there is a lucene patch that mitigates the bug, seems liek we shoul not commit this hack at the moment. Anyhow not able to rebuild lucene and solr may want to try it, however.
Comment #11
pwolanin commentedhttp://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html
https://issues.apache.org/jira/browse/SOLR-822
Comment #12
pwolanin commentedshould indicate the schema change too.
Comment #13
pwolanin commentedThis change to the schema seems to correct the highlighting.
Comment #14
pwolanin commentedcommited to 6.x, though it's possible we only need it for the query-time part. Needs further investigation.
Comment #15
jody lynnI have everything indexed under the latest 6.x but still have the highlighting offsets caused by UTF-8 characters (&#x characters). I tried the other patches above as well also without success. (My head also exploded)
Comment #16
pwolanin commented@Jody - what Solr server are you using for this? you must make sure the schema and solrconfig are in sync for the isomapping functionality to work and not give offsets.
Comment #17
jody lynnIt's the Acquia hosted Solr.
Comment #18
jody lynnDecimal numerical character references work fine by the way, it's just hexadecimal character references that offset highlighting.
Comment #19
pwolanin commentedAh, this relates perhaps to this issue: #420290: document preparation runs words together the idea being that we need to run a decode entitites to handle these sorts of characters.
Comment #20
jody lynnAttached is some of our text that gets the highlighter offset bug. I think it's mostly the single and double quotes that wreak havoc.
Comment #21
pwolanin commentedHopefully this will be fixed when we push the most current schema to all our hosted indexes.
Comment #22
jody lynnI tried to test this but now I'm not getting any results for anything that's a book page (which is all the content that had encoding issues). I only get results for other content types. Very strange.
Comment #23
jody lynnOk, never mind, I guess we just had to do a whole lot of reindexing to start to see the book pages somehow. The highlighting issue is resolved for me.
Comment #24
genernic commentedI'm having the exact same issue you do, without Solr, only using Lucene in combination with Tika, I get a lot of errors and in addition to this, the documents which do work, sometimes highlight the wrong words, what kind of workaround have you found?
Comment #25
pwolanin commented