character encoding issues caus Solr highlighter to fail [#382358]

Comment	File	Size	Author
#21	20090421-tw98it7xt57ch333cq94hwq8e2.jpg	34.72 KB	pwolanin
#21	20090421-m9pwguuuyst18a76xbb5a6mc8y.jpg	38.3 KB	pwolanin
#20	highlighter-text.txt	7.93 KB	jody lynn
#12	char-stream-aware-382358-12.patch	1.79 KB	pwolanin
#11	char-stream-aware-382358-11.patch	1.37 KB	pwolanin
#5	padding-382358-4.patch	937 bytes	pwolanin
#3	utf8-382358-3.patch	1.98 KB	pwolanin
#1	hl500-382358-1.patch	1.1 KB	pwolanin

Comment #1

pwolanin commented 24 February 2009 at 15:12

Status:

Active

» Needs review

Status	File	Size
new	hl500-382358-1.patch	1.1 KB

going to test this approach, to see if it helps.

Log in or register to post comments

Comment #2

pwolanin commented 24 February 2009 at 16:30

Status:

Needs review

» Needs work

While this avoid the error 500, the characters look wrong.

Log in or register to post comments

Comment #3

pwolanin commented 24 February 2009 at 19:41

Status	File	Size
new	utf8-382358-3.patch	1.98 KB

Here's a patch for pedantic utf-8 matching - does not help, however.

Log in or register to post comments

Comment #4

mikl

Møn

commented 24 February 2009 at 20:19

As per the thread on the Acquia Search beta forums, I'm affected by this bug as well. Please let me know if I can help – my Java-skills are somewhere around non-existing, but if you need help testing, feel free to hit me up on IRC – mikl on freenode.net

Log in or register to post comments

Comment #5

pwolanin commented 24 February 2009 at 21:16

Status	File	Size
new	padding-382358-4.patch	937 bytes

This should be a work-around, ugly though it is.

Log in or register to post comments

Comment #6

pwolanin commented 24 February 2009 at 21:40

Likely source for fixes at the Lucene level: https://issues.apache.org/jira/browse/LUCENE-1500

Log in or register to post comments

Comment #7

pwolanin commented 24 February 2009 at 23:25

Status:

Needs work

» Needs review

Log in or register to post comments

Comment #8

pwolanin commented 25 February 2009 at 01:37

My patch on the lucene issue prevents the exception - though obviously it's a bit of a PITA to build the lucene .jar and then put it in Solr and then build the solr.war.

Log in or register to post comments

Comment #9

JacobSingh commented 25 February 2009 at 11:58

I looked at this a little more today till my head exploded.

I couldn't find a better solution and I don't think we will get one pre-DC.

The patch Peter put together does work to mitigate the issue

Log in or register to post comments

Comment #10

pwolanin commented 27 February 2009 at 05:04

Status:

Needs review

» Postponed

Since there is a lucene patch that mitigates the bug, seems liek we shoul not commit this hack at the moment. Anyhow not able to rebuild lucene and solr may want to try it, however.

Log in or register to post comments

Comment #11

pwolanin commented 28 February 2009 at 04:48

Status	File	Size
new	char-stream-aware-382358-11.patch	1.37 KB

http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

"As you can see, if you use CharFilter, Token offsets could be incorrect
because CharFilters may convert 1 char to 2 chars or the other way
around."

https://issues.apache.org/jira/browse/SOLR-822

Log in or register to post comments

Comment #12

pwolanin commented 28 February 2009 at 13:01

Status	File	Size
new	char-stream-aware-382358-12.patch	1.79 KB

should indicate the schema change too.

Log in or register to post comments

Comment #13

pwolanin commented 2 March 2009 at 04:44

Status:

Postponed

» Reviewed & tested by the community

This change to the schema seems to correct the highlighting.

Log in or register to post comments

Comment #14

pwolanin commented 2 March 2009 at 04:48

Status:

Reviewed & tested by the community

» Active

commited to 6.x, though it's possible we only need it for the query-time part. Needs further investigation.

Log in or register to post comments

Comment #15

jody lynn

she/her

English

commented 15 April 2009 at 20:54

I have everything indexed under the latest 6.x but still have the highlighting offsets caused by UTF-8 characters (&#x characters). I tried the other patches above as well also without success. (My head also exploded)

Log in or register to post comments

Comment #16

pwolanin commented 15 April 2009 at 21:54

@Jody - what Solr server are you using for this? you must make sure the schema and solrconfig are in sync for the isomapping functionality to work and not give offsets.

Log in or register to post comments

Comment #17

jody lynn

she/her

English

commented 16 April 2009 at 14:15

It's the Acquia hosted Solr.

Log in or register to post comments

Comment #18

jody lynn

she/her

English

commented 16 April 2009 at 15:07

Decimal numerical character references work fine by the way, it's just hexadecimal character references that offset highlighting.

Log in or register to post comments

Comment #19

pwolanin commented 16 April 2009 at 15:37

Ah, this relates perhaps to this issue: #420290: document preparation runs words together the idea being that we need to run a decode entitites to handle these sorts of characters.

Log in or register to post comments

Comment #20

jody lynn

she/her

English

commented 21 April 2009 at 17:37

Status	File	Size
new	highlighter-text.txt	7.93 KB

Attached is some of our text that gets the highlighter offset bug. I think it's mostly the single and double quotes that wreak havoc.

Log in or register to post comments

Comment #21

pwolanin commented 21 April 2009 at 18:10

Status	File	Size
new	20090421-m9pwguuuyst18a76xbb5a6mc8y.jpg	38.3 KB
new	20090421-tw98it7xt57ch333cq94hwq8e2.jpg	34.72 KB

Hopefully this will be fixed when we push the most current schema to all our hosted indexes.

Log in or register to post comments

Comment #22

jody lynn

she/her

English

commented 23 April 2009 at 21:05

I tried to test this but now I'm not getting any results for anything that's a book page (which is all the content that had encoding issues). I only get results for other content types. Very strange.

Log in or register to post comments

Comment #23

jody lynn

she/her

English

commented 28 April 2009 at 15:29

Status:

Active

» Fixed

Ok, never mind, I guess we just had to do a whole lot of reindexing to start to see the book pages somehow. The highlighting issue is resolved for me.

Log in or register to post comments

Comment #24

genernic commented 29 April 2009 at 08:01

I'm having the exact same issue you do, without Solr, only using Lucene in combination with Tika, I get a lot of errors and in addition to this, the documents which do work, sometimes highlight the wrong words, what kind of workaround have you found?

Log in or register to post comments

Comment #25

pwolanin commented 12 May 2009 at 18:33

Status:

Fixed

» Closed (fixed)

Log in or register to post comments

character encoding issues caus Solr highlighter to fail

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Comment #19

Comment #20

Comment #21

Comment #22

Comment #23

Comment #24

Comment #25

News items

Our community

Documentation

Drupal code base

Governance of community