document preparation runs words together [#420290]

Comment	File	Size	Author
#16	apachesolr_htmlentities.patch	3.21 KB	mkalkbrenner
#10	add-space-420290-10.patch	2.16 KB	pwolanin
#9	add-space-420290-9.patch	2.09 KB	pwolanin
#7	apachesolr_clean_text.patch	3.93 KB	mkalkbrenner
#6	apachesolr.hierarchical_facet.patch	10.11 KB	mkalkbrenner
#4	apachesolr_clean_text.patch	1.44 KB	mkalkbrenner
#1	apachesolr_clean_text.patch	1.77 KB	mkalkbrenner
	keep-words-separate.txt	1.03 KB	cspitzlay

Comment #1

German

🇩🇪

commented 1 April 2009 at 10:49

Version:	6.x-1.0-beta6	» 6.x-1.x-dev
Assigned:	Unassigned	» mkalkbrenner
Status:	Active	» Needs review

Status	File	Size
new	apachesolr_clean_text.patch	1.77 KB

cspitzlay is right. And I found an additional problem.

For Example body will be indexed in Solr using a Whitespace Tokenizer. So two strings seperated by   won't become two tokens and could therefor not be found when searching because the token looks like this: Abc Def

I attached an improved version of this patch which also deals with html entities and removes some redundancy regarding stripping of control chars.

Log in or register to post comments

Comment #2

pwolanin commented 1 April 2009 at 13:26

I'm not sure this order of operations is right - we use $text later so we need to strip control characters.

Log in or register to post comments

Comment #3

damien tournoud commented 1 April 2009 at 14:03

Any reason not to move to solr.HTMLStripWhitespaceTokenizerFactory?

Log in or register to post comments

Comment #4

mkalkbrenner

German

🇩🇪

commented 1 April 2009 at 14:22

Status	File	Size
new	apachesolr_clean_text.patch	1.44 KB

@pwolanin: You're right. These html tags that get individually indexed should also not contain control characters. I adjusted the patch.

BTW The document that gets posted to solr should not contain any control characters at all. Maybe there's a better place to strip these before sending the document instead of doing so on some fields.

Log in or register to post comments

Comment #5

pwolanin commented 1 April 2009 at 16:24

@DamZ - answered that in IRC (breaks the ISO char mapping and also does not strip tags form the stored content, only the tokens, so search sniipets could have html including unmatched tags).

@mkalkbrenner - likely we cold propose this as an upstream enhancement in the Solr PHP class.

Log in or register to post comments

Comment #6

mkalkbrenner

German

🇩🇪

commented 2 April 2009 at 17:16

Status	File	Size
new	apachesolr.hierarchical_facet.patch	10.11 KB

please ignore this post

Log in or register to post comments

Comment #7

mkalkbrenner

German

🇩🇪

commented 2 April 2009 at 17:17

Status	File	Size
new	apachesolr_clean_text.patch	3.93 KB

Sorry, I attached the wrong file at previous post :-(

I created a patch to move stripping of control chars to Service.php:
http://code.google.com/p/solr-php-client/issues/detail?id=5

Attached you'll find an adjusted version of the clean text patch that relies on the patched version of Service.php.

Log in or register to post comments

Comment #8

pwolanin commented 2 April 2009 at 20:06

Status:

Needs review

» Needs work

+  $text = html_entity_decode($text, ENT_COMPAT, 'UTF-8');
+  return strip_tags(apachesolr_strip_ctl_chars($text));

I don't think we want to decode.

Log in or register to post comments

Comment #9

pwolanin commented 3 April 2009 at 13:31

Status:

Needs work

» Needs review

Status	File	Size
new	add-space-420290-9.patch	2.09 KB

Here's a minimal version that uses the more efficient str_replace()

Log in or register to post comments

Comment #10

pwolanin commented 3 April 2009 at 13:35

Status	File	Size
new	add-space-420290-10.patch	2.16 KB

Actually - that's not really robust - like this is safer.

Log in or register to post comments

Comment #11

pwolanin commented 3 April 2009 at 13:59

Status:

Needs review

» Active

commiting #10 - setting back to active for consideration of   issues

Log in or register to post comments

Comment #12

mkalkbrenner

German

🇩🇪

commented 3 April 2009 at 18:11

Status:

Active

» Needs work

What kind of problem do you see with html_entity_decode?

An example:

The quick brown&nbsp;fox jumps over the lazy dog.

Without converting   into a blank a search for "fox" has no result.

Theoretically this should also cause problems, but I haven't verified it yet:

German Umlauts like in H&auml;ngebr&uuml;cke

Without converting ä and ü into "ä" and "ü" a search for "Hängebrücke" should have no result.

From my point of view solr should index plain text and not html (we already do strip_tags but that's not enough). If you want to have html entities when displaying the search result you have to add them again at this point.

Anybody who can second this issue?

(cspitzlay told me that he could reproduce the   problem but not German Umlauts)

I hope to find some time soon to do some deeper investigations on this. Meanwhile I use my patch in our project.

Log in or register to post comments

Comment #13

pwolanin commented 3 April 2009 at 19:13

Well, we should perhaps decode but then run check_plain on it or some such. Basically - we don't want < > to be converted into tags that would be striped by strip_tags().

For example - if the node content includes the output of a code filter.

Log in or register to post comments

Comment #14

mkalkbrenner

German

🇩🇪

commented 6 April 2009 at 10:07

"we don't want < > to be converted into tags that would be striped by strip_tags()"

Good Argument. What about simply changing the order of commands:

$text = strip_tags($text);
$text = html_entity_decode($text, ENT_COMPAT, 'UTF-8');

check_plain could than be used to show the search result.

Log in or register to post comments

Comment #15

pwolanin commented 6 April 2009 at 12:26

No, we can't check_plain() the search result since we have tags for highlighting key words generated by Solr.

Log in or register to post comments

Comment #16

mkalkbrenner

German

🇩🇪

commented 7 April 2009 at 10:25

Status:

Needs work

» Needs review

Status	File	Size
new	apachesolr_htmlentities.patch	3.21 KB

You're right. Simply using check_plain() will break highlighting.

So please have a look at the attached patch. It solves the   problem by decoding html entities, but should not break highlighting or the representation of text entered using code filter.

Log in or register to post comments

Comment #17

pwolanin commented 8 April 2009 at 14:10

Status:

Needs review

» Needs work

this looks like excessive work - let's just run check_plain() or htmlspecialchars() after the decode

Log in or register to post comments

Comment #18

pwolanin commented 28 April 2009 at 20:15

see: #447622: Encoding of CCK facets

Log in or register to post comments

Comment #19

pwolanin commented 30 April 2009 at 12:55

Status:

Needs work

» Fixed

Log in or register to post comments

Comment #20

pwolanin commented 12 May 2009 at 18:33

Status:

Fixed

» Closed (fixed)

Log in or register to post comments

document preparation runs words together

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Comment #19

Comment #20

News items

Our community

Documentation

Drupal code base

Governance of community