The use of strip_tags in apachesolr_node_to_document() and in apachesolr_clean_text() sometimes causes words to be run together.
For example, an HTML-valued field might contain the code: "<p>Abc</p><p>Def</p>". This will be indexed as AbcDef because that's what strip_tags returns.

The attached patch keeps the words separate by adding whitespace before the tags are stripped.
In addition, the removal of control characters is done on the body now, too.

Comments

mkalkbrenner’s picture

Version: 6.x-1.0-beta6 » 6.x-1.x-dev
Assigned: Unassigned » mkalkbrenner
Status: Active » Needs review
StatusFileSize
new1.77 KB

cspitzlay is right. And I found an additional problem.

For Example body will be indexed in Solr using a Whitespace Tokenizer. So two strings seperated by &nbsp; won't become two tokens and could therefor not be found when searching because the token looks like this: Abc&nbsp;Def

I attached an improved version of this patch which also deals with html entities and removes some redundancy regarding stripping of control chars.

pwolanin’s picture

I'm not sure this order of operations is right - we use $text later so we need to strip control characters.

damien tournoud’s picture

Any reason not to move to solr.HTMLStripWhitespaceTokenizerFactory?

mkalkbrenner’s picture

StatusFileSize
new1.44 KB

@pwolanin: You're right. These html tags that get individually indexed should also not contain control characters. I adjusted the patch.

BTW The document that gets posted to solr should not contain any control characters at all. Maybe there's a better place to strip these before sending the document instead of doing so on some fields.

pwolanin’s picture

@DamZ - answered that in IRC (breaks the ISO char mapping and also does not strip tags form the stored content, only the tokens, so search sniipets could have html including unmatched tags).

@mkalkbrenner - likely we cold propose this as an upstream enhancement in the Solr PHP class.

mkalkbrenner’s picture

StatusFileSize
new10.11 KB

please ignore this post

mkalkbrenner’s picture

StatusFileSize
new3.93 KB

Sorry, I attached the wrong file at previous post :-(

I created a patch to move stripping of control chars to Service.php:
http://code.google.com/p/solr-php-client/issues/detail?id=5

Attached you'll find an adjusted version of the clean text patch that relies on the patched version of Service.php.

pwolanin’s picture

Status: Needs review » Needs work
+  $text = html_entity_decode($text, ENT_COMPAT, 'UTF-8');
+  return strip_tags(apachesolr_strip_ctl_chars($text));

I don't think we want to decode.

pwolanin’s picture

Status: Needs work » Needs review
StatusFileSize
new2.09 KB

Here's a minimal version that uses the more efficient str_replace()

pwolanin’s picture

StatusFileSize
new2.16 KB

Actually - that's not really robust - like this is safer.

pwolanin’s picture

Status: Needs review » Active

commiting #10 - setting back to active for consideration of &nbsp; issues

mkalkbrenner’s picture

Status: Active » Needs work

What kind of problem do you see with html_entity_decode?

An example:

The quick brown&nbsp;fox jumps over the lazy dog.

Without converting &nbsp; into a blank a search for "fox" has no result.

Theoretically this should also cause problems, but I haven't verified it yet:

German Umlauts like in H&auml;ngebr&uuml;cke

Without converting &auml; and &uuml; into "ä" and "ü" a search for "Hängebrücke" should have no result.

From my point of view solr should index plain text and not html (we already do strip_tags but that's not enough). If you want to have html entities when displaying the search result you have to add them again at this point.

Anybody who can second this issue?

(cspitzlay told me that he could reproduce the &nbsp; problem but not German Umlauts)

I hope to find some time soon to do some deeper investigations on this. Meanwhile I use my patch in our project.

pwolanin’s picture

Well, we should perhaps decode but then run check_plain on it or some such. Basically - we don't want &lt; &gt; to be converted into tags that would be striped by strip_tags().

For example - if the node content includes the output of a code filter.

mkalkbrenner’s picture

"we don't want < > to be converted into tags that would be striped by strip_tags()"

Good Argument. What about simply changing the order of commands:

$text = strip_tags($text);
$text = html_entity_decode($text, ENT_COMPAT, 'UTF-8');

check_plain could than be used to show the search result.

pwolanin’s picture

No, we can't check_plain() the search result since we have tags for highlighting key words generated by Solr.

mkalkbrenner’s picture

Status: Needs work » Needs review
StatusFileSize
new3.21 KB

You're right. Simply using check_plain() will break highlighting.

So please have a look at the attached patch. It solves the &nbsp; problem by decoding html entities, but should not break highlighting or the representation of text entered using code filter.

pwolanin’s picture

Status: Needs review » Needs work

this looks like excessive work - let's just run check_plain() or htmlspecialchars() after the decode

pwolanin’s picture

pwolanin’s picture

Status: Needs work » Fixed
pwolanin’s picture

Status: Fixed » Closed (fixed)