The use of strip_tags in apachesolr_node_to_document() and in apachesolr_clean_text() sometimes causes words to be run together.
For example, an HTML-valued field might contain the code: "<p>Abc</p><p>Def</p>". This will be indexed as AbcDef because that's what strip_tags returns.
The attached patch keeps the words separate by adding whitespace before the tags are stripped.
In addition, the removal of control characters is done on the body now, too.
| Comment | File | Size | Author |
|---|---|---|---|
| #16 | apachesolr_htmlentities.patch | 3.21 KB | mkalkbrenner |
| #10 | add-space-420290-10.patch | 2.16 KB | pwolanin |
| #9 | add-space-420290-9.patch | 2.09 KB | pwolanin |
| #7 | apachesolr_clean_text.patch | 3.93 KB | mkalkbrenner |
| #6 | apachesolr.hierarchical_facet.patch | 10.11 KB | mkalkbrenner |
Comments
Comment #1
mkalkbrennercspitzlay is right. And I found an additional problem.
For Example body will be indexed in Solr using a Whitespace Tokenizer. So two strings seperated by
won't become two tokens and could therefor not be found when searching because the token looks like this:Abc DefI attached an improved version of this patch which also deals with html entities and removes some redundancy regarding stripping of control chars.
Comment #2
pwolanin commentedI'm not sure this order of operations is right - we use $text later so we need to strip control characters.
Comment #3
damien tournoud commentedAny reason not to move to
solr.HTMLStripWhitespaceTokenizerFactory?Comment #4
mkalkbrenner@pwolanin: You're right. These html tags that get individually indexed should also not contain control characters. I adjusted the patch.
BTW The document that gets posted to solr should not contain any control characters at all. Maybe there's a better place to strip these before sending the document instead of doing so on some fields.
Comment #5
pwolanin commented@DamZ - answered that in IRC (breaks the ISO char mapping and also does not strip tags form the stored content, only the tokens, so search sniipets could have html including unmatched tags).
@mkalkbrenner - likely we cold propose this as an upstream enhancement in the Solr PHP class.
Comment #6
mkalkbrennerplease ignore this post
Comment #7
mkalkbrennerSorry, I attached the wrong file at previous post :-(
I created a patch to move stripping of control chars to Service.php:
http://code.google.com/p/solr-php-client/issues/detail?id=5
Attached you'll find an adjusted version of the clean text patch that relies on the patched version of Service.php.
Comment #8
pwolanin commentedI don't think we want to decode.
Comment #9
pwolanin commentedHere's a minimal version that uses the more efficient str_replace()
Comment #10
pwolanin commentedActually - that's not really robust - like this is safer.
Comment #11
pwolanin commentedcommiting #10 - setting back to active for consideration of
issuesComment #12
mkalkbrennerWhat kind of problem do you see with html_entity_decode?
An example:
Without converting
into a blank a search for "fox" has no result.Theoretically this should also cause problems, but I haven't verified it yet:
Without converting
äandüinto "ä" and "ü" a search for "Hängebrücke" should have no result.From my point of view solr should index plain text and not html (we already do strip_tags but that's not enough). If you want to have html entities when displaying the search result you have to add them again at this point.
Anybody who can second this issue?
(cspitzlay told me that he could reproduce the
problem but not German Umlauts)I hope to find some time soon to do some deeper investigations on this. Meanwhile I use my patch in our project.
Comment #13
pwolanin commentedWell, we should perhaps decode but then run check_plain on it or some such. Basically - we don't want
< >to be converted into tags that would be striped by strip_tags().For example - if the node content includes the output of a code filter.
Comment #14
mkalkbrenner"we don't want < > to be converted into tags that would be striped by strip_tags()"
Good Argument. What about simply changing the order of commands:
check_plain could than be used to show the search result.
Comment #15
pwolanin commentedNo, we can't check_plain() the search result since we have tags for highlighting key words generated by Solr.
Comment #16
mkalkbrennerYou're right. Simply using check_plain() will break highlighting.
So please have a look at the attached patch. It solves the
problem by decoding html entities, but should not break highlighting or the representation of text entered using code filter.Comment #17
pwolanin commentedthis looks like excessive work - let's just run check_plain() or htmlspecialchars() after the decode
Comment #18
pwolanin commentedsee: #447622: Encoding of CCK facets
Comment #19
pwolanin commentedComment #20
pwolanin commented