Unfortunately processors are only settable per Index and not per Field.
Therefor the HTML filter can only be turned on or off for all fields within an index. But it's absolutely valid to index HTML encoded text together with non HTML in one index, for example two different fields of the same entity when one uses HTML while the second one doesn't.

Even if that seems to work most of the time, there edge cases where the HTML filter corrupts the non-HTML texts. One problem could be html special chars that will be erroneously converted or stripped. For example look at https://www.ncbi.nlm.nih.gov/pubmed/27751366

Tokens like 'C>T' are very important in this content but it's obvious that the HTML filter has problems with it when the text is not encoded as HTML.

Therefor I suggest to make the HTML a little bit smarter to only touch the text if it's really HTML.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

mkalkbrenner created an issue. See original summary.

mkalkbrenner’s picture

Status: Active » Needs review
FileSize
3.32 KB

Status: Needs review » Needs work

The last submitted patch, 2: 2873694.patch, failed testing.

mkalkbrenner’s picture

Status: Needs work » Needs review
FileSize
869 bytes
4.29 KB

Status: Needs review » Needs work

The last submitted patch, 4: 2873694-4.patch, failed testing.

mkalkbrenner’s picture

Status: Needs work » Needs review
FileSize
503 bytes
4.6 KB

Status: Needs review » Needs work

The last submitted patch, 6: 2873694-6.patch, failed testing.

mkalkbrenner’s picture

Status: Needs work » Needs review
FileSize
502 bytes
4.6 KB
drunken monkey’s picture

Priority: Major » Normal
Status: Needs review » Postponed (maintainer needs more info)

Unfortunately processors are only settable per Index and not per Field.

That's not true, there is a "Enable this processor on the following fields" for all processors working on field values for just this reason.
#2859683: Processors don't correctly preprocess keywords per field could be a problem in this context, though – is that what you mean? In the case of the HTML filter, though, we can think about whether preprocessing the search query even makes sense. Is it really a use case that users enter HTML in their search keywords and we want to strip that? I think not, and if everyone agrees than that would probably be the solution.

mkalkbrenner’s picture

"Enable this processor on the following fields"
Ok, good to know:)

Nevertheless the issue still exists for us.
Removing HTML from the search keys / phrase is essential if you do similarity calculations. For example we "compare" field values.
For us the attached patch solves all our issues without destroying existing functionality.

mkalkbrenner’s picture

Status: Postponed (maintainer needs more info) » Needs review
FileSize
412 bytes
4.64 KB

As discussed in IRC I extended the isHtml() function by a check for HTML entities.

drunken monkey’s picture

As I also said in IRC, checking this for field values during indexing still doesn't make any sense, in my opinion.
Also, we should have proper test coverage for keywords processing.

Finally, one thing I forgot to actually mention in IRC, is that we could also just make this an option on the processor: "Preprocess search keys" (or something like that). This would just be off for the vast majority of use cases, and you could even just temporarily enable it in code (without saving it) when doing your custom searches. What's your opinion on that?
(I guess, even though we've never done that, we could even make this a "hidden" option – provide configuration for it, but no actual element in the config form. Should cover your use case, too, and not confuse all those users who, for the most part, won't want to use it anyways.)

The last submitted patch, 12: 2873694-12--html_filter_keywords--tests_only.patch, failed testing.

mkalkbrenner’s picture

Status: Needs review » Needs work

Thanks for your support.
But I don't think that just added an automatic or configurable HTML detection to the key processing is sufficient.
I think it's a valid use case to index HTML and non-HTML text in the same index field.
So here some expectations for tokens in this use-case during indexing:

foo<br>bar => "foo bar"
<ul><li>foo</li><li>bar</li></ul> => "foo bar"
foo bar => "foo bar"
foobar => "foobar"
foo > bar => "foo > bar"
foo>bar => "foo>bar"

The patch #11 ensures that, while the patch in #12 only treats the search keys.

mkalkbrenner’s picture

Status: Needs work » Needs review
FileSize
1 KB
2.07 KB

I extended the tests only patch from #12 according to #14.

Status: Needs review » Needs work

The last submitted patch, 15: 2873694-15-failing-test.patch, failed testing.

drunken monkey’s picture

Status: Needs work » Needs review

I think it's a valid use case to index HTML and non-HTML text in the same index field.

Hm, that probably depends on what you mean by that. It's true that, in a "formatted text" field you can pick both a plain text and a HTML-only format, and everything in between, and can't say that the field will only ever contain one or the other. However, at this point in the indexing process, I'd say we should have consistent data – either a field always contains plain text, or always HTML. Formatted text fields should probably always yield HTML, with the input formatted and escaped according to the selected format. Otherwise, input formats like BB code would be completely unsupported by default.

The real problem I see here now, which has nothing to do with the HTML filter per se, is that this doesn't actually seem to be the case: a quick test revealed that indexing, e.g., the nodes' "Body" field results in exactly the entered user input being handed to indexing, regardless of the selected format. This leads to HTML and unescaped ">" characters being potentially present in a single field value – something your solution can't properly handle, either, I think.

So, actually fixing that, and getting the formatted text for indexing, seems to me to be the more pressing problem, and potential bug. Should get a separate issue, though. In any case, I think it might also help with your issue? Because then, a "C>T" entered in a formatted text field would just arrive as "C&gt;T" at the HTML filter processor and leave it as "C>T" again – just like it should, for your use case.

drunken monkey’s picture

mkalkbrenner’s picture

See also what? ;-)

drunken monkey’s picture

drunken monkey’s picture

So, should we commit #12 in the meantime, or switch to a (potentially hidden) setting for whether to preprocess search keys in the HTML filter processor, or do you still have doubts about the whole approach?

mkalkbrenner’s picture

We're still using patch #11 in production. I have to check our edge cases with #12.

borisson_’s picture

We're still using patch #11 in production. I have to check our edge cases with #12.

@mkalkbrenner: Have you had the time to test this patch?

Otherwise I think we can commit #12 and close this issue.

drunken monkey’s picture

Bump.
(Damn, forgot to bug Markus about this during DrupalCon …)

mkalkbrenner’s picture

I really have to verify this again.

Because then, a "C>T" entered in a formatted text field would just arrive as "C&gt;T" at the HTML filter processor and leave it as "C>T" again – just like it should, for your use case.

Unfortunately that isn't true. Even in a standard Drupal installation a user can choose between HTML and plain text format for one field per entity. The choosen format is saved along with the data. So the same field can contain "C&gt;T" or "C>T" in two different entities.

drunken monkey’s picture

Because then, a "C>T" entered in a formatted text field would just arrive as "C&gt;T" at the HTML filter processor and leave it as "C>T" again – just like it should, for your use case.

Unfortunately that isn't true.

It would be true if we fixed the problem I described in the paragraph preceding that statement – that's what I said.

kfritsche’s picture

While the discussion is still in progress, I just re-rolled both patches from #11 and #12 with tests from #15.

No interdiff as its just a re-roll.

The last submitted patch, 27: 2873694-27-11.patch, failed testing. View results

Status: Needs review » Needs work

The last submitted patch, 27: 2873694-27-12.patch, failed testing. View results