HTML filter corrupts non-HTML text [#2873694]

Unfortunately processors are only settable per Index and not per Field.
Therefor the HTML filter can only be turned on or off for all fields within an index. But it's absolutely valid to index HTML encoded text together with non HTML in one index, for example two different fields of the same entity when one uses HTML while the second one doesn't.

Even if that seems to work most of the time, there edge cases where the HTML filter corrupts the non-HTML texts. One problem could be html special chars that will be erroneously converted or stripped. For example look at https://www.ncbi.nlm.nih.gov/pubmed/27751366

Tokens like 'C>T' are very important in this content but it's obvious that the HTML filter has problems with it when the text is not encoded as HTML.

Therefor I suggest to make the HTML a little bit smarter to only touch the text if it's really HTML.

Comment	File	Size	Author
#27	2873694-27-12.patch	3.17 KB	kfritsche
#27	8.x-1.x: PHP 7 & MySQL 5.5, D8.5 345 pass, 1 fail
#27	2873694-27-11.patch	5.5 KB	kfritsche
#27	8.x-1.x: PHP 7 & MySQL 5.5, D8.5 345 pass, 1 fail
#15	2873694-15-failing-test.patch	2.07 KB	mkalkbrenner
#15	8.x-1.x: PHP 7 & MySQL 5.5, D8.4 322 pass, 1 fail
#15	12-15-failing-test-interdiff.txt	1 KB	mkalkbrenner
#12	2873694-12--html_filter_keywords.patch	2.33 KB	drunken monkey
#12	8.x-1.x: PHP 7 & MySQL 5.5, D8.4 354 pass
#12	2873694-12--html_filter_keywords--tests_only.patch	1.21 KB	drunken monkey
#12	8.x-1.x: PHP 7 & MySQL 5.5, D8.4 322 pass, 1 fail
#12	2873694-12--html_filter_keywords--interdiff.txt	5.38 KB	drunken monkey
#11	2873694-11.patch	4.64 KB	mkalkbrenner
#11	8.x-1.x: PHP 7 & MySQL 5.5, D8.4 349 pass
#11	8-11-interdiff.txt	412 bytes	mkalkbrenner
#8	2873694-8.patch	4.6 KB	mkalkbrenner
#8	8.x-1.x: PHP 7 & MySQL 5.5, D8.4 349 pass
#8	4-8-interdiff.txt	502 bytes	mkalkbrenner
#6	2873694-6.patch	4.6 KB	mkalkbrenner
#6	8.x-1.x: PHP 7 & MySQL 5.5, D8.4 PHPLint Failed
#6	4-6-interdiff.txt	503 bytes	mkalkbrenner
#4	2873694-4.patch	4.29 KB	mkalkbrenner
#4	8.x-1.x: PHP 7 & MySQL 5.5, D8.4 322 pass, 1 fail
#4	2-4-interdiff.txt	869 bytes	mkalkbrenner
#2	2873694.patch	3.32 KB	mkalkbrenner
#2	8.x-1.x: PHP 7 & MySQL 5.5, D8.4 322 pass, 1 fail

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

28 April 2017 at 14:26

mkalkbrenner created an issue. See original summary.

Comment #2

mkalkbrenner

German

🇩🇪

CreditAttribution: mkalkbrenner at bio.logis Genetic Information Management GmbH commented 28 April 2017 at 14:31

Status:

Active

» Needs review

File	Size
2873694.patch	3.32 KB
8.x-1.x: PHP 7 & MySQL 5.5, D8.4 322 pass, 1 fail

Comment #3

28 April 2017 at 14:40

Status:

Needs review

» Needs work

The last submitted patch, 2: 2873694.patch, failed testing.

Comment #4

mkalkbrenner

German

🇩🇪

CreditAttribution: mkalkbrenner at bio.logis Genetic Information Management GmbH commented 28 April 2017 at 15:10

Status:

Needs work

» Needs review

File	Size
2-4-interdiff.txt	869 bytes
2873694-4.patch	4.29 KB
8.x-1.x: PHP 7 & MySQL 5.5, D8.4 322 pass, 1 fail

Comment #5

28 April 2017 at 15:18

Status:

Needs review

» Needs work

The last submitted patch, 4: 2873694-4.patch, failed testing.

Comment #6

mkalkbrenner

German

🇩🇪

CreditAttribution: mkalkbrenner at bio.logis Genetic Information Management GmbH commented 28 April 2017 at 15:24

Status:

Needs work

» Needs review

File	Size
4-6-interdiff.txt	503 bytes
2873694-6.patch	4.6 KB
8.x-1.x: PHP 7 & MySQL 5.5, D8.4 PHPLint Failed

3 files were hidden/shown/deleted

File	Size
2873694.patch	3.32 KB
8.x-1.x: PHP 7 & MySQL 5.5, D8.4 322 pass, 1 fail
2-4-interdiff.txt	869 bytes
2873694-4.patch	4.29 KB
8.x-1.x: PHP 7 & MySQL 5.5, D8.4 322 pass, 1 fail

Comment #7

28 April 2017 at 15:30

Status:

Needs review

» Needs work

The last submitted patch, 6: 2873694-6.patch, failed testing.

Comment #8

mkalkbrenner

German

🇩🇪

CreditAttribution: mkalkbrenner at bio.logis Genetic Information Management GmbH commented 28 April 2017 at 15:35

Status:

Needs work

» Needs review

File	Size
4-8-interdiff.txt	502 bytes
2873694-8.patch	4.6 KB
8.x-1.x: PHP 7 & MySQL 5.5, D8.4 349 pass

1 file was hidden/shown/deleted

File	Size
2873694-6.patch	4.6 KB
8.x-1.x: PHP 7 & MySQL 5.5, D8.4 PHPLint Failed

Comment #9

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey as a volunteer commented 30 April 2017 at 10:27

Priority:	Major	» Normal
Status:	Needs review	» Postponed (maintainer needs more info)

Unfortunately processors are only settable per Index and not per Field.

That's not true, there is a "Enable this processor on the following fields" for all processors working on field values for just this reason.
#2859683: Processors don't correctly preprocess keywords per field could be a problem in this context, though – is that what you mean? In the case of the HTML filter, though, we can think about whether preprocessing the search query even makes sense. Is it really a use case that users enter HTML in their search keywords and we want to strip that? I think not, and if everyone agrees than that would probably be the solution.

Comment #10

mkalkbrenner

German

🇩🇪

CreditAttribution: mkalkbrenner at bio.logis Genetic Information Management GmbH commented 30 April 2017 at 11:10

"Enable this processor on the following fields"
Ok, good to know:)

Nevertheless the issue still exists for us.
Removing HTML from the search keys / phrase is essential if you do similarity calculations. For example we "compare" field values.
For us the attached patch solves all our issues without destroying existing functionality.

Comment #11

mkalkbrenner

German

🇩🇪

CreditAttribution: mkalkbrenner at bio.logis Genetic Information Management GmbH commented 1 May 2017 at 09:13

Status:

Postponed (maintainer needs more info)

» Needs review

File	Size
8-11-interdiff.txt	412 bytes
2873694-11.patch	4.64 KB
8.x-1.x: PHP 7 & MySQL 5.5, D8.4 349 pass

As discussed in IRC I extended the isHtml() function by a check for HTML entities.

Comment #12

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey as a volunteer commented 1 May 2017 at 10:28

File	Size
2873694-12--html_filter_keywords--interdiff.txt	5.38 KB
2873694-12--html_filter_keywords--tests_only.patch	1.21 KB
8.x-1.x: PHP 7 & MySQL 5.5, D8.4 322 pass, 1 fail
2873694-12--html_filter_keywords.patch	2.33 KB
8.x-1.x: PHP 7 & MySQL 5.5, D8.4 354 pass

3 files were hidden/shown/deleted

File	Size
2873694-8.patch	4.6 KB
8.x-1.x: PHP 7 & MySQL 5.5, D8.4 349 pass
8-11-interdiff.txt	412 bytes
2873694-11.patch	4.64 KB
8.x-1.x: PHP 7 & MySQL 5.5, D8.4 349 pass

As I also said in IRC, checking this for field values during indexing still doesn't make any sense, in my opinion.
Also, we should have proper test coverage for keywords processing.

Finally, one thing I forgot to actually mention in IRC, is that we could also just make this an option on the processor: "Preprocess search keys" (or something like that). This would just be off for the vast majority of use cases, and you could even just temporarily enable it in code (without saving it) when doing your custom searches. What's your opinion on that?
(I guess, even though we've never done that, we could even make this a "hidden" option – provide configuration for it, but no actual element in the config form. Should cover your use case, too, and not confuse all those users who, for the most part, won't want to use it anyways.)

Comment #13

1 May 2017 at 10:31

The last submitted patch, 12: 2873694-12--html_filter_keywords--tests_only.patch, failed testing.

Comment #14

mkalkbrenner

German

🇩🇪

CreditAttribution: mkalkbrenner at bio.logis Genetic Information Management GmbH commented 2 May 2017 at 08:08

Status:

Needs review

» Needs work

Thanks for your support.
But I don't think that just added an automatic or configurable HTML detection to the key processing is sufficient.
I think it's a valid use case to index HTML and non-HTML text in the same index field.
So here some expectations for tokens in this use-case during indexing:

foo<br>bar => "foo bar"
<ul><li>foo</li><li>bar</li></ul> => "foo bar"
foo bar => "foo bar"
foobar => "foobar"
foo > bar => "foo > bar"
foo>bar => "foo>bar"

The patch #11 ensures that, while the patch in #12 only treats the search keys.

Comment #15

mkalkbrenner

German

🇩🇪

CreditAttribution: mkalkbrenner at bio.logis Genetic Information Management GmbH commented 2 May 2017 at 08:11

Status:

Needs work

» Needs review

File	Size
12-15-failing-test-interdiff.txt	1 KB
2873694-15-failing-test.patch	2.07 KB
8.x-1.x: PHP 7 & MySQL 5.5, D8.4 322 pass, 1 fail

I extended the tests only patch from #12 according to #14.

Comment #16

2 May 2017 at 08:20

Status:

Needs review

» Needs work

The last submitted patch, 15: 2873694-15-failing-test.patch, failed testing.

Comment #17

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey as a volunteer commented 12 May 2017 at 10:07

Status:

Needs work

» Needs review

I think it's a valid use case to index HTML and non-HTML text in the same index field.

Hm, that probably depends on what you mean by that. It's true that, in a "formatted text" field you can pick both a plain text and a HTML-only format, and everything in between, and can't say that the field will only ever contain one or the other. However, at this point in the indexing process, I'd say we should have consistent data – either a field always contains plain text, or always HTML. Formatted text fields should probably always yield HTML, with the input formatted and escaped according to the selected format. Otherwise, input formats like BB code would be completely unsupported by default.

The real problem I see here now, which has nothing to do with the HTML filter per se, is that this doesn't actually seem to be the case: a quick test revealed that indexing, e.g., the nodes' "Body" field results in exactly the entered user input being handed to indexing, regardless of the selected format. This leads to HTML and unescaped ">" characters being potentially present in a single field value – something your solution can't properly handle, either, I think.

So, actually fixing that, and getting the formatted text for indexing, seems to me to be the more pressing problem, and potential bug. Should get a separate issue, though. In any case, I think it might also help with your issue? Because then, a "C>T" entered in a formatted text field would just arrive as "C>T" at the HTML filter processor and leave it as "C>T" again – just like it should, for your use case.

Comment #18

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey as a volunteer commented 12 May 2017 at 11:08

Comment #19

mkalkbrenner

German

🇩🇪

CreditAttribution: mkalkbrenner at bio.logis Genetic Information Management GmbH commented 12 May 2017 at 11:57

Comment #20

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey as a volunteer commented 12 May 2017 at 13:26

Oops, sorry!
See also that: #2875048: Allow indexing of rendered (instead of raw) field values.

Comment #21

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey as a volunteer commented 27 May 2017 at 10:48

So, should we commit #12 in the meantime, or switch to a (potentially hidden) setting for whether to preprocess search keys in the HTML filter processor, or do you still have doubts about the whole approach?

Comment #22

mkalkbrenner

German

🇩🇪

CreditAttribution: mkalkbrenner at bio.logis Genetic Information Management GmbH commented 27 May 2017 at 14:24

We're still using patch #11 in production. I have to check our edge cases with #12.

Comment #23

borisson_

Dutch

Mechelen, 🇧🇪

CreditAttribution: borisson_ as a volunteer and at Dazzle commented 16 September 2017 at 09:18

2 files were hidden/shown/deleted

File	Size
2873694-12--html_filter_keywords--tests_only.patch	1.21 KB
8.x-1.x: PHP 7 & MySQL 5.5, D8.4 322 pass, 1 fail
12-15-failing-test-interdiff.txt	1 KB

We're still using patch #11 in production. I have to check our edge cases with #12.

@mkalkbrenner: Have you had the time to test this patch?

Otherwise I think we can commit #12 and close this issue.

Comment #24

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey as a volunteer commented 1 October 2017 at 10:51

Bump.
(Damn, forgot to bug Markus about this during DrupalCon …)

Comment #25

mkalkbrenner

German

🇩🇪

CreditAttribution: mkalkbrenner at bio.logis Genetic Information Management GmbH commented 2 October 2017 at 08:08

I really have to verify this again.

Because then, a "C>T" entered in a formatted text field would just arrive as "C>T" at the HTML filter processor and leave it as "C>T" again – just like it should, for your use case.

Unfortunately that isn't true. Even in a standard Drupal installation a user can choose between HTML and plain text format for one field per entity. The choosen format is saved along with the data. So the same field can contain "C>T" or "C>T" in two different entities.

Comment #26

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey as a volunteer commented 8 October 2017 at 13:09

Because then, a "C>T" entered in a formatted text field would just arrive as "C>T" at the HTML filter processor and leave it as "C>T" again – just like it should, for your use case.

Unfortunately that isn't true.

It would be true if we fixed the problem I described in the paragraph preceding that statement – that's what I said.

Comment #27

kfritsche

German

🇩🇪🇪🇺

CreditAttribution: kfritsche at bio.logis Genetic Information Management GmbH commented 16 October 2017 at 12:55

File	Size
2873694-27-11.patch	5.5 KB
8.x-1.x: PHP 7 & MySQL 5.5, D8.5 345 pass, 1 fail
2873694-27-12.patch	3.17 KB
8.x-1.x: PHP 7 & MySQL 5.5, D8.5 345 pass, 1 fail

1 file was hidden/shown/deleted

File	Size
2873694-12--html_filter_keywords.patch	2.33 KB
8.x-1.x: PHP 7 & MySQL 5.5, D8.4 354 pass

While the discussion is still in progress, I just re-rolled both patches from #11 and #12 with tests from #15.

No interdiff as its just a re-roll.

Comment #28

16 October 2017 at 13:00

The last submitted patch, 27: 2873694-27-11.patch, failed testing. View results

Comment #29

16 October 2017 at 13:00

Status:

Needs review

» Needs work

The last submitted patch, 27: 2873694-27-12.patch, failed testing. View results

HTML filter corrupts non-HTML text

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Comment #19

Comment #20

Comment #21

Comment #22

Comment #23

Comment #24

Comment #25

Comment #26

Comment #27

Comment #28

Comment #29

Related issues

Thank you to these Drupal contributors

News items

Our community

Documentation

Drupal code base

Governance of community