Refactor Drupal HTML corrector (PHP5) [#374441]

Comment	File	Size	Author
#44	374441-refactor-html-corrector_17.patch	12.32 KB	scor

#43	374441-refactor-html-corrector_16.patch	11.81 KB	scor

#41	374441-refactor-html-corrector_15.patch	11.81 KB	scor

#38	374441-refactor-html-corrector_14.patch	12.23 KB	scor

#37	374441-refactor-html-corrector_13.patch	11.57 KB	scor

#35	374441-refactor-html-corrector_12.patch	10.74 KB	scor

#33	374441-refactor-html-corrector.patch	12.03 KB	Damien Tournoud

#31	374441-refactor-html-corrector.patch	11.06 KB	Damien Tournoud

#30	374441-refactor-html-corrector_9.patch	10.1 KB	scor

#28	374441-refactor-html-corrector.patch	9.25 KB	Damien Tournoud

#25	374441-refactor-html-corrector.patch	9.25 KB	Damien Tournoud

#18	374441-refactor-html-corrector.patch	8.54 KB	Damien Tournoud

#16	374441-refactor-html-corrector.patch	7.71 KB	Damien Tournoud

#12	374441-refactor-html-corrector.patch	7.56 KB	Damien Tournoud

#11	374441-refactor-html-corrector.patch	7.43 KB	Damien Tournoud

#8	374441-refactor-html-corrector.patch	7.26 KB	Damien Tournoud

#6	374441-refactor-html-corrector.patch	7.26 KB	Damien Tournoud

#4	374441-refactor-html-corrector.patch	7.1 KB	Damien Tournoud

#1	374441-refactor-html-corrector.patch	3.02 KB	Damien Tournoud

File	Size
374441-refactor-html-corrector.patch	3.02 KB

Comment #2

14 February 2009 at 17:30

Status:

Needs review

» Needs work

The last submitted patch failed testing.

Log in or register to post comments

Comment #3

chx CreditAttribution: chx commented 15 February 2009 at 11:38

Also you need to add XML then to Drupal hook_requirements. There are idiot distributions outta there that disable core PHP extensions...

Edit: not disable, just package separately.

Log in or register to post comments

Comment #4

Damien Tournoud CreditAttribution: Damien Tournoud commented 15 February 2009 at 12:31

Status:

Needs work

» Needs review

File	Size
374441-refactor-html-corrector.patch	7.1 KB

I'm not sure it's even possible to package the DOM extension separately (seems to be part of PHP core these days), but I added a requirement test anyway.

Also created a proper test case for this.

Log in or register to post comments

Comment #5

15 February 2009 at 13:45

Status:

Needs review

» Needs work

The last submitted patch failed testing.

Log in or register to post comments

Comment #6

Damien Tournoud CreditAttribution: Damien Tournoud commented 16 February 2009 at 07:11

Status:

Needs work

» Needs review

File	Size
374441-refactor-html-corrector.patch	7.26 KB

There was an issue with the regexp: we need "." to match the newline character (that's the role of the "s" modifier). Extended the test case to check for this.

Log in or register to post comments

Comment #7

16 February 2009 at 07:30

Status:

Needs review

» Needs work

The last submitted patch failed testing.

Log in or register to post comments

Comment #8

Damien Tournoud CreditAttribution: Damien Tournoud commented 3 March 2009 at 01:12

Status:

Needs work

» Needs review

File	Size
374441-refactor-html-corrector.patch	7.26 KB

Reroll, resubmit, goto 1.

Log in or register to post comments

Comment #10

Damien Tournoud CreditAttribution: Damien Tournoud commented 13 March 2009 at 10:28

1. I don't think DOMDocument properly handles encoding in this case. I attempted to run some Georgian text through the filter and it came out mangled (not so without this patch applied). More info: http://www.onphp5.com/article/57.

Ok, apparently we need to add a XML declaration to force the DOM to load as UTF-8 (I guess it still defaults to latin1, silly PHP).

2. I'm wondering if we can use saveXML(DOMNode) instead of the regex to get to the contents of the body tag:
something like (untested and currently returns the body element too, not just contents)

*So* great ;) I wasn't aware that saveXML() can be passed a DOMNode. I was more looking for a saveXML() method on DOMNode itself (which would make a lot more sense).

Log in or register to post comments

Comment #11

Damien Tournoud CreditAttribution: Damien Tournoud commented 13 March 2009 at 10:39

Status:

Needs work

» Needs review

File	Size
374441-refactor-html-corrector.patch	7.43 KB

Let's try that.

Log in or register to post comments

Comment #12

Damien Tournoud CreditAttribution: Damien Tournoud commented 13 March 2009 at 11:35

File	Size
374441-refactor-html-corrector.patch	7.56 KB

Apparently, DOM only read HTML meta tags to detect encoding. Fixed, and added a test for that.

Log in or register to post comments

Comment #13

mr.baileys

Dutch

🇧🇪 (Ghent)

CreditAttribution: mr.baileys commented 13 March 2009 at 11:40

Strange, I was testing the patch in #11 and the encoding problems I experienced with the patch from #8 were gone...

I guess we should add a test to test.filter that ensures non-Latin characters make it through the filters?

Log in or register to post comments

Comment #14

Damien Tournoud CreditAttribution: Damien Tournoud commented 13 March 2009 at 11:43

I guess we should add a test to test.filter that ensures non-Latin characters make it through the filters?

Already done.

      // Encoding is correctly kept.
      '<p>دروبال' => '<p>دروبال</p>'

... which is apparently the literal transcription of Drupal.

Log in or register to post comments

Comment #15

mr.baileys

Dutch

🇧🇪 (Ghent)

CreditAttribution: mr.baileys commented 13 March 2009 at 12:25

Some minor comments after review:

If I apply the patch, the test string is converted from "دروبال" to "Ø¯Ø±ÙˆØ¨Ø§Ù„". I'm not sure if this is my configuration (windows/eclipse), or if this has something to do with the patch or the file's encoding. Ignore this if there is no problem applying this against HEAD
Tested some extra single-tag elements like area, hr, ... (to ensure they're properly closed): works fine.
Should we add a test to verify that non-nesting tags are auto-closed (I personally think this differs enough from "Properly close unclosed tags, and remove useless closing tags." to warant its own test):

// Auto-close improperly nested tags.
'<p>test<p>test</p>\n' => '<p>test</p><p>test</p>\n',

In any case, above test passes succesfully with patch applied.

Going through the old code, I think all 'replaced' logic is properly addressed via the tests, and by passing the tests the new logic proves to be sound.

I think this is good to go...

Log in or register to post comments

Comment #16

Damien Tournoud CreditAttribution: Damien Tournoud commented 13 March 2009 at 12:32

File	Size
374441-refactor-html-corrector.patch	7.71 KB

1. If I apply the patch, the test string is converted from "دروبال" to "Ø¯Ø±ÙˆØ¨Ø§Ù„". I'm not sure if this is my configuration (windows/eclipse), or if this has something to do with the patch or the file's encoding. Ignore this if there is no problem applying this against HEAD

I believe this is your editor misbehaving. The patch is clean UTF-8.

2. Tested some extra single-tag elements like area, hr, ... (to ensure they're properly closed): works fine.

Extended the tests to test for

too. I don't believe it's necessary to test all of them (the list is readily available).

3. Should we add a test to verify that non-nesting tags are auto-closed (I personally think this differs enough from "Properly close unclosed tags, and remove useless closing tags." to warant its own test):
      // Auto-close improperly nested tags.
      '<p>test<p>test</p>\n' => '<p>test</p><p>test</p>\n',
      
In any case, above test passes succesfully with patch applied.

Added. Pass fine.

5. Going through the old code, I think all 'replaced' logic is properly addressed via the tests, and by passing the tests the new logic proves to be sound.

There is no way that the old code pass all those tests. The old HTML corrector is broken in so many ways, and I really don't want to fix it.

Log in or register to post comments

Comment #17

mr.baileys

Dutch

🇧🇪 (Ghent)

CreditAttribution: mr.baileys commented 13 March 2009 at 12:49

I believe this is your editor misbehaving. The patch is clean UTF-8.

If I open the patch from d.o. in my browser, it's exhibiting the same behavior. I have to manually switch my browser to UTF-8 encoding to see the correct characters because d.o doesn't send encoding info with the patch. This doesn't have anything to do with the patch itself though, just something to take into account when saving/downloading the patch for review...

There is no way that the old code pass all those tests. The old HTML corrector is broken in so many ways, and I really don't want to fix it.

I understand. I just meant to say that whatever the old corrector claims to do, your new version does correctly (and has the test results to prove it). I haven't ran the new tests against the old corrector.

IMHO, with a change like this, there's bound to be some exotic edge cases where the old and new corrector produce (slightly) different results, but given the test results and the fact that core PHP functionality (DOMDocument) is used, I'm confident that this patch is an improvement over the current situation.

Log in or register to post comments

Comment #18

Damien Tournoud CreditAttribution: Damien Tournoud commented 13 March 2009 at 13:04

File	Size
374441-refactor-html-corrector.patch	8.54 KB

I have to manually switch my browser to UTF-8 encoding to see the correct characters because d.o doesn't send encoding info with the patch.

Yes, this is a known issue.

IMHO, with a change like this, there's bound to be some exotic edge cases where the old and new corrector produce (slightly) different results, but given the test results and the fact that core PHP functionality (DOMDocument) is used, I'm confident that this patch is an improvement over the current situation.

Of course. Using PHP DOM is way better because the HTML parser of the DOM will not only automatically close dangling tags, but make proper valid XHTML. For example, it will automatically close tags that can only contain inline elements when there is a block level element in them:

<p>my test case<div>a block</div>the rest of the test case</p>

... will become:

<p>my test case</p><div>a block</div>the rest of the test case

This is a good thing at two levels: (1) we are guaranteed to produce valid (X)HTML, (2) we will have to be sure that other filters behave correctly, because the HTML corrector became unforgiving.

In this new version:

- I added several new DTD correctness tests.
- I moved from ->assertIdentical() to a fuzzy ->assertHtmlIdentical(), that is more fair to the old HTML corrector in accepting "<br/>" and "<br />" as identical

Log in or register to post comments

Comment #19

mr.baileys

Dutch

🇧🇪 (Ghent)

CreditAttribution: mr.baileys commented 13 March 2009 at 13:21

...and this change would fix the problem outlined in #372454: Filtered HTML filter modifies class attribute...

(verified by using the test '<pre class="brush: bash;"></pre>' => '<pre class="brush: bash;"/>',)

Crossing my fingers that this patch gets applied to HEAD asap...

Log in or register to post comments

Comment #20

Damien Tournoud CreditAttribution: Damien Tournoud commented 13 March 2009 at 14:02

@mr.baileys: you tested this carefully, reviewed the code and suggested modifications. Why not RTBC-ing it now?

Log in or register to post comments

Comment #21

mr.baileys

Dutch

🇧🇪 (Ghent)

CreditAttribution: mr.baileys commented 13 March 2009 at 14:09

Status:

Needs review

» Reviewed & tested by the community

I thought issue etiquette was to wait for 2 positive reviews for a RTBC? Let's try and see what happens :)

Log in or register to post comments

Comment #22

Dries CreditAttribution: Dries commented 13 March 2009 at 14:21

- Do we know if this might have any performance implications? We all know how XML parsers can be. On certain pages (e.g. forum topics with lots of comments), this filter can be called a lot ...

- Code looks good. Good job on the code size reduction.

- If this improves the quality of the HTML filter, this might be worthy a CHANGELOG.txt item. I'll leave that up to Damien to decide.

Log in or register to post comments

Comment #23

mr.baileys

Dutch

🇧🇪 (Ghent)

CreditAttribution: mr.baileys commented 13 March 2009 at 14:43

Interesting... I ran 4 quick performance tests (100000 iterations):
A. With patch (new code)
B. Without patch (old code)

1. just one div as input
2. the source for Dries's comment as input

and here are the results:
A1: 5.5 seconds
A2: 17.12 seconds

B1: 2.83 seconds (old code beats new)
B2: 30.72 seconds (new code is almost twice as fast as old code)

So it looks like the new code is faster, except when it comes to really small pieces of content. I guess the overhead of initialising the DOMDocument is to blame for this...

Anyway, probably no match for some real performance testing, but gives an indication...

Log in or register to post comments

Comment #24

15 March 2009 at 14:55

Status:

Reviewed & tested by the community

» Needs work

The last submitted patch failed testing.

Log in or register to post comments

Comment #25

Damien Tournoud CreditAttribution: Damien Tournoud commented 2 May 2009 at 10:59

File	Size
374441-refactor-html-corrector.patch	9.25 KB

Added an entry in CHANGELOG.txt.

Log in or register to post comments

Comment #26

sun

German

Karlsruhe

CreditAttribution: sun commented 3 May 2009 at 03:32

Very nice. I want.

+  // DOM can load HTML soup. But, HTML soup can throw warnings, suppress them.

I understand that.. But we really want less emotions in comments. ;)

+  // The "s" modifier makes "." match newlines too.

Let's not document PCRE basics.

+  @$htmlDom = DOMDocument::loadHTML('<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head><body>' . $text . '</body></html>');
...
+  if (preg_match("|^<body[^>]*>(.*)</body>$|s", $htmlDom->saveXML($bodyNode), $matches)) {

I don't see why we need a preg_match() here? We hard-code <body></body>, but we can't rely on it?

+    return "";

Single quotes.

+      'description' => t('Filter each filter individually: Convert line breaks, Correct broken HTML.'),

Lowercase "correct" after comma.

+      // Do not touch HTML comments.
+      '<!-- this is a comment -->' => '<!-- this is a comment -->',
+      'test <!-- comment -->' => 'test <!-- comment -->',

Can we add a test for "improper" comments as well? This:

<!--this is a comment-->

Log in or register to post comments

Comment #27

sun

German

Karlsruhe

CreditAttribution: sun commented 3 May 2009 at 03:42

Damien explained me why we need preg_match() here - the content could potentially contain a BODY tag.

We should test that as well though :)

Log in or register to post comments

Comment #28

Damien Tournoud CreditAttribution: Damien Tournoud commented 6 June 2009 at 23:32

Status:

Needs work

» Needs review

File	Size
374441-refactor-html-corrector.patch	9.25 KB

Just refreshing the same patch.

Log in or register to post comments

Comment #29

7 June 2009 at 00:55

Status:

Needs review

» Needs work

The last submitted patch failed testing.

Log in or register to post comments

Comment #30

scor CreditAttribution: scor commented 6 July 2009 at 12:55

File	Size
374441-refactor-html-corrector_9.patch	10.1 KB

@damZ: why did you remove the createFormat() and deleteFormat() functions in the previous patches? If these are not used anywhere, they should be removed, but maybe in a different patch.

I've adapted the foreach listing of the tests to the current, more verbose syntax used in core. I do think the foreach approach is more readable and less verbose. any reason why it is not used in filter.test?

I've also moved the silent switch @ to the DOMDocument::loadHTML function.

It still does not passes all tests though.

Log in or register to post comments

Comment #31

Damien Tournoud CreditAttribution: Damien Tournoud commented 6 July 2009 at 13:14

Status:

Needs work

» Needs review

File	Size
374441-refactor-html-corrector.patch	11.06 KB

Log in or register to post comments

Comment #32

6 July 2009 at 13:35

Status:

Needs review

» Needs work

The last submitted patch failed testing.

Log in or register to post comments

Comment #33

Damien Tournoud CreditAttribution: Damien Tournoud commented 6 July 2009 at 13:57

Status:

Needs work

» Needs review

File	Size
374441-refactor-html-corrector.patch	12.03 KB

Changed the tests that assumed a faulty behavior of the HTML corrector.

Log in or register to post comments

Comment #34

tic2000 CreditAttribution: tic2000 commented 6 July 2009 at 14:42

Seems like sun's review was ignored

Log in or register to post comments

Comment #35

scor CreditAttribution: scor commented 6 July 2009 at 14:59

File	Size
374441-refactor-html-corrector_12.patch	10.74 KB

Drupal strives to produce XHTML code, let's not break the XHTML compliant tests just for this patch to pass the tests. Instead, this patch adds the missing XHTML whitespace for empty elements in the HTML corrector filter and fix the non XHTML tests.

Log in or register to post comments

Comment #36

6 July 2009 at 15:10

Status:

Needs review

» Needs work

The last submitted patch failed testing.

Log in or register to post comments

Comment #37

scor CreditAttribution: scor commented 6 July 2009 at 15:54

Status:

Needs work

» Needs review

File	Size
374441-refactor-html-corrector_13.patch	11.57 KB

update the tests for the text_summary() test accordingly to comply with XHTML and the output of the HTML corrector.

Log in or register to post comments

Comment #38

scor CreditAttribution: scor commented 6 July 2009 at 16:28

File	Size
374441-refactor-html-corrector_14.patch	12.23 KB

This patch addresses sun's comments in #26.

in #27:

Damien explained me why we need preg_match() here - the content could potentially contain a BODY tag.

We should test that as well though :)

If I try with <div><body>test</body></div>, the body tags are removed and <div>test</div> is returned. does a test make sense here?

Log in or register to post comments

Comment #39

sun

German

Karlsruhe

CreditAttribution: sun commented 6 July 2009 at 16:48

Works, too, but the more natural case would probably be that $text contains <html><head><title>Foo bar</title></head><body>Some content</body></html> already.

Log in or register to post comments

Comment #40

scor CreditAttribution: scor commented 6 July 2009 at 16:56

@Dries:

- Do we know if this might have any performance implications? We all know how XML parsers can be. On certain pages (e.g. forum topics with lots of comments), this filter can be called a lot ...

The filtered content is cached, so the impact is minimized and would only be visible if the cache has just been cleared.

The whitespace added via preg_replace('|<([^>]*)/>|i', '<$1 />', $body_content) is to support the compatibility with the old HTML 4 browsers, see http://www.w3.org/TR/xhtml1/#C_2. The code would still be valid XHTML without it but I'm unsure how the older browsers would react. Since Drupal and the old HTML Corrector have been supporting it, we would need to make sure it does not break anywhere before removing it.

Log in or register to post comments

Comment #41

scor CreditAttribution: scor commented 7 July 2009 at 09:32

File	Size
374441-refactor-html-corrector_15.patch	11.81 KB

The (old) HTML Corrector converts the valid XHTML <span class="test" /> to non valid <span class="test" /></span>. The new implementation in this patch fixes that. Adding a test for it.

Log in or register to post comments

Comment #42

14 July 2009 at 14:10

Status:

Needs review

» Needs work

The last submitted patch failed testing.

Log in or register to post comments

Comment #43

scor CreditAttribution: scor commented 14 July 2009 at 14:56

Status:

Needs work

» Needs review

File	Size
374441-refactor-html-corrector_16.patch	11.81 KB

rerolling the patch: Remove t() from getInfo

Log in or register to post comments

Comment #44

scor CreditAttribution: scor commented 26 July 2009 at 09:05

File	Size
374441-refactor-html-corrector_17.patch	12.32 KB

talked with tic2000 which pointed out a couple of related issues on the HTML corrector breaking comments, so I added more tests with complex comments and also removed one which was redundant. Also normalized the spelling of PHP 5 (with space).

Log in or register to post comments

Comment #45

tic2000 CreditAttribution: tic2000 commented 26 July 2009 at 10:36

Status:

Needs review

» Reviewed & tested by the community

The patch is working properly.
What I did notice while reviewing and testing this:
Using Filtered HTML input will remove the comments if there is a clean comment or even worse if you have something like  which will output comment -->
Using Full HTML will not strip the comment, but because of the line brake filter it will output in source view

<p><!-- comment --><br /><!-- comment
<p>comment</p>
<p> --></p>

The second problem is fixed by the patch in #222926: HTML Corrector filter escapes HTML comments the part that changes the _filter_autop function.

But both of those problems are worst without the patch, not in the scope of this patch and should be treated in a follow-up issue so this one is RTBC.

Log in or register to post comments

Comment #46

Dries CreditAttribution: Dries commented 27 July 2009 at 19:10

I assume this doesn't fubar the  tag that we are using for node teasers? At first glance, there don't seem to be any tests for the break-tag elsewhere in the code so this might require some manual testing after applying the patch.

Log in or register to post comments

Comment #47

tic2000 CreditAttribution: tic2000 commented 27 July 2009 at 20:12

@Dries
The html corrector in this patch doesn't touch comments which is a huge improvements over what we have now.
What removes comments is Filtered HTML input, and it does that before the patch too. <!-- is not an allowed tag.
What brakes comments with html tags inside is the xss filter and that also is not touched by this patched and is present in the current state of things.
To fix those problems we need a separate issue to address them. As I said in #45 the second part is already addressed in another issue and it will be an easy commit I think (if this one gets committed #222926: HTML Corrector filter escapes HTML comments becomes a one line patch).