Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Damien Tournoud’s picture

Status: Active » Needs review
FileSize
3.02 KB

And here it is...

 filter.module |   69 ++--------------------------------------------------------
 1 file changed, 3 insertions(+), 66 deletions(-)

Status: Needs review » Needs work

The last submitted patch failed testing.

chx’s picture

Also you need to add XML then to Drupal hook_requirements. There are idiot distributions outta there that disable core PHP extensions...

Edit: not disable, just package separately.

Damien Tournoud’s picture

Status: Needs work » Needs review
FileSize
7.1 KB

I'm not sure it's even possible to package the DOM extension separately (seems to be part of PHP core these days), but I added a requirement test anyway.

Also created a proper test case for this.

Status: Needs review » Needs work

The last submitted patch failed testing.

Damien Tournoud’s picture

Status: Needs work » Needs review
FileSize
7.26 KB

There was an issue with the regexp: we need "." to match the newline character (that's the role of the "s" modifier). Extended the test case to check for this.

Status: Needs review » Needs work

The last submitted patch failed testing.

Damien Tournoud’s picture

Status: Needs work » Needs review
FileSize
7.26 KB

Reroll, resubmit, goto 1.

mr.baileys’s picture

Status: Needs review » Needs work

This would a great way of streamlining and removing a lot of code.

Some issues after review though:

  1. I don't think DOMDocument properly handles encoding in this case. I attempted to run some Georgian text through the filter and it came out mangled (not so without this patch applied). More info: http://www.onphp5.com/article/57.
  2. I'm wondering if we can use saveXML(DOMNode) instead of the regex to get to the contents of the body tag:
    something like (untested and currently returns the body element too, not just contents)
    $bodyNode = $htmlDom->getDocumentElement()->getElementsByTagName('body')->item(0);
    return $htmlDom->saveXML($bodyNode);
    

    instead of

    +  // The result of DOMDocument->saveXML() is a full (X)HTML document. Only
    +  // extract the body of it. The "s" modifier makes "." match newlines too.
    +  if (preg_match("|<body[^>]*>(.*)</body>|s", $htmlDom->saveXML(), $matches)) {
    +    return $matches[1];
       }
    
Damien Tournoud’s picture

1. I don't think DOMDocument properly handles encoding in this case. I attempted to run some Georgian text through the filter and it came out mangled (not so without this patch applied). More info: http://www.onphp5.com/article/57.

Ok, apparently we need to add a XML declaration to force the DOM to load as UTF-8 (I guess it still defaults to latin1, silly PHP).

2. I'm wondering if we can use saveXML(DOMNode) instead of the regex to get to the contents of the body tag:
something like (untested and currently returns the body element too, not just contents)

*So* great ;) I wasn't aware that saveXML() can be passed a DOMNode. I was more looking for a saveXML() method on DOMNode itself (which would make a lot more sense).

Damien Tournoud’s picture

Status: Needs work » Needs review
FileSize
7.43 KB

Let's try that.

Damien Tournoud’s picture

Apparently, DOM only read HTML meta tags to detect encoding. Fixed, and added a test for that.

mr.baileys’s picture

Strange, I was testing the patch in #11 and the encoding problems I experienced with the patch from #8 were gone...

I guess we should add a test to test.filter that ensures non-Latin characters make it through the filters?

Damien Tournoud’s picture

I guess we should add a test to test.filter that ensures non-Latin characters make it through the filters?

Already done.

      // Encoding is correctly kept.
      '<p>دروبال' => '<p>دروبال</p>'

... which is apparently the literal transcription of Drupal.

mr.baileys’s picture

Some minor comments after review:

  1. If I apply the patch, the test string is converted from "دروبال" to "دروبال". I'm not sure if this is my configuration (windows/eclipse), or if this has something to do with the patch or the file's encoding. Ignore this if there is no problem applying this against HEAD
  2. Tested some extra single-tag elements like area, hr, ... (to ensure they're properly closed): works fine.
  3. Should we add a test to verify that non-nesting tags are auto-closed (I personally think this differs enough from "Properly close unclosed tags, and remove useless closing tags." to warant its own test):
  4. // Auto-close improperly nested tags.
    '<p>test<p>test</p>\n' => '<p>test</p><p>test</p>\n',
    

    In any case, above test passes succesfully with patch applied.

  5. Going through the old code, I think all 'replaced' logic is properly addressed via the tests, and by passing the tests the new logic proves to be sound.

I think this is good to go...

Damien Tournoud’s picture

1. If I apply the patch, the test string is converted from "دروبال" to "دروبال". I'm not sure if this is my configuration (windows/eclipse), or if this has something to do with the patch or the file's encoding. Ignore this if there is no problem applying this against HEAD

I believe this is your editor misbehaving. The patch is clean UTF-8.

2. Tested some extra single-tag elements like area, hr, ... (to ensure they're properly closed): works fine.

Extended the tests to test for


too. I don't believe it's necessary to test all of them (the list is readily available).

3. Should we add a test to verify that non-nesting tags are auto-closed (I personally think this differs enough from "Properly close unclosed tags, and remove useless closing tags." to warant its own test):

      // Auto-close improperly nested tags.
      '<p>test<p>test</p>\n' => '<p>test</p><p>test</p>\n',
      

In any case, above test passes succesfully with patch applied.

Added. Pass fine.

5. Going through the old code, I think all 'replaced' logic is properly addressed via the tests, and by passing the tests the new logic proves to be sound.

There is no way that the old code pass all those tests. The old HTML corrector is broken in so many ways, and I really don't want to fix it.

mr.baileys’s picture

I believe this is your editor misbehaving. The patch is clean UTF-8.

If I open the patch from d.o. in my browser, it's exhibiting the same behavior. I have to manually switch my browser to UTF-8 encoding to see the correct characters because d.o doesn't send encoding info with the patch. This doesn't have anything to do with the patch itself though, just something to take into account when saving/downloading the patch for review...

There is no way that the old code pass all those tests. The old HTML corrector is broken in so many ways, and I really don't want to fix it.

I understand. I just meant to say that whatever the old corrector claims to do, your new version does correctly (and has the test results to prove it). I haven't ran the new tests against the old corrector.

IMHO, with a change like this, there's bound to be some exotic edge cases where the old and new corrector produce (slightly) different results, but given the test results and the fact that core PHP functionality (DOMDocument) is used, I'm confident that this patch is an improvement over the current situation.

Damien Tournoud’s picture

I have to manually switch my browser to UTF-8 encoding to see the correct characters because d.o doesn't send encoding info with the patch.

Yes, this is a known issue.

IMHO, with a change like this, there's bound to be some exotic edge cases where the old and new corrector produce (slightly) different results, but given the test results and the fact that core PHP functionality (DOMDocument) is used, I'm confident that this patch is an improvement over the current situation.

Of course. Using PHP DOM is way better because the HTML parser of the DOM will not only automatically close dangling tags, but make proper valid XHTML. For example, it will automatically close tags that can only contain inline elements when there is a block level element in them:

<p>my test case<div>a block</div>the rest of the test case</p>

... will become:

<p>my test case</p><div>a block</div>the rest of the test case

This is a good thing at two levels: (1) we are guaranteed to produce valid (X)HTML, (2) we will have to be sure that other filters behave correctly, because the HTML corrector became unforgiving.

In this new version:

- I added several new DTD correctness tests.
- I moved from ->assertIdentical() to a fuzzy ->assertHtmlIdentical(), that is more fair to the old HTML corrector in accepting "<br/>" and "<br />" as identical

mr.baileys’s picture

...and this change would fix the problem outlined in #372454: Filtered HTML filter modifies class attribute...

(verified by using the test '<pre class="brush: bash;"></pre>' => '<pre class="brush: bash;"/>',)

Crossing my fingers that this patch gets applied to HEAD asap...

Damien Tournoud’s picture

@mr.baileys: you tested this carefully, reviewed the code and suggested modifications. Why not RTBC-ing it now?

mr.baileys’s picture

Status: Needs review » Reviewed & tested by the community

I thought issue etiquette was to wait for 2 positive reviews for a RTBC? Let's try and see what happens :)

Dries’s picture

- Do we know if this might have any performance implications? We all know how XML parsers can be. On certain pages (e.g. forum topics with lots of comments), this filter can be called a lot ...

- Code looks good. Good job on the code size reduction.

- If this improves the quality of the HTML filter, this might be worthy a CHANGELOG.txt item. I'll leave that up to Damien to decide.

mr.baileys’s picture

Interesting... I ran 4 quick performance tests (100000 iterations):
A. With patch (new code)
B. Without patch (old code)

1. just one div as input
2. the source for Dries's comment as input

and here are the results:
A1: 5.5 seconds
A2: 17.12 seconds

B1: 2.83 seconds (old code beats new)
B2: 30.72 seconds (new code is almost twice as fast as old code)

So it looks like the new code is faster, except when it comes to really small pieces of content. I guess the overhead of initialising the DOMDocument is to blame for this...

Anyway, probably no match for some real performance testing, but gives an indication...

Status: Reviewed & tested by the community » Needs work

The last submitted patch failed testing.

Damien Tournoud’s picture

Added an entry in CHANGELOG.txt.

sun’s picture

Very nice. I want.

+  // DOM can load HTML soup. But, HTML soup can throw warnings, suppress them.

I understand that.. But we really want less emotions in comments. ;)

+  // The "s" modifier makes "." match newlines too.

Let's not document PCRE basics.

+  @$htmlDom = DOMDocument::loadHTML('<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head><body>' . $text . '</body></html>');
...
+  if (preg_match("|^<body[^>]*>(.*)</body>$|s", $htmlDom->saveXML($bodyNode), $matches)) {

I don't see why we need a preg_match() here? We hard-code <body></body>, but we can't rely on it?

+    return "";

Single quotes.

+      'description' => t('Filter each filter individually: Convert line breaks, Correct broken HTML.'),

Lowercase "correct" after comma.

+      // Do not touch HTML comments.
+      '<!-- this is a comment -->' => '<!-- this is a comment -->',
+      'test <!-- comment -->' => 'test <!-- comment -->',

Can we add a test for "improper" comments as well? This:

<!--this is a comment-->
sun’s picture

Damien explained me why we need preg_match() here - the content could potentially contain a BODY tag.

We should test that as well though :)

Damien Tournoud’s picture

Status: Needs work » Needs review
FileSize
9.25 KB

Just refreshing the same patch.

Status: Needs review » Needs work

The last submitted patch failed testing.

scor’s picture

@damZ: why did you remove the createFormat() and deleteFormat() functions in the previous patches? If these are not used anywhere, they should be removed, but maybe in a different patch.

I've adapted the foreach listing of the tests to the current, more verbose syntax used in core. I do think the foreach approach is more readable and less verbose. any reason why it is not used in filter.test?

I've also moved the silent switch @ to the DOMDocument::loadHTML function.

It still does not passes all tests though.

Damien Tournoud’s picture

Status: Needs work » Needs review
FileSize
11.06 KB

Status: Needs review » Needs work

The last submitted patch failed testing.

Damien Tournoud’s picture

Status: Needs work » Needs review
FileSize
12.03 KB

Changed the tests that assumed a faulty behavior of the HTML corrector.

tic2000’s picture

Seems like sun's review was ignored

scor’s picture

Drupal strives to produce XHTML code, let's not break the XHTML compliant tests just for this patch to pass the tests. Instead, this patch adds the missing XHTML whitespace for empty elements in the HTML corrector filter and fix the non XHTML tests.

Status: Needs review » Needs work

The last submitted patch failed testing.

scor’s picture

Status: Needs work » Needs review
FileSize
11.57 KB

update the tests for the text_summary() test accordingly to comply with XHTML and the output of the HTML corrector.

scor’s picture

This patch addresses sun's comments in #26.

in #27:

Damien explained me why we need preg_match() here - the content could potentially contain a BODY tag.

We should test that as well though :)

If I try with <div><body>test</body></div>, the body tags are removed and <div>test</div> is returned. does a test make sense here?

sun’s picture

Works, too, but the more natural case would probably be that $text contains <html><head><title>Foo bar</title></head><body>Some content</body></html> already.

scor’s picture

@Dries:

- Do we know if this might have any performance implications? We all know how XML parsers can be. On certain pages (e.g. forum topics with lots of comments), this filter can be called a lot ...

The filtered content is cached, so the impact is minimized and would only be visible if the cache has just been cleared.

The whitespace added via preg_replace('|<([^>]*)/>|i', '<$1 />', $body_content) is to support the compatibility with the old HTML 4 browsers, see http://www.w3.org/TR/xhtml1/#C_2. The code would still be valid XHTML without it but I'm unsure how the older browsers would react. Since Drupal and the old HTML Corrector have been supporting it, we would need to make sure it does not break anywhere before removing it.

scor’s picture

The (old) HTML Corrector converts the valid XHTML <span class="test" /> to non valid <span class="test" /></span>. The new implementation in this patch fixes that. Adding a test for it.

Status: Needs review » Needs work

The last submitted patch failed testing.

scor’s picture

Status: Needs work » Needs review
FileSize
11.81 KB

rerolling the patch: Remove t() from getInfo

scor’s picture

talked with tic2000 which pointed out a couple of related issues on the HTML corrector breaking comments, so I added more tests with complex comments and also removed one which was redundant. Also normalized the spelling of PHP 5 (with space).

tic2000’s picture

Status: Needs review » Reviewed & tested by the community

The patch is working properly.
What I did notice while reviewing and testing this:
Using Filtered HTML input will remove the comments if there is a clean comment or even worse if you have something like <!-- comment <p>comment</p> --> which will output comment -->
Using Full HTML will not strip the comment, but because of the line brake filter it will output in source view

<p><!-- comment --><br /><!-- comment
<p>comment</p>
<p> --></p>

The second problem is fixed by the patch in #222926: HTML Corrector filter escapes HTML comments the part that changes the _filter_autop function.

But both of those problems are worst without the patch, not in the scope of this patch and should be treated in a follow-up issue so this one is RTBC.

Dries’s picture

I assume this doesn't fubar the <!--break--> tag that we are using for node teasers? At first glance, there don't seem to be any tests for the break-tag elsewhere in the code so this might require some manual testing after applying the patch.

tic2000’s picture

@Dries
The html corrector in this patch doesn't touch comments which is a huge improvements over what we have now.
What removes comments is Filtered HTML input, and it does that before the patch too. <!-- is not an allowed tag.
What brakes comments with html tags inside is the xss filter and that also is not touched by this patched and is present in the current state of things.
To fix those problems we need a separate issue to address them. As I said in #45 the second part is already addressed in another issue and it will be an easy commit I think (if this one gets committed #222926: HTML Corrector filter escapes HTML comments becomes a one line patch).

Dries’s picture

Status: Reviewed & tested by the community » Fixed

Alright, thanks for explaining that (again). Committed to CVS HEAD. Thanks!

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.