Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
Because Drupal 7 requires PHP 5.2, we can make use of PHP XML parsing facilities... ;)
Comment | File | Size | Author |
---|---|---|---|
#44 | 374441-refactor-html-corrector_17.patch | 12.32 KB | scor |
#43 | 374441-refactor-html-corrector_16.patch | 11.81 KB | scor |
#41 | 374441-refactor-html-corrector_15.patch | 11.81 KB | scor |
#38 | 374441-refactor-html-corrector_14.patch | 12.23 KB | scor |
#37 | 374441-refactor-html-corrector_13.patch | 11.57 KB | scor |
Comments
Comment #1
Damien Tournoud CreditAttribution: Damien Tournoud commentedAnd here it is...
Comment #3
chx CreditAttribution: chx commentedAlso you need to add XML then to Drupal hook_requirements. There are idiot distributions outta there that disable core PHP extensions...
Edit: not disable, just package separately.
Comment #4
Damien Tournoud CreditAttribution: Damien Tournoud commentedI'm not sure it's even possible to package the DOM extension separately (seems to be part of PHP core these days), but I added a requirement test anyway.
Also created a proper test case for this.
Comment #6
Damien Tournoud CreditAttribution: Damien Tournoud commentedThere was an issue with the regexp: we need "." to match the newline character (that's the role of the "s" modifier). Extended the test case to check for this.
Comment #8
Damien Tournoud CreditAttribution: Damien Tournoud commentedReroll, resubmit, goto 1.
Comment #9
mr.baileysThis would a great way of streamlining and removing a lot of code.
Some issues after review though:
something like (untested and currently returns the body element too, not just contents)
instead of
Comment #10
Damien Tournoud CreditAttribution: Damien Tournoud commentedOk, apparently we need to add a XML declaration to force the DOM to load as UTF-8 (I guess it still defaults to latin1, silly PHP).
*So* great ;) I wasn't aware that saveXML() can be passed a DOMNode. I was more looking for a saveXML() method on DOMNode itself (which would make a lot more sense).
Comment #11
Damien Tournoud CreditAttribution: Damien Tournoud commentedLet's try that.
Comment #12
Damien Tournoud CreditAttribution: Damien Tournoud commentedApparently, DOM only read HTML meta tags to detect encoding. Fixed, and added a test for that.
Comment #13
mr.baileysStrange, I was testing the patch in #11 and the encoding problems I experienced with the patch from #8 were gone...
I guess we should add a test to test.filter that ensures non-Latin characters make it through the filters?
Comment #14
Damien Tournoud CreditAttribution: Damien Tournoud commentedAlready done.
... which is apparently the literal transcription of Drupal.
Comment #15
mr.baileysSome minor comments after review:
In any case, above test passes succesfully with patch applied.
I think this is good to go...
Comment #16
Damien Tournoud CreditAttribution: Damien Tournoud commentedI believe this is your editor misbehaving. The patch is clean UTF-8.
Extended the tests to test for
too. I don't believe it's necessary to test all of them (the list is readily available).
Added. Pass fine.
There is no way that the old code pass all those tests. The old HTML corrector is broken in so many ways, and I really don't want to fix it.
Comment #17
mr.baileysIf I open the patch from d.o. in my browser, it's exhibiting the same behavior. I have to manually switch my browser to UTF-8 encoding to see the correct characters because d.o doesn't send encoding info with the patch. This doesn't have anything to do with the patch itself though, just something to take into account when saving/downloading the patch for review...
I understand. I just meant to say that whatever the old corrector claims to do, your new version does correctly (and has the test results to prove it). I haven't ran the new tests against the old corrector.
IMHO, with a change like this, there's bound to be some exotic edge cases where the old and new corrector produce (slightly) different results, but given the test results and the fact that core PHP functionality (DOMDocument) is used, I'm confident that this patch is an improvement over the current situation.
Comment #18
Damien Tournoud CreditAttribution: Damien Tournoud commentedYes, this is a known issue.
Of course. Using PHP DOM is way better because the HTML parser of the DOM will not only automatically close dangling tags, but make proper valid XHTML. For example, it will automatically close tags that can only contain inline elements when there is a block level element in them:
... will become:
This is a good thing at two levels: (1) we are guaranteed to produce valid (X)HTML, (2) we will have to be sure that other filters behave correctly, because the HTML corrector became unforgiving.
In this new version:
- I added several new DTD correctness tests.
- I moved from ->assertIdentical() to a fuzzy ->assertHtmlIdentical(), that is more fair to the old HTML corrector in accepting "
<br/>
" and "<br />
" as identicalComment #19
mr.baileys...and this change would fix the problem outlined in #372454: Filtered HTML filter modifies class attribute...
(verified by using the test
'<pre class="brush: bash;"></pre>' => '<pre class="brush: bash;"/>',
)Crossing my fingers that this patch gets applied to HEAD asap...
Comment #20
Damien Tournoud CreditAttribution: Damien Tournoud commented@mr.baileys: you tested this carefully, reviewed the code and suggested modifications. Why not RTBC-ing it now?
Comment #21
mr.baileysI thought issue etiquette was to wait for 2 positive reviews for a RTBC? Let's try and see what happens :)
Comment #22
Dries CreditAttribution: Dries commented- Do we know if this might have any performance implications? We all know how XML parsers can be. On certain pages (e.g. forum topics with lots of comments), this filter can be called a lot ...
- Code looks good. Good job on the code size reduction.
- If this improves the quality of the HTML filter, this might be worthy a CHANGELOG.txt item. I'll leave that up to Damien to decide.
Comment #23
mr.baileysInteresting... I ran 4 quick performance tests (100000 iterations):
A. With patch (new code)
B. Without patch (old code)
1. just one div as input
2. the source for Dries's comment as input
and here are the results:
A1: 5.5 seconds
A2: 17.12 seconds
B1: 2.83 seconds (old code beats new)
B2: 30.72 seconds (new code is almost twice as fast as old code)
So it looks like the new code is faster, except when it comes to really small pieces of content. I guess the overhead of initialising the DOMDocument is to blame for this...
Anyway, probably no match for some real performance testing, but gives an indication...
Comment #25
Damien Tournoud CreditAttribution: Damien Tournoud commentedAdded an entry in CHANGELOG.txt.
Comment #26
sunVery nice. I want.
I understand that.. But we really want less emotions in comments. ;)
Let's not document PCRE basics.
I don't see why we need a preg_match() here? We hard-code
<body></body>
, but we can't rely on it?Single quotes.
Lowercase "correct" after comma.
Can we add a test for "improper" comments as well? This:
Comment #27
sunDamien explained me why we need preg_match() here - the content could potentially contain a BODY tag.
We should test that as well though :)
Comment #28
Damien Tournoud CreditAttribution: Damien Tournoud commentedJust refreshing the same patch.
Comment #30
scor CreditAttribution: scor commented@damZ: why did you remove the createFormat() and deleteFormat() functions in the previous patches? If these are not used anywhere, they should be removed, but maybe in a different patch.
I've adapted the foreach listing of the tests to the current, more verbose syntax used in core. I do think the foreach approach is more readable and less verbose. any reason why it is not used in filter.test?
I've also moved the silent switch @ to the DOMDocument::loadHTML function.
It still does not passes all tests though.
Comment #31
Damien Tournoud CreditAttribution: Damien Tournoud commentedComment #33
Damien Tournoud CreditAttribution: Damien Tournoud commentedChanged the tests that assumed a faulty behavior of the HTML corrector.
Comment #34
tic2000 CreditAttribution: tic2000 commentedSeems like sun's review was ignored
Comment #35
scor CreditAttribution: scor commentedDrupal strives to produce XHTML code, let's not break the XHTML compliant tests just for this patch to pass the tests. Instead, this patch adds the missing XHTML whitespace for empty elements in the HTML corrector filter and fix the non XHTML tests.
Comment #37
scor CreditAttribution: scor commentedupdate the tests for the text_summary() test accordingly to comply with XHTML and the output of the HTML corrector.
Comment #38
scor CreditAttribution: scor commentedThis patch addresses sun's comments in #26.
in #27:
If I try with
<div><body>test</body></div>
, the body tags are removed and<div>test</div>
is returned. does a test make sense here?Comment #39
sunWorks, too, but the more natural case would probably be that $text contains
<html><head><title>Foo bar</title></head><body>Some content</body></html>
already.Comment #40
scor CreditAttribution: scor commented@Dries:
The filtered content is cached, so the impact is minimized and would only be visible if the cache has just been cleared.
The whitespace added via
preg_replace('|<([^>]*)/>|i', '<$1 />', $body_content)
is to support the compatibility with the old HTML 4 browsers, see http://www.w3.org/TR/xhtml1/#C_2. The code would still be valid XHTML without it but I'm unsure how the older browsers would react. Since Drupal and the old HTML Corrector have been supporting it, we would need to make sure it does not break anywhere before removing it.Comment #41
scor CreditAttribution: scor commentedThe (old) HTML Corrector converts the valid XHTML
<span class="test" />
to non valid<span class="test" /></span>
. The new implementation in this patch fixes that. Adding a test for it.Comment #43
scor CreditAttribution: scor commentedrerolling the patch: Remove t() from getInfo
Comment #44
scor CreditAttribution: scor commentedtalked with tic2000 which pointed out a couple of related issues on the HTML corrector breaking comments, so I added more tests with complex comments and also removed one which was redundant. Also normalized the spelling of PHP 5 (with space).
Comment #45
tic2000 CreditAttribution: tic2000 commentedThe patch is working properly.
What I did notice while reviewing and testing this:
Using Filtered HTML input will remove the comments if there is a clean comment or even worse if you have something like
<!-- comment <p>comment</p> -->
which will outputcomment -->
Using Full HTML will not strip the comment, but because of the line brake filter it will output in source view
The second problem is fixed by the patch in #222926: HTML Corrector filter escapes HTML comments the part that changes the _filter_autop function.
But both of those problems are worst without the patch, not in the scope of this patch and should be treated in a follow-up issue so this one is RTBC.
Comment #46
Dries CreditAttribution: Dries commentedI assume this doesn't fubar the
<!--break-->
tag that we are using for node teasers? At first glance, there don't seem to be any tests for the break-tag elsewhere in the code so this might require some manual testing after applying the patch.Comment #47
tic2000 CreditAttribution: tic2000 commented@Dries
The html corrector in this patch doesn't touch comments which is a huge improvements over what we have now.
What removes comments is Filtered HTML input, and it does that before the patch too.
<!--
is not an allowed tag.What brakes comments with html tags inside is the xss filter and that also is not touched by this patched and is present in the current state of things.
To fix those problems we need a separate issue to address them. As I said in #45 the second part is already addressed in another issue and it will be an easy commit I think (if this one gets committed #222926: HTML Corrector filter escapes HTML comments becomes a one line patch).
Comment #48
Dries CreditAttribution: Dries commentedAlright, thanks for explaining that (again). Committed to CVS HEAD. Thanks!