The HTML Corrector module replaces the < bracket with a < when it is the start of an HTML comment. There are two cases where this is a problem:

  • Users have fill HTML access and you want to be able to catch missing tags. Users need to be able to copy in content from an older site which contains HTML comments, or the users want to be able to put in comments for their own use.
  • At least one module I maintain (Table of Contents) and possibly others use HTML comments as markup so if the module is disabled nothing is rendered to the user. This worked fine in D5, but if the HTML corrector module is enabled then the markup is rendered.

If this is considered to a valid issue, then let me know and I'll create a patch to fix it. Otherwise, please suggest ways around this!

Thanks,
--Andrew

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

henrrrik’s picture

This affects my XStandard module too.

This tweak to the regular expression in the _filter_htmlcorrector function seems to do the trick:

// Properly entify angles.
$text = preg_replace('!<([^a-zA-Z\!/])!', '&lt;\1', $text);
gpk’s picture

(Duplicated at http://drupal.org/node/221252.)

The regex might be marginally simpler as follows:

$text = preg_replace('|<([^a-zA-Z!/])|', '&lt;\1', $text);

Yes, this needs a patch.

Once the regex is modified to catch HTML comments as well as other tags, how does http://api.drupal.org/api/function/_filter_htmlcorrector/6 cope with the fact that the comment is never closed in the way that most tags should be? i.e. if you have <!-- comment --> does try to close it by inserting </--> or similar later on?

henrrrik’s picture

Yes, it closes with a </!--> tag.

This adjustment to the code that closes the tag fixes that:

// Close remaining tags.
  foreach ($stack as $closing_tag) {
    if ($closing_tag == '!--') {
      $output .= '-->';
    }
    else {
      $output .= '</'. $closing_tag .'>';
    }
  }
gpk’s picture

See also http://drupal.org/node/97182 (same issue but in the context of htmlcorrector as a contrib module for Drupal 5.x). I'd mark one or other duplicate except that these are now separate projects since htmlcorrector has been taken into core in 6.x.

Spaceman-Spiff’s picture

Any idea if this will be fixed in a future release or such fix/module/patch has been released?

Will the htmlcorrector.module.comment_fix.patch posted by jcfiala work on 6.x?

jvong’s picture

Hello,

I have posted http://drupal.org/node/240312 how do I resolve this issue. Can someone please help?

Thanks,

John

walker_643’s picture

Patching modules/filter/filter.module fixed this (very annoying) issue for me. I hereby vote for such a patch in 6.3/7.x

sun’s picture

Version: 6.0 » 6.2
Status: Active » Needs review
FileSize
1.6 KB

Re-rolled my patch of #97182: <!--break --> is transformed into html code with lt and gt against Drupal 6.x. Should fix this issue without much changes.

sun’s picture

Re-rolled patch patch against HEAD. Dunno whether this issue should be moved to 7.x as well.

gpk’s picture

Maybe the patch at http://drupal.org/node/269095 is better?

sun’s picture

Version: 6.2 » 7.x-dev
FileSize
911 bytes

Re-rolled patch from #269095: [Filter] FULL HTML filter with HTML corrector enabled break comments <!-- ..... --> for D7 and marked that issue as duplicate.
I did not test the patch yet, but it looks indeed simpler.

jcnventura’s picture

FileSize
1.32 KB

These patches only deal with:
1. Preventing the replacement of '<' by '&lt;'
2. Preventing the insertion of a dummy closing tag for the comments

However, the contents of the comment are still processed by the HTML corrector module.. The attached patch extends #11 to prevent the processing of the contents of the comment. To test this, create a multi-line comment.

Drupal 6.3 currently produces :

&lt;!-- Testing a comment<br />
--><br />
&lt;!-- Testing a comment2<br />
--><br />
&lt;!-- Testing another comment -->

With the patch the output is:

<!-- Testing a comment
-->
<!-- Testing a comment2
-->
<!-- Testing another comment -->

The attached patch is against Drupal 6.3, and please, please commit it before D6.4. I have tested this patch and it works nicely for me.

João

PS: Updated on 2008-07-22 because the original patch terminated the comment with any '>' and not only with the first '-->'

Matt B’s picture

This works for me - thanks!

jcnventura’s picture

Priority: Normal » Critical

I shouldn't be doing this, but since this bug is interfering with so many modules (including the AdSense module), I am raising this to critical hoping that it will be tested/reviewed and merged before Drupal 6.4 is released.

João

Matt B’s picture

Version: 7.x-dev » 6.4

didn't make it to Drupal 6.4, I had to repatch when I upgraded (patch still works great). This needs to be fixed, Drupal 6 is not usable for me without it!

stefan_seefeld’s picture

I'm reading this discussion as I'm needing some custom markup on user-generated pages, too. I wonder whether it would be useful to allow (or even pre-define) processing instructions (i.e. <?...?>), not only generic xml comment elements, to be preserved such that specific modules can handle them.

Thanks,
Stefan

Matt B’s picture

Stefan - would it not be better for the specific module to implement a filter to do this?

stefan_seefeld’s picture

Certainly, the module in question needs to recognize the instruction.
All I'm asking here is whether it is possible to allow for processing instructions to be preserved, i.e. not being filtered out or escaped. If that's already the case, just ignore my request. :-)

Thanks,
Stefan

PS: as a use-case, consider the 'print' module with its ability to generate pdf. I want to inject page-break markers into my nodes, specifically targetted at the pdf generator, but I have no idea whether drupal would allow processing instructions, i.e. whether those can be passed through.

jcnventura’s picture

@stefan_seefeld: please don't cloud the issue here.. The problem here has nothing to do with inserting input filters like you're suggesting. That's to be left to the modules themselves.

The problem here is that HTML comments are escaped by the HTML corrector, rendering them visible on the normal content..

Joao

PS: As to what we had discussed for the print module, this needs to be fixed because PDF specific-instructions must be placed in the content in a way that is not visible normally and that activates only during PDF generation.. I would prefer to use HTML comments for that as it be less overhead than using an input filter.

stefan_seefeld’s picture

@jcnventura: sorry, I misunderstood what the issue was. I assumed that the escaping of comments (and possibly processing-instructions) would not only display them when not wanted, but also make it impossible for other modules to pick them up.

I'll follow-up on the rest on the print-specific issue tracker.

Thanks,
Stefan

tcblack’s picture

The latest patch appears to work for me.

jcnventura’s picture

Status: Needs review » Reviewed & tested by the community

Well, it seems to work for at least three persons: me, tcblack and Matt B.

Marking it RTBC, in the hopes that it will be committed soon.

João Ventura

Damien Tournoud’s picture

Version: 6.4 » 7.x-dev
Priority: Critical » Normal
Status: Reviewed & tested by the community » Needs work

This is a patch to a rather complicated piece of code. It will require a full-featured unit test case before going in.

jcnventura’s picture

Status: Needs work » Reviewed & tested by the community

@Damien Tournoud : I agree with what you're saying on the part that this is a patch to a rather complicated piece of code. However, I disagree with the need for a fully-featured test case. Why?

1. Because of this bug in the HTML corrector, that feature is almost unusable since it is breaking a lot of contrib modules.
2. The patch is only 3 lines! It's quite easy to see that the only changes are related to processing of HTML comments.
3. The patch has been tested to work on Drupal 6.4 by at least 3 persons.

Since the tests for the HTML corrector filter haven't yet been done (just looked in D7 CVS), this isn't a simple case of waiting for someone to extend those tests to cover this case. You're proposing that this (simple) fix be put on hold until a very complex set of simpletests are written.

From this thread I can tell you that this is breaking the following modules:
- Table of Contents
- XStandard
- AdSense

João

Damien Tournoud’s picture

Status: Reviewed & tested by the community » Needs work

@jcnventura, sorry, but the Html Corrector itself is a rather complicated piece of code. I would like to see what's happen in corner cases like <!-- inside tags and so on. At this time, the whole filter is untested so we really don't know.

I'm also really unsure this could break the modules you listed. For Table of Contents, for example, you just have to be sure to order the filters the right way (the HTML corrector has to be on top). So I would say this is more an annoyance than a breakage.

jcnventura’s picture

Status: Needs work » Needs review

I was going to put it RTBC again, but I give up. You're right that all those modules aren't broken if the HTML corrector filter is the first one to execute..

However, a lot of people monetize their sites using AdSense (and most of them don't use the AdSense module as it's not clear if using it is not a violation of Google's TOS). So they are ordered by Google to type in the following in their pages:

<script type="text/javascript"><!--
google_ad_client = "pub-xxxxxxxxxx";
/* Drupal 728x90 */
google_ad_slot = "xxxxxxxxxx";
google_ad_width = 728;
google_ad_height = 90;
//-->
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script>

What do you think happens with that comment in the script? It gets turned into something that is definitively against Google's TOS. In some archaic browsers it will even display the code (the reason why everyone comments the inside of JavaScripts).

The only way to avoid this is to turn off the HTML corrector, which most people don't even know it's there.

However, the patch doesn't need work. I am setting it to needs review. What needs work is the filter module simpletests.

João

Damien Tournoud’s picture

Status: Needs review » Needs work

One last thing: if I understand all this right (we don't have a test case, so I can't be 100% sure), the htmlcorrector with the patch in #12 will probably do some silly things like transforming:

<!-- this is a test <i>test</i> to <!--this is a test <i>test</i></!-->,

while it will not even try to close <!-- this is a new test because there are no matching ">".

This definitely needs *first* a full test case, and *only then* a patch.

deviantintegral’s picture

@Damien Tournoud: Actually, that's not quite true. In the best case scenario, I'd like to tell users to place the HTML corrector before the tableofcontents filter. Then, there should be less things to break when users forget to do things like close heading tags. Most users setting up filters would generally do HTML corrector followed by HTML filter, followed by additional filters, but that's not possible until this patch lands.

Damien Tournoud’s picture

Putting an adsense code in a node content makes no sense at all. The least bad option is to use a block for this and place it in the footer.

sun’s picture

Please stop discussing about tests. This has to be fixed in 7.x first, to be back-ported to 6.x. And we have a clear rule for patches against 7.x - everything needs tests, even if there are none yet.

dman’s picture

Placing adsense code in node content is certainly against best practices. But not unreasonable.

.. but ALLOWING SCRIPT TAGS in HTML-filtered input? And then finding that the comments inside the scripts were a problem?!
HTML filters are just protection against potentially malicious code. If you trust yourself enough to break your own system to this extent - don't use HTML filter! FFS.

.dan.

jcnventura’s picture

Just a comment on #27: there is no correct way for this. A comment can contain HTML inside (and even other comments), so actually the silly thing that it may do is not so silly (I haven't tested that situation). Where's the boundary if you forget to close the comment?

My personal opinion is that it should be when it reaches the </body> tag. That way, when the rest of the content disappears, the user knows that something is amiss.

I am creating a link to this issue in the issue for creating the tests for the filter module (#276597: TestingParty08: filter.module).

João

peterx’s picture

Patch from #12 works in my 6.4 site with wide range of comments including comments around javascript. Some of the comments and javascript arrive in AHAH from other Web sites. Do I use this patch or turn off HTML correction? The incoming HTML needs filtering because the HTML was created by Orks for Internet Explorer 0.5 alpha and that rules out switching off HTML filtering and correction. Thank you for the patch.

Matt B’s picture

again, this hasn't made it to Drupal 6.5, so I had to repatch. Without doing this, Drupal 6 is not usable (for me)! Works fine in drupal 6.5

alexanderpas’s picture

roadmap:
- Create Tests (that'll fail) for 7.x (#276597: TestingParty08: filter.module)
- fix 7.x code
- No tests fail anymore
- backport to 6.x
- backport to 5.x html corrector module.

Matt B’s picture

patch works fine on 6.6

veriKami’s picture

subscribing :-)

grendzy’s picture

subscribing

Todd Nienkerk’s picture

I ran into the same problem in 6.x and created a patch for it: #343236: HTML corrector encodes (entifies) <!--break--> tag. Because the break tag is encoded, it shows up in RSS feeds.

palmstrom’s picture

subscribing

kmillecam’s picture

subscribing

alanBrookland’s picture

subscribing

timatlee’s picture

subscribing

escoles’s picture

Appears to still be an issue for 6.9 -- I've just documented on two 6.9 installs where it will be a problem due to large amounts of legacy content that weren't previously affected by this issue. Hence, many contain HTML comments.

Is this issue re. 6.x or 7.x? "Version" is currently set to 7.x-dev, but most of the discussion is re. 6.x. Can someone help me understand why this issue would be deferred to 7.x? Does this mean that if we need a fix in 6.x, we're always going to have to patch-over the latest HTML Corrector release?

CORRECTION -- I now see the rationale in the "roadmap" here:
http://drupal.org/node/222926#comment-1086392
... so I see that plan seems to be to backport to 6.x after using 7.x efforts to unit test.

(Note that this is a particular problem for people who do a lot of cut & paste blogging. We need the corrector to close un-closed containers that we might miss by not scrutinizing the HTML of what we paste -- or to correct stuff we did in the past. But if we enable it, the comments embedded in the stuff we copy & paste will show.)

Dave Hirschman’s picture

The HTML corrector filter will also substitute the html entity for '<' characters within javascript. For example,
if (x < y) { .... }
produces an error, whereas
if (y > x) {...}
does not.

For now, I guess I'll create a second input filter with HTML corrector turned off and select that for those few nodes where I want to use a bit of javascript or have some html comments.

seanr’s picture

Version: 7.x-dev » 6.x-dev

This needs to be fixed in 6.x. What the hell is taking so long to get a clearly working patch committed? The problem here is that this filter is enabled by default in core and is breaking things. You cannot get any more critical than that. Please commit the fix!

seanr’s picture

Priority: Normal » Critical

Marking as critical since it clearly is.

sun’s picture

Version: 6.x-dev » 7.x-dev
cameronp’s picture

subscribing

jcfiala’s picture

I tried to use the latest patch and it failed because of a drupal-6.4 or something in the patch file - I've adjusted the patch file so it runs w/ -p0

deviantintegral’s picture

FYI there is another related issue against D5 with the HTML filter #103563: HTML filter escaping html comments. If you're having problems with D6 even with the HTML corrector disabled, take a look at the patch over there.

elly’s picture

thanks for the d6 patch, it works like a charm!

jcnventura’s picture

Status: Needs work » Needs review
FileSize
3.13 KB

Hello again,

Here's my latest attempt. I have to acknowledge that Damien Tournoud (among others) was completely right. After applying some tests*, I concluded that the patch was indeed flawed.

This applies to the current 7.x-dev, and includes a couple of tests that verify it. After applying this patch all the tests in #276597: TestingParty08: filter.module still pass.

One final note: as it is now, the patch only handles proper comments (i.e. it requires that the comment is properly closed). I didn't try to handle that case, as it is too complicated to decide where to close it. I think that the issue of correcting a comment that is not terminated should be handled elsewhere.

João

*developed by wrwrwr in #276597: TestingParty08: filter.module

jcnventura’s picture

revisiting #35:
1. Create Tests (that'll fail) for 7.x ------- done
2. fix 7.x code --------- done
3. No tests fail anymore ------- done

4. backport to 6.x ------ TBD
5. backport to 5.x html corrector module ------- TBD

monotaga’s picture

subscribing

grendzy’s picture

Status: Needs review » Reviewed & tested by the community

#54 looks good. The new test passes only after patching filter.module. I also tinkered with different strings in the test case, and couldn't find any issues. One thing I did notice is that self-closing tags inside of comments get corrected, like <br> gets converted to <br />, which seems harmless.
Also this patch passes coder's style checks.

sun’s picture

Status: Reviewed & tested by the community » Needs work

Thanks for working on this!

+  /**
+   * Test the HTML corrector.
+   */
+  function testHtmlCorrector2() {

PHPDoc could use a proper description. Replace "2" with "Comments" in function name.

jcnventura’s picture

Status: Needs work » Needs review
FileSize
3.19 KB

@sun: Changed the description and renamed the function as you asked

@grendzy: Thanks for bringing that to my attention. Since what you described went against the code of the changes I made, I went back and analyzed it better. I had forgotten to include the 's' modifer for the PCRE regular expression, so the HTML corrector was still active in multi-line comments. I have added a single-use tag to the (multi-line) html corrector test to verify that this is now working correctly.

I am setting this back to RTBC because the only change to the filter module was the 's' modifier. Feel free to set it back to 'needs review' if you believe that change to be important enough to downgrade it's status.

jcnventura’s picture

Status: Needs review » Reviewed & tested by the community
Damien Tournoud’s picture

Status: Reviewed & tested by the community » Needs work

Well, you added a single-line constraint on the "don't apply processing inside of those tags" regexp, please confirm in a test that it doesn't fall apart when the tags are on multiple lines. Example:

<script>
  my test script
</script>

or:

<pre>
 test line 1
 test line 2
</pre
>
jcnventura’s picture

I didn't mention it, but the tests in #276597: TestingParty08: filter.module still pass, and one of them already handles the first case (which doesn't get affected at all by the s modifier since the tag is still in the same line).

However, your second case raises a valid point. Previously the tag matching stopped at the end of the line, and now it stops at the closing delimiter (>). Since the code is SUPPOSED to stop at the delimiter, I am pretty sure that simple change actually improved the code.

I will add a new test case to confirm that this doesn't break anything when opening or closing tags are spread over multiple lines.

João

jcnventura’s picture

Status: Needs work » Needs review

I couldn't come up with a new test, and because of that I remembered that the s modifier only applies to the dot character (http://es.php.net/manual/en/reference.pcre.pattern.modifiers.php).

So in simple terms, the s modifier added between #53 and #58 only modifies the handling of multi-line comments and that is already being tested in the test included in #58.

Based on that, as there's really no work needed, I am setting it to needs review. Note that grendzy had set it to RTBC in #56 (for 6 minutes)..

João

Status: Needs review » Needs work

The last submitted patch failed testing.

geerlingguy’s picture

JeremyFrench’s picture

The latest patch seems to change, filter.module and filter.test, should the test not be changed as part of #276597? If it should let me know and I will split the patch in two.

jcnventura’s picture

Status: Needs work » Needs review
FileSize
3.19 KB

Apparently the testing system failed to install Drupal, and the only way to restart it is to re-upload the patch.

EvanDonovan’s picture

Thanks for your work on this! The patch in #66 resolved the issue for me on Drupal 6.11. I had to apply it manually though (one of the hunks failed because the offsets were too different from HEAD).

It also seems to fix #348514: Node body does not handle <!--break--> properly for me.

TBarregren’s picture

Just for information: I have made a D6 module that can be used to overcome this problem. See http://drupal.org/project/htmlcomment.

dixon_’s picture

Re rolled the patch from #66 to apply on a fresh D7 install.

Status: Needs review » Needs work

The last submitted patch failed testing.

stella’s picture

Status: Needs work » Needs review

setting back to needs review, to re-run tests

joshmiller’s picture

bump

ChrisBryant’s picture

I understand this also causes problems for the Paging module as seen here:

#400190: Teaser break tag <!--break--> exposed

Status: Needs review » Needs work

The last submitted patch failed testing.

deekayen’s picture

Status: Needs work » Needs review

trying out a new testing bot that apparently wasn't working

joshmiller’s picture

Status: Needs review » Reviewed & tested by the community

The new testing bot liked it. I applied the patch, the added test applied with some fuzz.

After reading through the issue que, this patch is ready to be committed.

Josh

tic2000’s picture

You should keep an eye on #374441: Refactor Drupal HTML corrector (PHP5) too which solves a lot of the problems here. You can see in #45 there the problems that will still remain after this or that one get committed. My vote goes for #374441: Refactor Drupal HTML corrector (PHP5) which will trim the current patch here to 1 line.

jcnventura’s picture

Can we then apply this to Drupal 6 to fix the damn bug?? And let the refactored Drupal HTML corrector for Drupal 7?

tic2000’s picture

@jcnventura
Not until one of the patches gets committed.

tic2000’s picture

Status: Reviewed & tested by the community » Needs work

#374441: Refactor Drupal HTML corrector (PHP5) got committed.
This patch no longer applies.

To resume what I said in the other issue.

Testing the patch there and the patch here I noticed that:
1. Using Filtered HTML input format comments are removed. I think it shouldn't do this.
2. If the comments have some html tags inside, the result is even worse. <!-- comment <p>comment</p> --> will result in comment -->. If my previous statement is arguable, now for sure something is wrong. It should either remove the comment or (ideally IMO) let it untouched.
3. Finally, using Full HTML will not strip the comment, but because of the line brake filter if you write

<!-- comment -->
<!-- comment <p>comment</p> -->

it will output in source view

<p><!-- comment --><br /><!-- comment
<p>comment</p>
<p> --></p>

.

Point 3 is corrected by the change the patch in this issue does in the _filter_autop() function.
Point 1 and 2 remain to be addressed.

Damien Tournoud’s picture

I'm pretty convinced that the only way to properly solve those issues is to move those filters out of crazy pattern matching and into proper manipulation of the DOM tree.

jcnventura’s picture

In reply to #80:

1. I never tried filtered HTML, but that would probably be a bug in that filter and not in the HTML corrector filter, so it should possibly go into another issue.
2. I will try to take a look at it and add a specific test to make sure of it.
3. Glad to know.

One final thing.. Now that this will NEVER make it to Drupal 7.. Can we move it back to Drupal 6 and commit this patch there? This is an itch that needs scratching for several of us third-party module maintainers. Of course the better choice would be to apply the refactored HTML corrector to D6, but I don't think that will happen.

João

tic2000’s picture

@jcnventura
Yes, I think we can open separate issues for that and let this one for d6.
The re factored html corrector in D6 can be only done in contrib. Among others things that stops it to make it to core is that it requires PHP5.

@Damien Tournoud
I agree. Too much regexp voodoo now.

tic2000’s picture

Version: 7.x-dev » 6.x-dev
Status: Needs work » Needs review
emdalton’s picture

I started to see this when I upgraded from D5 to D6. There were legacy comments on some nodes, and all of a sudden they were displaying when they hadn't before. Then I also started to see stray comment fragments in other places, e.g. when inserting views using filters. I've now turned off the HTMLcorrector filter, but this really needs to be addressed.

markus_petrux’s picture

Subscribing

westbywest’s picture

Subscribing

webchick’s picture

Version: 6.x-dev » 7.x-dev

Fixing version. AFAIK none of the comments has said that this isn't a problem in D7, and bugs need to get fixed there first.

webchick’s picture

Status: Needs review » Needs work
jcnventura’s picture

webchick, from #80 on, it seems this and the refactored HTML corrector are incompatible. I am assuming that including the test from this patch in the tests for the HTML corrector will make sure that this patch is not necessary in D7.

However, it is still highly annoying in D6, and this patch should be applied to that branch.

tic2000’s picture

FileSize
3.19 KB

@webchick
html corrector has no problem with comments now. So either the issue is moved to D6 que or the title is changed to reflect the problems in #80 which comes from autop filter (auto line brake) and html filter (allowed tags).
Even if the title is changed a new issue has to be open for D6 because the solution used in D7 so comments are not escaped can't be used in D6.

Meanwhile a first try to solve the problems in #80.

tic2000’s picture

Status: Needs work » Needs review

Status: Needs review » Needs work

The last submitted patch failed testing.

tic2000’s picture

hmm. it fails on the spam deterent test. rel="nofollow" is not added in a link inside a comment. But before the patch it only passes because the comment tag is stripped out so the text inside is displayed.

tic2000’s picture

Version: 7.x-dev » 6.x-dev

Back to D6. I'll open a new issue for the problems in #80 to be addressed properly in D7 (#559584: filter_xss() and Line break filter break HTML comments).
This one needs work because the patch in #69 needs a re-roll against D6 branch.

EvanDonovan’s picture

Thanks for championing this, tic2000. It sounds like you are correct - the PHP5-specific changes to the filter in 7.x mean that a different approach is needed here. I will be happy to review once a new patch is rolled.

hgmichna’s picture

Isn't the following also related?

Drupal uses XHTML, which inherits from XML the standard way to differentiate between XML/HTML code and embedded non-XML/HTML code by way of the

                <![CDATA[ … ]]>

syntax, as I certainly don't need to remind the web programmers among us.

For example, if you embed HTML or script code in a document and don't want it to be interpreted as HTML by the browser, you have to embed it in these entities, if it contains XML-breaking code. Example:

<script>//<![CDATA[
    if (1 < 2) alert("OK");
//]]></script>

One thing that's urgently needed is that Drupal keeps away from <![CDATA[ … ]]> blocks, including this tag itself, and under no circumstances fiddles with them. If anybody wants to embed HTML, script, or other code and doesn't want it to be interpreted as XML or HTML, he wraps it in this tag, no browser will touch it, and neither should Drupal.

By the way, as to #45 by Dave.Hirschman, embedding script in a HTML <!-- … --> comment tag makes no sense, as all browsers today understand the script tag. Programmers did that in the very early days of web programming, but no longer. However, we still need the <![CDATA[ … ]]> construct today, because otherwise literally intended "<" characters could break XML syntax rules.

I've tried to alert everybody to the problem in issue #556648: <![CDATA[ escaped, but am not sure whether it should be dealt with here, because the problem seems to be related.

grendzy’s picture

Issue tags: +FilterSystemRevamp

tagging

jenlampton’s picture

Status: Needs work » Needs review
FileSize
2.37 KB

So we're now on D 6.14 and still waiting for a patch to be re-rolled against D6? Let's try this one :-)

mcarbone’s picture

Status: Needs review » Reviewed & tested by the community

Patch applied and seems to be working.

deviantintegral’s picture

Status: Reviewed & tested by the community » Needs work

The patch in #99 needs to be taken from the drupal root, not the root of the filter module. As well, when testing a comment, I end up with something like:

<!--test comment-->text</!--test comment-->

Where the "closing" tag is at the end of the document. A closing comment doesn't need to be inserted since it's not an actual tag.

smk-ka’s picture

Status: Needs work » Needs review
FileSize
6.38 KB

Rerolled patch: passing through comments unchanged in both htmlcorrector and autop functions. HTML filter treats comments correctly using code from #91.

Needless to say it would be nice to have this fixed before Nibiru.

mlncn’s picture

Status: Needs review » Reviewed & tested by the community

And we have a winner! This (#102) keeps HTML Corrector from incorrectly escaping comment tags and keeps the everything else happy too. Tested and also using on production.

benjamin, agaric

hgmichna’s picture

Keeps everbody else happy too? Also #97?

smk-ka’s picture

Status: Reviewed & tested by the community » Needs work

No, I surely missed that comment.

Poieo’s picture

Subscribing...

neochief’s picture

+1

maestrojed’s picture

Patch in #102 applied cleanly but did not fix the issue for me. (drupal 6.14

Subscribing...

smk-ka’s picture

but did not fix the issue for me

Can you give more details/an example of what didn't work?

jhedstrom’s picture

This patch (#102) works for me.

EvanDonovan’s picture

Patch #102 works for me on a 6.15 install. Can we get this patch applied to Drupal 6.x core first and then deal with #556648: <![CDATA[ escaped separately? That way, we don't have to keep patching this every time that Drupal core is upgraded.

neochief’s picture

Status: Needs work » Reviewed & tested by the community

+1 for #111

JGO’s picture

indeed, please add it to core ! :)

----------------------------
JGO | http://www.e2s.be
----------------------------

alb’s picture

for who use fckeditor with botton break and pagebreak;
1) if in a node there are one < !--break-- > and some < !--pagebreak-- > the node teaser is ok, but when the node is opened appear the string < !--break-- >
2)
but in a similar node, with only the break, is all ok when the node is opened.

Regards problem in the point 1), if in the Input format I remove the filter Html corrector,
the problems not appear more; so the problem is this filter;

if I not delete, but move this filter after Paging filter the problem in the point 1 there isn't, but others problem appears (appear strings of < !--pagebreak-- > and < !--page_filter-- >);

in the patch, is possible to insert also a solution for this problem?

Marc Bijl’s picture

Subscribing...

In the meantime: any workarounds?

UPDATE
=====
Applied the patch (Drupal 6.15) and seems to work.
However, if the opening tag of a HTML comment is on its own line (the comment below it),
I need to type a space after the opening tag; otherwise it results in a paragraph like this <p>&lt;!--</p>

geerlingguy’s picture

I third #111. I have a lot of legacy content, and many nodes have pasted-from-word comments, and cleaning them up manually will take a ton of time. This patch will let me forget about having to do some dirty regex work just to remove a bunch of comments...

RichieB’s picture

Subscribing. I'm in the same boat as #116. We have some content that is pasted from MS-Word that leaves visible comments in Drupal 6.15.

torgosPizza’s picture

Patch in #102 worked for me on Pressflow 6.15.

Gábor Hojtsy’s picture

#559584: filter_xss() and Line break filter break HTML comments was marked duplicate of this issue without the D7 issues being fixed. That sounds like a recipe for regressions.

gpk’s picture

I reopened #559584: filter_xss() and Line break filter break HTML comments.

@119: I infer you would be unwilling to commit #102, even though the problem is partially fixed in 7.x via #374441: Refactor Drupal HTML corrector (PHP5) ?

geerlingguy’s picture

Just FYI, for those who need a quick fix (especially if you get a lot of Word copy/pastes on your site), I stuck this into a custom module on my site to set up an input filter, which I enabled and set to run first on my input formats page:

/**
 * Implementation hook_filter().
 */
function custom_filter($op, $delta = 0, $format = -1, $text = '', $cache_id = 0) {
  switch ($op) {
    case 'list':
      return array(0 => 'HTML Comment Removal Filter');
    case 'description':
      return t('Removes HTML comments, like those inserted by Microsoft Word or other nefarious applications.');
    case 'process':
      // Remove HTML Comments beginning with <!-- and ending with -->
      $text = str_replace("\n", 'placeholder text string', $text);
      $text = preg_replace('/<!--.*?-->/m', '', $text);
      $text = str_replace('placeholder text string', "\n", $text);
      return $text;
    default:
      return $text;
  }
}
hgmichna’s picture

…
$text = preg_replace('/<!--.*?-->/m', '', $text);
…

Are you sure the m should not be a g? m can only be wrong.

mcarbone’s picture

hgmichna’s picture

The two most common elements in the universe are hydrogen and stupidity.

Harlan Jay Ellison (American author, born 1934-05-27)

robertjd’s picture

subscribe

RichieB’s picture

I am running Drupal 6.15 with the patch from #102, and it works fine for normal node views, but HTML comments are still shown when I use this code:

node_view(node_load(array('nid' => $node->nid)), 1)

I use this to build an overview page of all nodes posted of a certain content type. How do I get rid of the HTML comments when using node_view()?

rfay’s picture

Subscribing

jenlampton’s picture

@RichieB, why not use a view?

@smk-ka, there's another issue open for #97, can we commit this one and deal with that one over there? I think getting this fix in now would be better than waiting for that one to get fixed too.

What do others think?
Jen

EvanDonovan’s picture

@jenlampton: I agree. As I said at #111, I think we should commit this and deal with #97 separately. I don't like having to patch core every time I upgrade...

btully’s picture

Patch in #102 worked for me on Pressflow 6.15 as well. Great job and many thanks!

lizhenry’s picture

I'm having the same issue as @RichieB. The pages I've got here are built with views, but they contain custom fields with php code that calls node_view, and the site has quite a lot of users who cut and paste from Word and I think Outlook . @jenlampton you suggest using views to fix the problem. I'm not sure how to use views to filter out the Microsoft style information that appears as an HTML comment.

geerlingguy’s picture

Bump, please - Patch in #102 applied cleanly to three different Drupal 6.15 installs (haven't tested on 6.16 yet), and this is absolutely an essential bug to fix, for all the 100+ reasons stated above :-)

I can't get teaser breaks to work on my site without this patch...

rfay’s picture

I have this deployed just fine on 6.16.

This is a key bug in Drupal, and I strongly agree that it should be committed.

pgacv2’s picture

Applied patch in #102 to a 6.16 install and works fine, not escaping comments in a block.

kmonty’s picture

+1

Kafu’s picture

Subscribing.

geerlingguy’s picture

Just tested in 6.16 on http://www.archstl.org/, and the patch applies cleanly (offsets are a little different, but both hunks applied). Hopefully we can push this in during the code sprint at DrupalCon SF! http://hojtsy.hu/blog/2010-apr-13/making-drupal-6-even-more-awesome-code...

quicksketch’s picture

Another vote in support of putting in #102. As has been noted several times already, Drupal 7 requires a different approach to fixing this issue, since the htmlcorrector has changed significantly and exhibits a different problem than Drupal 6 (see #559584: filter_xss() and Line break filter break HTML comments).

The CDATA issue raised in #97 has an issue over at #556648: <![CDATA[ escaped.

We're running patch #102 on (soon to be finally upgraded) lullabot.com. This patch is ready, any other related tasks have appropriate issues, let's get this fixed for Drupal 6.17.

q0rban’s picture

subscribe

EvanDonovan’s picture

Thanks, quicksketch, for the vote of confidence! I'm glad to know that lullabot.com is also waiting on this core patch :)

john.kenney’s picture

applied #102 patch on 6.16 and it resolved nagging problem of <!--paging_filter--> printing in my views due to use of paging module.

chantra’s picture

#102 seems to work fine against 6.16 for me too. Hoping to get this included in 6.x core at some stage.

Why is the test status postponed?

berenddeboer’s picture

Applied #102 and works fine. Please can we get this in core? It's just really annoying to see

in RSS feeds displaying full text.

Gábor Hojtsy’s picture

Status: Reviewed & tested by the community » Fixed

Great, thanks for all the testing and support, committed.

Heine’s picture

Priority: Critical » Normal
Status: Fixed » Needs work

I nearly got a stroke when I saw #80, but I've since confirmed that HTML comments are removed by the HTML filter.

Comments are equivalent to script, object and embed tags in terms of danger:

<!--[if expression]> HTML <![endif]-->

Next, comment handling is still a bit odd;

  filter_xss("<!--\nThis is a multiline comment <br />\nand can span through as many as lines you like.\n-->");

results in

and can span through as many as lines you like. --&gt;

#222926-101: HTML Corrector filter escapes HTML comments also describes an issue, likely in combination with the html corrector, or the html corrector alone.

(edit; example)

Pomax’s picture

Still broken for the "<" symbol inside script blocks with full html selected:

Some random text.
<script type="text/javascript">
  var moo = "bull";
  var num = 256;
  if(4>5) { moo = "cow1"; }
  else if(4<5) { moo = "cow2"; }
  if(moo=="bull") { num << 4;}
</script>
Some more random text.

Placing this on a page in Drupal 6.19 turns the comparator "<" and the bit shift operator "<<" into "&lt;" and "&lt;&lt" respectively. Oddly, the character ">" is fine.

It does, however, massively break pages with custom on-page javascript as is =(

hgmichna’s picture

You should try the (old-fashioned) HTML comment tags that are otherwise nowhere needed any more:

<script type="text/javascript">//<!--
  JavaScript code goes here.
//--></script>

Same for <style …> tags.

By the way, it is normal and expected that the right angle bracket is accepted. This is, for example, part of the XML spec.

In theory, the <![CDATA[ … ]]> entities should also work, but I think they still don't. Without these we may be breaking XML and XHTML.

Pomax’s picture

Ah! A good idea, but sadly that doesn't help prevent the problem. Drupal 6.19, at least, still turns the "<" and "<<" inside the commented block into &lt; entities.

I changed the article I have on http://projects.nihongoresources.com/utilities/utf_to_hex_converter to use your suggestion, but as can be seen from the source code that page generates, Drupal still filters the text (when editing the page, these characters are shown as normal "<" and "<<").

hgmichna’s picture

Isn't it the whole purpose of this thread to correct this defect? Perhaps the correction hasn't made it into the production versions of Drupal yet.

Pomax’s picture

Priority: Normal » Critical

changing status to critical, because while left unfixed it gets in prevents any "full html" pages from executing on-page javascript with "<" evaluations, which are frequent. Hopefully someone higher up will quickly act on this bug.

berenddeboer’s picture

Pomax, < is invalid inside xhtml, so you really need to write < or something must translate your < into &lt;

hgmichna’s picture

That is not quite correct. < is permitted inside <![CDATA[ … ]]> entities. Hence the correct form for script and style elements in XHTML is, for example:

<script type="text/javascript">//<![CDATA[
  Some JavaScript
//]]></script>

Here JavaScript can contain all these characters.

This is why Drupal must not encode any characters inside CDATA entities, but it erroneously does.

I cannot tell you how often I see software whose author has not understood the essence of transparency in protocols.

Pomax’s picture

hgmichan is correct, "<" is permitted in CDATA blocks (normally it's PCDATA, or Parsed Character data; CDATA tells the parser to turn off parsing, skip ahead to the end of the CDATA block, and then resume parsing). However, with explicit CDATA block indicators the "<" and "<<" inside the block are still erroneoously substituted with &lt; and &lt;&lt;

peterconnolly’s picture

Subscribing.

This is a major issue preventing upgrade of client websites past Drupal 6.16

EvanDonovan’s picture

Workaround until this gets resolved: turn the htmlcorrector filter off for the the input formats in which you have JS that you need to execute.

If you need something to clean up HTML in those input formats, you could use the HTML Purifier contrib module instead.

Pomax’s picture

Unless you mean a different html corrector, setting input format set to "Full HTML" does not bypass the problem.

EvanDonovan’s picture

@Pomax: HTML Purifier is a module available for download from drupal.org. It is not the same thing as core's Full HTML format, which contains the HTML Corrector filter by default (although it can be disabled).

adamdicarlo’s picture

Ugh. I'm getting *page breaking* output like this in a teaser view (Drupal 6.19, Views 3.0-alpha3):

<li class="training-event left">
        
      <h3>  
    
      <a href="{node-link}">{Title}</a>
      </h3>
  
      <div>
    
      <!-- <p><strong>{blah} </strong>{yadda yadda}.
      </div>

    </li>

because the content starts with an HTML comment. I believe this bug is to blame. This really sucks. (This is in a view where the Node: Teaser field is set to abbreviate to x characters, and "Correct HTML" is turned on.)

adamdicarlo’s picture

Another detail: That's the output I get when logged in. When anonymous, the HTML comment tag gets escaped as &lt!--.

magpie5212’s picture

I think this is the same bug which is duplicating code I am inserting into pages for IE. (The code is there to fix IE differences to all other browsers with object tags - I change them back to iframes. The following is just dummy text used for debug.)

I traced this to the _filter_autop

$text="<p>For all browsers.</p> <!--[if lte IE 8]> <p>Internet Explorer is Here!</p><![endif]-->";
$text=_filter_autop($text); 
var_dump($text);

The result is:

string(153) "<p>For all browsers.</p>
<!--[if lte IE 8]> <p>Internet Explorer is Here!</p><![endif]--><!--[if lte IE 8]> <p>Internet Explorer is Here!</p><![endif]-->"
magpie5212’s picture

My problem is cured by removing line 910 of filter.module: $output .= $chunk; Not sure about a test case to find out if this causes any other problems.

bradspry’s picture

subscribing

adraskoy’s picture

subscribe

hansamurai’s picture

Subscribing...

Edit: this was fixed for me by updating to Drupal 6.20, I was a bit behind.

Danny Englander’s picture

Subscribing, having the same issue.

ndeschildre’s picture

I confirm magpie5212's bug on Drupal 6.20:

When using the "convert line breaks into HTML" filter, it duplicates HTML comments. And commenting the following line (filter.module):

  foreach ($chunks as $i => $chunk) {
    if ($i % 2) {
      // Passthrough comments.
      if (substr($chunk, 1, 3) == '!--') {
        $output .= $chunk;  // <====== Comment this line
      }

it works.

dpearcefl’s picture

Status: Needs work » Postponed (maintainer needs more info)

Is this issue fixed in the latest D6? is there any interest in pursuing this issue?

sun’s picture

Status: Postponed (maintainer needs more info) » Needs work

@dpearceMN: Please stop doing this.

dpearcefl’s picture

I respectfully asked the questions. Perhaps is some action was shown on this issue, I wouldn't have to ask.

rfay’s picture

@dpearceMN, we appreciate you checking into these poor orphaned issues. Just don't change the status unnecessarily, because it makes them disappear off other people's radar. And please read the issue before asking whether something has been committed. Even though it's detective work, it can be done. I do wish it was easier.

And yes, this is a critical issue for D6 which I think should have gone in a long time ago. If you can spend some time with it, get a refreshed patch which deals with whatever reason it was set to "needs work" maybe it can get going.

EvanDonovan’s picture

@dpearceMN: So it looks from a read-through of the issue that it is fixed for regular comments, but not for angle brackets inside Javascript, or IE conditional comments. A patch has never been written for those things.

I would recommend that if you need HTML corrector to handle those correctly you use the HTML Purifier contributed module instead. The 6.x HTML corrector, being regex-based, will never be 100% accurate. An alternative would be for someone to take the 7.x HTML corrector, which uses the SimpleXML parser, and make it into a contrib module.

I would suspect that at this point this issue will never be fixed for 6.x, since it is already fully fixed for 7.x and thus there is no possibility of a patch to backport.

dpearcefl’s picture

Status: Needs work » Closed (won't fix)

Then "won't fix" it shall be.

sun’s picture

Title: htmlcorrector filter escapes HTML comments » HTML Corrector filter escapes HTML comments
Priority: Critical » Normal
Status: Closed (won't fix) » Needs work

There are still 2-3 years in which someone might step up and fix this in a non-breaking/backwards-compatible way for D6.

EvanDonovan’s picture

@dpearceMN: I didn't mean to suggest the issue should be "won't fix". I was simply suggesting a few alternatives that you could use to resolve the problem since probably the people who would know how to fix it won't have the time.

jenlampton’s picture

This other issue may be a duplicate
#881006: Regression: 'break' tag doesn't work with Filtered HTML
or maybe we can at least steal some of that solution for 6.x as well :)

Status: Needs work » Closed (outdated)

Automatically closed because Drupal 6 is no longer supported. If the issue verifiably applies to later versions, please reopen with details and update the version.