Problem/Motivation

Faulty HTML inputted by content creators can cause the module to error. Enabling the 'Correct Faulty HTML' filter does not prevent this.

Where there is an unclosed tag, for example <h3>Some title<h3>, processing of $dom in TocBuilder.php fails with messages like 'cannot call hasAttributes() on null', or 'DOMNode no longer exists.'

Steps to reproduce

Run toc_filter module on a node with unclosed tags.

Proposed resolution

I spent a lot of time on this and it is not enough to test for existence of $dom_node or $dom_node->hasAttributes(). These conditions can pass but the processing still fails. See the accepted answer at https://stackoverflow.com/questions/42471684/php-pthread-couldnt-fetch-d...

My conclusion is that any function which depends on the Drupal Html service (a wrapper for PHP's DOMDocument) is always going to be fragile.

Possible solutions are to pass the html through the PHP tidy extension before parsing them.

if (extension_loaded('tidy')) {
      $tidy = new  \tidy();
      $html = $tidy->repairString($toc->getContent());
      $dom = Html::load($html);
    } else {
      $dom = Html::load($toc->getContent());
    }

If this is accepted, it would be possible to use the tidy library via the service provided by https://www.drupal.org/project/tidy_html and make than an optional, recommended dependency.

At present, after a lot of work, I cannot see another way of handling the problem. Either we 'tidy' the HTML before parsing it, or change the module so that it no longer relies on the DOMDocument class.

Remaining tasks

Decide on a fix!

Comments

John_B created an issue. See original summary.

john_b’s picture

Issue summary: View changes
john_b’s picture

Out of interest I tried the Nullsafe operator, available in PHP 8.0:

if($dom_node && $dom_node?->hasAttributes()) {

still errors with Error: Couldn't fetch DOMElement in DOMNode->hasAttributes() (line 62 of modules/contrib/toc_api/src/TocBuilder.php).

john_b’s picture

StatusFileSize
new3.39 KB

This patch adds some checks to deal with invalid HTML in TocBuilder.php including using the Tidy library to repair HTML as suggested in the description, and including the related issue #3223929: errors and broken cron if $dom_node in TocBuilder is null. It also adds a note to README.md recommending installing the PHP Tidy library.

Could add a test for handling of HTML with open tags. Could also add a line to the Drupal status report noting whether the Tidy extension is installed.

john_b’s picture

Status: Active » Needs review
smulvih2’s picture

Status: Needs review » Needs work

@john_b I was able to reproduce this error, but only by going into source and modifying a tag, then saving the node while still in source mode. I think this might be an issue for my site since I have a role defined that can edit source and HTML, so would be nice to fix. Are you able to add a line item on the status report to indicate if the tidy extension is loaded or not? Not able to get your patch to work probably because I don't have Tidy installed. Once you add this I can test again. Thanks!

john_b’s picture

Status: Needs work » Needs review
StatusFileSize
new4.49 KB

The attached patch makes no change other than to add a toc_api.install file which generates a line in the Status report recommending PHP Tidy extension if it is missing, or confirming if it is present.

We are not using CKEditor at all. All our content editors work with HTML source. When migrating D7 to D9, it seemed best to keep the site as far as possible the same as the old site. If we want to starting use a WYSIWYG that will only come after the migration is complete.

john_b’s picture

StatusFileSize
new4.49 KB

Same patch with typo fixed.

smulvih2’s picture

I just ran into this error while comparing revisions with drupal/diff on a node with a TOC.

john_b’s picture

@smulvih2 Was the error caused by a version with unclosed tags? Does my patch prevent it?

I spent a long time on this, tracing stuff in PHPStorm, and cannot think of a solution other than my patch, short of rewriting completely with something closer to the D7 approach, which would be a pain. I am using the DOMDocument class in a custom filter module without problems.

hanoii’s picture

StatusFileSize
new1.06 KB

I also had some issues with malformed HTML as well. Not particularly with unclosed HTML but different things like:

- opening h3 but closing h4
- headers as children of other headers

I created a different solution, which I am attaching for reference as well and to use it locally. Basically I had a bunch of different issues but more importantly with the replace as it was really replacing itself so preventing that to happen.

On the issue of malformed HTML, I wonder why you are not using Drupal's own HTML fixing filter which should take care of that, at least that's what I am using.

I agree this is not a simple thing to fix, I don't think that requiring a PHP extension is the best solution specially as core already comes with an HTML fixing.

john_b’s picture

The Drupal HTML fixing filter was not working for me. Probably an error in my code. I never worked out why.

john_b’s picture

Status: Needs review » Reviewed & tested by the community

I have tested hanoii's patch in #11. It is as good a solution as we are likely to get. It would be good to see it committed.

  • VladimirAus committed fea9b8f on 8.x-1.x authored by hanoii
    Issue #3224559 by John_B, hanoii, smulvih2, VladimirAus: Invalid HTML...
vladimiraus’s picture

Status: Reviewed & tested by the community » Fixed

Thanks everyone. Committed.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.