Problem/Motivation
Faulty HTML inputted by content creators can cause the module to error. Enabling the 'Correct Faulty HTML' filter does not prevent this.
Where there is an unclosed tag, for example <h3>Some title<h3>, processing of $dom in TocBuilder.php fails with messages like 'cannot call hasAttributes() on null', or 'DOMNode no longer exists.'
Steps to reproduce
Run toc_filter module on a node with unclosed tags.
Proposed resolution
I spent a lot of time on this and it is not enough to test for existence of $dom_node or $dom_node->hasAttributes(). These conditions can pass but the processing still fails. See the accepted answer at https://stackoverflow.com/questions/42471684/php-pthread-couldnt-fetch-d...
My conclusion is that any function which depends on the Drupal Html service (a wrapper for PHP's DOMDocument) is always going to be fragile.
Possible solutions are to pass the html through the PHP tidy extension before parsing them.
if (extension_loaded('tidy')) {
$tidy = new \tidy();
$html = $tidy->repairString($toc->getContent());
$dom = Html::load($html);
} else {
$dom = Html::load($toc->getContent());
}
If this is accepted, it would be possible to use the tidy library via the service provided by https://www.drupal.org/project/tidy_html and make than an optional, recommended dependency.
At present, after a lot of work, I cannot see another way of handling the problem. Either we 'tidy' the HTML before parsing it, or change the module so that it no longer relies on the DOMDocument class.
Remaining tasks
Decide on a fix!
| Comment | File | Size | Author |
|---|---|---|---|
| #11 | 3224559-11.patch | 1.06 KB | hanoii |
| #8 | toc_api-3224559-invalid-html-error_8.patch | 4.49 KB | john_b |
| #7 | toc_api-3224559-invalid-html-error_7.patch | 4.49 KB | john_b |
| #4 | toc_api-3224559-invalid-html-error_4.patch | 3.39 KB | john_b |
Comments
Comment #2
john_b commentedComment #3
john_b commentedOut of interest I tried the Nullsafe operator, available in PHP 8.0:
if($dom_node && $dom_node?->hasAttributes()) {still errors with Error: Couldn't fetch DOMElement in DOMNode->hasAttributes() (line 62 of modules/contrib/toc_api/src/TocBuilder.php).
Comment #4
john_b commentedThis patch adds some checks to deal with invalid HTML in TocBuilder.php including using the Tidy library to repair HTML as suggested in the description, and including the related issue #3223929: errors and broken cron if $dom_node in TocBuilder is null. It also adds a note to README.md recommending installing the PHP Tidy library.
Could add a test for handling of HTML with open tags. Could also add a line to the Drupal status report noting whether the Tidy extension is installed.
Comment #5
john_b commentedComment #6
smulvih2@john_b I was able to reproduce this error, but only by going into source and modifying a tag, then saving the node while still in source mode. I think this might be an issue for my site since I have a role defined that can edit source and HTML, so would be nice to fix. Are you able to add a line item on the status report to indicate if the tidy extension is loaded or not? Not able to get your patch to work probably because I don't have Tidy installed. Once you add this I can test again. Thanks!
Comment #7
john_b commentedThe attached patch makes no change other than to add a toc_api.install file which generates a line in the Status report recommending PHP Tidy extension if it is missing, or confirming if it is present.
We are not using CKEditor at all. All our content editors work with HTML source. When migrating D7 to D9, it seemed best to keep the site as far as possible the same as the old site. If we want to starting use a WYSIWYG that will only come after the migration is complete.
Comment #8
john_b commentedSame patch with typo fixed.
Comment #9
smulvih2I just ran into this error while comparing revisions with drupal/diff on a node with a TOC.
Comment #10
john_b commented@smulvih2 Was the error caused by a version with unclosed tags? Does my patch prevent it?
I spent a long time on this, tracing stuff in PHPStorm, and cannot think of a solution other than my patch, short of rewriting completely with something closer to the D7 approach, which would be a pain. I am using the DOMDocument class in a custom filter module without problems.
Comment #11
hanoiiI also had some issues with malformed HTML as well. Not particularly with unclosed HTML but different things like:
- opening h3 but closing h4
- headers as children of other headers
I created a different solution, which I am attaching for reference as well and to use it locally. Basically I had a bunch of different issues but more importantly with the replace as it was really replacing itself so preventing that to happen.
On the issue of malformed HTML, I wonder why you are not using Drupal's own HTML fixing filter which should take care of that, at least that's what I am using.
I agree this is not a simple thing to fix, I don't think that requiring a PHP extension is the best solution specially as core already comes with an HTML fixing.
Comment #12
john_b commentedThe Drupal HTML fixing filter was not working for me. Probably an error in my code. I never worked out why.
Comment #13
john_b commentedI have tested hanoii's patch in #11. It is as good a solution as we are likely to get. It would be good to see it committed.
Comment #15
vladimirausThanks everyone. Committed.