Problem/Motivation

When heading ID is generated, ToC API only takes into account IDs it generates, but not those already present in the
source. There is a chance that user generated content would have a duplicating ID. The chances are higher for the IDs
being generated based on the heading title.

Steps to reproduce

Generate ToC for the following HTML:

<p><a id="documents"></a></p>

<h3>Documents</h3>

Proposed resolution

Take into account IDs from the source for ToC.

Remaining tasks

User interface changes

API changes

Data model changes

Comments

RoSk0 created an issue. See original summary.

rosk0’s picture

Status: Active » Needs review
StatusFileSize
new2.97 KB

Here is a patch which fixes and issue. Together with test. Test passes locally , but it can't be run in CI as all the other tests in the module are outdated and broken.

ericgsmith’s picture

Status: Needs review » Reviewed & tested by the community

Patch looks good - fixes the error with duplicate IDs.

Test is good too - as noted by you it could be a unit test, I think given currently the tests are not running that this shouldn't prevent it being RTBC and a follow up issue can be created to get the existingtests running and passing.

+++ b/tests/src/Kernel/TocBuilderTest.php
@@ -0,0 +1,68 @@
+   * @todo Convert to unit test and include in TocTest, once that is fixed.
ericgsmith’s picture

Status: Reviewed & tested by the community » Needs work

Sorry, was a bit premature on setting RTBC.

On closer look the test and code assumes that the ID precedes the generated ID, e.g

<h3>Documents</h3>

<p><a id="documents"></a></p>

Will still produce duplicate IDs.

rosk0’s picture

Status: Needs work » Needs review
StatusFileSize
new5.14 KB
new4.05 KB

Thanks for the review Eric and picking that issue up!

Also spotted this.

When ToC API faces this type of HTML:

<p><a name="documents"></a></p>
<h2>Documents</h2>

with now deprecated, name attribute.

It produces this:

<h2 id="documents">Documents<a href="#documents" class="anchor">#</a></h2>
<h2 id="documents">Documents<a href="#documents" class="anchor">#</a></h2>

and throws this warning:

Warning: Uninitialized string offset 1 in Drupal\toc_api\TocBuilder->renderContent() (line 75 of /app/web/modules/contrib/toc_api/src/TocBuilder.php) 

So I added a workaround and test for that as well.

ericgsmith’s picture

Status: Needs review » Reviewed & tested by the community

Nice work - looks good to me.

Ran test locally on D10 / PHP 8.1 and all green.

rosk0’s picture

Nice , thanks!

I was testing on D9.5, PHP 8.0 so we have a good test coverage!

rosk0’s picture

Status: Reviewed & tested by the community » Needs review
StatusFileSize
new6.33 KB
new3.38 KB

We've identified a huge performance hit with introduced ID deduplication code on larger HTML documents. For the context, the documents we were testing this are around 5 Mb of plain HTML and are very complex , with lots of tables , headings and ID attributes.

Rendering the page in development environment was taking around 15 min to render and failing at times.

With this new approach based on XPath filtering , rather than whole DOM processing, it is now way more performance and we saw results of under 60 seconds to render on a cold cache.

Based on the XPath spec:

The resulting node sequence is returned in document order.

So we shouldn't loose heading order here. Tests to prove this would be nice to have surely, I don't have capacity for that at the moment.

ericgsmith’s picture

Status: Needs review » Reviewed & tested by the community

Awesome work! Significant improvement when dealing with large content.

rosk0’s picture

We have this patch running in production since 2023-06-27 with no negative effects found to date.

vladimiraus’s picture

Status: Reviewed & tested by the community » Fixed

Thank you! Commited and released! 🍻

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.