Currently, our HTML processing and sanitization logic is tightly coupled within the document_loader_html_processor module. While this made sense when the processor was exclusively used by the Document Loader, it creates an architectural bottleneck moving forward.
By keeping this logic bundled together:
- Developers cannot reuse these robust HTML sanitization and manipulation services for other sources (e.g., API ingest, WYSIWYG processing, or custom imports) without also pulling in the entire document_loader dependency tree.
- Site Managers are forced to enable heavier, intertwined modules even if they only need standalone HTML processing capabilities, making the site architecture harder to maintain and audit.
Proposed Resolution
Migrate the core HTML manipulation, extraction, and sanitization services out of document_loader_html_processor and into a new, standalone html_processor module.
This ensures the HTML Processor acts as a pure utility that does one thing exceptionally well, adhering to the Single Responsibility Principle.
Developer & Site Manager Impact
- Site Managers: A cleaner module ecosystem. You can now enable HTML processing utilities across the site without inheriting Document Loader behaviors.
- Developers: Access to a decoupled, easily injectable set of services (e.g., HtmlProcessorInterface, HtmlSanitizerConfigBuilderInterface). The codebase becomes easier to unit test, and we eliminate circular dependencies or bloated service wiring.
Remaining Tasks
- Create the new html_processor module namespace and .info.yml (ensuring it has no dependency on document_loader).
- Migrate the core services, interfaces, and classes (e.g., HtmlProcessor, AdRemovalHelper, RelativeUrlRewriter) to the new namespace.
- Update services.yml to reflect strict dependency injection (private services by default, with only the primary interface exposed publicly if required for procedural hooks).
- Update and migrate all relevant PHPUnit tests to the new module.
Issue fork html_processor-3606910
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #2
webbywe commentedComment #5
webbywe commentedComment #7
webbywe commented