DOC TO HTML
Libre Office Settings
Base Settings
Test Wizard
Set Widget DOC TO HTML
Libre Office Settings
Basic Settings
Import To field settings

Doc to HTML — Usage and Configuration Guide (2.x)

The Doc to HTML module allows editors to upload documents directly from
a node edit form, convert them to HTML with LibreOffice, review the
generated markup in CKEditor 5, and save it into a
text_long or text_with_summary field.

The primary supported formats are DOC and DOCX.
Site builders can also enable additional LibreOffice-supported formats such as
ODT, RTF, and PPTX from the module
configuration.

Requirements

The module relies on LibreOffice being available on the same environment
where Drupal PHP runs, such as the web server, PHP-FPM container, or DDEV web container.
Drupal executes LibreOffice from the command line to perform the document-to-HTML
conversion.

  • Install LibreOffice on the same environment where your Drupal PHP runtime
    is executed.
  • Configure the LibreOffice executable directory in the Base path for LibreOffice
    setting, for example /usr/bin or
    /Applications/LibreOffice.app/Contents/MacOS.
  • Configure only the executable name in the Command setting, for example
    soffice, libreoffice, or soffice.exe.
    Do not include path separators in this field.
  • From the command line, you should be able to run a simple test such as
    libreoffice --version or soffice --version without errors.
  • PHP 8.1 or later is required. The proc_open function is recommended so the
    module can enforce the configured conversion timeout.

Example DDEV configuration (config.yaml)

On local environments that use DDEV, you can install LibreOffice inside the
web container by adding an extra package to your config.yaml:

name: doc-to-html
  type: drupal11
  docroot: web
  php_version: "8.3"
  webserver_type: nginx-fpm

  webimage_extra_packages:
    - libreoffice
  

With this configuration, DDEV installs libreoffice inside the web container,
so that the Doc to HTML module can execute LibreOffice from Drupal.

Global Module Configuration

Before using the widget on content types, configure the global settings used by the
Doc to HTML module. These settings provide the default behavior for conversion,
file handling, and HTML cleanup.

Typical global configuration steps:

  1. Configure LibreOffice settings
    Go to /admin/config/content/doc_to_html/libreoffice-settings and configure:
    • The base path where the LibreOffice executable is located.
    • The executable command name, such as soffice or libreoffice.
    • The conversion timeout in seconds. The default is 60 seconds. Use 0 to disable the timeout.
  2. Configure basic conversion settings
    Go to /admin/config/content/doc_to_html/basic-settings and configure:
    • The public files subfolder used for temporary converted HTML files.
    • The enabled upload formats: DOC, DOCX, ODT, RTF, and/or PPTX.
    • Whether converted output should be normalized to UTF-8.
    • The optional body extraction regex used to extract content from the generated HTML.
    • The body regex match index, where 0 means the full match and 1 or higher selects a capture group.

Testing the Conversion with TestWizard

The module provides a TestWizard tool to validate your LibreOffice
configuration and preview the conversion pipeline before enabling the widget on content
fields.

Using the TestWizard typically involves:

  1. Navigate to /admin/config/content/doc_to_html/test-wizard.
  2. Upload a sample document using one of the file types enabled in Basic Settings.
  3. Run the conversion to review:
    • The raw HTML generated by LibreOffice.
    • The extracted <body> segment after applying the body regex.
    • The final preview after DOM regex cleanup is applied.
  4. Save the tested regex settings if the output matches the expected result.

Once the test behaves as expected, you can enable the widget on content fields.

Enabling the Widget on Content Fields

After configuring the global settings and validating them with the TestWizard,
enable the Doc to HTML widget on specific fields in your content types.

The widget is designed to work with:

  • text_long fields
  • text_with_summary fields

To enable the widget for a given field:

  1. Go to Structure → Content types and select the content type
    you want to configure.
  2. Open the Manage form display tab.
  3. Locate the text_long or text_with_summary field that should
    receive the converted HTML.
  4. In the widget selector for that field, choose DOC to HTML.
  5. Configure the widget settings and save the form display.

Widget Settings

Each widget instance can override or extend the global behavior.

  • Apply body extraction regex: applies the body extraction regex configured
    in Basic Settings to the conversion output.
  • DOM regex override: overrides the global DOM regex for this widget instance.
  • Maximum file size: limits the uploaded document size. Use 0 for no widget-level limit.
  • Source file field: optionally saves the original uploaded document into
    a file or image field on the same entity.

How the Doc to HTML Widget Works

Once the Doc to HTML widget is enabled for a field, the node edit form includes an
additional DOC to HTML file section.

  • The widget adds a virtual managed upload element used to upload the source document.
    This upload element is part of the widget UI and is not itself a separate content field.
  • The editor uploads a document and clicks Convert.
  • The module:
    1. Receives the uploaded document through the widget upload element.
    2. Calls LibreOffice using the configured base path, command, and timeout.
    3. Converts the document to HTML.
    4. Optionally extracts the configured <body> match.
    5. Optionally applies DOM regex cleanup or custom post-processing.
    6. Injects the final HTML into the target text_long or text_with_summary field.
  • The converted HTML is injected into the form state and editor field. The entity is
    not saved automatically when Convert is clicked.
  • The editor can review or edit the generated HTML in CKEditor, then click
    Save to persist the content.

Saving the Source Document

By default, the uploaded source document is treated as an intermediate conversion input.
If the widget is configured with a Source file field, the original document
is promoted to a permanent file and saved into that file or image field when the entity
is saved.

This is useful when the site needs to keep the original DOC/DOCX file available for
download or archival purposes. The selected source field should be a normal file or
image field on the same content type.

Extensibility

Other modules can customize the conversion workflow with hooks or Symfony events.

  • hook_doc_to_html_pre_convert() can modify conversion options or cancel the conversion.
  • hook_doc_to_html_post_convert() can inspect or replace the final HTML.
  • Symfony event subscribers can listen to doc_to_html.pre_convert and
    doc_to_html.post_convert.

Drush Commands

The module also provides Drush commands for debugging and maintenance.

  • drush dth:version: show the detected LibreOffice version.
  • drush dth:clean: clean the conversion temporary folder.
  • drush dth:convert /path/to/document.docx: convert a document and print the generated HTML.

Typical Workflow Summary

  1. Install LibreOffice on the server or container where Drupal PHP runs.
  2. Configure the LibreOffice base path, command name, and timeout.
  3. Configure supported file types, UTF-8 handling, and optional body extraction settings.
  4. Use the TestWizard to validate LibreOffice and preview the generated HTML.
  5. Enable the DOC to HTML widget on a text_long or text_with_summary field.
  6. Optionally configure a source file field if the original document should be kept.
  7. Editors upload a document, click Convert, review the generated HTML in CKEditor, and click Save.

With this setup, the Doc to HTML module provides a configurable workflow for converting
office documents into clean Drupal text field content, using LibreOffice and the Drupal
Form API.

Supporting organizations: 

Project information

Releases