HTML import logo
Department of the Prime Minister and Cabinet - Annual Report 2013-14 imported to govCMS using the HTML Import module. Creative Commons licence

Introduction

This module is to divide one single large HTML document into a structured Drupal book where the heading level hierarchy is respected. This module works with HTML exported from Word; HTML document converted from PDF as well as HTML document exported from Adobe InDesign. The purpose of this module is to make provide an alternative for legacy documents to meet meet WCAG accessibility requirements. By converting the documents into HTML, it also makes full-text search easier.

What's new in 2.x?

We will be releasing the 2.x version of this module very soon! Those who wants to try out the 2.x version before its formal release are more than welcome to download the development branch.

New features of the 2.x version include:
- Thoroughly tested under GovCMS 7.x-2.0-beta3
- An example sub-module that automatically creates book parent (publication) and book (publication section) pages so you don't have set them up yourself.
- Entity reference support
- Workbench support
- Content administrators can specify the language of imported pages if locale module is enabled
- Anchor reference links are re-established using aliased URLs when they are available
- Some other minor bug fixes and improvements

Workbench Integration

We are pleased to announce that we have commenced work to integrate HTML Import and Workbench to provide full content workflow support. Functions such as scheduled publishing/unpublishing, content review etc will soon be support.

govCMS

This module is compatible with Australian Government's govCMS distribution and can be used to easily import reports for agencies.

GovCMS users are recommended to use the 7.x-2.x-dev branch of the HTML Import module for best compatibility. This module has been successfully tested against GovCMS 7.x-2.0-beta3.

We have used this module to produce HTML version of reports for the Australian government agencies. Some examples of our past projects are:

Sponsorship

This project is sponsored by XiNG Digital

Logo of XiNG Digital, blue square with a star in the centre

Main feature

The main features of this module are:

  • Allows user to specify heading levels at which the HTML document is divided and imported. For example, if H3 is selected under "Heading level depth", each H1, H2 and H3 will become a separate page in the imported book.
  • Imports images. The module scans and fixes the paths of the images referenced by imported pages.
  • Respects the heading level hierarchy of the source document by reconstructing the same book hierarchy
  • Scans and re-creates reference links. Reference links in the source HTML may be divided into different book pages after import. The module scans and re-links the reference links to maintain the integrity of the document.
  • Scans and moves footnotes/endnotes. If footnotes/endnotes are well-formatted, the module scans and moves the footnotes/endnotes to the sections where they are referenced for easy reading.
  • Meets WCAG accessibility requirements. The module preserves accessibility properties of the source HTML such as ALT text.
  • Cleans up undesirable Word characters such as smart quotes in titles for clean URL creation

Demonstration

An example of the output, which is converted from PDF is here: http://doccloud.com.au/docs/food-processing-industry-strategy-group-fina....

XiNG Digital has also developed techniques to convert almost any PDF document to WCAG 2.0 compliant HTML that can be used by our document importer. Please contact the maintainers if you would like to learn more.

Installation and configuration

  1. Download and enable this module as well as its dependencies
  2. Create or modify the content type to which the imported book pages will be attached. Be sure this content type has a file field with the machine name "field_images". This field should also be made single value and only accepts "zip" files. This field does not need to be mandatory.
  3. Create or modify the content type for imported pages. Be sure this content type has the following fields
    1. Footnotes. A "Long text" field where the footnotes/endnotes a section references to will be stored
    2. Imported images. A file field with the machine name "field_html_import_images", and allows unlimited number of values and only accepts image formats you would like this field to accept. Please note because a large number of image files may be imported by this module, it is desirable if those images are kept in directories that are relevant to their corresponding imported pages. The File (field) path module allows us to assign path such as "documents/[node:nid]/images" to keep the file system neat and tidy.
    3. Publication parent. A single value "Node reference" field that only allows references to the type of content specified in Step 2 above. This field will allow us to assign hierarchical URL aliases to imported pages.
  4. Go to Structure > Feeds importers > Add importers and follow these steps:
    1. Basic settings > Make the content type in Step 2 above the "Attach to content type". Be sure "Periodic import" is "Off"
    2. Fetcher > Change > Choose "HTML Import Fetcher"
    3. Fetcher > Settings > Make sure "Allowed file extensions" only allows HTML extensions such as "html"
    4. Parser > Change > Choose "HTML Import Parser"
    5. Processor > Change > Choose "HTML Import Processor"
    6. Processor > Settings > Be sure the content type in Step 3 above is selected in the bundle field. Be sure "Update existing nodes" is selected. Be sure "Full HTML" or equivalent is selected under "Text format".
    7. Processor > Mapping > Title maps Title, Body maps to Body, Footnotes maps to Footnotes and Book ID maps to Publication parent (Node reference by node ID).
  5. Content > Books > Settings > Make sure "Content types allowed in book outlines" and "Content type for child pages" reflect the correct content type in Steps 2 and 3 above.
  6. Create a new content using the type in Step 2. Upload a zipped directory of images to the file filed created in Step 2. Note all image files needs to be stored in a directory named for example images, and the source HTML needs to reference images in this directory). Upload the source HTML to the File field under Feed field group. Choose your desired heading level depth. Follow the on-screen instructions for the rest fields if you wish.
  7. Save and import.

Working with Microsoft Word

A short tutorial on converting a Word document and prepare it for import

Example

An example report you can use to test this module is provided. This example, Australian Haemovigilance Report was recently processed by us from a Word document for the National Blood Authority. This example is licenced under Creative Commons Attribution 4.0 licence. The final report is also available on the National Blood Authority's website.

Additional notes

Note 1: Some changes to the display of the imported pages such as the table of contents menu, book navigation and footnotes may need to be configured. The best place to start is to create your own instance of the template file book-navigation.tpl.php under your theme/module.

Note 2: The HTML Import module is memory bound. If you are experience problems while importing HTML files, please first check your PHP memory limit. We have been testing this module under 256MB without seeing issues when importing HTML files converted from about 200 to 300 pages PDF. It is however advisable to increase your PHP memory limit to 512MB in a production environment to ensure there is sufficient resources for other critical production web functions.

Note 3: During the development phase of this module we have imported more than 100,000 pages converted from PDF, Word and InDesign across a number of websites. The largest single document it has successfully imported has more than 1,000 pages.

Note 4: The HTML Import module is packed with a small utility that consolidates imported files and save the HTML content to a "full-text" text filed of the book parent page. This field allows full-text search to match both the parent page and any of its children if they contain any of the search words. Apache Solr Views module could be used to provide a very useful full-text search function.

Supporting organizations: 
Development Sponsored

Project Information

Downloads