HTML and other files as sources

File sources can be used to migrate legacy HTML sources, as described here. For more on using these sources to import existing files into the Drupal file system, see the MigrateFileUri section of MigrateDestinationFile.

It's the worst-case scenario of content migration - importing directly from raw HTML files (hello, it's 1995 calling, they want their browser back). The hard part - parsing the particular templates (or lack thereof) in your HTML content - Migrate can't help you much with (you may want to check out QueryPath), but there are classes to help you iterate over your HTML files and keep track of what's been imported, using the list pattern. You need to create instances of a list class (which is responsible for traversing directory structures to find files to import) and an item class (responsible for pulling the raw content from a single file).

First, you need an instance of MigrateListFiles:

$directories = array(
  '/var/html_source/en/news',
  '/var/html_source/fr/news',
  '/var/html_source/zh/news',
);
$base_dir = '/var/html_source';
$file_mask = '/(.*\.htm$|.*\.html$)/i';
$list = new MigrateListFiles($directories, $base_dir, $file_mask);

The first argument is an array of directories to search for source files. The second is the prefix to strip from the full file specifications - what remains after this prefix is stripped will serve as the unique ID for each file. In this case, the file /var/html_source/en/news/todays_big_story.html will be tracked by Migrate using the unique identifier '/en/news/todays_big_story.html'. The third argument is a regular expression used to filter the file list - in this case, only .htm and .html files will be included in the list.

There is also an optional fourth parameter, an array of options to be passed to file_scan_directory, and an optional fifth parameter, an instance of a MigrateContentParser class (explained below).

Next you need to create the item object:

$item = new MigrateItemFile($base_dir);

We pass the same prefix as we did the list class, so the full filespec can be recreated.

We then use these two objects, plus an array to document the generated source fields to create the source object:

$fields = array('title' => t('Title'), 'body' => t('Body'));
$this->source = new MigrateSourceList($list, $item, $fields);

See http://fourkitchens.com/blog/2012/05/04/migrating-old-html-files-drupal for more hints on importing HTML files using Migrate.

Also see https://gist.github.com/marktheunissen/2596787 for a complete example for migrating HTML content.

MigrateContentParser

TBD...

Comments

problem with $file_mask

Bhanuji commented 26 November 2013 at 10:37

For some of the regexp it is not working...

$file_mask = '~^(?!/images/)[a-zA-Z0-9/-]+(?!_ss\d|\d)\.html$~ ';

$arr = array(
'/magazines/sample.html',
'/test/index.html',
'/test/format_ss1.html',
'/test/folder/newstyle_1.html',
'/images/two.html'
);
foreach($arr as $str) {
    if (preg_match('~^(?!/images/)[a-zA-Z0-9/-]+(?!_ss\d|\d)\.html$~', $str))
        echo $str,"\n";
}

output :
/magazines/sample.html
/test/index.html

this will works fine in php script. result above.
but not in Drupal. result in Drupal is below. we can see "image/two.html"

/magazines/sample.html
/test/index.html
/images/two.html

Source Issue