Migration Tools

CONTENTS OF THIS FILE
* Introduction
* Features
* Drush Commands
* Requirements
* Installation
* Configuration
* Bonus Features
* Maintainers


Introduction

Migration Tools does nothing by itself. It simply adds classes and methods that
developers can use within their custom migration classes to make migrations
easier and more reliable. It also contains example classes to illustrate how
the tools can be utilized.

Features

  • CheckFor: Common checks that can be implemented in prepareRow to evaluate
    and report on the results.
  • Html: Methods for sanitizing or cleaning up html pages or body
    content using QueryPath.
  • Message: A messaging class to handle outputting useful information
    to the terminal when running migrations or logging in Watchdog.
  • NodeTools: Methods for processing nodes.
  • String: Methods for cleaning up aspects of string content.
  • TaxonomyTools: Methods for processing vocabularies and terms.
  • Url: Methods for handling URLS and processing redirects.
  • SourceParsers: A variety of parsers that can be used directly, or as an
    and example of a parser.
  • Obtainers: A collection of methods for processing html files to extract
    titles, dates, id numbers and other items from html pages with inconsistent
    structure. Use of Obtainers requires QueryPath library either through
    the QueryPath module, or installed separately as a library from
    https://github.com/technosophos/querypath
  • Examples: Migration classes meant to demonstrate the use of Migration Tools
    for a migration that you can copy and modify in your custom migration module.

Drush Commands

mt-generate-redirects-list [filename]

Point it at a file of URLs (one per line) and it will scan each URL for
server-side and html based redirects. If it finds a redirect it will add it
to the row in the file and save the file.  The file must reside in your
custom_migration_module/redirect_source/[filename]

Copying unmanaged images

incomplete documentation

To recursively copy image files from the migration source directory to a Drupal files destination:

There is a drush command that will move the images to public:// from
$conf["migration_tools_base_dir"] for any given organization:

drush mt-migrate-images

Generating and Importing Menus

incomplete documentation

There are 2 drush commands to generate a menu and one to import a menu.
* Generating a menu from a csv file
* Generating a menu by actually crawling menu html on the legacy site.

The generating command looks like this:

drush mt-generate-menu-import-file menu_name –css-selector="#navbar" –local-base-uri="subpath" –menu-location-uri="subpath"

The only parameter required is the abbreviation of the migration, in this
case menu_name.

Other configuration is optional:

  • css-selector should be a css selector pointing to the outer-most ul of the
    menu in the live site
  • local-base-uri should be the path to where the content is locally after a
    migration has been run (ex. ag or usao-nm)
  • menu-location-uri should be a page in the live site containing the menu that
    we want to generate

After a migration is run, so the content is present locally (This is a
prerequisite of menu generation), and the menu is generated, we can then
import it.

the import command is

drush mt-generate-menu-import-file menu_name.txt subpath

The command takes the file where the menu is (the script assumes this file is
inside the sources directory) and the abbreviation of the migration.

incomplete documentation

Discovery and Planning

Various utilities are available to assist in the process of discovering content
in your migration source, which may be helpful in planning and facilitating
discussion around your migrations.

  • drush mt-list-html-files [directory_name]
    Scans a specified directory and its subdirectories and lists all htm(l) files.
    Particularly useful if you a migrating from a site that did not use a CMS and
    need to generate a list of pages for content audit/weeding.

  • drush mt-list-html-directories [directory_name]
    Scans a directory and lists all directories containing htm(l) files.

  • drush mt-list-file-directories [directory_name]
    Scans a directory and lists all directories containing binary files that are
    not image files. (doc, pdf, ppt…)

Debugging and Iterations

There are two settings through drupal variables that can aid in building and
debugging a migration. The default for each is FALSE but can be overidden in
your settings.local.php

  • variable: migration_tools_drush_debug
    Enables output to be seen in the terminal on a file by file basis to see
    what elements are being found by the obtainers and migrated. Default is FALSE.
    Enable debug output: drush vset migration_tools_drush_debug TRUE
    More verbose output: drush vset migration_tools_drush_debug 2

  • migration_tools_drush_stop_on_error
    When migration_tools_drush_debug is TRUE and a warning is thrown by the
    messsaging system that is of the level WATCHDOG_ERROR, WATCHDOG_CRITICAL,
    WATCHDOG_ALERT, WATCHDOG_EMERGENCY.
    Default is FALSE.
    Enable stop on error: drush vset migration_tools_drush_stop_on_error TRUE

Installation

  • Enable this module

Configuration

  • Set $conf["migration_tools_base_dir"] correctly for your local machine. This
    is the location for your migration source files. (parallel to docroot)

Requirements

  • Migrate
  • Pathauto
  • Redirect

Migration Development

incomplete documentation

Look through the source files to see if there is garbage to exclude.

There is no real science here, just poke around and look for things like /old or /original or header.html or footer.html or test.html. If they are in a path like /old, they can be removed from the array of source directories in the migration class.

// Define source directories.
$source_dirs = array(
  0 => 'subdirectory/corruption/mango',
  1 => 'subdirectory/corruption',
  2 => 'subdirectory/thing/corruption/then',
  3 => 'subdirectory/thing/corruption/now',
  4 => 'subdirectory/thing/old',  // Just delete this from this array if it should not be migrated.
);

Excluding pages

Sometimes there are pages that should not be migrated
They can be skipped in prepareRow within the migration class like this:

 /**
   * {@inheritdoc}
   */
  public function prepareRow($row) {
    $skip_these = array(
      '/subdirectory/must/must_index.html',
      '/subdirectory/ski/ski_index.html',
    );

    if (mt_migration_skip_file($row->fileId, $skip_these) || (parent::prepareRow($row) === FALSE)) {
      return FALSE;
    }
  }

Additionally you can make the migration stop on lack of title or date by setting the value

drush vset mt_migration_tools_drush_stop_on_error TRUE

Migrate the images

Images are pretty straightforward and require little effort. The path argument is for the original location of the images in the source.

drush mt-migrate-images subdirectory

Copy and paste the report from your terminal into a comment on the ticket.

If this is an organization then you need to open .htaccess and add the
abbreviation of the group abbreviation to the re-write rule on [this line].

Migrate Content Type: Page

The first time you run a migration you will want to migrate only a few pages to
see that the Obtainers are looking in the right spot to get the title and the
body. To run just a few you can limit the migration like this:

 drush migrate-import [PageMigrationName] --limit='2 items'

The debug output in the terminal should show you what the title looks like.
If the title does not match the title of the original page, create/adjust the
plucker methods being used. Inspect the source to see what you should be
targeting with a plucker. Finders and Pluckers run in the Obtainer. Finders
try to find items but do not remove them from the Document. Pluckers will
remove what they find from the Document if what they find passes the validation
within the Obtainer.

When you have the correct titles coming in, now view the imported nodes and see
if the body looks correct. There may be extraneous markup that needs to be
removed.

Obtainer->finder and plucker Method Tuning

You can tune the stack of finder and plucker methods in the migration class or
in the source parser. Anything done in the migration class itself will override
the defaults set in the sourceparser so in general it is easiest to set them
in the migration class. Setting them in the source parser is a better option
only IF you are going to reuse the sourceparser on another migration (which is
unlikely).

Overriding them in the migration class looks like this:

$title = new ObtainerJob("title", 'ObtainTitle');
$title->addSearch('pluckSelector', array("h1", 2));
$title->addSearch('findH1Any');
$arguments['obtainers_info'][] = $title;

Overriding them in the source parser looks like this:

protected function setDefaultObtainersInfo() {
  parent::setDefaultObtainersInfo();
  $ct = new ObtainerInfo('content_type');
  $ct->addMethod("findPRClassBreadcrumbPressRelease");
  $ct->addMethod("findPRImmediateRelease");
  $this->addObtainerInfo($ct);

  $date = new ObtainerInfo('date');
  $date->addMethod("pluckDivPAlignLeft");
  $date->addMethod('pluckSelector', array(".newsRight", 1));
  $date->addMethod('pluckXpath', array("/html/body/p[6]/span[1]/span/text()[1]", 1));
  $date->addMethod("findDivClassContentSubDiv3");
  $date->addMethod("pluckProbableDate");
  $this->addObtainerInfo($date);
}

Iterate through the migration a few at a time checking them as you go. You will need to rollback as you iterate. Continue tuning the Obtainer finder method stack and excluding pages if they should not be migrated.

drush migrate-rollback [PageMigrationName]

After you have fine tuned it enough to get several to work fine. Run the migration without the –limit option.

Examine the report at the end of the migration to see if pages came in with duplicate titles or missing titles. Investigate the reason. Sometimes titles are actually repeated in the source files, sometimes you end up grabbing a page element that is not the title. Fix if needed. rollback -> migrate -> repeat.

When satisfied with the migration, copy the command and report and put it in as a comment in the ticket.

Iterating specific pages (rather than migrating and rollback the entire migration) can be done by targeting that page file id.

drush migrate-import [PageMigrationName] --idlist="/musto_index.html"
# And
drush migrate-rollback [PageMigrationName] --idlist="/musto_index.html"

Migrate Menu

When you have all the content migrated on your machine, run one of these commands to build the menu.

drush mt-generate-menu-import-file subdirectory --menu-location-uri='subdirectory/index.htm' --local-base-uri='subdirectory' --css-selector='ul#navbar' --recurse='FALSE'

The options:
* menu-location-uri is the original path to the page where the menu exists.
* local-base-uri is the new location of the menu.
* selector is the css selector to use to find the menu on the page.
* recurse Whether or not load and crawl the primary pages to obtain menu items
* from those pages.

This will crawl the old menu and generate a text file for importing the menu.
Scan the menu for odd urls that may have gotten caught in the crawler. If a
full url appears, it is an indication that the page has not been migrated yet,
so it links to the live site and will redirect accordingly when the page is
eventually migrated. An http:// makes drupal treat it like an external link.
example: http://www.oldsite.com/section/subsection/somepage.html

The following command will import the menu.

drush mt-import-group-menu my-menu.txt subdirectory

Advanced Methods

Sometimes you need some unusual methods:
* Restricting by Content Type
* Restricting by a subpath
* Restricting by Date

Restricting by Content Type

There are times when you have multiple content types residing in the same
directory and we need to limit them so that a press release migration brings in
only press releases and skips non press releases.
Here are the basic steps:

  1. Add the Obtainer for content_type to your migration.

    // Put this in your migration _construct
    $type = new ObtainerInfo('content_type', "ObtainContentType");
    // Add any finders you need.  Each one will take a shot at identifying the content type.
    $type->addMethod('findPRImmediateRelease');
    

    $arguments["obtainers_info"][] = $type;

  2. Add an "isType" check within prepareRow that will cause it to skip the document if it is not a "press_release". Using isType will cause it to give proper drush feedback.

    if (mt_migration_skip_file($row->fileId, $skip_these) || parent::prepareRow($row) === FALSE || (!self::isType('press_release', $row) ) {
      return FALSE;
     }
    
  3. If you need a new finder (a way to identify the kind of document you are looking for) add them to mt_migration/includes/obtainers/ObtainContentType.php. The only difference is that each finder returns the machine name of the content type it thinks it found, or "" if it turned out to be something else. So any logic checks for the type, occur within the finder rather than on validate as is done in typical finders.

Restricting by a subpath

It might be necessary (due to the recursive method of file discovery) to exclude
certain subpaths within the file system. This can be done by creating an array
of subpaths to exclude, then placing a call to isInPath() within prepareRow.

isInPath(array $paths, $row)

Returns true if the file being processed is within one of the paths specified.

Restricting by Date

If you are faced with press releases from multiple years all mixed in one
directory, and you are trying to restrict them to only things after 2012.
It can be done using mt_migration_date_after($date, $date_cutoff, $default = TRUE)
in prepareRow.
Example:

   // Skip any press releases prior to 2013.
   if (!mt_migration_date_after($row->field_pr_date, '12/31/2012', FALSE)) {
     $message = '@fileid ------> Dated prior to 2013. Skipped: intentionally.';
     \MigrationTools\Message::make($message, array('@fileid' => $row->fileId), WATCHDOG_WARNING);
     return FALSE;
   }

BONUS

——–
The following modules are not required, but if you have them enabled they will
improve the experience:

  • Markdown - When the Markdown filter is enabled, display of the module help
    will be rendered with markdown.

MAINTAINERS


This repo is present on https://www.drupal.org/project/migration_tools as well
as https://github.com/swirtSJW/migration_tools

Other Documentation http://web-dev.wirt.us/modules/migration-tools

Supporting organizations: 
Funded the original work that some of which now appears in this project.
Support the continued work on Migration Tools

Project Information

Downloads