A reader is a class that fetches content from a resource (xml, csv, sql, ...). A basic reader should extend FeedImportReader class, implement three abstract methods and optionally override one. FeedImportReader class extends FeedImportConfigurable, which has the setOptions method where you’ll get all user settings. These settings are saved in the protected $options array.

Let's see those abstract methods:

abstract public function init();

Here you should init your resources (open file handle, connect to database, ...).

abstract public function get();

Here you should return the next available item or NULL/FALSE if no more items.

abstract public function map(&$obj, &$path);

Here you must return the value for given path. $obj is usually the item you return with get(). If you want $path to be in other format (default is string) you should override formatPath($path) method:

public function formatPath($path) {
  // Return an array of strings (splited by slash).
  return explode('/', $path);
}

The formatPath() method is called only once for each path.

Feed Import provides more classes that extend FeedImportReader:

  • FeedImportSimpleXPathReader - map() applies xpath on a SimpleXMLElement object.
  • FeedImportDomXPathReader - map() applies an xpath on a DomNode object using DomXPath object
  • FeedImportVectorReader - map() returns values from a nested array (object) using a given array of paths, each path being an array of string that specify path to value (like path to a file in a filesystem). formatPath() is already implemented. This is used by JSON reader provided by Feed Import module
  • FeedImportUniVectorReader - like FeedImportVectorReader, except that doesn't support nested values. It's ideal for CSV, SQl result-sets or similar. formatPath() is already implemented. This is used by CSV/SQL readers provided by Feed Import module

For more info please check the source code.


OK, now that we know the basics we can start creating our reader. For this example the reader will be a crawler, using Behat Mink. The crawler should navigate pages using a next link.

Behat accepts two major traversal methods: css paths and xpaths. We can get from a HTML element: html content, text content, node attributes. Let's define the following pattern for our paths:

text|html|@attr-name=css|xpath(mycss|myxpath)

Some examples:

  • text=css(#content) - gets the text content of element(s) matching specified css path
  • html=xpath(//div[@class="item"]) - gets the html content of element(s) matching specified xpath
  • @href=css(a.next-page) - gets the href attribute of element(s) matching specified css path

Besides the html/text/attribute we define also element which does nothing but is used for parent xpath and next link. Because in Behat we need to use specific functions for each action (getText, getHtml, getAttribute) we need a special formatPath():

  /**
   * {@inheritdoc}
   */
  public function formatPath($path) {
    /**
     * This will return an array like: array(
     *  'type' => 'css',
     *  'path' => 'div.content',
     *  'func' => 'getText',
     *  'arg'  => NULL,
     * )
     *
     * Path format can be:
     * html=css(mycss)
     * html=xpath(myxpath)
     * text=css(mycss)
     * text=xpath(myxpath)
     * @attribute-name=css(mycss)
     * @attribute-name=xpath(myxpath)
     */
    if (preg_match('/^(?P<func>(?:text|element|html|@[a-z0-9_-]+))=(?P<type>css|xpath)\((?P<path>[^\)]+)\)/iu', $path, $m)) {
      $path = array(
        'type' => $m['type'],
        'path' => $m['path'],
        'arg'  => NULL,
      );
      if ($m['func'][0] == '@') {
        // It is an attribute.
        $path['func'] = 'getAttribute';
        $path['arg'] = substr($m['func'], 1);
      }
      else {
        switch (strtolower($m['func'])) {
          case 'text':
            $path['func'] = 'getText';
            break;
          case 'html':
            $path['func'] = 'getHtml';
            break;
        }
      }
      return $path;
    }
    // Invalid path.
    throw new Exception('Invalid path given: ' . $path);
  }

Great, now we only need to map those paths:

  /**
   * {@inheritdoc}
   */
  public function map(&$item, &$path) {
    if (!($values = $item->findAll($path['type'], $path['path']))) {
      // No results, give null.
      return NULL;
    }
    elseif (($count = count($values)) == 1) {
      // There is only one result, return the value
      return $values[0]->{$path['func']}($path['arg']);
    }
    // We have multiple results
    for ($i = 0; $i < $count; $i++) {
      $values[$i] = $values[$i]->{$path['func']}($path['arg']);
    }
    return $values;
  }

Done! This was the easy part. There are two more methods to do. Reading Behat documentation we can see that it supports multiple drivers (Goutte, Selenium, ...). Our reader should be an abstract class too, because each driver comes with different settings. So, let's create that class:

abstract class BehatReader extends FeedImportReader {

  // Mink session
  protected $session;
  // Path for next page link
  protected $next;
  // Page object
  protected $page;
  // Path for context
  protected $root;
  // Current page.
  protected $currentPage;
  // Max number of pages
  protected $maxPages;

  /**
   * Gets the driver
   */
  abstract protected function getDriver();
}

We have some useful properties (like session, parent/next path) and an abstract method getDriver(). Moving on to init() method (in BehatReader):

  /**
   * {@inheritdoc}
   */
  public function init() {
    // Load mink classes. This is only for example, you shuld have mink.phar in the same path as this file.
    require_once 'mink.phar';
    // Create a new session for Goutte driver
    $this->session = new Behat\Mink\Session($this->getDriver());
    // Visit start page
    $this->session->visit($this->options['start']);
    // Get page object, we will need this for clicking next page link
    $this->page = $this->session->getPage();
    // First page
    $this->currentPage = 0;
    // Max pages
    $this->maxPages = $this->options['max'];
    // Format the next page path
    $this->next = $this->formatPath($this->options['next']);
    // Format the context path
    $this->root = $this->formatPath($this->options['parent']);
    // Get all items on start page in an array
    $this->items = $this->page->findAll($this->root['type'], $this->root['path']);
    // All good
    return TRUE;
  }

The found items will be saved in $items property, and the get() should look like:

  /**
   * {@inheritdoc}
   */
  public function get() {
    // If we have items or pages to navigate then return the next available item
    return ($this->items || $this->nextPage()) ? array_shift($this->items) : NULL;
  }

nextPage() will generate items for the next page:

  /**
   * Gets next page
   */
  protected function nextPage() {
    // Check for next page link
    if (++$this->currentPage < $this->maxPages &&
      ($link = $this->page->find($this->next['type'], $this->next['path']))) {
      // Navigate
      $link->click();
      // Get context items if any
      if ($this->items = $this->page->findAll($this->root['type'], $this->root['path'])) {
        // Ok, we have new items
        return TRUE;
      }
    }
    // No more items, that was the last page
    // Stop the session, and clean variables
    $this->session->stop();
    $this->page = NULL;
    return FALSE;
  }

Great, now we have our abstract class, and the Goutte crawler is as simple as:

class GoutteCrawler extends BehatReader {

  /**
   * {@inheritdoc}
   */
  public function getDriver() {
    $client = new Behat\Mink\Driver\Goutte\Client();
    $driver = new Behat\Mink\Driver\GoutteDriver($client);
    if (isset($this->options['user_agent'])) {
      $driver->setRequestHeader('User-Agent', $this->options['user_agent']);
    }
    return $driver;
  }

}

We can already use this class, but the UI doesn't know of its existence. Let's implement the hook:

/**
 * Implements hook_feed_import_reader_info().
 */
function feed_import_crawler_feed_import_reader_info() {
  $items = array();

  // This is used only to inherit options. Not shown as option.
  $items['behat_crawler'] = array(
    'hidden' => TRUE,
    'name' => t('Behat crawler'),
    'description' => t('Path format can be html|text|@attr-name|element=css|xpath(mycss|myxpath).'),
    'inherit_options' => FALSE,
    'class' => 'BehatReader',
    'options' => array(
      'start' => array(
        '#type' => 'textfield',
        '#title' => t('Start URL'),
        '#description' => t('This is where the crawling process starts'),
        '#required' => TRUE,
        '#default_value' => '',
      ),
      'parent' => array(
        '#type' => 'textfield',
        '#title' => t('Parent path'),
        '#description' => t('Context for all items'),
        '#required' => TRUE,
        '#default_value' => '',
      ),
      'next' => array(
        '#type' => 'textfield',
        '#title' => t('Path for next page'),
        '#description' => t('Path to a link that navigates to next page.'),
        '#required' => TRUE,
        '#default_value' => '',
      ),
      'max' => array(
        '#type' => 'textfield',
        '#title' => t('Max pages'),
        '#description' => t('Max pages to crawl.') . ' ' .
                          t('Use 0 to ignore this setting.'),
        '#required' => TRUE,
        '#default' => 0,
      ),
    ),
  );

  // This is the Goutte reader.
  $items['goutte_behat_crawler'] = array(
    'name' => t('Goutte crawler'),
    'description' => t('Imports content using Behat - Goutte as crawler.') . '<br>' .
                      $items['behat_crawler']['description'],
    'class' => 'GoutteCrawler',
    'inherit_options' => 'behat_crawler',
    'options' => array(
      'user_agent' => array(
        '#type' => 'textfield',
        '#title' => t('User-Agent'),
        '#default_value' => '',
      ),
    )
  );
  return $items;
}

We are done! You can download the whole module: https://drupal.org/files/feed_import_crawler.tar_.gz


Enable the module and import the following feed configuration to test it: https://drupal.org/files/feed_import_crawler_stackex.txt

This example imports questions from drupal.stackexchange.com into article nodes.