I'm looking at doing a D6->D7 migration with a web service component for export so we can move changes across continuously until the final switch over. The number of ids starts getting pretty crazy (there's 330,000+ nodes and 540,000+ users) so even just getting a listing of ids turns into a lot of data. What I'm probably going to subclass MigrateListJSON and override getIdList to use a page parameter and keep making requests until it gets all the ids.

It occurred to me that it might be more elegant to modify MigrateList's interface so that getIdList() could return an iterateable object of ids, rather than just an array. Since an an array is iterateable it would be backwards compatible. This would allow the iterator to fetch a page of values at a time rather than all at once. Going page at a time even if they're large (say 10,000 ids) would still use less memory than fetching the entire list. I think the only real change would need to be in MigrateSourceList::next() where it does:
while ($this->id = array_shift($this->idList))
It could go to something like

foreach ($this->idList as $id) {
  $this->id = $id;

Thoughts?

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

drewish’s picture

Ah just started thinking about the need to be resumable. I think we could probably just test the result of the getIdList() call and if it's an array call new ArrayIterator($this->idList); then in the loop use the Iterator interface.

mikeryan’s picture

Category: support » feature

Sounds like a good idea. Too late to get into 2.2, which I plan on cutting very soon, but let's put it on the list for 2.3.

drewish’s picture

Status: Active » Needs review
FileSize
2.37 KB

Here's kind of what I'm thinking. It doesn't seem like it's a good idea to expose getIdList() since no-one's calling it so I killed that off.

One downside is that the iterator class couldn't make use of other parts of the MigrateList. In my case I was trying to extend MigrateListJSON and found I needed to be able to call getIDsFromJSON(). So I ended up having my list implement Iterator so I could do it in one place.

It's also kind of odd because we don't care about keys so you you either end up implementing key() and ignoring it.

I'm wondering if I should just be implementing MigrateSource myself... I wonder if there's a way we can make that process easier.

Here's my class:



abstract class DSListJSON extends MigrateListJSON implements Iterator {
  // Subclasses need to populate this with a URL where we can get a count.
  protected $countUrl;
  // Which page are we on?
  protected $pageNumber;
  // Array of remaining values on this page.
  protected $pageValues;
  // Current value.
  protected $current;
  // Have we reached the end of the list?
  protected $eof;

  public function __construct($list_url, $http_options = array()) {
    parent::__construct($list_url, $http_options);
    $this->listUrl = $list_url;
    $this->httpOptions = $http_options;
    $this->rewind();
  }

  public function computeCount() {
    $count = NULL;

    migrate_instrument_start(__METHOD__);

    migrate_instrument_start("Retrieve $this->countUrl");
    $json = file_get_contents($this->countUrl);
    migrate_instrument_stop("Retrieve $this->countUrl");

    if ($json) {
      $data = drupal_json_decode($json);
      $count = $data;
    }

    migrate_instrument_stop(__METHOD__);

    return $count;
  }

  public function getIdList() {
    return $this;
  }

  /**
   * Try to fetch and parse the page specified by pageValues, populates
   * pageValues with the ids and fills current with the first value.
   */
  protected function fetchPage() {
    var_dump(__METHOD__ . ' ' . $this->pageNumber);
    // Assume there's nothing left so we can be proved wrong.
    $this->pageValues = array();
    $this->eof = TRUE;
    $this->current = FALSE;

    $url = str_replace(':page', $this->pageNumber, $this->listUrl);
    $cid = __CLASS__ . ':' . $url;

    $cache = cache_get($cid);
    if ($cache !== FALSE && isset($cache->data)) {
      $ids = $cache->data;
    }
    else {
      migrate_instrument_start("Retrieve $this->listUrl");
      $jsonString = file_get_contents($url);
      migrate_instrument_stop("Retrieve $this->listUrl");
      if ($jsonString === FALSE) {
        return NULL;
      }
      $jsonArray = drupal_json_decode($jsonString);
      if ($jsonArray === NULL) {
        return NULL;
      }
      $ids = $this->getIDsFromJSON($jsonArray);
      if (!$ids) {
        return NULL;
      }
      cache_set($cid, $ids);
    }

    $this->pageValues = $ids;
    $this->eof = empty($this->pageValues);
    $this->current = reset($this->pageValues);
var_dump("IDS $url", $this->pageValues);
  }

  function rewind() {
    var_dump(__METHOD__);
    $this->pageNumber = 0;
    $this->fetchPage();
  }

  function current() {
    var_dump(__METHOD__, $this->current);
    return $this->current;
  }

  function key() {
    var_dump(__METHOD__);
    return $this->pageNumber . ':' . $this->rowCount;
  }

  function next() {
    var_dump(__METHOD__);
    $this->current = array_shift($this->pageValues);
    if ($this->current === NULL) {
      if (!$this->eof) {
        $this->pageNumber += 1;
        $this->fetchPage();
      }
      else {
        $this->current = FALSE;
      }
    }
    return $this->current;
  }

  function valid() {
    $valid = !empty($this->pageValues) || !$this->eof;
    var_dump(__METHOD__, $valid);
    return $valid;
  }
}
drewish’s picture

Title: Let MigrateList::getIdList() return iterateable » Let MigrateItems and MigrateList::getIdList() return iterateable
FileSize
4.82 KB

Added support to MigrateItems for working with an Iterator.

drewish’s picture

FileSize
4.97 KB

Had some mixed capitalization of idList and idlist.

drewish’s picture

mikeryan’s picture

Status: Needs review » Fixed

Committed for D6 and D7, thanks!

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.