The XML source plugin does not scale well. It uses SimpleXML, which reads and parses the entire XML file and builds an in-memory structure representing it. If you need to deal with, say, a 353MB XML file, even if you're able to bump up PHP's memory_limit high enough to not die completely, it is excruciatingly slow.

So, we need to look at alternative XML parsing solutions, in particular XMLReader and XML Parser. We cannot simply plug in an alternative library under the existing implementation, however, because these do not support xpath, which is a critical part of the current API. We need separate XML plugins - well, at least for MigrateItemsXML, which is the use case where you're most likely to deal with huge files. The tradeoffs between the two will be

  • SimpleXML-based API - for simplicity (use of xpath in mappings) when the XML files are not too big
  • Alternative API - for performance with large XML files

The challenge with the alternative API will be figuring out how the client migration can specify what elements and/or attributes are to be mapped - perhaps by supporting a limited subset of the xpath syntax.

Files: 
CommentFileSizeAuthor
#6 MigrateSourceXML-1336534-5.patch15.76 KBmikeryan

Comments

mikeryan’s picture

Ah! Found a couple of examples on the web, once we've got an element representing a source object, we can feed it into DOM/SimpleXML and use xpath as we normally do to pull values from it. So, it's really only the $item_xpath argument to the MigrateItemsXML constructor where we don't have built-in xpath capability and need to come up with a custom solution.

drewish’s picture

I'd started using XMLReader for a project that needed to load a 1GB XML file and that's what prompted part of my work on #1268070: Let MigrateItems and MigrateList::getIdList() return iterateable. Once you can return an iterator to and control the order you can just work your way down the file. I'd like to keep working on simplifying the amount of code that goes into a Source so that they're easier to write and you don't need to bother with the whole iterator interface.

mikeryan’s picture

Exactly - at this moment (which might change in another 10 minutes;) I'm looking at doing

class MigrateXMLReader extends XMLReader implements Iterator {
  public function __construct($xml_url, $item_xpath='item') {
  } 
}

Iterating over each match to the $item_xpath. The MigrateItems implementation would encapsulate this. Or, the MigrateItems implementation would implement Iterator and encapsulate an XMLReader instance...

drewish’s picture

The issue I had with XMLReader was that you had to call read() to descend down into the tree before next() would work. I was having a hard time figuring out how to generalize the class so that it didn't need to know the exact schema of the doc... Even worse you had to parse the entire document before you knew what the counts were.

mikeryan’s picture

Status: Active » Needs review

For your review... I did implement MigrateXMLReader much as suggested above, wrapped inside a new MigrateSource named MigrateSourceXML. You pass a URL, element query, and ID query much like the existing MigrateItemsXML, but the element query is more restrictive - it does not support the full xpath language, you need to give an explicit path to the element(s) you want (i.e., '/producers/producer' rather than '//producers'). You may also filter on attribute value (e.g., '[@attribute="value"]'). Another feature I found a need for at the last minute is that you can derive your own class from MigrateXMLReader and substitute it.

@drewish: As for the time to count a big document, that's a big enough problem for my current project that I plan on implementing a skip_count feature on sources. Issue to be opened momentarily...

mikeryan’s picture

It helps to attach the patch...

mikeryan’s picture

drewish’s picture

Did you benchmark that method of creating SimpleXML nodes? I seem to remember serializing to text and back to be faster but that might be a mis-recollection.

mikeryan’s picture

Well, let me give that a try... I have my 353MB XML file, containing something like 70K elements to be turned into nodes and 1M comments, and a migration for 1291 elements going into a video content type, about half skipped by prepareRow(). I added instrumentation around the DOM handling stuff to try

$ drush mi video --instrument=timer
Processed 1291 (617 created, 0 updated, 0 failed, 674 ignored) in 82.6 sec (448/min) - done with 'Video'
...
 MigrateXMLReader::next                             35.493     1292   27.471 
...
 MigrateXMLReader::next inner                       1.429      1291   1.107 
...

I'm thinking there's not a lot of optimization to be done there, but I'll give it a try to give it a shot tomorrow.

mikeryan’s picture

Replacing the DOM stuff with

  $this->currentElement = simplexml_load_string($this->readOuterXML());

and running

drush mr video; drush mi video --instrument=timer

three times, the inner section takes on average 8.0% of the MigrateXMLReader::next time (a little over 2ms per call). The DOM method averages 4.2%, just a bit over 1ms, so despite how the code may look, it really is faster.

mikeryan’s picture

Status: Needs review » Fixed

Committed.

mikeryan’s picture

Assigned: Unassigned » mikeryan
Status: Fixed » Needs work

Oops. Upgrading to PHP 5.3 generates strict warnings:

Strict warning: Declaration of MigrateXMLReader::next() should be compatible with that of XMLReader::next() in require_once() (line 582 of /Users/mryan/Sites/m7/sites/all/modules/migrate/plugins/sources/xml.inc).

Problems - we're extending XMLReader and implementing Iterator, each of which declares their own next() method. As it is, things work for our purposes - we've implemented next() for Iterator, which is the only next() we really need - but it's incorrect, and if someone were to try using a MigrateXMLReader as an XMLReader they would find themselves in trouble calling next().

So, what I need to do is encapsulate an XMLReader object in MigrateXMLReader, rather than extend the class.

mikeryan’s picture

Status: Needs work » Fixed

All right, MigrateXMLReader is no longer an XMLReader itself, it contains member $reader.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

killtheliterate’s picture

Where is the documentation for how to implement this?

mikgreen’s picture

XMl migration is very, very slow. If you need to import more then 1000 records, I suggest moving XML data to database and do SQL migration instead.

mccrodp’s picture

I was getting very slow imports, about 65/min there using MigrateSourceMultiItems which seems to use SimpleXML, as discussed above. I swapped to using mikeryan's MigrateXMLReader method which if I'm correct (I haven't checked), is invoked by using MigrateSourceXML instead. i.e. Use MigrateSourceXML instead of MigrateSourceMultiItems or MigrateSourceList

$this->source = new MigrateSourceMultiItems($items_class, $fields);

$this->source = new MigrateSourceXML($items_url, $item_xpath, $item_ID_xpath, $fields);

I am now getting speeds seen below. An example is seen at the top of #1806920: Migration does not complete properly from XML source

Processed 2566 (2566 created, 0 updated, 0 failed, 0 ignored) in 78 sec (1973/min)

XML is no longer slow! :)

tinflute’s picture

Title: XML source plugin does not scale » Refreshingly delicious, switch now.

I too switched from MigrateSourceMultiItems to MigrateSourceXML and the results speak for themselves.
I went from ~5/min to 363/min.
Good with mustard too.

mikeryan’s picture

Title: Refreshingly delicious, switch now. » XML source plugin does not scale
tinflute’s picture

woops. accidental title change. my bad

seanB’s picture

Thank you for letting me know about the difference between MigrateSourceMultiItems() and MigrateSourceXML()!! I was trying to figure out what was wrong with my import for 2 days, and now it finally runs with decent performance.