The XML source plugin does not scale well. It uses SimpleXML, which reads and parses the entire XML file and builds an in-memory structure representing it. If you need to deal with, say, a 353MB XML file, even if you're able to bump up PHP's memory_limit high enough to not die completely, it is excruciatingly slow.
So, we need to look at alternative XML parsing solutions, in particular XMLReader and XML Parser. We cannot simply plug in an alternative library under the existing implementation, however, because these do not support xpath, which is a critical part of the current API. We need separate XML plugins - well, at least for MigrateItemsXML, which is the use case where you're most likely to deal with huge files. The tradeoffs between the two will be
- SimpleXML-based API - for simplicity (use of xpath in mappings) when the XML files are not too big
- Alternative API - for performance with large XML files
The challenge with the alternative API will be figuring out how the client migration can specify what elements and/or attributes are to be mapped - perhaps by supporting a limited subset of the xpath syntax.
| Comment | File | Size | Author |
|---|---|---|---|
| #6 | MigrateSourceXML-1336534-5.patch | 15.76 KB | mikeryan |
Comments
Comment #1
mikeryanAh! Found a couple of examples on the web, once we've got an element representing a source object, we can feed it into DOM/SimpleXML and use xpath as we normally do to pull values from it. So, it's really only the $item_xpath argument to the MigrateItemsXML constructor where we don't have built-in xpath capability and need to come up with a custom solution.
Comment #2
drewish commentedI'd started using XMLReader for a project that needed to load a 1GB XML file and that's what prompted part of my work on #1268070: Let MigrateItems and MigrateList::getIdList() return iterateable. Once you can return an iterator to and control the order you can just work your way down the file. I'd like to keep working on simplifying the amount of code that goes into a Source so that they're easier to write and you don't need to bother with the whole iterator interface.
Comment #3
mikeryanExactly - at this moment (which might change in another 10 minutes;) I'm looking at doing
Iterating over each match to the $item_xpath. The MigrateItems implementation would encapsulate this. Or, the MigrateItems implementation would implement Iterator and encapsulate an XMLReader instance...
Comment #4
drewish commentedThe issue I had with XMLReader was that you had to call read() to descend down into the tree before next() would work. I was having a hard time figuring out how to generalize the class so that it didn't need to know the exact schema of the doc... Even worse you had to parse the entire document before you knew what the counts were.
Comment #5
mikeryanFor your review... I did implement MigrateXMLReader much as suggested above, wrapped inside a new MigrateSource named MigrateSourceXML. You pass a URL, element query, and ID query much like the existing MigrateItemsXML, but the element query is more restrictive - it does not support the full xpath language, you need to give an explicit path to the element(s) you want (i.e., '/producers/producer' rather than '//producers'). You may also filter on attribute value (e.g., '[@attribute="value"]'). Another feature I found a need for at the last minute is that you can derive your own class from MigrateXMLReader and substitute it.
@drewish: As for the time to count a big document, that's a big enough problem for my current project that I plan on implementing a skip_count feature on sources. Issue to be opened momentarily...
Comment #6
mikeryanIt helps to attach the patch...
Comment #7
mikeryan#1341776: Option to skip counting
Comment #8
drewish commentedDid you benchmark that method of creating SimpleXML nodes? I seem to remember serializing to text and back to be faster but that might be a mis-recollection.
Comment #9
mikeryanWell, let me give that a try... I have my 353MB XML file, containing something like 70K elements to be turned into nodes and 1M comments, and a migration for 1291 elements going into a video content type, about half skipped by prepareRow(). I added instrumentation around the DOM handling stuff to try
I'm thinking there's not a lot of optimization to be done there, but I'll give it a try to give it a shot tomorrow.
Comment #10
mikeryanReplacing the DOM stuff with
and running
three times, the inner section takes on average 8.0% of the MigrateXMLReader::next time (a little over 2ms per call). The DOM method averages 4.2%, just a bit over 1ms, so despite how the code may look, it really is faster.
Comment #11
mikeryanCommitted.
Comment #12
mikeryanOops. Upgrading to PHP 5.3 generates strict warnings:
Strict warning: Declaration of MigrateXMLReader::next() should be compatible with that of XMLReader::next() in require_once() (line 582 of /Users/mryan/Sites/m7/sites/all/modules/migrate/plugins/sources/xml.inc).
Problems - we're extending XMLReader and implementing Iterator, each of which declares their own next() method. As it is, things work for our purposes - we've implemented next() for Iterator, which is the only next() we really need - but it's incorrect, and if someone were to try using a MigrateXMLReader as an XMLReader they would find themselves in trouble calling next().
So, what I need to do is encapsulate an XMLReader object in MigrateXMLReader, rather than extend the class.
Comment #13
mikeryanAll right, MigrateXMLReader is no longer an XMLReader itself, it contains member $reader.
Comment #15
killtheliterate commentedWhere is the documentation for how to implement this?
Comment #16
mikgreen commentedXMl migration is very, very slow. If you need to import more then 1000 records, I suggest moving XML data to database and do SQL migration instead.
Comment #17
mccrodp commentedI was getting very slow imports, about 65/min there using
MigrateSourceMultiItemswhich seems to use SimpleXML, as discussed above. I swapped to using mikeryan's MigrateXMLReader method which if I'm correct (I haven't checked), is invoked by usingMigrateSourceXMLinstead. i.e. Use MigrateSourceXML instead of MigrateSourceMultiItems or MigrateSourceList$this->source = new MigrateSourceMultiItems($items_class, $fields);$this->source = new MigrateSourceXML($items_url, $item_xpath, $item_ID_xpath, $fields);I am now getting speeds seen below. An example is seen at the top of #1806920: Migration does not complete properly from XML source
XML is no longer slow! :)
Comment #18
tinflute commentedI too switched from MigrateSourceMultiItems to MigrateSourceXML and the results speak for themselves.
I went from ~5/min to 363/min.
Good with mustard too.
Comment #19
mikeryanComment #20
tinflute commentedwoops. accidental title change. my bad
Comment #21
seanbThank you for letting me know about the difference between MigrateSourceMultiItems() and MigrateSourceXML()!! I was trying to figure out what was wrong with my import for 2 days, and now it finally runs with decent performance.