I have implemented a migration using a custom source class extending the default MigrateListJSON and MigrateItemJSON classes. It is using highwater marks to update content. I am also using stubs for a entityreference to itself. The migration creates commerce products with several field collections and file inside of it. The performance is awful in general. It starts off at 10/min. But after migrating about 700 products it drops to 1 product per 2500 seconds.
What could be causing this? is it a combination of highwater and stubs (needs_update flags)?
How can i check if the highwater marks are stored correctly in the database?
I also realized that there are a lot of calls to the MigrateItemJSON URLs. Even if only one product is being migrated. Attached is a screenshot of a xhprof recording.
Comment | File | Size | Author |
---|---|---|---|
#4 | Screen Shot 2013-03-12 at 11.35.50 AM.png | 457.75 KB | Lukas von Blarer |
Screen Shot 2013-02-07 at 9.38.13 PM.png | 267.69 KB | Lukas von Blarer |
Comments
Comment #1
mikeryanAh, so you did do xhprof. Here's a question - are you only attempting the import through the UI? Have you tried drush? If the problem is UI-only, please read http://drupal.org/node/1806824.
Comment #2
Lukas von BlarerYes, I figured that out as well. I am now running the migration using drush. I have a performance constant at 7/min. So I fixed the decrease problem that way. But the performance is still very bad.
What is the fready Excl. Wall Time telling me? Could it be that the big json list is being loaded for every product? Or what could be causing this?
Comment #3
Lukas von BlarerAs you suggested, I provide you with the information returned with the option --instrument=timer:
I should mention that I have huge product entities containing references to entities provided by field_collection. Also I have remote files that are being saved with remote_stream_wrapper. How can I improve performance? What is the bottleneck? Is it MySQL? Is it PHP? OR is it slow requests for the JSON URLs?
Comment #4
Lukas von BlarerHere is a updated XHProf profile of a import of 10 products:
So one big problem is running 31000 SQL queries for importing 10 products. Most expensive ones are are DELETE queries which make up 30% of all time. But i am not Deleting stuff. I am importing. So why would I want to run DELETE queries? How should i proceed?
Comment #5
mikeryanI don't know where your DELETE queries are coming from, but I will note the 18 seconds opening your JSON feed. For large JSON feeds, MigrateSourceJSON will perform much better.
Comment #6
Lukas von BlarerI discovered that one of my big performance problems were remote images which had to be saved. Maybe I will write some kind of cache for that. The second issue are the nested field collections causing these insane amounts of MySQL queries. But I think I will not be able to solve that one. I posted a issue for it:
#1955338: Nested field collection cause migration to run slow
But now I am using highwater to update existing entries. And even if there is not a single entry to be updated, the migration takes 27 hours! I tracked that down to the slow JSON API. It serves the JSON entries for each row very slow. But I have the highwater field available in the JSON list as well. So actually there is no need to make the 14000 calls to the API at all. In MigrateSource::next() for every row this call is being made: $this->getNextRow(). And after that the highwater field is being checked. is there a way to work around this?
Comment #7
Lukas von BlarerI tried to solve the highwater issue like this in my class extending MigrateListJSON:
Is there anything wrong in doing this?
Comment #8
mikeryanI think that'll throw off your source count, so the numbers reported in the dashboard/drush migrate-status will look funky, but it should work for your purposes.
Comment #9
mikeryanNo further information provided.
Comment #10
Lukas von BlarerI think we could improve this by providing an option to use a highwater field inside the list. This improves performance a lot.
Comment #11
mikeryanI don't understand what "use a highwater field inside the list" means - can you describe what you're visualizing?
Comment #12
Lukas von BlarerI am using the MigrateListJSON class as a base for my JSON list. It would be nice to be able to use the timestamps coming from there and not have to query each MigrateItemJSON URL.
Comment #13
mikeryanAs, so the timestamps are part of the list data? I'm not sure what a general approach to this would look like, extending the list class as you're doing seems like the best way. Of course, patches welcome if someone comes up with an answer...
Comment #14
Lukas von BlarerI achieved it by passing the highwater from the getIDsFromJSON() method in my list class:
and then assigning it to each item inside the getItem() of my item class:
That roughly trippled the performance of my migration.
How could we implement this to make it available in migrate?
Comment #15
pifagor