I have implemented a migration using a custom source class extending the default MigrateListJSON and MigrateItemJSON classes. It is using highwater marks to update content. I am also using stubs for a entityreference to itself. The migration creates commerce products with several field collections and file inside of it. The performance is awful in general. It starts off at 10/min. But after migrating about 700 products it drops to 1 product per 2500 seconds.

What could be causing this? is it a combination of highwater and stubs (needs_update flags)?

How can i check if the highwater marks are stored correctly in the database?

I also realized that there are a lot of calls to the MigrateItemJSON URLs. Even if only one product is being migrated. Attached is a screenshot of a xhprof recording.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

mikeryan’s picture

Status: Active » Postponed (maintainer needs more info)

Ah, so you did do xhprof. Here's a question - are you only attempting the import through the UI? Have you tried drush? If the problem is UI-only, please read http://drupal.org/node/1806824.

Lukas von Blarer’s picture

Yes, I figured that out as well. I am now running the migration using drush. I have a performance constant at 7/min. So I fixed the decrease problem that way. But the performance is still very bad.

What is the fready Excl. Wall Time telling me? Could it be that the big json list is being loaded for every product? Or what could be causing this?

Lukas von Blarer’s picture

Status: Postponed (maintainer needs more info) » Active

As you suggested, I provide you with the information returned with the option --instrument=timer:

Processed 51 (0 created, 51 updated, 0 failed, 0 ignored) in 1047.9 sec (3/min) - stopped 'CommerceProductMigration'     [warning]
 Timer                                                          Cum (sec)  Count  Avg (msec) 
 page                                                           1048.319   1      1048319.02 
 destination import                                             982.889    51     19272.333  
 entity_save                                                    49.908     51     978.585    
 MigrateSourceList getNextRow                                   46.343     51     908.69     
 MigrateSourceList performRewind                                18.051     1      18050.62   
 Retrieve https://example.com/list.json                         17.974     1      17973.96   
 drupal_http_request                                            8.948      2382   3.756      
 MigrateFieldsEntityHandler->prepare                            0.184      51     3.615      
 MigrateDestinationEntity->prepareFields                        0.183      51     3.593      
 saveIDMapping                                                  0.132      78     1.689      
 MigrateFieldsEntityHandler->complete                           0.108      51     2.112      
 MigrateDestinationEntity->completeFields                       0.107      51     2.09       
 mapRowBySource                                                 0.041      51     0.797      
 MigrateTextFieldHandler->prepare                               0.032      1377   0.023      
 lookupDestinationID                                            0.019      31     0.609      
 MigrateCommercePriceFieldHandler->prepare                      0.014      153    0.091      
 AtlasantibodiesBackendCommerceProductMigration prepareRow      0.009      51     0.183      
 MigrateValueFieldHandler->prepare                              0.003      153    0.021      
 MigratePathEntityHandler->prepare                              0.002      51     0.043      
 MigrateCountryFieldHandler->prepare                            0.001      51     0.022      
 MigrateEntityReferenceFieldHandler->prepare                    0.001      51     0.021  

I should mention that I have huge product entities containing references to entities provided by field_collection. Also I have remote files that are being saved with remote_stream_wrapper. How can I improve performance? What is the bottleneck? Is it MySQL? Is it PHP? OR is it slow requests for the JSON URLs?

Lukas von Blarer’s picture

Here is a updated XHProf profile of a import of 10 products:

Screen Shot 2013-03-12 at 11.35.50 AM.png

So one big problem is running 31000 SQL queries for importing 10 products. Most expensive ones are are DELETE queries which make up 30% of all time. But i am not Deleting stuff. I am importing. So why would I want to run DELETE queries? How should i proceed?

mikeryan’s picture

I don't know where your DELETE queries are coming from, but I will note the 18 seconds opening your JSON feed. For large JSON feeds, MigrateSourceJSON will perform much better.

Lukas von Blarer’s picture

I discovered that one of my big performance problems were remote images which had to be saved. Maybe I will write some kind of cache for that. The second issue are the nested field collections causing these insane amounts of MySQL queries. But I think I will not be able to solve that one. I posted a issue for it:

#1955338: Nested field collection cause migration to run slow

But now I am using highwater to update existing entries. And even if there is not a single entry to be updated, the migration takes 27 hours! I tracked that down to the slow JSON API. It serves the JSON entries for each row very slow. But I have the highwater field available in the JSON list as well. So actually there is no need to make the 14000 calls to the API at all. In MigrateSource::next() for every row this call is being made: $this->getNextRow(). And after that the highwater field is being checked. is there a way to work around this?

Lukas von Blarer’s picture

I tried to solve the highwater issue like this in my class extending MigrateListJSON:

<?php
  protected function getIDsFromJSON(array $data) {
    $migration = Migration::currentMigration();
    $highwater = strtotime($migration->getHighwater());
    $ids = array();
    foreach ($data['data'] as $item) {
      if(strtotime($item['timestamp']) > $highwater){
        $ids[] = $item['product_number'];
      }
    }
    return $ids;
  }
?>

Is there anything wrong in doing this?

mikeryan’s picture

Status: Active » Postponed (maintainer needs more info)

I think that'll throw off your source count, so the numbers reported in the dashboard/drush migrate-status will look funky, but it should work for your purposes.

mikeryan’s picture

Status: Postponed (maintainer needs more info) » Closed (works as designed)

No further information provided.

Lukas von Blarer’s picture

Status: Closed (works as designed) » Active

I think we could improve this by providing an option to use a highwater field inside the list. This improves performance a lot.

mikeryan’s picture

Category: support » feature
Status: Active » Postponed (maintainer needs more info)

I don't understand what "use a highwater field inside the list" means - can you describe what you're visualizing?

Lukas von Blarer’s picture

Status: Postponed (maintainer needs more info) » Active

I am using the MigrateListJSON class as a base for my JSON list. It would be nice to be able to use the timestamps coming from there and not have to query each MigrateItemJSON URL.

mikeryan’s picture

Title: Migrating 14000 entries with JSON as a source » Obtain timestamps in JSON list class

As, so the timestamps are part of the list data? I'm not sure what a general approach to this would look like, extending the list class as you're doing seems like the best way. Of course, patches welcome if someone comes up with an answer...

Lukas von Blarer’s picture

I achieved it by passing the highwater from the getIDsFromJSON() method in my list class:

<?php
  protected function getIDsFromJSON(array $data) {
    $migration = Migration::currentMigration();
    $highwater = strtotime($migration->getHighwater());
    $ids = array();
    foreach ($data['data'] as $item) {

      $time = strtotime($item['timestamp']);

      if($time > $highwater){
        $ids[] = array(
          'id' => $item['product_number'],
          'highwater' => $item['timestamp'],
        );
      }
      
    }
    return $ids;
  }
?>

and then assigning it to each item inside the getItem() of my item class:

<?php
  public function getItem($id) {

    $item_id = $id['id'];

    // We cache the parsed JSON at $this->data.
    if (empty($this->data[$item_id])) {
      $item_url = $this->constructItemUrl($id['id']);
      $data = $this->loadJSONUrl($item_url);
      if ($data) {
        // Let's index the array by the ID for easy retrieval.
        foreach ($data as $item) {
          if(isset($item->product_number)){
            $item->timestamp = $id['highwater'];
            $this->data[$item->product_number] = $item;
          }
        }
      }
      else {
        // Error-handling here....
      }
    }
    // Return the requested item
    if (isset($this->data[$item_id])) {
      return $this->data[$item_id];
    }
    else {
      return NULL;
    }
  }
?>

That roughly trippled the performance of my migration.

How could we implement this to make it available in migrate?

pifagor’s picture

Issue summary: View changes
Status: Active » Closed (outdated)