I have spent the better part of a week trying out different scenarios with my files migrations. Initially we were getting a lot of failed file imports but as we have narrowed the list down it has become apparent that some of the files do exist, but fail for no known reason.

Upon looking into things further I noticed that migrate is using @copy() to do the transfer and the @ symbol overrides the default error handling, hiding it and making it unusable for the error message that is generated. It would be nice if rather than doing this that we could specify the error handler and get return values from it for the error message.

While working through this I put a curl request to get the headers into the "if (!@copy($this->sourcePath, $destination))" block and appended the result to the error messages. Surprisingly some were 200 OK, while some were legitimately 404. Interesting.

When I bring that same file (that showed 200 OK in curl) into /devel/php running php copy() I get a 404 the first time, and a 200 OK the second time. Obviously we have a local problem!

It would be nice if we defined a custom error handler rather than simply hiding the errors. There is a D8 issue for doing this across the board here: https://drupal.org/node/1247666.

Yes, I'm aware that it is better to use a local file source than a remote one, but I'm dealing with *many* dev/staging/production environments in this case. For us that will only work in the production environment... so we have an override for that case, but ideally dev and staging would work too.

Comments

13rac1’s picture

IDK if this is really a bug, seems more like feature request. Anyway...

Here is a reduced-to-the-minimum of how I handled checking external files in an import of 95,000+ images.

   public function prepareRow($row) {
    if (parent::prepareRow($row) === FALSE) {
      watchdog('migrate_example', 'prepare row failure');
      return FALSE;
    }

    // If both path and filename are specified, change webpath to a full URL on
    // the webserver.
    if (!empty($row->filename) && !empty($row->path)) {
      $image = 'http://example.com/Source/' .
          $row->path . $row->filename . '.jpg';
      // Set the HTTP request to request only the headers and not the response body.
      stream_context_set_default(
        array(
          'http' => array(
            'method' => 'HEAD'
          )
        )
      );
      // See: http://www.php.net/manual/en/function.get-headers.php
      $headers = @get_headers($image);
      // Set the HTTP request back to GET so the system will correctly download images.
      stream_context_set_default(
        array(
          'http' => array(
            'method' => 'GET'
          )
        )
      );
      // If the result is less than 400 (200, 301, etc) then the file is found, and the URL is valid.
      if (substr($headers[0], 9, 3) < 400) {
        $row->image = $image;
        $row->image_available = 1; // A flag for Solr sorts/filters.
      }
      // URL isn't valid, so mark missing file.
      else {
        $row->image = NULL;
        $row->image_available = 0;
      }
    }
  }
Anonymous’s picture

Category: Bug report » Feature request

Good point, this should be a feature request. Thanks for posting your code!

mikeryan’s picture

Patches welcome!

Side note:

When I bring that same file (that showed 200 OK in curl) into /devel/php running php copy() I get a 404 the first time, and a 200 OK the second time. Obviously we have a local problem!

I saw this last week - working on D8 file migration from my D6 site, every time I ran it the first file would get a 404 while the rest worked fine (and yes, it did exist at the designated URL). Changed the order and the new first file got the 404. Stepped through in the debugger, and when I took more than a few (like 3-5 seconds) between copy() calls, the next file would get a 404. I shared this among some of my colleagues here, no one could come up with a better explanation than "PHP bug". It seems to me like maybe the first call times out before the HTTP connection is made, and the connection persists as long as it isn't idle for more than X seconds...?

pifagor’s picture

Status: Active » Closed (outdated)