Currently the filefield mapper performs minimal validation on remote URLs. Specifically, it runs a valid_url check to make sure it is correct. My suggestion is to enhance the module to perform extra validation. This validation would involve the following:

1) For URLs, check the remote URL with valid_url as it does today
2) Perform an HTTP HEAD on the remote file to make sure we get an OK or other positive response from the server
3) Filter out content types that are not supported by filefield/imagefield

#3 is the big one for me. I'm getting feeds from third parties that are mistakenly inserting a link to a web page and not an image. So we are expecting some type of image but we get text/html instead. The mapper ignores this and continues on to happily create the field, but it's broken when you view the node since it now has an embedded URL instead of an image.

My code for #3 is minimal, with the thought that we can either expand on it or perform closer linkage to filefield/imagefield to only allow for content to be mapped that is supported by those modules. The HEAD solution does create more overhead since you now have to make that web query before saving the node, but it greatly improves the end result as the files are downloaded.

Patch enclosed. Thoughts and comments are welcome.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

rjbrown99’s picture

I still maintain this is a valid enhancement, and I am finding a need to make further improvements. I have a feed that doesn't give me a normal image URL such as:

http://theirsite.com/file.jpg

Instead, it gives me this:

http://theirsite.com/image.ms?product=1234&width=768&height=1024

The script on their side outputs straight image/jpeg as a content type (not application/octet-stream like we currently expect.) We also don't cleanly handle creating a filename. We create something called "FeedsEnclosure-image.ms?product=1234_width=768_height=1024" which doesn't work very well.

I'm going to enhance two things from my original patch.

1) My URL checking function that does the HEAD, filefield_feeds_valid_url, is now going to return a content type instead of TRUE. If we determine a type from the HEAD request, we pass that back. Otherwise we pass back application/octet stream as a default. This will ensure that FeedsEnclosure is called with the correct content type.

2) In the event we don't get a filename, I'm going to figure out a way to account for that and create one.

rjbrown99’s picture

FileSize
4.47 KB
14.85 KB

OK, I'm done with something that seems to work. I can roll a patch, but it's probably easier to just drop in the function. There were other commits since my first patch.

FeedsParser.inc looks like the following. We accept a passed variable $mimetype, which if it is available comes directly from the HTTP HEAD information on the remote file we are downloading. This allows us to accurately set the mimetype on the file.

Basically, this function is different from the one in the tree now as follows:
1) Accepts passed $mimetype variable
2) Removes/replaces whitespace in filenames with underscores
3) Decodes HTML entities in the basename/filename
4) Determines mimetype extensions if available using a new library called mime.inc.
5) Searches the proposed filename for URL query strings, which indicate we might not have a file. If it finds query strings, we manually create the filename based on the date, some random characters, and the mimetype extension from #4 above.

  /**
   * @return
   *   The file path to the downloaded resource referenced by the enclosure.
   *   Downloads resource if not downloaded yet.
   *
   * @todo This is not concurrency safe.
   */
  public function getFile($mimetype) {
    if(!isset($this->file)) {
      // First set the default filename. Replace spaces and decode all HTML entities.
      $replace = array("%20");
      $filename = str_replace($replace, "_", htmlspecialchars_decode(html_entity_decode(basename($this->getValue()))));
      // Grab the normal file extensions if we have a mimetype
      if($mimetype) {
        feeds_include_library('mime.inc', 'mime');
        $mimetypes = array_flip(feeds_drupal_mime_types());
        if (array_key_exists($mimetype, $mimetypes)) {
          $extensions = explode("|", $mimetypes[$mimetype]);
        }
      }
      // If we see query strings, assume we don't have a good name and create our own
      $qstrings = array("?", "=", "&", ";");
      foreach($qstrings as $qstring) {
        $isqstring[] = strpos($filename, $qstring);
      }
      if(!empty($isqstring)) {
        if($extensions) {
          $filename = date("YmdHis") . '_' . mt_rand() . '.' . $extensions[0];
        } else {
          $filename = date("YmdHis") . '_' . mt_rand();
        }
      }

      $dest = file_destination(file_directory_temp() .'/'. get_class($this) .'-'. $filename, FILE_EXISTS_RENAME);
      if (ini_get('allow_url_fopen')) {
        $this->file = copy($this->getValue(), $dest) ? $dest : 0;
      }
      else {
        $this->file = file_save_data($this->getContent(), $dest);
      }
      if ($this->file === 0) {
        watchdog('content', 'Cannot write content to %dest', array('%dest' => $dest));
      }
    }
    return $this->file;
  }

I also attached my modified filefield module, as described in comment #1 above. This includes the HTTP HEAD checks and the returned content types, which dovetail into the enhancements in this post.

I attached the new mime.inc, which is a slightly changed version of the same include file from the fileframework module.

I'd love some feedback on this.

alex_b’s picture

3 questions:

1) In some cases a text/html may be expected (think of when filefield is not used as image field) - are these cases handled yet?
2) I am struggling to understand why we can't just make FeedsEnclosure::getMIMEType() work and then only set the target in filefield_feeds_set_target() if the mime type is an accepted image mime type.
3) Shouldn't getFile() throw an exception if URL can't be downloaded? Couldn't we catch this exception in filefield_feeds_set_target() and bail?

rjbrown99’s picture

Thanks for the reply Alex.

1) Good thought and I just did a bit more research. Within the filefield module, the widget for the field stores allowed file extensions in the field definition. The actual extensions seem to be stored in the database within the content_node_field_instance table, CCK definition for the filefield, widget_settings field. Speaking strictly of the filefield module, upon upload the function in filefield.module called filefield_widget_upload_validators() runs. This checks the uploaded file against the maximum filesize and the allowed extensions.

My initial thought was perhaps we can just call that function to perform the same validation on files that are created from a feed import. The difference in the case of feeds vs. a normal upload is we can't enforce the same level of consistency at the remote end. With filefield, we can just deny a user from uploading a file with the wrong extension. In the case of feeds we may run into a remote site that links us to a file with an extension OR does what I described above and instead sends us a URL mapper such as theirsite.com/getmyfile.php?id=1234&width=500&height=500 with a mimetype of image/jpeg. It's not a 'real' file but is more of a web service serving us a file and we have to deal with the aftermath of trying to check it and save it. They could also send back various HTTP statuses of 403/404 or even return a 'themed' 404 page that just looks like a regular webpage.

This was the rationale behind trying to do a HEAD on the remote 'file' to determine what we are dealing with. Perhaps taking this a step further, we could do an HTTP HEAD, figure out the mimetype of the remote file, compare it against the allowed filetypes from the field definition, and then proceed if we succeeded with those checks. Or we could do this, save the file locally, check the actual mimetype based on php fileinfo, etc.

What do you think?

2) How would this work in the above scenario? To accurately determine mime type, would we would have to fetch it from the remote end first? Or, would we just try to grab whatever is at the remote end, save it somewhere, and then post-process it. There would still have to be logic involved to work around 403/404 and other errors that might come back as text/html. I'm working under the belief that to use FeedsEnclosure::getMIMEType(), we would have to have the file first from getFile().

3) Yes, I think we should have some kind of error handling ability. Some of this will probably depend on the answers to the above as it's somewhat inter-related.

I'm pretty motivated to write whatever is needed to implement this assuming we come to a final approach. I'm dealing with very large files/imports and all of them use filefield.

alex_b’s picture

Status: Needs review » Needs work

1)

Filefield uses the mimedetect modules to map file extensions to mime types. We could leverage that.

if (module_exists('mimedetect')) {
        $type = mimedetect_mime($file);
        if ($type != $file->filemime) {
          $errors[] = t('The file contents (@type) do not match its extension (@extension).', array('@type' => $type, '@extension' => $extension));
        }
      }

2)

I simply suggest to move the detection of mime types that is in filefield_feeds_valid_url() into getMIMEType() - thus this functionality would be more portable. Would that work? Or would we just create way too many FeedsEnclosure objects just to find out their MIME type? Further, what parser are you using? You see how FeedsSimplePie parser simply accepts what simplepie.inc determines as MIME type.

Sorry for the late reply.

rjbrown99’s picture

I'm still working on this code to perform better detection of the remote file via an HTTP HEAD request. I ran into an interesting situation on one of my feeds. I am feeding in an XML file with lots of image links that are to be turned into Drupal filefield/imagefields upon import.

One of the feeds I am using is hosting images at Amazon. Apparently they are using some kind of Amazon service where they only allow for 1 inbound request per second from a single IP. It's not a strict enforcement of that rule and a lot of items work, then you run into an image that instead gets this:

 HTTP/1.1 403 Forbidden
Server: squid
Content-Type: text/html
Content-Length: 1307
X-Squid-Error: ERR_ACCESS_DENIED 0
X-Cache-Lookup: NONE from cdn-images.amazon.com:10080
Cneonction: close
Expires: Sun, 18 Jul 2010 23:52:09 GMT
Date: Sun, 18 Jul 2010 23:52:09 GMT
Connection: close

If you request the same image a few seconds to a minute later, it works fine. So basically you get throttled down with 403's when you ask for too many images in a short period of time. The solution presented in the original issue intentionally makes two requests - one HEAD and one GET. This is part of what's creating the intermittent failures. The end result is you get an import with some working images, and some broken images.

I'm reworking it to support a delay if you get one of those 403s. That seems to have cured most of my problems. I'm also thinking of forward proxy support to reduce the possible load on the remote server. I'm just putting it out there as an FYI in case others are running in to the same thing.

meatbag’s picture

subscribing

pwaterz’s picture

I too am having a similar issue. Filenames need to be url decoded before they are saved and all parameters need to be removed. In the case of http://www.pnas.org/misc/Dawson%20--%20FINAL.mp3?rss=1 the file is being stored as Dawson%20--%20FINAL.mp3?rss=1 which isn't a valid path in drupal.

pwaterz’s picture

FileSize
702 bytes

Here is the patch for #8 fixed my issue.

gnucifer’s picture

How is this issue progressing? I'm willing to work on this functionality, but does the code in this issue reflect the latest changes?

gnucifer’s picture

Status: Needs review » Needs work
FileSize
6.34 KB

Don't know if this is the appropriate issue for this patch. But I needed better mime-detection for files created through the feedapi filefield-mapper. I guess field-specific validation could be added quite easily, but this patch contains no such feature. I also refactored the code for the FeedsEncosure class. The code probably needs some more testing (works fine in my user-case though).

gnucifer’s picture

Status: Needs work » Needs review
vinmassaro’s picture

Version: 6.x-1.x-dev » 7.x-2.x-dev
Status: Needs work » Needs review

I hope this can be revived for D7. I am seeing an issue when importing images from a remote source that isn't providing a file extension. An example would be a filename like imgSrv?objectId=102067&size=large. When I try to load the original version in Colorbox, it appears as a long string of characters instead.

vinmassaro’s picture

Turns out this was due to a misconfiguration on my end and these images paths were actually 404ing.

Bumping this hoping for some traction, or someone else running into this issue. In 7.x, when importing an image that does not contain a file extension (mapped to an image field) I get these notices. Thanks in advance.

Notice: Undefined variable: file in FeedsEnclosure->getFile() (line 381 of /sandbox/d7/sites/all/modules/contrib/feeds/plugins/FeedsParser.inc).

Notice: Undefined variable: file in file_feeds_set_target() (line 84 of /sandbox/d7/sites/all/modules/contrib/feeds/mappers/file.inc).

vinmassaro’s picture

Status: Needs work » Needs review

This issue has been dead, but I came up with a workaround for our purposes by implementing hook_feeds_presave() so hopefully it will help someone. Feeds was not saving images with extensions for filenames like imgSrv?objectId=102067&size=large and not setting the MIME type correctly. The example below targets images without extensions and basically renames the file to add the extension, then updates the file object. Be sure to rename field_image with the machine name of your image field.

/**
 * Implementation of hook_feeds_presave().
 *
 * Add file extensions to files that don't have them.
 */
function mymodule_feeds_presave(FeedsSource $source, $entity, $item) {
  if (!empty($entity->field_image)) {
    foreach ($entity->field_image[LANGUAGE_NONE] as $image) {

      // Get the file extension for each image.
      $extension = pathinfo($image['uri'], PATHINFO_EXTENSION);

      // If an image doesn't have an extension, we need to add it.
      if (empty($extension)) {

        // Load the file object.
        $file = file_load($image['fid']);

        // Get the file extension and mime type.
        $image_type = exif_imagetype($file->uri);
        $extension = image_type_to_extension($image_type, TRUE);
        $mime = image_type_to_mime_type($image_type);

        // Define short extension replacements.
        $extension_replacements = array(
          'jpeg' => 'jpg',
          'tiff' => 'tif',
        );

        // Replace to the short extension.
        $extension = strtr($extension, $extension_replacements);

        // Move the file so the uri gets the proper file extension.
        file_move($file, $file->uri . $extension);

        // Reload the updated file object.
        $file = file_load($file->fid);

        // We may have originally gotten the wrong MIME type, so set it correctly if they don't match.
        if (!empty($file->filemime) && $file->filemime !== $mime) {
          $file->filemime = $mime;
        }

        // Save the file.
        file_save($file);
      }
    }
  }
}

Status: Needs review » Needs work

The last submitted patch, feeds_filefield-mapper-mime-706908-11.patch, failed testing.

Status: Needs review » Needs work
twistor’s picture

Issue summary: View changes
Status: Needs work » Closed (duplicate)

We've added lots more file validation in various issues.