Enhance filefield mapper to perform more validation on remote URLs [#706908]

Currently the filefield mapper performs minimal validation on remote URLs. Specifically, it runs a valid_url check to make sure it is correct. My suggestion is to enhance the module to perform extra validation. This validation would involve the following:

1) For URLs, check the remote URL with valid_url as it does today
2) Perform an HTTP HEAD on the remote file to make sure we get an OK or other positive response from the server
3) Filter out content types that are not supported by filefield/imagefield

#3 is the big one for me. I'm getting feeds from third parties that are mistakenly inserting a link to a web page and not an image. So we are expecting some type of image but we get text/html instead. The mapper ignores this and continues on to happily create the field, but it's broken when you view the node since it now has an embedded URL instead of an image.

My code for #3 is minimal, with the thought that we can either expand on it or perform closer linkage to filefield/imagefield to only allow for content to be mapped that is supported by those modules. The HEAD solution does create more overhead since you now have to make that web query before saving the node, but it greatly improves the end result as the files are downloaded.

Patch enclosed. Thoughts and comments are welcome.

Comment	File	Size	Author
#11	feeds_filefield-mapper-mime-706908-11.patch	6.34 KB	gnucifer
#9	feeds_filename_fix.patch	702 bytes	pwaterz
#2	mime.inc_.txt	14.85 KB	rjbrown99
#2	filefield.inc_.txt	4.47 KB	rjbrown99
	filefield.patch	2.3 KB	rjbrown99

Comments

Comment #1

rjbrown99 commented 22 February 2010 at 04:29

I still maintain this is a valid enhancement, and I am finding a need to make further improvements. I have a feed that doesn't give me a normal image URL such as:

http://theirsite.com/file.jpg

Instead, it gives me this:

http://theirsite.com/image.ms?product=1234&width=768&height=1024

The script on their side outputs straight image/jpeg as a content type (not application/octet-stream like we currently expect.) We also don't cleanly handle creating a filename. We create something called "FeedsEnclosure-image.ms?product=1234_width=768_height=1024" which doesn't work very well.

I'm going to enhance two things from my original patch.

1) My URL checking function that does the HEAD, filefield_feeds_valid_url, is now going to return a content type instead of TRUE. If we determine a type from the HEAD request, we pass that back. Otherwise we pass back application/octet stream as a default. This will ensure that FeedsEnclosure is called with the correct content type.

2) In the event we don't get a filename, I'm going to figure out a way to account for that and create one.

Comment #2

rjbrown99 commented 22 February 2010 at 07:21

Status	File	Size
new	filefield.inc_.txt	4.47 KB
new	mime.inc_.txt	14.85 KB

OK, I'm done with something that seems to work. I can roll a patch, but it's probably easier to just drop in the function. There were other commits since my first patch.

FeedsParser.inc looks like the following. We accept a passed variable $mimetype, which if it is available comes directly from the HTTP HEAD information on the remote file we are downloading. This allows us to accurately set the mimetype on the file.

Basically, this function is different from the one in the tree now as follows:
1) Accepts passed $mimetype variable
2) Removes/replaces whitespace in filenames with underscores
3) Decodes HTML entities in the basename/filename
4) Determines mimetype extensions if available using a new library called mime.inc.
5) Searches the proposed filename for URL query strings, which indicate we might not have a file. If it finds query strings, we manually create the filename based on the date, some random characters, and the mimetype extension from #4 above.

  /**
   * @return
   *   The file path to the downloaded resource referenced by the enclosure.
   *   Downloads resource if not downloaded yet.
   *
   * @todo This is not concurrency safe.
   */
  public function getFile($mimetype) {
    if(!isset($this->file)) {
      // First set the default filename. Replace spaces and decode all HTML entities.
      $replace = array("%20");
      $filename = str_replace($replace, "_", htmlspecialchars_decode(html_entity_decode(basename($this->getValue()))));
      // Grab the normal file extensions if we have a mimetype
      if($mimetype) {
        feeds_include_library('mime.inc', 'mime');
        $mimetypes = array_flip(feeds_drupal_mime_types());
        if (array_key_exists($mimetype, $mimetypes)) {
          $extensions = explode("|", $mimetypes[$mimetype]);
        }
      }
      // If we see query strings, assume we don't have a good name and create our own
      $qstrings = array("?", "=", "&", ";");
      foreach($qstrings as $qstring) {
        $isqstring[] = strpos($filename, $qstring);
      }
      if(!empty($isqstring)) {
        if($extensions) {
          $filename = date("YmdHis") . '_' . mt_rand() . '.' . $extensions[0];
        } else {
          $filename = date("YmdHis") . '_' . mt_rand();
        }
      }

      $dest = file_destination(file_directory_temp() .'/'. get_class($this) .'-'. $filename, FILE_EXISTS_RENAME);
      if (ini_get('allow_url_fopen')) {
        $this->file = copy($this->getValue(), $dest) ? $dest : 0;
      }
      else {
        $this->file = file_save_data($this->getContent(), $dest);
      }
      if ($this->file === 0) {
        watchdog('content', 'Cannot write content to %dest', array('%dest' => $dest));
      }
    }
    return $this->file;
  }

I also attached my modified filefield module, as described in comment #1 above. This includes the HTTP HEAD checks and the returned content types, which dovetail into the enhancements in this post.

I attached the new mime.inc, which is a slightly changed version of the same include file from the fileframework module.

I'd love some feedback on this.

Comment #3

alex_b commented 23 February 2010 at 21:34

3 questions:

1) In some cases a text/html may be expected (think of when filefield is not used as image field) - are these cases handled yet?
2) I am struggling to understand why we can't just make FeedsEnclosure::getMIMEType() work and then only set the target in filefield_feeds_set_target() if the mime type is an accepted image mime type.
3) Shouldn't getFile() throw an exception if URL can't be downloaded? Couldn't we catch this exception in filefield_feeds_set_target() and bail?

Comment #4

rjbrown99 commented 24 February 2010 at 00:25

Thanks for the reply Alex.

1) Good thought and I just did a bit more research. Within the filefield module, the widget for the field stores allowed file extensions in the field definition. The actual extensions seem to be stored in the database within the content_node_field_instance table, CCK definition for the filefield, widget_settings field. Speaking strictly of the filefield module, upon upload the function in filefield.module called filefield_widget_upload_validators() runs. This checks the uploaded file against the maximum filesize and the allowed extensions.

My initial thought was perhaps we can just call that function to perform the same validation on files that are created from a feed import. The difference in the case of feeds vs. a normal upload is we can't enforce the same level of consistency at the remote end. With filefield, we can just deny a user from uploading a file with the wrong extension. In the case of feeds we may run into a remote site that links us to a file with an extension OR does what I described above and instead sends us a URL mapper such as theirsite.com/getmyfile.php?id=1234&width=500&height=500 with a mimetype of image/jpeg. It's not a 'real' file but is more of a web service serving us a file and we have to deal with the aftermath of trying to check it and save it. They could also send back various HTTP statuses of 403/404 or even return a 'themed' 404 page that just looks like a regular webpage.

This was the rationale behind trying to do a HEAD on the remote 'file' to determine what we are dealing with. Perhaps taking this a step further, we could do an HTTP HEAD, figure out the mimetype of the remote file, compare it against the allowed filetypes from the field definition, and then proceed if we succeeded with those checks. Or we could do this, save the file locally, check the actual mimetype based on php fileinfo, etc.

What do you think?

2) How would this work in the above scenario? To accurately determine mime type, would we would have to fetch it from the remote end first? Or, would we just try to grab whatever is at the remote end, save it somewhere, and then post-process it. There would still have to be logic involved to work around 403/404 and other errors that might come back as text/html. I'm working under the belief that to use FeedsEnclosure::getMIMEType(), we would have to have the file first from getFile().

3) Yes, I think we should have some kind of error handling ability. Some of this will probably depend on the answers to the above as it's somewhat inter-related.

I'm pretty motivated to write whatever is needed to implement this assuming we come to a final approach. I'm dealing with very large files/imports and all of them use filefield.

Comment #5

alex_b commented 28 April 2010 at 18:47

Status:

Needs review

» Needs work

Filefield uses the mimedetect modules to map file extensions to mime types. We could leverage that.

if (module_exists('mimedetect')) {
        $type = mimedetect_mime($file);
        if ($type != $file->filemime) {
          $errors[] = t('The file contents (@type) do not match its extension (@extension).', array('@type' => $type, '@extension' => $extension));
        }
      }

I simply suggest to move the detection of mime types that is in filefield_feeds_valid_url() into getMIMEType() - thus this functionality would be more portable. Would that work? Or would we just create way too many FeedsEnclosure objects just to find out their MIME type? Further, what parser are you using? You see how FeedsSimplePie parser simply accepts what simplepie.inc determines as MIME type.

Sorry for the late reply.

Comment #6

rjbrown99 commented 19 July 2010 at 00:00

I'm still working on this code to perform better detection of the remote file via an HTTP HEAD request. I ran into an interesting situation on one of my feeds. I am feeding in an XML file with lots of image links that are to be turned into Drupal filefield/imagefields upon import.

One of the feeds I am using is hosting images at Amazon. Apparently they are using some kind of Amazon service where they only allow for 1 inbound request per second from a single IP. It's not a strict enforcement of that rule and a lot of items work, then you run into an image that instead gets this:

 HTTP/1.1 403 Forbidden
Server: squid
Content-Type: text/html
Content-Length: 1307
X-Squid-Error: ERR_ACCESS_DENIED 0
X-Cache-Lookup: NONE from cdn-images.amazon.com:10080
Cneonction: close
Expires: Sun, 18 Jul 2010 23:52:09 GMT
Date: Sun, 18 Jul 2010 23:52:09 GMT
Connection: close

If you request the same image a few seconds to a minute later, it works fine. So basically you get throttled down with 403's when you ask for too many images in a short period of time. The solution presented in the original issue intentionally makes two requests - one HEAD and one GET. This is part of what's creating the intermittent failures. The end result is you get an import with some working images, and some broken images.

I'm reworking it to support a delay if you get one of those 403s. That seems to have cured most of my problems. I'm also thinking of forward proxy support to reduce the possible load on the remote server. I'm just putting it out there as an FYI in case others are running in to the same thing.

Comment #7

meatbag commented 19 July 2010 at 09:42

subscribing

Comment #8

pwaterz commented 25 November 2011 at 23:26

I too am having a similar issue. Filenames need to be url decoded before they are saved and all parameters need to be removed. In the case of http://www.pnas.org/misc/Dawson%20--%20FINAL.mp3?rss=1 the file is being stored as Dawson%20--%20FINAL.mp3?rss=1 which isn't a valid path in drupal.

Comment #9

pwaterz commented 25 November 2011 at 23:45

Status	File	Size
new	feeds_filename_fix.patch	702 bytes

Here is the patch for #8 fixed my issue.

Comment #10

gnucifer commented 22 December 2011 at 18:20

How is this issue progressing? I'm willing to work on this functionality, but does the code in this issue reflect the latest changes?

Comment #11

gnucifer commented 30 December 2011 at 13:20

Status:

Needs review

» Needs work

Status	File	Size
new	feeds_filefield-mapper-mime-706908-11.patch	6.34 KB

Don't know if this is the appropriate issue for this patch. But I needed better mime-detection for files created through the feedapi filefield-mapper. I guess field-specific validation could be added quite easily, but this patch contains no such feature. I also refactored the code for the FeedsEncosure class. The code probably needs some more testing (works fine in my user-case though).

Comment #12

gnucifer commented 28 December 2011 at 13:55

Status:

Needs work

» Needs review

Comment #13

vinmassaro commented 11 October 2012 at 00:01

Version:	6.x-1.x-dev	» 7.x-2.x-dev
Status:	Needs work	» Needs review

I hope this can be revived for D7. I am seeing an issue when importing images from a remote source that isn't providing a file extension. An example would be a filename like imgSrv?objectId=102067&size=large. When I try to load the original version in Colorbox, it appears as a long string of characters instead.

Comment #14

vinmassaro commented 19 February 2013 at 15:38

Turns out this was due to a misconfiguration on my end and these images paths were actually 404ing.

Bumping this hoping for some traction, or someone else running into this issue. In 7.x, when importing an image that does not contain a file extension (mapped to an image field) I get these notices. Thanks in advance.

Notice: Undefined variable: file in FeedsEnclosure->getFile() (line 381 of /sandbox/d7/sites/all/modules/contrib/feeds/plugins/FeedsParser.inc).

Notice: Undefined variable: file in file_feeds_set_target() (line 84 of /sandbox/d7/sites/all/modules/contrib/feeds/mappers/file.inc).

Comment #15

vinmassaro commented 18 March 2013 at 16:58

Status:

Needs work

» Needs review

This issue has been dead, but I came up with a workaround for our purposes by implementing hook_feeds_presave() so hopefully it will help someone. Feeds was not saving images with extensions for filenames like imgSrv?objectId=102067&size=large and not setting the MIME type correctly. The example below targets images without extensions and basically renames the file to add the extension, then updates the file object. Be sure to rename field_image with the machine name of your image field.

/**
 * Implementation of hook_feeds_presave().
 *
 * Add file extensions to files that don't have them.
 */
function mymodule_feeds_presave(FeedsSource $source, $entity, $item) {
  if (!empty($entity->field_image)) {
    foreach ($entity->field_image[LANGUAGE_NONE] as $image) {

      // Get the file extension for each image.
      $extension = pathinfo($image['uri'], PATHINFO_EXTENSION);

      // If an image doesn't have an extension, we need to add it.
      if (empty($extension)) {

        // Load the file object.
        $file = file_load($image['fid']);

        // Get the file extension and mime type.
        $image_type = exif_imagetype($file->uri);
        $extension = image_type_to_extension($image_type, TRUE);
        $mime = image_type_to_mime_type($image_type);

        // Define short extension replacements.
        $extension_replacements = array(
          'jpeg' => 'jpg',
          'tiff' => 'tif',
        );

        // Replace to the short extension.
        $extension = strtr($extension, $extension_replacements);

        // Move the file so the uri gets the proper file extension.
        file_move($file, $file->uri . $extension);

        // Reload the updated file object.
        $file = file_load($file->fid);

        // We may have originally gotten the wrong MIME type, so set it correctly if they don't match.
        if (!empty($file->filemime) && $file->filemime !== $mime) {
          $file->filemime = $mime;
        }

        // Save the file.
        file_save($file);
      }
    }
  }
}

Comment #16

15 March 2013 at 21:08

Status:

Needs review

» Needs work

The last submitted patch, feeds_filefield-mapper-mime-706908-11.patch, failed testing.

Comment #17

2 January 2014 at 19:06

Status:

Needs review

» Needs work

Correcting issue status, see #2125263: Issue status not taken from last comment during D7 migration.

Comment #18

twistor commented 20 July 2015 at 06:30

Issue summary:	View changes
Status:	Needs work	» Closed (duplicate)

We've added lots more file validation in various issues.

Enhance filefield mapper to perform more validation on remote URLs

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Referenced by

News items

Our community

Documentation

Drupal code base

Governance of community