Trying to import a 100MB CSV feed, I ran into PHP memory trouble. Cause is the http_request_get() function's trying to cache_set('feeds_http_download_' . md5($url), $result);

So I changed it into:

  $headers_only = new stdClass;
  $headers_only->headers = $result->headers;
  cache_set('feeds_http_download_' . md5($url), $headers_only);

I understand the benefit was the 304 code case, which I removed.

Maybe make this caching logic optional? Anyway, I override the http_request call in a custom Fetcher.inc, so don't mind.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Anonymous’s picture

Just noticed all feeds enclosures are also fully stored in the cache table. For me that is not desirable. The enclosures are usually unique and will be only downloaded once.

It would make much more sense to check if the local file already exists, then decide not to re-download. That would be a performance improvement and you don't even need to store everything in the cache table.

5n00py’s picture

Yes! In my case cache table have 2GB size! There is no reason to store full files! Only hash or headers etc.!
Subscribing!

Anonymous’s picture

Version: 7.x-2.0-alpha4 » 7.x-2.x-dev

I solved it by implementing a custom http_request.inc and simpy removing/rewriting the cache_set/cache_get part. My solution is to only save the headers, and check for the file date on disk. I keep my feeds for 7 days before re-downloading them.

I'd post my http_request.inc file here, but it's too customized now. I might post a patch here if I standardize the changes.

5n00py’s picture

Status: Active » Closed (fixed)

I don't know why, but!
I clean all my caches(including http_requests caches).
And reimport my db!
Now after import feeds recreate all caches but without full data, table size now very little (~660b/row, not 600kb/row).
So as i see this issue fixed.
Version = "7.x-2.0-alpha5"

5n00py’s picture

Status: Closed (fixed) » Active

Hm, no, I do some mistake when test this issue!
Actually feeds save full fetched data(in my case images files) in cache table!

Sry for wrong info in #4

leendertdb’s picture

Hi all, I have taken the courtesy to create a patch for making the feed cache optional.

Our use case was a bit similar. We are running a feeds import every minute in order to sync two different systems. The file feeds is downloading is about 20MB in size, which cause issues with our memcache setup as memcache only allows 1 megabyte objects in size by default. In this case it tries to split it up to 20 different chunks which causes additional delay. For us in the end, we don't need caching as the file it downloads is dynamically generated and thus will always be different anyway.

I have created a patch for this issue, which adds an extra option in the feed fetcher configuration: "Cache HTTP result of request" (see screenshot). This is option is enabled by default in order to maintain backwards compatibility.

With this patch you will be able to disable the cache on a per-feed basis on the fly. It maintains full backwards compatibility with existing feeds.

leendertdb’s picture

Status: Active » Needs review
FileSize
371.47 KB

Attaching screenshot of what the option looks like in the backend, and updating status to needs review.

MegaChriz’s picture

@leendertdb
This is cool! I make this a child issue of #2032333: [Meta] Store remote files in the file system instead of in the database so that it becomes a target for Feeds 7.x-2.0-beta4.

Note though that your patch doesn't fix the whole issue. Image files for example (when mapping to an image field), are also downloaded via http_request_get(), but not via the HTTP fetcher. So these would still end up in the cache. See FeedsEnclosure::getContent() in feeds/plugins/FeedsParser.inc.

It would be great if you want to give #2829096: Cache result of HTTP source on file system a review too. The goal of that issue is to store the cached source on the file system and the metadata of the HTTP response in the database.

MegaChriz’s picture

As I don't like the way options are added to http_request_get(), I opened #2877125: Refactor http_request_get().

MegaChriz’s picture

Here is a reroll and an automated test.

Since with recent changes the downloaded HTTP result is now cached on the file system by default, I changed the description of the "Cache HTTP result of request" option to this:

Disabling this cache means that the downloaded source will not be cached (for example: on the file system, in the database or on memcache), but will be redownloaded on every feeds import attempt. This can be helpful if the source to download is dynamically generated and will always be different, or if it is very large (50 MB+) which could cause PHP memory timeout issues.

If you are having issues with the feeds importer not processing changes to the source file, or are experiencing caching issues, you can disable the caching of this feeds content.

MegaChriz’s picture

I've done an experiment to optimize downloading data via HTTP, see #2880129: Optimize downloading data via HTTP. With the help of the "Display memory usage" setting from the Devel module and by executing the cron manually using the link on the Drupal status report page (admin/reports/status) I came to the conclusion that the real memory hog is in feeds_http_request() (formerly http_request_get()) and the memory usage barely increases after that (even with the optimization). Therefore claiming disabling the cache could fix PHP memory timeout issues is wrong.

In the new patch, the description is now this:

Disabling this cache means that the downloaded source will not be cached (for example: on the file system or on memcache), but will be redownloaded on every feeds import attempt. This can be helpful if the source to download is dynamically generated and will always be different, or if it is very large (50 MB+) which would cost additional webspace.

If you are having issues with the feeds importer not processing changes to the source file, or are experiencing caching issues, you can disable the caching of this feeds content.

  • MegaChriz committed c787299 on 7.x-2.x
    Issue #1360910 by MegaChriz, leendertdb: Added an option for the HTTP...
MegaChriz’s picture

Status: Needs review » Fixed

Committed #11 with small changes in the wordings. The last bit now reads:

If you're having issues with Feeds not processing changes from the source file, or if you are experiencing caching issues, you can disable the caching of this feeds content.

leendertdb’s picture

@MegaChriz,

Thanks for committing the patch. One note though about the change in the form description: I see that you changed my wordings of "1-4MB" to "50MB+". The reason why I added the "1-4 MB" part is because of a limitation in memcache.

By default memcache only supports cache items with a size of 1 MB maximum. When e.g. you try to cache a 10MB variable it will split it up into 10 different chunks of 1MB. This can cause a lot of overhead with memcache sets/gets and also (depending on whether your memcache instance is on the local machine or a different node in the network) a lot of network bandwith usage.

The same reasoning applies to the database, when the database instance is on another machine in the network the database cache_gets and cache_sets can add up really fast where a 1gbit/sec internal network connection can be a bottleneck really fast. At least that's what was happening at our end with a multi server setup on a high traffic drupal platform.

Just a FYI about the reasoning behind my originally proposed description.

MegaChriz’s picture

@leendertdb
I changed that text because in the dev version I have introduced a cache class for the cache_feeds_http bin. The new default is that only headers are cached in the database and the raw file data is cached on the file system. I've made this so that even when you set a different cache class for cache_feeds_http, the file contents are still cached on the filesystem. You can only explicitly override this behavior if you use a cache class that extends FeedsHTTPCache. This way the file contents should not end up in memcache when using memcache for the cache_feeds_http bin. I've backed up this behavior with an automated test. See FeedsFileHTTPTestCase::testHTTPCacheOverride().

leendertdb’s picture

@MegaChriz,

That sounds like a good change, would have solved our issue at that time as well. Thanks for clarifying. +1 :-).

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.