Problem/Motivation

When fetching content via http_request_get(), the raw source is statically cached. This can result into a PHP memory exhaustion error if many files are fetched in one request.

Proposed resolution

Do not statically cache the raw source in http_request_get(). To prevent requesting the same URL in the same request, just note with a statically cached boolean value that content has been fetched. If content for the same URL is requested during the same request, directly return the content from the cache.

Remaining tasks

  • Implement the proposed resolution.
  • Write a test that ensures that the content isn't refetched via HTTP when requesting the URL a second time during the same request.

User interface changes

None.

API changes

None.

Data model changes

None.

Original report by 75th Trombone

This seems very, very similar to many issues I've found, but none of them seem exactly right, nor do their solutions fix my problem.

I'm updating around 140 nodes with a new file that goes in a certain file field on the nodes. I'm importing a CSV, and the CSV itself is not too big at all (and a large CSV is the source of some of the similar problems I've found reported). But the imported files themselves are large — the smallest are around 4 MB and the largest are around 12 MB.

When I run the import, it runs for a bit, but it fails with a PHP memory exhaustion error. It doesn't always get exactly the same number of nodes into the import before failing, but the farthest it's gotten is a combined 110 MB's worth of files, and my PHP memory limit is 128 MB.

Clearly, these files' content and not just their file names are being stored in memory. I found #2032333-8: [Meta] Store remote files in the file system instead of in the database, but the patch didn't solve my problem. I've tried the import both with and without "Process in background" turned on, and I get an memory error either way, just delivered differently (one shows an AJAX error page, the other shows a white screen).

Is there anything I can do to fix this? Can the work that's been done in that issue be applied to imported files, not just the import feed itself?

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

75th Trombone created an issue. See original summary.

MegaChriz’s picture

#2032333: [Meta] Store remote files in the file system instead of in the database is about caching a source that is fetched via the HTTP Fetcher on the local file system instead of the database. When a next import is ran, the source provider will be asked if the contents had changed. If not, then the cached file will be used.

If I look at the code in the class FeedsEnclosure in plugins/FeedsParser.inc, I see that in the method getFile() the complete contents of the file is loaded into memory. This has to be done if the file is fetched via HTTP. I think there is no way around it.
If I understanded the code from FeedsEnclosure::getFile() correctly, the file is not loaded into memory when the file is already on the local filesystem. If the files to import are already on your local filesytem, then you could try to specify a path in your CSV file instead of an URL.

So far I haven't seen yet how the contents of the file to save is being kept in memory. The file contents are passed to file_save_data() which should save the file and then I would expect that PHP automatically frees the memory.

dwillems’s picture

I started running into OOM issues yesterday with a feed that has some high-resolution photo attachments. I too tried the patch in #2032333, which solved the OOM errors in certain cases, but it turns out the root of my problem is in http_request.inc.

If you look at http_request_get, you will notice it makes a call to http_request_get_cache and http_request_set_cache to cache enclosures in the same cache_feeds_http bucket that the above mentioned patch moves to a file cache. This seems to be working fine. However, this function also makes use of drupal_static() to cache file data in static (per-request) variables. I don't really understand why both would be necessary, it really makes no sense to me. I confirmed that commenting out the drupal_static() usage puts and end to my errors. My feed imports now finish properly in drush every time.

MegaChriz’s picture

@dwillems
http_request_get() caches the received data in a static variable to ensure that the data isn't retrieved from the same URL twice during a request. Even when the data is cached in the cache_feeds_http bin, the source URL is still consulted to know if the data has changed in the mean time. http_request_get() asks this by sending the HTTP header "If-Modified-Since" along. Only if the source URL reports HTTP status code 304, the data from cache_feeds_http is used. It's possible that the source URL doesn't support returning HTTP status code 304. By statically caching the retrieved data the case where the source is requested over and over again is prevented.

I can understand the memory issues. Maybe http_request_get() should only note that data from a certain URL is retrieved and when the function is called a second time for the same URL it should just return the data that is in the cache_feeds_http bin (without consulting the source URL to check if the data has changed)?

MegaChriz’s picture

I've updated the issue summary, and plan to fix this as part of #2032333: [Meta] Store remote files in the file system instead of in the database.

MegaChriz’s picture

This patch removes the raw source from the static cache in http_request_get().

MegaChriz’s picture

I've tested the patch myself with importing about 50 images varying in size from 3.3 MB to 4.6 MB. I had set the PHP memory limit to 128 MB.
Without the patch applied, PHP would run out of memory. With the patch applied, memory peaked at 15 MB, so that the patch certainly proves to be an improvement.

When importing a CSV file of about 65 MB however, PHP would still run out of memory. There is not much I can do about that with the current architecture I'm afraid, as the downloaded resource still needs to be passed around and that does cost memory: twice or maybe three times as much as the size of the file. This issue could perhaps be fixed if files are downloaded in a stream. That's more food for an other issue though. It can be handled in #1793806: Add support for downloading from stream wrappers..

For a more theoretical improvement I assigned the download resource in http_request_get() directly to $result->data. Else there would be two variables in that function holding the downloaded resource: $data and $result->data. Manual testing didn't result in memory improvement however, probably because the downloaded resource isn't handled in the most efficient way when it leaves http_request_get().

If this passes tests, then it's ready for commit.

  • MegaChriz committed 2c52b18 on 7.x-2.x
    Issue #2662892 by MegaChriz: Do not cache downloaded source statically...
MegaChriz’s picture

Status: Needs review » Fixed

Committed #7.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.