Problem/Motivation

When an import does not complete in one run and the source is fetched via HTTP, the raw source is saved in the database, in the feeds_source table. This is a problem when the raw source is very big. It can result into errors when the content is bigger than the MySQL max allowed packet. For example: if the content is 200 MB and max_allowed_packet = 128M, then it will result into a PDOException with the message "MySQL server has gone away".

Interesting to note is that content is also saved on the file system (in public://feeds) and that when an import continues, the content is taken from this file. So saving it also to the database is redundant as it is not being used!

Upon completion of the import, the content is deleted from both the feeds_source table and the file system.

How to reproduce

  1. Create an importer using the HTTP fetcher and the CSV parser. Turn off "Import on submission" setting. Eventually set to import as often as possible and eventually enable "skip hash check" setting to reproduce the issue again on subsequent imports.
  2. Make sure that the import cannot be completed in one run:
    1. Set the variable "feeds_process_limit" to a low value, for example "1". Each cycle will then import one item if the parser supports this (the CSV parser supports this setting, the Common syndication parser does not).
    2. In a custom module, alter the maximum run time of the queue "feeds_source_import". When an import is not completed in one cycle, the job will be put on a queue so multiple cycles can be ran on a single cron run. That is why setting only the variable "feeds_process_limit" is not sufficient.
      Code example:
      /**
       * Implements hook_cron_queue_alter().
       *
       * Sets runtime limit for feeds_source_import queue to 3 seconds.
       */
      function mymodule_cron_queue_info_alter(&$queues) {
        $queues['feeds_source_import']['time'] = 3;
      }
      
    3. Make sure that processing a single item takes more time to complete:
      /**
       * Implements hook_feeds_after_save().
       *
       * Makes that processing a single item takes at least a second to complete.
       */
      function mymodule_feeds_after_save() {
        sleep(1);
      }
      
    4. Have a CSV file with enough items. With the example values used from above, you will need at least four items.
  3. Initiate the import.
  4. Run cron.

Proposed resolution

When the fetcher result is saved in the feeds_source table, unset $fetcher_result->raw before serializing the object. This can be done by implementing the magic __sleep() method on it.
Possible caveat for this solution: private members on the fetcher result may not be saved? There are no private members on FeedsFetcherResult, but classes that extend it could have it, so this could potentially introduce a BC-break.

I also have plans to refactor FeedsFetcherResult a bit: the file to import will be saved at $feedsdir/in_progress; the name of the file should contain importer ID + Feed NID, the name may be hashed.

Remaining tasks

  1. Write a test that ensures that the content is not saved in the feeds_source table when the import does not complete in one run. (start: #2032333-20: [Meta] Store remote files in the file system instead of in the database)
  2. Write a test that ensures that the content continues were it left off (check this by altering the cache).
  3. Review!

User interface changes

None.

API changes

  • The class FeedsFetcherResult may get a __sleep() method, which could potentially break classes that extend FeedsFetcherResult and have private members.
  • The class FeedsFetcherResult may be refactored a bit, small chance of a BC-break.

Data model changes

  • The raw source is no longer saved in the feeds_source table.
  • The file saved on the file system will probably be renamed to $feedsdir/in_progress/[importer_id]-[feed_nid] (file name may be hashed).

Comments

MegaChriz created an issue. See original summary.

megachriz’s picture

Status: Active » Needs review
StatusFileSize
new7.34 KB

Now that #2829096: Cache result of HTTP source on file system is done, this issue can get worked on.

Starting with the tests taken from #2032333-20: [Meta] Store remote files in the file system instead of in the database. Tests should fail.

Status: Needs review » Needs work

The last submitted patch, 2: feeds-fetcher-raw-2829097-2.patch, failed testing.

megachriz’s picture

Status: Needs work » Needs review
StatusFileSize
new7.11 KB
new932 bytes

Oops, feeds_tests_cron_queue_info_alter() did already exist.

Status: Needs review » Needs work

The last submitted patch, 4: feeds-fetcher-raw-2829097-4.patch, failed testing.

megachriz’s picture

Status: Needs work » Needs review
StatusFileSize
new17.3 KB

This is a start for a fix, which changes quite a lot of code. I had taken ideas from the patch in #2032333-20: [Meta] Store remote files in the file system instead of in the database and also checked the code from the D8 version of Feeds. I haven't tested yet if it passes the tests that I added in #4. Just want to see if I'm on the right track.

Anyway, this calls for an automated test that specifically tests FeedsFetcherResult.

Status: Needs review » Needs work

The last submitted patch, 6: feeds-fetcher-raw-2829097-6.patch, failed testing.

megachriz’s picture

Status: Needs work » Needs review
StatusFileSize
new17.32 KB
new2.06 KB

FeedsFetcherResult::sanitizeRawOptimized() may not be called by the file fetcher as only variables may be passed by reference. Also, the file fetcher passed an empty string to its parent, causing $raw property to be set and overwriting stuff. Let's replace isset($this->raw) with !empty($this->raw).
There may be an endless loop hidden, but it could also have been caused by the file fetcher...

Status: Needs review » Needs work

The last submitted patch, 8: feeds-fetcher-raw-2829097-8.patch, failed testing.

megachriz’s picture

Status: Needs work » Needs review
StatusFileSize
new25.22 KB
new8.37 KB

Now with a test that specifically tests the FeedsFetcherResult class.

And with that, I spotted the following mistake in the previous patch (this is from the interdiff):

+++ b/plugins/FeedsFetcher.inc
@@ -54,7 +54,7 @@ class FeedsFetcherResult extends FeedsResult {
-    if (!isset($this->raw)) {
+    if (!empty($this->raw)) {
megachriz’s picture

StatusFileSize
new24.97 KB
new425 bytes

I looked at the code of a few contrib fetchers that are noted on https://www.drupal.org/node/856644 and saw that most fetchers call the parent constructor with an empty string, which file fetcher did before (but which I changed to passing NULL instead in the patches from this issue). So better keep it an empty string to match better what contrib does.

Other than that, I saw no code in these contrib fetchers what worried me in terms of backwards compatibility. So I'm confident that the attached patch is ready given that it still passes tests.

I'll wait for a week and maybe a bit longer for feedback before committing it.

megachriz’s picture

During testing I found three issues with this patch:

  1. Instead of one, two temporary files got saved in the in_progress directory. This is because both FeedsFetcher::getFilePath() and FeedsFetcher::__sleep() can call FeedsFetcher::saveRawToFile(). Both methods do this when using the CSV parser and multiple cron runs are needed to complete the import.
  2. When the temporary file in the in_progress directory got removed before import is complete, the source gets refetched instead of aborted. This is a problem because the source may have changed in the mean time, so possibly resulting into missing some records.
  3. The variable 'feeds_in_progress_dir' was not documented. It was also not removed upon uninstall.

Two attached patches, one with additional tests only that should fail so I'm sure that the problems that I found are properly covered. The other one with fixes, which should pass.

The last submitted patch, 12: feeds-fetcher-raw-2829097-12-new-tests.patch, failed testing.

Status: Needs review » Needs work

The last submitted patch, 12: feeds-fetcher-raw-2829097-12.patch, failed testing.

megachriz’s picture

Some tests failed because of failing branch tests (see #2857935: Fix failing test FeedsMapperMultilingualFieldsTestCase::testChangedLanguageImportForExistingNode()). Back to "Needs review".

megachriz’s picture

Status: Needs work » Needs review

The last submitted patch, 12: feeds-fetcher-raw-2829097-12-new-tests.patch, failed testing.

  • MegaChriz committed 314c210 on 7.x-2.x
    Issue #2829097 by MegaChriz: fixed don't store raw source in...
megachriz’s picture

Status: Needs review » Fixed

I did not spot any new issues. So let's call this issue done. Committed #12!

Oof. This must have been my most challenging Feeds issue so far. 1,5 years of work! Well, for #2032333: [Meta] Store remote files in the file system instead of in the database to be completed, there are still some (small) optimizations to be implemented. But the big chunk of it is done now. :)

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.