Don't store raw source in feeds_source table [#2829097]

Problem/Motivation

When an import does not complete in one run and the source is fetched via HTTP, the raw source is saved in the database, in the feeds_source table. This is a problem when the raw source is very big. It can result into errors when the content is bigger than the MySQL max allowed packet. For example: if the content is 200 MB and max_allowed_packet = 128M, then it will result into a PDOException with the message "MySQL server has gone away".

Interesting to note is that content is also saved on the file system (in public://feeds) and that when an import continues, the content is taken from this file. So saving it also to the database is redundant as it is not being used!

Upon completion of the import, the content is deleted from both the feeds_source table and the file system.

How to reproduce

Create an importer using the HTTP fetcher and the CSV parser. Turn off "Import on submission" setting. Eventually set to import as often as possible and eventually enable "skip hash check" setting to reproduce the issue again on subsequent imports.
Make sure that the import cannot be completed in one run:
1. Set the variable "feeds_process_limit" to a low value, for example "1". Each cycle will then import one item if the parser supports this (the CSV parser supports this setting, the Common syndication parser does not).
2. In a custom module, alter the maximum run time of the queue "feeds_source_import". When an import is not completed in one cycle, the job will be put on a queue so multiple cycles can be ran on a single cron run. That is why setting only the variable "feeds_process_limit" is not sufficient.
  Code example:
```
/**
 * Implements hook_cron_queue_alter().
 *
 * Sets runtime limit for feeds_source_import queue to 3 seconds.
 */
function mymodule_cron_queue_info_alter(&$queues) {
  $queues['feeds_source_import']['time'] = 3;
}
```
3. Make sure that processing a single item takes more time to complete:
```
/**
 * Implements hook_feeds_after_save().
 *
 * Makes that processing a single item takes at least a second to complete.
 */
function mymodule_feeds_after_save() {
  sleep(1);
}
```
4. Have a CSV file with enough items. With the example values used from above, you will need at least four items.
Initiate the import.
Run cron.

Proposed resolution

When the fetcher result is saved in the feeds_source table, unset $fetcher_result->raw before serializing the object. This can be done by implementing the magic __sleep() method on it.
Possible caveat for this solution: private members on the fetcher result may not be saved? There are no private members on FeedsFetcherResult, but classes that extend it could have it, so this could potentially introduce a BC-break.

I also have plans to refactor FeedsFetcherResult a bit: the file to import will be saved at $feedsdir/in_progress; the name of the file should contain importer ID + Feed NID, the name may be hashed.

Remaining tasks

Write a test that ensures that the content is not saved in the feeds_source table when the import does not complete in one run. (start: #2032333-20: [Meta] Store remote files in the file system instead of in the database)
Write a test that ensures that the content continues were it left off (check this by altering the cache).
Review!

User interface changes

None.

API changes

The class FeedsFetcherResult may get a __sleep() method, which could potentially break classes that extend FeedsFetcherResult and have private members.
The class FeedsFetcherResult may be refactored a bit, small chance of a BC-break.

Data model changes

The raw source is no longer saved in the feeds_source table.
The file saved on the file system will probably be renamed to $feedsdir/in_progress/[importer_id]-[feed_nid] (file name may be hashed).

Comment	File	Size	Author
#12	interdiff-2829097-11-12.txt	8.44 KB	megachriz
#12	feeds-fetcher-raw-2829097-12.patch	28.29 KB	megachriz
#12	feeds-fetcher-raw-2829097-12-new-tests.patch	26.99 KB	megachriz
#11	interdiff-2829097-10-11.txt	425 bytes	megachriz
#11	feeds-fetcher-raw-2829097-11.patch	24.97 KB	megachriz
#10	interdiff-2829097-8-10.txt	8.37 KB	megachriz
#10	feeds-fetcher-raw-2829097-10.patch	25.22 KB	megachriz
#8	interdiff-2829097-6-8.txt	2.06 KB	megachriz
#8	feeds-fetcher-raw-2829097-8.patch	17.32 KB	megachriz
#6	feeds-fetcher-raw-2829097-6.patch	17.3 KB	megachriz
#4	interdiff-2829097-2-4.txt	932 bytes	megachriz
#4	feeds-fetcher-raw-2829097-4.patch	7.11 KB	megachriz
#2	feeds-fetcher-raw-2829097-2.patch	7.34 KB	megachriz

Comments

Comment #1

20 November 2016 at 10:57

MegaChriz created an issue. See original summary.

Comment #2

megachriz

Dutch

commented 11 February 2017 at 14:36

Status:

Active

» Needs review

Status	File	Size
new	feeds-fetcher-raw-2829097-2.patch	7.34 KB

Now that #2829096: Cache result of HTTP source on file system is done, this issue can get worked on.

Starting with the tests taken from #2032333-20: [Meta] Store remote files in the file system instead of in the database. Tests should fail.

Comment #3

11 February 2017 at 14:47

Status:

Needs review

» Needs work

The last submitted patch, 2: feeds-fetcher-raw-2829097-2.patch, failed testing.

Comment #4

megachriz

Dutch

commented 11 February 2017 at 14:57

Status:

Needs work

» Needs review

Status	File	Size
new	feeds-fetcher-raw-2829097-4.patch	7.11 KB
new	interdiff-2829097-2-4.txt	932 bytes

Oops, feeds_tests_cron_queue_info_alter() did already exist.

Comment #5

11 February 2017 at 15:08

Status:

Needs review

» Needs work

The last submitted patch, 4: feeds-fetcher-raw-2829097-4.patch, failed testing.

Comment #6

megachriz

Dutch

commented 11 February 2017 at 16:44

Status:

Needs work

» Needs review

Status	File	Size
new	feeds-fetcher-raw-2829097-6.patch	17.3 KB

This is a start for a fix, which changes quite a lot of code. I had taken ideas from the patch in #2032333-20: [Meta] Store remote files in the file system instead of in the database and also checked the code from the D8 version of Feeds. I haven't tested yet if it passes the tests that I added in #4. Just want to see if I'm on the right track.

Anyway, this calls for an automated test that specifically tests FeedsFetcherResult.

Comment #7

11 February 2017 at 18:01

Status:

Needs review

» Needs work

The last submitted patch, 6: feeds-fetcher-raw-2829097-6.patch, failed testing.

Comment #8

megachriz

Dutch

commented 11 February 2017 at 18:27

Status:

Needs work

» Needs review

Status	File	Size
new	feeds-fetcher-raw-2829097-8.patch	17.32 KB
new	interdiff-2829097-6-8.txt	2.06 KB

FeedsFetcherResult::sanitizeRawOptimized() may not be called by the file fetcher as only variables may be passed by reference. Also, the file fetcher passed an empty string to its parent, causing $raw property to be set and overwriting stuff. Let's replace isset($this->raw) with !empty($this->raw).
There may be an endless loop hidden, but it could also have been caused by the file fetcher...

Comment #9

11 February 2017 at 18:37

Status:

Needs review

» Needs work

The last submitted patch, 8: feeds-fetcher-raw-2829097-8.patch, failed testing.

Comment #10

megachriz

Dutch

commented 18 February 2017 at 20:50

Status:

Needs work

» Needs review

Status	File	Size
new	feeds-fetcher-raw-2829097-10.patch	25.22 KB
new	interdiff-2829097-8-10.txt	8.37 KB

Now with a test that specifically tests the FeedsFetcherResult class.

And with that, I spotted the following mistake in the previous patch (this is from the interdiff):

+++ b/plugins/FeedsFetcher.inc
@@ -54,7 +54,7 @@ class FeedsFetcherResult extends FeedsResult {
-    if (!isset($this->raw)) {
+    if (!empty($this->raw)) {

Comment #11

megachriz

Dutch

commented 19 February 2017 at 20:11

Status	File	Size
new	feeds-fetcher-raw-2829097-11.patch	24.97 KB
new	interdiff-2829097-10-11.txt	425 bytes

I looked at the code of a few contrib fetchers that are noted on https://www.drupal.org/node/856644 and saw that most fetchers call the parent constructor with an empty string, which file fetcher did before (but which I changed to passing NULL instead in the patches from this issue). So better keep it an empty string to match better what contrib does.

Other than that, I saw no code in these contrib fetchers what worried me in terms of backwards compatibility. So I'm confident that the attached patch is ready given that it still passes tests.

I'll wait for a week and maybe a bit longer for feedback before committing it.

Comment #12

megachriz

Dutch

commented 5 March 2017 at 08:54

Status	File	Size
new	feeds-fetcher-raw-2829097-12-new-tests.patch	26.99 KB
new	feeds-fetcher-raw-2829097-12.patch	28.29 KB
new	interdiff-2829097-11-12.txt	8.44 KB

During testing I found three issues with this patch:

Instead of one, two temporary files got saved in the in_progress directory. This is because both FeedsFetcher::getFilePath() and FeedsFetcher::__sleep() can call FeedsFetcher::saveRawToFile(). Both methods do this when using the CSV parser and multiple cron runs are needed to complete the import.
When the temporary file in the in_progress directory got removed before import is complete, the source gets refetched instead of aborted. This is a problem because the source may have changed in the mean time, so possibly resulting into missing some records.
The variable 'feeds_in_progress_dir' was not documented. It was also not removed upon uninstall.

Two attached patches, one with additional tests only that should fail so I'm sure that the problems that I found are properly covered. The other one with fixes, which should pass.

Comment #13

5 March 2017 at 09:09

The last submitted patch, 12: feeds-fetcher-raw-2829097-12-new-tests.patch, failed testing.

Comment #14

5 March 2017 at 09:12

Status:

Needs review

» Needs work

The last submitted patch, 12: feeds-fetcher-raw-2829097-12.patch, failed testing.

Comment #15

megachriz

Dutch

commented 5 March 2017 at 10:49

Some tests failed because of failing branch tests (see #2857935: Fix failing test FeedsMapperMultilingualFieldsTestCase::testChangedLanguageImportForExistingNode()). Back to "Needs review".

Comment #16

megachriz

Dutch

commented 5 March 2017 at 10:49

Status:

Needs work

» Needs review

Comment #17

5 March 2017 at 11:15

The last submitted patch, 12: feeds-fetcher-raw-2829097-12-new-tests.patch, failed testing.

Comment #18

25 March 2017 at 13:04

MegaChriz committed 314c210 on 7.x-2.x

Issue #2829097 by MegaChriz: fixed don't store raw source in...

Comment #19

megachriz

Dutch

commented 25 March 2017 at 13:04

Status:

Needs review

» Fixed

I did not spot any new issues. So let's call this issue done. Committed #12!

Oof. This must have been my most challenging Feeds issue so far. 1,5 years of work! Well, for #2032333: [Meta] Store remote files in the file system instead of in the database to be completed, there are still some (small) optimizations to be implemented. But the big chunk of it is done now. :)

Comment #20

8 April 2017 at 13:09

Status:

Fixed

» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Don't store raw source in feeds_source table

Problem/Motivation

How to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Comment #19

Comment #20

Change records for this issue

Parent issue

News items

Our community

Documentation

Drupal code base

Governance of community