Problem/Motivation
When an import does not complete in one run and the source is fetched via HTTP, the raw source is saved in the database, in the feeds_source table. This is a problem when the raw source is very big. It can result into errors when the content is bigger than the MySQL max allowed packet. For example: if the content is 200 MB and max_allowed_packet = 128M, then it will result into a PDOException with the message "MySQL server has gone away".
Interesting to note is that content is also saved on the file system (in public://feeds) and that when an import continues, the content is taken from this file. So saving it also to the database is redundant as it is not being used!
Upon completion of the import, the content is deleted from both the feeds_source table and the file system.
How to reproduce
- Create an importer using the HTTP fetcher and the CSV parser. Turn off "Import on submission" setting. Eventually set to import as often as possible and eventually enable "skip hash check" setting to reproduce the issue again on subsequent imports.
- Make sure that the import cannot be completed in one run:
- Set the variable "feeds_process_limit" to a low value, for example "1". Each cycle will then import one item if the parser supports this (the CSV parser supports this setting, the Common syndication parser does not).
- In a custom module, alter the maximum run time of the queue "feeds_source_import". When an import is not completed in one cycle, the job will be put on a queue so multiple cycles can be ran on a single cron run. That is why setting only the variable "feeds_process_limit" is not sufficient.
Code example:/** * Implements hook_cron_queue_alter(). * * Sets runtime limit for feeds_source_import queue to 3 seconds. */ function mymodule_cron_queue_info_alter(&$queues) { $queues['feeds_source_import']['time'] = 3; } - Make sure that processing a single item takes more time to complete:
/** * Implements hook_feeds_after_save(). * * Makes that processing a single item takes at least a second to complete. */ function mymodule_feeds_after_save() { sleep(1); } - Have a CSV file with enough items. With the example values used from above, you will need at least four items.
- Initiate the import.
- Run cron.
Proposed resolution
When the fetcher result is saved in the feeds_source table, unset $fetcher_result->raw before serializing the object. This can be done by implementing the magic __sleep() method on it.
Possible caveat for this solution: private members on the fetcher result may not be saved? There are no private members on FeedsFetcherResult, but classes that extend it could have it, so this could potentially introduce a BC-break.
I also have plans to refactor FeedsFetcherResult a bit: the file to import will be saved at $feedsdir/in_progress; the name of the file should contain importer ID + Feed NID, the name may be hashed.
Remaining tasks
- Write a test that ensures that the content is not saved in the feeds_source table when the import does not complete in one run. (start: #2032333-20: [Meta] Store remote files in the file system instead of in the database)
- Write a test that ensures that the content continues were it left off (check this by altering the cache).
- Review!
User interface changes
None.
API changes
- The class FeedsFetcherResult may get a
__sleep()method, which could potentially break classes that extend FeedsFetcherResult and have private members. - The class FeedsFetcherResult may be refactored a bit, small chance of a BC-break.
Data model changes
- The raw source is no longer saved in the feeds_source table.
- The file saved on the file system will probably be renamed to $feedsdir/in_progress/[importer_id]-[feed_nid] (file name may be hashed).
| Comment | File | Size | Author |
|---|---|---|---|
| #12 | interdiff-2829097-11-12.txt | 8.44 KB | megachriz |
| #12 | feeds-fetcher-raw-2829097-12.patch | 28.29 KB | megachriz |
| #12 | feeds-fetcher-raw-2829097-12-new-tests.patch | 26.99 KB | megachriz |
| #11 | interdiff-2829097-10-11.txt | 425 bytes | megachriz |
| #11 | feeds-fetcher-raw-2829097-11.patch | 24.97 KB | megachriz |
Comments
Comment #2
megachrizNow that #2829096: Cache result of HTTP source on file system is done, this issue can get worked on.
Starting with the tests taken from #2032333-20: [Meta] Store remote files in the file system instead of in the database. Tests should fail.
Comment #4
megachrizOops,
feeds_tests_cron_queue_info_alter()did already exist.Comment #6
megachrizThis is a start for a fix, which changes quite a lot of code. I had taken ideas from the patch in #2032333-20: [Meta] Store remote files in the file system instead of in the database and also checked the code from the D8 version of Feeds. I haven't tested yet if it passes the tests that I added in #4. Just want to see if I'm on the right track.
Anyway, this calls for an automated test that specifically tests FeedsFetcherResult.
Comment #8
megachrizFeedsFetcherResult::sanitizeRawOptimized()may not be called by the file fetcher as only variables may be passed by reference. Also, the file fetcher passed an empty string to its parent, causing$rawproperty to be set and overwriting stuff. Let's replaceisset($this->raw)with!empty($this->raw).There may be an endless loop hidden, but it could also have been caused by the file fetcher...
Comment #10
megachrizNow with a test that specifically tests the FeedsFetcherResult class.
And with that, I spotted the following mistake in the previous patch (this is from the interdiff):
Comment #11
megachrizI looked at the code of a few contrib fetchers that are noted on https://www.drupal.org/node/856644 and saw that most fetchers call the parent constructor with an empty string, which file fetcher did before (but which I changed to passing
NULLinstead in the patches from this issue). So better keep it an empty string to match better what contrib does.Other than that, I saw no code in these contrib fetchers what worried me in terms of backwards compatibility. So I'm confident that the attached patch is ready given that it still passes tests.
I'll wait for a week and maybe a bit longer for feedback before committing it.
Comment #12
megachrizDuring testing I found three issues with this patch:
FeedsFetcher::getFilePath()andFeedsFetcher::__sleep()can callFeedsFetcher::saveRawToFile(). Both methods do this when using the CSV parser and multiple cron runs are needed to complete the import.Two attached patches, one with additional tests only that should fail so I'm sure that the problems that I found are properly covered. The other one with fixes, which should pass.
Comment #15
megachrizSome tests failed because of failing branch tests (see #2857935: Fix failing test FeedsMapperMultilingualFieldsTestCase::testChangedLanguageImportForExistingNode()). Back to "Needs review".
Comment #16
megachrizComment #19
megachrizI did not spot any new issues. So let's call this issue done. Committed #12!
Oof. This must have been my most challenging Feeds issue so far. 1,5 years of work! Well, for #2032333: [Meta] Store remote files in the file system instead of in the database to be completed, there are still some (small) optimizations to be implemented. But the big chunk of it is done now. :)