Problem/Motivation

When an import needs multiple cron runs to complete and when using the HTTP Fetcher, Feeds uses the temporary directory to temporary store the file that is being imported.

On some servers, the default configured temporary directory is not static. It can change per request. This can for example be the case on Acquia and Pantheon. This is problematic because if the temporary directory is changed between two import cycles, Feeds looses track of the file being imported and therefore cannot finish the import.

Proposed resolution

Do not use the temporary directory to temporary store a file to import, but use private://feeds/in_progress instead. The D7 version of Feeds already does this.

The first step however is adding test coverage for importing data that is using multiple calls/requests. Right now, there is only test coverage for importing data in one go. There are already tests written for this case for the D7 version of Feeds. It probably is a good idea to port these over. These tests can be found in the class 'FeedsFileHTTPTestCase'. The following methods from that class would be the most interesting to port over:

  1. ::setUpMultipleCronRuns()
  2. ::getInProgressFile()
  3. ::testImportSourceWithMultipleCronRuns() - ported in #3044983: Add test coverage for importing using multiple cron runs
  4. ::testAbortImportWhenTemporaryFileIsDeleted()

Additionally, there seems to be no warnings logged when a file that is in progress being imported is suddenly missing.

Workaround

A workaround is to set a different path for the temporary directory, one that is static. In settings.php:
$config['system.file']['path']['temporary'] = '/your/temporary/directory';
Be sure to change '/your/temporary/directory' to the directory of your choice.

Remaining tasks

  1. Write test coverage for importing data that needs multiple cron runs to complete.
  2. Write test coverage for aborting an import because of a missing temporary file.
  3. Log a warning when the file that is progress of being imported no longer exists.
  4. Move the temporary file to private://feeds/in_progress.
  5. Write test coverage for removing the temporary file after import (this task could also be handled in #2778373: Auto-purge old temporary feed downloads).
  6. Add a setting for the in progress directory called "in_progress_dir". Make it part of a config object called "feeds.settings".

Original report by bburg

I'm running several sites using the Pantheon platform on an elite plan. These sites use feeds to import content from a central site. They also run in a load-balanced set up, so multiple instances of Drupal running on separate servers with a shared database.

The problem is that on cron, the Feeds module will download the feed source (an XML document from the central site), and store it in the temporary directory. It appears that the parser isn't triggered immediately, but instead added to the queue. The queue entry references the temporary file in the full path. On pantheon, the temporary directory may have an inconsistent path. e.g. /srv/bindings/580b80cb8cb344dfa6f2c1059d896835/tmp. The hashed number I believe is something called the "Pantheon Binding." Which I assume is a reference to the current container, but I'm not certain. So, on subsequent cron runs, the queue processor may end up trying to access a the temporary file that no longer exists as the Pantheon Binding has changed since the queue job was saved to the queue table. This results in an accumulation of queue items, that point to non-existent temporary files, that will never be removed, since the queue job doesn't remove them if the temporary file is missing. This leads to a lot of errors like this one, which also seem to cause performance issues during cron:

RuntimeException: File /srv/bindings/c1e9936fd5b347d7ae98b1226d7724f0/tmp/feeds_http_fetcherELdIb4 does not exist. in Drupal\feeds\Result\FetcherResult->checkFile() (line 53 of /srv/bindings/580b80cb8cb344dfa6f2c1059d896835/code/web/modules/contrib/feeds/src/Result/FetcherResult.php).

I'm using a snapshot of the dev branch from August 2016. So if this problem has already been addressed in a previous issue, my apologies. Upgrading to a newer snapshot seems to lead to several errors...

Pantheon support has suggested that we change the temporary files directory to point to a path on the shared, networked file system, under sites/default/files/tmp, which they have documented how to do under Temporary File Management with Multiple Application Containers: Using Valhalla, Pantheon's Networked Filesystem. I'll link their support team to this ticket as well so they can make any comments and clarify the issue. Our next scheduled deployment window isn't until early October, So I won't be able to test that suggestion in production until then.

Also, to address the stuck queue items, I think the only option is to delete all queue items whose name matches "feeds_feed_parse%". So adding an update hook with the following:

  db_delete('queue')
    ->condition('name', 'feeds_feed_parse%', 'like')
    ->execute();

(db_delete() is deprecated, but I don't know the new method yet). This might be something to address in this module. Also, for the issue itself. Might it be better to store the file reference in the queue item as a filestream wrapper instead? e.g. temporary://feeds_http_fetcherELdIb4.

I'm not sure if supporting such environments is really in the scope of this project (why I've filed this as a support request). I'm really filing this issue here in order to improve the search-ability of the problem in general in case other people are running into it.

CommentFileSizeAuthor
#136 feeds_s3fs.patch1.25 KBrdivince
#114 feeds--temporary-files-handling--2912130-114.patch9.69 KBsomeshver
#110 feeds--temporary-files-handling--2912130-110.patch8.37 KBjmcerda
#107 feeds--temporary-files-handling--2912130-107-D8.patch8.37 KBmeladawy
#82 interdiff-2912130-81-82.txt1.51 KBMegaChriz
#82 feeds-in-progress-test-2912130-82.patch4.31 KBMegaChriz
#81 interdiff-2912130-80-81.txt1.18 KBMegaChriz
#81 feeds-in-progress-test-2912130-81.patch3.97 KBMegaChriz
#80 feeds-2912130-80-test-cases.patch3.57 KBmstrelan
#79 feeds-2912130-79-test-cases.patch3.57 KBmstrelan
#73 interdiff_62-73.txt1021 bytesmstrelan
#73 feeds--temporary-files-handling--2912130-73-D8.patch8.45 KBmstrelan
#62 feeds--temporary-files-handling--2912130-62-D8.patch8.42 KBbburg
#51 feeds--temporary-files-handling--2912130-51-D8.patch8.44 KBn4r3n
#49 feeds--temporary-files-handling--2912130-49-D8.patch5.95 KBn4r3n
#48 feeds--temporary-files-handling--2912130-48-D8.patch4.4 KBn4r3n
#39 feeds--temporary-files-handling--2912130-39-D8.patch11.24 KBmerauluka
#37 feeds--temporary-files-handling--2912130-37-D8.patch11.22 KBmerauluka
#28 feeds-temporary-files-path-2912130-23-D8.patch9.82 KBFant0m
#25 feeds-temporary-files-path-2912130-22-D8.patch9.83 KBFant0m
#23 feeds-temporary-files-path-2912130-21-D8.patch9.42 KBFant0m
#22 feeds-temporary-files-path-2912130-20-D8.patch6.48 KBFant0m
#21 feeds-temporary-files-path-2912130-19-D8.patch10.07 KBFant0m
#20 feeds-temporary-files-path-2912130-19-D8.patch10 KBFant0m
#19 feeds-temporary-files-path-2912130-18-D8.patch9.21 KBFant0m
#18 feeds-temporary-files-path-2912130-17-D8.patch8.16 KBFant0m
#17 feeds-temporary-files-path-2912130-D8.patch7.01 KBFant0m
#16 feeds_temporary-files-path_2912130-15-D8.patch7.46 KBFant0m
#14 feeds_temporary-files-path_2912130-12-D8.patch7.88 KBFant0m
#12 feeds_temporary-files-path_2912130-12-D8.patch4.84 KBartematem
#11 feeds_temporary-files-path_2912130-11-D8.patch4.83 KBartematem
#10 feeds_temporary-files-path_2912130-10-D8.patch4.82 KBartematem
#9 feeds_temporary-files-path_2912130-9-D8.patch4.35 KBartematem
#7 2912130.patch4.14 KBFant0m

Issue fork feeds-2912130

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

bburg created an issue. See original summary.

bburg’s picture

Issue summary: View changes
MegaChriz’s picture

@bburg
This has not yet been addressed. In the latest dev temporary files are still saved to temporary://. I think this is wrong and it would be better if Feeds would save it in a 'in progress' directory. In the D7 version of Feeds a similar issue exists: in that version, Feeds saves temporary files in 'public://feeds'. This file is then removed after six hours, regardless of if the import has been completed or not.

In the D7 version I have been experimenting with saving a temporary file in 'private://feeds/in_progress' instead. See #1029102: Importing Large CSV document (downloaded and processed in the background). The file gets removed after import has been completed. Maybe it is possible to port that idea to D8?

A similar issue for D8 Feeds (maybe even the same problem): #2778373: Auto-purge old temporary feed downloads

bburg’s picture

Category: Support request » Task

I've been thinking more about this issue lately. We deployed in October the update to the temporary files directory location. We are seeing some other issues alleviated, but we are still seeing the Feeds errors. I'm not sure saving to temporary:// is an issue per se, it seems that even with the new location, these are still added as queue items, that reference the absolute path of the temporary file, which can vary between application containers as the path contains the container id.

I suspect that the only reason this is working now in the setup is that it occasionally runs the fetch, and the queue processing in the same container. But purging the broken queue items would likely be helpful.

If changing the directory from temporary:// is something you were already considering, I'll change the category of this issue to be a task. I'd offer to assist, but I'm still pretty unfamiliar with this code and D8 module architecture.

MegaChriz’s picture

Yes, moving from temporary:// to private://feeds/in_progress is what I'm considering, with the reason being that files may not exist long enough in temporary:// to complete an import. On my local machine for example, temporary:// is set to '/tmp', which is cleared after every machine restart.

For this issue to move forward, we will likely need automated tests. Feeds only has unit tests right now, but I think this issue could use a functional test (or maybe a kernel test - which I'm not familiar with yet).

If you like to help on this issue or help with the development in general, maybe you want to help with writing a base class for tests? See #2917925: Port remaining methods from test base class in D7. That's probably easier to do and helps to build foundation for this issue. Or other things mentioned in #2917928: Plan for 8.x-3.0-alpha1 release? Anyway, it's unlikely I will put time in this issue before 8.x-3.0-alpha1 is done.

MegaChriz’s picture

Issue summary: View changes

Closed #3010908: Articles are not being imported to production Acquia site, possibly because /tmp folder is not on the same container between requests as a duplicate. Some relevant info from there:

From @alberto56:

I think this may be due to the fact that on certain servers, Acquia among them, the production server is in fact a cluster of containers which share /var/www/html/sites/default/files, but do not share /tmp. A quick look at your code suggests you are using the /tmp folder in a batch process between calls, to store the raw feed.

The problem might be that in one call, we are storing the raw feed in the /tmp folder of one container in the cluster, and the second call, which might or might not hit the same cluster, the /tmp folder might no longer contain that raw data.

An issue which might be related on the webform module was fixed in #2980276: Webform assumes the /tmp directory is always the same, but if there are multiple servers, each may have its own /tmp directory

From @MegaChriz:

My idea for fixing this is by not using the temporary directory for temporary storing the file to import, but use private://feeds/in_progress instead.

The first step however is adding test coverage for importing data that is using multiple calls/requests. Right now, there is only test coverage for importing data in one go. There are already tests written for this case for the D7 version of Feeds. It probably is a good idea to port these over. These tests can be found in the class 'FeedsFileHTTPTestCase'. The following methods from that class would be the most interesting to port over:

  1. ::setUpMultipleCronRuns()
  2. ::getInProgressFile()
  3. ::testImportSourceWithMultipleCronRuns()
  4. ::testAbortImportWhenTemporaryFileIsDeleted()

A workaround for your issue would be to set a different path for the temporary directory as also suggested in #2980276-18: Webform assumes the /tmp directory is always the same, but if there are multiple servers, each may have its own /tmp directory:
$config['system.file']['path']['temporary'] = '/your/temporary/directory';

The watchdog has no entries. Is there a debug mode? How would you suggest that I try to figure out what is wrong?

Unfortunately, there is no debug mode in Feeds D8. PHP errors during an import would be logged, but I think there are no log entries for when a file to import has gone missing. It would be good idea to find out if we can make sure that a warning is logged in this case.

Fant0m’s picture

Add fast patch to have possibility using public directory for temp feeds files. In admin config path: '/admin/config/feeds_settings' add Settings form where you can check "Use public directory for temporary feed files."

MegaChriz’s picture

@Fant0m
Thanks for giving this a go!

  1. +++ b/src/Feeds/Fetcher/HttpFetcher.php
    @@ -81,7 +81,15 @@ class HttpFetcher extends PluginBase implements ClearableInterface, FetcherInter
    +      // Remove all files by mask before imported files as not needed any more.
    +      array_map('unlink', glob($this->fileSystem->realpath($destination) . "/feeds_http_fetcher*"));
    

    Note that this may disrupt other unfinished imports, since you are deleting all files in the feeds_http_fetcher directory here.

  2. Using a checkbox for the "feeds_temporary_dir" is fine as a temporary solution, but ideally it would be textfield so people can choose the private directory as well.
artematem’s picture

OK, as a first version of patch. Added default feeds.system.yml file to see if it help to pass the tests.

artematem’s picture

artematem’s picture

artematem’s picture

MegaChriz’s picture

+++ b/src/Feeds/Fetcher/HttpFetcher.php
@@ -81,7 +81,15 @@ class HttpFetcher extends PluginBase implements ClearableInterface, FetcherInter
+    if (\Drupal::config('feeds.settings')->get('feeds_temporary_dir')) {

The reason for the test failure is that the feeds.settings config object is not injected.

Fant0m’s picture

Status: Needs review » Needs work

The last submitted patch, 14: feeds_temporary-files-path_2912130-12-D8.patch, failed testing. View results
- codesniffer_fixes.patch Interdiff of automated coding standards fixes only.

Fant0m’s picture

Fant0m’s picture

Fant0m’s picture

Fant0m’s picture

Fant0m’s picture

Fant0m’s picture

Fant0m’s picture

Fant0m’s picture

MegaChriz’s picture

@Fant0m
Thanks for working on this issue! It would be great if you could document what you are working on in each patch and what your motivations are for some of the changes. That helps with reviewing the patches.

Fant0m’s picture

MegaChriz’s picture

+++ b/src/Feeds/Fetcher/HttpFetcher.php
@@ -69,19 +79,42 @@ class HttpFetcher extends PluginBase implements ClearableInterface, FetcherInter
+  /**
+   * {@inheritdoc}
+   */
+  public static function create(ContainerInterface $container, array $configuration, $plugin_id, $plugin_definition) {
+    return new static(
+      $configuration,
+      $plugin_id,
+      $plugin_definition,
+      $container->get('config.factory')
+    );
+  }

Feeds currently uses annotation to specify dependent services. So adding (or overriding) a create() method is not going to work here.

We could open a new issue for Feeds to use ContainerFactoryPluginInterface instead of defining dependent services in the annotation.

Fant0m’s picture

Thank you @MegaChriz.

Fant0m’s picture

Add to HttpFetcherTest like emulation of container with 'config.factory'.

Fant0m’s picture

Status: Needs work » Needs review
maxdev’s picture

Latest patch doesn't work. If you go to Feeds page it throws fatal error.

ArgumentCountError: Too few arguments to function Drupal\feeds\Feeds\Fetcher\HttpFetcher::__construct(), 6 passed and exactly 7 expected in Drupal\feeds\Feeds\Fetcher\HttpFetcher->__construct() (line 85 of modules/contrib/feeds/src/Feeds/Fetcher/HttpFetcher.php).

maxdev’s picture

Status: Needs review » Needs work
artematem’s picture

@maxdev,
Did you try with https://simplytest.me/ ?
I run with Feeds 8.x-3.0-alpha4 + patch #28 and didn't see error.

bburg’s picture

Just jumping in to make a side comment. Since I changed the temporary directory location to address the issue we are now experiencing an issue where we have accumulated a large number of temporary feeds files that aren't being cleaned out regularly (like 200K+). While the files are very small and not filling up too much space, the quantity of them is causing performance issues when we run backups on Pantheon, taking over 2 hours on some of the worst affected sites.

MegaChriz’s picture

Issue summary: View changes

I've opened #3044983: Add test coverage for importing using multiple cron runs for porting ::testImportSourceWithMultipleCronRuns(). A port of this test is also needed for #2978490: Optimize Feeds queue.

MegaChriz’s picture

@Fant0m

+++ b/src/Feeds/Fetcher/HttpFetcher.php
@@ -69,19 +79,42 @@ class HttpFetcher extends PluginBase implements ClearableInterface, FetcherInter
+  public function __construct(array $configuration, $plugin_id, array $plugin_definition, ClientInterface $client, CacheBackendInterface $cache, FileSystemInterface $file_system, ConfigFactoryInterface $config_factory) {
...
+  /**
+   * {@inheritdoc}
+   */
+  public static function create(ContainerInterface $container, array $configuration, $plugin_id, $plugin_definition) {
+    return new static(
+      $configuration,
+      $plugin_id,
+      $plugin_definition,
+      $container->get('config.factory')
+    );
+  }
+

The number of parameters passed to new static in ::create() does not equal the number of parameters in the constructor.

So this function may be removed, but even better would be to start to change Feeds so that it can use the standard instead of that it is defining parameters in the annotation.

This is likely also needed for #3001376: entity.query deprecated warnings and as such to make Feeds compatible with Drupal 9 in the future.

So I fear we need to take a step back here and fix other things properly first.

MegaChriz’s picture

merauluka’s picture

We have been seeing this issue on Acquia now that we've switched over to using the production environment. It's created serious issues for us with feeds constantly jamming up the queue system since they cannot find their temporary files.

I have reworked the patch from #28 and allowed the user to select either the public or private file system for storage. I'm sure there's probably a more comprehensive approach to allowing those stream wrappers, but I wanted to get support for the core options first.

Because my version completely moves away from the use of temporary files, I have updated the logic for file name generation as well. I thought it would be helpful to identify the feeds using the bundle so, at a glance, they can be associated with their appropriate feed. I also added on a new method to the FetcherResult the allows for the removal of a file once it's been processed, and a call to remove that file once the feed has finished processing.

oknate’s picture

The schema for feeds.settings needs to be updated for the changes in #37:

diff --git a/config/install/feeds.settings.yml b/config/install/feeds.settings.yml
new file mode 100644
index 00000000..7fa05cbe
--- /dev/null
+++ b/config/install/feeds.settings.yml
@@ -0,0 +1 @@
+feeds_storage_method: public
diff --git a/config/schema/feeds.schema.yml b/config/schema/feeds.schema.yml
index 4639abdf..65a6374b 100644
--- a/config/schema/feeds.schema.yml
+++ b/config/schema/feeds.schema.yml
@@ -163,3 +163,11 @@ feeds.processor.entity:
 
 feeds.processor.entity:*:
   type: feeds.processor.entity
+
+feeds.settings:
+  type: config_object
+  label: 'Feeds settings'
+  mapping:
+    feeds_temporary_dir:
+      type: boolean
+      label: 'Use public directory for temporary feed files.'
merauluka’s picture

Updated my patch with schema changes. This is still going to break tests. But it will break them differently now?

MegaChriz’s picture

@merauluka
Thanks for working on this! This is definitively an issue I like to see fixed at some point.

I did made a rough start on this a few weeks ago: I started with copying code from the D7 version. So far I did not came to anything even somewhat workable. The issue I ran into was that tempnam() seems to ignore subdirectories. I wanted the generated file to go to private://feeds/in_progress, but the file kept going to just private://.

So I think that we need to replace tempnam() with something that supports subdirectories.

merauluka’s picture

@MegaChriz Yeah. When I reviewed the code, it's actually a part of the tempnam process to strip out any directories and just take the file name, append a random string, and save it in temporary storage.

What I did was update the first patch to use either public or private storage (user's choice, default to public), use the same feed file name for the beginning, but I inserted the machine name of the feed type into the name also, then append the random string. This allows a user to see where the stored files came from.

I also had the logic create the files under a subdirectory that is controlled by the module. This was mostly for convenience since allowing the user to create the directory leads to the need for additional discussion/logic around what we would do with files that were in the old directory.

But it's a good start. And I can report that with my patch in place, the issues we were seeing in our load balanced environment, with temporary files being unavailable, are no longer present now that the files are in the private/public directories instead.

merauluka’s picture

As a follow up comment, this change recently saved a client when some of their feed content was erroring due to changes made by a feed provider. We were able to address the changes and the cached copies of the feeds in the private directory in Drupal allowed us to retrieved almost 2 months worth of missing data.

If the files were still stored in the temporary directory, they would more than likely have been removed before we had a resolution to the issue in place.

bburg’s picture

Patch #3 works for me, but one observation. Without a default value getting set in an existing installation. The directory just defaulted to one with just a colon ":" as the name directly in the Drupal root.

jorgediazhav’s picture

@bburg, you mean it only works when you do not set a value? By just applying the patch and clearing the cache?

merauluka’s picture

i have run into an issue with my patch. Before my patch, the feeds module relied on the fact that the system periodically empties the temporary folder to handle left over files that were created during an HTTP Fetch operation.

I added the option to move these files to public/private storage instead and added a clean up action to remove the file once the queued articles have been processed.

There are two issues:

  1. If there is an error, the files are still created and never removed. #3080098: Delete Empty Files Created If Empty/Error Exception is Returned
  2. If the user uses the UI instead of drush/cron to import feeds, the file is never deleted because the UI uses batch. Because these batches can be interrupted at any time, even if a mechanism was in place to remove the file once the import was complete, the file has the possibility of never being cleared.

I am withdrawing my recommendation to use my patches until a unified file clean up option is available.

merauluka’s picture

bburg’s picture

I noticed that too. The default /tmp directory is usually cleared about by the OS, so using a custom one will require a separate process.

We implemented a cron hook to kick off a process to periodically clear out the files:

function my_module_cron() {
  // Function to confirm the presence of the temporary directory, and if not, create it, which may not exist in newly instantiated dev environments.
  my_module_verify_temp_directory();

  $tmp = \Drupal::config('system.file')->get('path.temporary');
  
  // Call function to clear out temporary files.
  my_module_tmp_file_delete(500, $tmp);
  my_module_tmp_file_delete(500, $tmp . '/feeds-ingestion');
}

/**
 * Delete temporary files that are older than a week
 */
function my_module_tmp_file_delete($limit = 500, $path = false) {

  if (!$path) {
    return false;
  }
  if(!is_dir($path)) {
    return false;
  }

  $file = opendir($path);
  if($file === false) {
    return false;
  }

  $fileList = array();
  $count = 0;

  while($entry = readdir($file) && $count < $limit)
  {
    $entry = readdir($file);

    // Avoid adding directories to be deleted.
    if(!is_dir($entry)) {
      $fileList[] = $path . '/' .$entry;
      $count++;
    }
  }

  closedir($file);
  $now = time();
  array_multisort(array_map( 'filemtime', $fileList ), SORT_NUMERIC, SORT_ASC, $fileList);
  $count = 0;
  $deleted_files = [];

  foreach($fileList as &$file)
  {
    if ($now - filemtime($file) >= 60 * 60 * 24 * 7)
    {
      if (file_exists($file))
      {
        if(!is_dir($file))
        {
          $deleted_files[] = $file;
          if ( unlink($file)) {
            $count++;
          }
        }
      }
    }
  }

  \Drupal::logger('my_module_cron')->notice("Deleted $count files." . '<ul><li>' . implode('</li><li>', $deleted_files) . '</li></ul>');
}
n4r3n’s picture

n4r3n’s picture

Updating HttpFetcherTest case.

MegaChriz’s picture

+++ b/src/Feeds/Fetcher/HttpFetcher.php
@@ -96,7 +109,15 @@ class HttpFetcher extends PluginBase implements ClearableInterface, FetcherInter
+      $destination = 'public://';

Note that by using the public:// destination, you are potentially opening a security hole. Files stored in this directory could in theory be read by anonymous users. They must guess the filename, however.

n4r3n’s picture

n4r3n’s picture

Chris Burge’s picture

Is this issue relevant anymore? Beginning in D8.8, Drupal assumes the path to the temp directory is '/tmp' unless you set $settings['file_temp_path'] in settings.php. The "workaround" described in the issue summary is required configuration for the application to function in a multi-web server environment.

MegaChriz’s picture

@Chris Burge
I think it is. In theory, files in /tmp could be cleaned up before the Feeds import is complete. Feeds divides an import task into multiple chunks. For large imports (hundreds of items), this can take multiple cron runs to complete. If the webserver cleans up temporary files in between, then the import cannot finish. The more items to import and/or the less times cron runs, the higher the chance that this can happen.

This issue is more or less fixed in the D7 version. I would like to see a port of that version along with tests, rather than what the current patch does. The D7 version makes it configurable to where the files to import are stored and it defaults to private://feeds/in_progress (or public://feeds/in_progress when the private file system is not configured). The current patch forces these files to be in the public file system and I think that’s a potential security risk, because files could come from a protected source. Ok, the generated file names could be hard to guess but if the data to import is sensitive, you’d better be safe than sorry and ensure that these files are on the private filesystem.

mikalaki’s picture

I can assure that problem continues exist on Drupal 8.8, currently working on a site with Drupal 8.8.1 version on it and I am facing the issue.

j9’s picture

Hi and thanks!

I am also getting errors upon attempt to import feeds.

RuntimeException: File <em class="placeholder">/tmp/feeds_http_fetchertuzw4F</em> does not exist. in Drupal\feeds\Result\FetcherResult->checkFile() (line 53 of /home1/netup/public_html/cryptos.cool/modules/feeds/src/Result/FetcherResult.php).

I tried the patch, I tried setting manually a temp directory in settings.php, but haven't gotten through yet.

When I run the import manually from the /admin/content/feed page, I get the feed items imported!

But cron does not seem to import on its own.

In the Recent log messages, it looks like it is because it is being run as an Anon user, instead of manually running as an Admin.

Any thought on what I might try next?

Drupal core 8.8.4
Feeds 8.x-3.0-alpha6

On a shared Hostgator server. Feeds was working just great, but somehow stopped.

Much appreciated and thank you again!

MegaChriz’s picture

Marking as a stable release blocker.

jjwfcd’s picture

thanks for you guys effort and wish to have a release soon.

bkosborne’s picture

+1 to getting this resolved. Agreed the current patch is not the way to go, since it puts the feeds files in public storage. @MegaChriz, so it seems like we just need to update this patch to use private file system if available or public otherwise, just like D7?

MegaChriz’s picture

@bkosborne
Yes, I propose to add temporary files to private://feeds/in_progress if the private file system is available, otherwise it would be public://feeds/in_progress. Do you want to help with resolving this issue? If yes, let me know if you understand the proposed resolution + the remaining tasks from the issue summary.

jwilson3’s picture

Just a note that #51 applied cleanly to 3.0-alpha7 but no longer applies to 3.0-alpha8, so in addition to the fixes requested in #60, this needs a reroll.

bburg’s picture

Conflict is in tests/src/Unit/Feeds/Fetcher/HttpFetcherTest.php with the addition of two"use" classes:

@@@ -17,7 -17,7 +18,11 @@@ use GuzzleHttp\Handler\MockHandler
  use GuzzleHttp\Psr7\Request;
  use GuzzleHttp\Psr7\Response;
  use Prophecy\Argument;
++<<<<<<< HEAD
 +use RuntimeException;
++=======
+ use Symfony\Component\DependencyInjection\ContainerBuilder;
++>>>>>>>

Fixed to allow both.

jwilson3’s picture

We had upgraded to alpha8 and applied #62 but we're still seeing the error on Pantheon.

File /srv/bindings/<uniqueid>/tmp/feeds_http_fetcherz6rDEi does not exist.

We're going to dive in here and see what is going on.

jwilson3’s picture

Facepalm.

The reason the patch wasn’t working was due to a configuration error on my part. I didn't realize the patch exposed a new config option that has to be checked/enabled for things to work at URL /admin/config/feeds_settings

Checkbox:
[ ] Use public directory for temporary feed files.
Site public file system path settings.

Sadly, checking the option doesn’t solve the problem. The messages on screen after a manual feed import are now slightly different:

Created 100 [Nodetype] items.
File /srv/bindings/<uniqueid>/files/feeds_http_fetcheroJyPaV does not exist.

There should have been 325 items created total. Note that the Path is now pointing to the files folder on Pantheon which is supposedly synced across containers. Something is making the file go away after the first batch size of 100 items is imported.

I don’t know if this is a Feeds problem yet or a Pantheon issue.

jwilson3’s picture

My facepalm moment above clearly demonstrates that this issue needs work: A) it should be using the private files folder not the public folder, and B) no one should have to manually specify this option.

MegaChriz’s picture

@jwilson3
Yes, I would like to see a similar solution as in the D7 version. The D7 version saves the file to import in private://feeds/in_progress (or public://feeds/in_progress if the private filesystem is not configured). See also the issue summary.

Would you like to work on this? It is on my list to pick up eventually, but it could still take a while before I would get to it myself. There are enough things on the roadmap: #1960800: [meta] Feeds 8.x roadmap. If you do want to pick it up, I hope I can assist you as much as possible. :)

jwilson3’s picture

@MegaChriz, thanks for your vote of support. I've asked my colleague @marthinal to have a look at making the recommended changes to the patch.

I think I also hit upon a related issue in #64, whereby if the line_limit is too low to process all items in a feed, the file gets deleted prematurely, but i think that would have to be a follow-up ticket. Maybe there already is one, I haven't looked. After i bumped line limit to 1000, AND exported the config yml change for the public folder, I was able to use the manual "import" button to get all of the items ingested.

MegaChriz’s picture

@jwilson3
Thanks for your support! :)

Line Limit setting

The line limit setting dictates how many lines of a file should be parsed at a time. Feeds goes through a few import stages and these stages get looped when they are not completed yet:

  1. When the import starts, Feeds creates a single "fetch" task.
  2. Fetch stage: the fetcher gets called a single time and returns a fetcher result.
  3. Parse stage: the fetcher result is passed to the parser and the parser gets called a single time. The parser returns a parser result.
  4. Process stage: for each item in the parser result a "process" task is created.
  5. When all process tasks are completed, Feeds asks the parser if there's more to parse. If so, the import process continues with step 3.
  6. If there's nothing more to parse, Feeds asks the fetcher if there's more to fetch. If so, the import process continues with step 2.
  7. If there's nothing more to fetch, the clean stage starts.
  8. Clean stage: during this stage, items that are no longer available in the source can get unpublished/deleted (depends 'Previously imported items' setting).
  9. If the Clean stage is done or there's nothing to clean, the import finishes.

So as you see, the line_limit setting shouldn't affect how many items get imported during the whole import process. It shouldn't matter if not everything is parsed on the first call. Feeds gets back to the parser after all items from the first parser result are processed.

The reason you won't get everything imported is probably that the fetcher result - which lives in a temporary file when using the HTTP fetcher - is no longer available at the time Feeds wants to get back to the parser for the second iteration.

bburg’s picture

For the Pantheon users, Pantheon is in the process of moving to their "compute optimized environment" Which moves away from the container ids in paths. While the issue still exists, I think the bulk of the users here were probably experiencing this on Pantheon.

dpagini’s picture

Just ran into the issue in #3163897 on our team at Acquia, so I found my way over here. I'm a little hesitant to pull in the patches here with all the discussion about a different approach... so the latest patch seems to just give a setting to either a) use the temp directory (existing functionality), or b) change to use the public directory? And from what I'm reading, if I use the public directory for my files, I won't run into the problem outlined in #3163897, but then my files are available to the public?

So just to try and summarize what I'm reading, it sounds like this PR needs 3 things still...
1. Test coverage for running imports.
2. Use the private directory instead of public, if it exists, and if not, fall back to public.
3. A way to clean up the public/private files after the feeds are processed...?

Does that sound right?

Quick review of latest patch...

  1. +++ b/config/install/feeds.settings.yml
    @@ -0,0 +1 @@
    +feeds_temporary_dir: false
    

    I think I saw this in some comments already, but I think we need an update hook to set this for existing sites.

  2. +++ b/src/Feeds/Fetcher/HttpFetcher.php
    @@ -96,7 +109,15 @@ class HttpFetcher extends PluginBase implements ClearableInterface, FetcherInter
    +    if ($this->configFactory->get('feeds.settings')->get('feeds_temporary_dir')) {
    

    Am I missing something? If the name of the settings is feeds_temporary_directory, would this need to be FALSE to use the public filesystem?

MegaChriz’s picture

@dpagini
Yes, that sounds about right. I tried to summarize what needs to be done in the section “Remaining tasks” in the issue summary.

When fetching a file from a HTTP source it needs to be locally stored somewhere, because the file can be too large to process in one go. If you use cron to run imports, then it can happen that multiple cron runs are needed to complete the import. This is because Feeds is given a limited time per cron run to do its job and that time is about one minute. So the file to import must be kept over multiple cron runs.

Right now, the file is stored in a temporary directory indeed and that can be problematic if that directory is getting cleaned between two cron runs and the import is not done yet. So my proposal is to store the downloaded file in the private directory instead (with a fallback to the public directory if there is no private directory).

When the import is done, the downloaded file is no longer useful, so it should be cleaned up after import. Otherwise, the private directory would grow with files that have been imported over time. Files in the temporary directory are not cleaned up by Feeds right now, hence the issue #2778373: Auto-purge old temporary feed downloads.

Tests are needed to overcome exceptions during this process. For example: when a file is manually removed from the private directory, Feeds shouldn’t try to continue the import over and over again. So for that situation we need test coverage.

The approach I proposed is already implemented in the D7 version of Feeds, so some code from the D7 version can perhaps be reused. Or at least the D7 implementation can be used as inspiration here.

Lastly, this issue is on my radar to be picked up soon. The plan of the Feeds team is to first resolve all beta blockers and perhaps also the beta targets. My hope is to release the beta before the end of this year. Some of the beta targeted issues need review/testing, so help on them is welcome. :) See the roadmap: https://www.drupal.org/project/feeds/issues/1960800#feeds-beta-blockers

dpagini’s picture

Thanks @MegaChriz - I'm happy to help a little bit as allowed by my company. =)
I pulled in the last patch here and started using it. I think it will work for me for the time being. I'm pulling in a public feed, so there's no concern storing it on the public directory, and I'm only using 1 feed, so the below concern is not a problem for me.
Also, just worth mentioning that the setting as defined in this patch, as mentioned above, is confusing to me b/c I have to set "temp = true" in order to use the public directory.

+++ b/src/Feeds/Fetcher/HttpFetcher.php
@@ -96,7 +109,15 @@ class HttpFetcher extends PluginBase implements ClearableInterface, FetcherInter
+      array_map('unlink', glob($this->fileSystem->realpath($destination) . "/feeds_http_fetcher*"));

Just a question here... if there is more than one feed in a site, is it possible that this line could delete a temp file from another feed?

mstrelan’s picture

Just a question here... if there is more than one feed in a site, is it possible that this line could delete a temp file from another feed?

Yes. Attached patch address this.

jerrac’s picture

So, I've been watching this issue for a while, ever since I submitted https://www.drupal.org/project/feeds/issues/3163897.

Is there a reason the patch doesn't just add a "Use private" checkbox the same way it adds the "Use public" checkbox?

jwilson3’s picture

Is there a reason the patch doesn't just add a "Use private" checkbox the same way it adds the "Use public" checkbox?

Checkboxes wouldn't make sense with two options, since they're mutually exclusive. It could be made into a single field with two radio options, default to "Use private" but let the user switch to "Use public". This would clarify what is really going on here.

Chris Burge’s picture

Why not make the setting a string and accept a stream-wrapper path? (e.g. 'private://feeds-tmp')? That would provide a greater degree of flexibility and wouldn't encourage people to use public:// like it's temporary://. One of the requirements for the Drupal temp directory path is that it cannot be accessible over the web: file_directory_temp() is deprecated and moved to the FileSystem service. This is a security-based requirement.

MegaChriz’s picture

Issue summary: View changes

@Chris Burge

Why not make the setting a string and accept a stream-wrapper path? (e.g. 'private://feeds-tmp')?

Good suggestion. The D7 version of Feeds already does this. In the proposed resolution in the issue summary I proposed to have it default to 'private://feeds/in_progress' (or 'public://feeds/in_progress' if no private filesystem is available). In the D7 version this setting is called "feeds_in_progress_dir". I would call it "in_progress_dir" here and make it part of a config object called "feeds.settings".

mstrelan’s picture

Just looking at the remaining tasks, it seems that "Log a warning when the file that is progress of being imported no longer exists." could be expanded to also remove the queue entry. Currently every cron run just updates the "expire" timestamp.

mstrelan’s picture

Attached patch adds the missing test cases as mentioned in the proposed resolution and completes the task "Write test coverage for aborting an import because of a missing temporary file."

This patch does not attempt to change the existing functionality and therefore I believe it makes sense to commit it as-is before continuing with the remaining tasks.

mstrelan’s picture

Not sure why that worked locally. This patch tries to delete the file by it's URI instead of the object.

MegaChriz’s picture

@mstrelan
Thanks for working on the tests! And good idea to commit that bit first when that passes.

The first thing to change in the parser configuration, is to limit the CSV parser how many items to parse per batch. By default the line limit is 100, which means the items from the source file are already parsed on the first cron if it only contains 9 items.

However, even after I set the line limit, the test still fails.

  • I added the following debug code to testAbortImportWhenTemporaryFileIsDeleted():
        // Remove file.
        $file = $this->getInProgressFile();
        print_r($file);
        \Drupal::service('file_system')->unlink($file->uri);
        $files = scandir('temporary://');
        print_r($files);
        print "\n";
        print time();
        sleep(2);
    
  • And the following and the end of Drupal\feeds\Feeds\Parser\CsvParser::parse():
    print '<pre>';
    $files = scandir('temporary://');
    print_r($files);
    print "\n";
    print "\n";
    print_r($fetcher_result->getFilePath());
    print "\n";
    print_r($result);
    print "\n";
    print time();
    

With a call to time() I could verify that one more parser task ran after deleting the file. But strangely enough scanning the directory in CronTest.php results into:

Array
(
    [0] => .
    [1] => ..
    [2] => .htaccess
)

But the same code scanning the directory in CsvParser.php (happening later), results into:

Array
(
    [0] => .
    [1] => ..
    [2] => .htaccess
    [3] => feeds_http_fetcherxfDF7P
)

Any ideas?

MegaChriz’s picture

Coding standard fixes and additional code comments.

Apparently, the test does pass on PHP 7 :o. Let's see if that happens on other PHP versions too. I tested locally on PHP 7.3 and there it failed.

MegaChriz’s picture

Okay, tests are passing on drupal.org. I have no idea then why they are failing locally.

@mstrelan
If you think this test is good too, I'll commit it.

mstrelan’s picture

The first thing to change in the parser configuration, is to limit the CSV parser how many items to parse per batch. By default the line limit is 100, which means the items from the source file are already parsed on the first cron if it only contains 9 items.

Ah, that explains it, I ran out of time to debug it that far so thanks for explaining it.

Okay, tests are passing on drupal.org. I have no idea then why they are failing locally.

Your patch in #82 passes for me in my local PHP 7.4 env. Not sure what happened on your env and not overly familiar with the handling of temporary:// in tests. What is responsible for deleting the downloaded file, maybe there was a temp file hanging around from an earlier run? I think getInProgressFile() is a bit flaky because we just assume that a file matching the pattern is the right file rather than checking what's saved in the cron queue.

mventure2021’s picture

Can we get some reasonable solution committed to the dev-x 8.x please?

mstrelan’s picture

@mventure2021 set your feeds to not import automatically and configure a cron job to run drush feeds:import <ID> periodically.

smustgrave’s picture

The patch in #82 has anyone used that on their production yet? Is it safe?

achap’s picture

This bug caught us out just before we went live :P Changing the temporary directory caused more issues for us then it solved, so we ended up doing as @mstrelan suggested and running from drush directly while turning the other stuff off. I guess it is not subject to the same memory limits so will always complete? Either way, it worked for us. Thanks!

smustgrave’s picture

What's the current status of this by chance? Do we know when to expect a reasonable solution like #85 mentioned?

jwilson3’s picture

Our workaround approach is similar to what @mstrelan and @achap have said above, but without Drush.

We've installed Ultimate Cron contrib module, then use Uptime Robot to ping site hosted on Pantheon (which doesnt provide anyway to setup cron jobs through their admin panel) on a more frequent basis than pantheon's deafult of once per hour. Ultimate Cron suggests running cron every minute but we do ever 5 and seem to now avoid the issues.

Ultimate Cron isolates each module's hook_cron implementation into a separate thread or request, and ensures that there is no overlap when running different jobs in one thread which lead to timeouts and/or the need to hit the batch limit. Caveat emptor, YMMV. I'm not 100% certain this is a sure-fire solution for everyone, but it works for our needs for the time being. Be sure to do your own testing with real feeds data sets in real server scenarios.

loopy1492’s picture

I pegged our site at 3x-dev and applied the #82 patch and drush cron still comes up with:

[notice] Starting execution of update_cron(), execution of system_cron() took 14.27ms.
 [notice] Starting execution of webform_cron(), execution of update_cron() took 772.51ms.
 [notice] Starting execution of webform_scheduled_email_cron(), execution of webform_cron() took 2.03ms.
 [notice] Cron task executed. (scheduled = 0; rescheduled = 0; unscheduled = 0; already scheduled = 0; ignored = 0; sent = 0; not sent = 0; skipped = 0)
 [notice] Execution of webform_scheduled_email_cron() took 1.59ms.
 [notice] Cron run completed.
 [error]  RuntimeException: File <em class="placeholder">/mnt/tmp/OUR-ACQUIA-SITE-NAME/feeds_http_fetcher06g66O</em> does not exist. in Drupal\feeds\Result\FetcherResult->checkFile() (line 53 of /mnt/www/html/OUR-SITE-NAME/docroot/modules/contrib/feeds/src/Result/FetcherResult.php). 
loopy1492’s picture

We have sites on Acquia Cloud and this problem has cropped up after upgrading drupal core. We've tried setting the Acquia environmental variables in settings.php as indicated in their documentation.

if (isset($_ENV['AH_SITE_ENVIRONMENT']) && isset($_ENV['AH_SITE_GROUP'])) {
  $settings['file_temp_path'] = '/mnt/tmp/' . $_ENV['AH_SITE_GROUP'] . '.' . $_ENV['AH_SITE_ENVIRONMENT'];
}

This does not seem to have worked. We also tried changing the temp files folder permissions to 770 and that also didn't work. We also tried installing Ultimate Cron and that hasn't worked either.

Last update of the day.

It seems that running the import **manually** from the feed page DOES create the file in the temp directory, but produces the ssl error:

The feed from <em class="placeholder">https://SITE_URL.COM/apps/cfa/public/api.cfc?method=getTrackingForSite&amp;roundKeyList=all</em> seems to be broken because of error "<em class="placeholder">cURL error 60: SSL certificate problem: unable to get local issuer certificate (see https://curl.haxx.se/libcurl/c/libcurl-errors.html)</em>".

And running **cron** produces the does not exist error because, apparently, cron has no idea that it's not running form the Stage environment.

[error] RuntimeException: File <em class="placeholder">/mnt/tmp/SITE/feeds_http_fetcher06g66O</em> does not exist. in Drupal\feeds\Result\FetcherResult->checkFile() (line 53 of /var/www/docroot/modules/contrib/feeds/src/Result/FetcherResult.php).

We have a ticket with Acquia now and if they can give us a resolution, I'll be sure to post it here.

vinmassaro’s picture

Anyone have luck with the workarounds getting this to work on Pantheon?

I've got $settings["file_temp_path"] = sys_get_temp_dir(); set and $config['system.file']['path']['temporary'] = 'sites/default/files/tmp'; with the tmp directory created. When running cron, it always fails:

RuntimeException: File <em                                           [error]
class="placeholder">/srv/bindings/1e1d9be9ee8c41ab8e72730e7fc637f8/files/feeds_http_fetcherIpFByl</em>
does not exist. in Drupal\feeds\Result\FetcherResult->checkFile()
(line 53 of
/code/modules/contrib/feeds/src/Result/FetcherResult.php).
bburg’s picture

@vinmassaro

I was under the impression that Pantheon had moved all sites to their "Compute Optimized Environments," and the paths with their server bindings were no longer necessary

https://pantheon.io/docs/platform-considerations#compute-optimized-envir...

Also, I think that part of the problem in this issue is that the queue item is saved with the full path to the temporary directory, which includes the non-existent bind id. I think if you can remove those queue items (either by direct SQL or with a module like https://www.drupal.org/project/queue_ui -- I am only guessing you can delete queue items with this), and attempt to re-import once you have new settings in place and finish your feed import process.

Oligerd’s picture

@vinmassaro
I had the same issue on the Drupal 8 site on the Pantheon.

if (isset($_ENV['PANTHEON_ENVIRONMENT'])) {

  $settings["file_temp_path"]= 'sites/default/files/tmp';

}

fixed this issue. No more errors after cron running.

BeauTownsend’s picture

#73 worked for me on a Jenkins/GCP/k8s setup.

For anyone deploying this patch, know that, upon login, I experienced a long load time and, when I executed `drush ws --tail`, I discovered that my site was rolling through what appears to be all of the busted queues and "expiring" each missing temp file. In my case, there were thousands of these, so the browser just sat and spun for a good while before finally logging me in.

vinmassaro’s picture

@Oligerd: for your code in #95, are you using it with any of the patches in this thread?

Oligerd’s picture

@vinmassaro
I am not using any patch. I found that you need to delete the feed from content (here /admin/content/feed), without deleting imported items and then create the same feed. In this case, the old path will be deleted. It needs to create manually this tmp directory inside the files directory. I am using FileZilla to create tmp on every Pantheon environment.

maskedjellybean’s picture

I think the suggestion to move tmp directory into sites/default/files as a work around for Pantheon is incorrect. The field description for Temporary directory at /admin/config/media/file-system says "This directory should not be accessible over the web", which to me implies it should be outside of /web (or /docroot). I assume there are security reasons for this. It may not be a security problem for feeds, but feeds is not the only one using the temp directory.

That said, I really need a solution that works with Pantheon (but doesn't also open up potential security holes).

maskedjellybean’s picture

Ok! I realized that the patch in #73 is not supposed to work immediately after patching. It adds a config item that needs to be set. So after patching, go to /admin/config/feeds_settings and check "Use public directory for temporary feed files". I'll give this a try on Pantheon and report back.

Another thought based on the explanation of the cause of this problem: If this happens when an import takes more than one cron run to complete, perhaps within our feed type settings under Processor settings > Advanced settings, we should at least temporarily make sure that "Force update" is unchecked. I assume this would make the import more likely to take more than one cron to complete, but this is a complete guess.

MegaChriz’s picture

@maskedjellybean
"Force update" can make the import process longer yes, but even without that option enabled, an import with hundreds of items can easily take more than one cron run to complete. For every item in the feed, Feeds still has to check if there's an existing item for it and then if it also has changed. Unchecking "Force update" only potentially skips a lot of entity save calls.

In the issue summary I proposed to move the file to import to private://feeds/in_progress by default. Only use the public filesystem if there's no private filesystem configured.

maskedjellybean’s picture

@MegaChriz
Thank you for clarifying what force update does.

I'm going to use whatever patch works. I don't believe there is a patch to move the files to private, but there is one to move to public, so that is what I'll use unless I find it doesn't work. For what it's worth, I agree with you that private would be preferable to public, but I wouldn't feel comfortable writing that patch. Unless you're saying the private://feeds/in_progress functionality already exists in some version of feeds or in an existing patch?

andileco’s picture

Tried to post this yesterday, didn't realize my post didn't go through:

@maskedjellybean, just so you know, Pantheon has added "protected_web_paths" in pantheon.yml. I think you could do something like this:

protected_web_paths:
  - /sites/default/files/private/
  - /sites/default/files/config/
  - /sites/default/files/tmp/
maskedjellybean’s picture

Thanks @andelico, I'll have to look into that.

I've found that the patch in #73 is working well for me on Pantheon now that I realized I have to check the checkbox at /admin/config/feeds_settings. Thanks for the patch!

maskedjellybean’s picture

After inspecting the queue table it seems I may either still be having this issue with the patch applied, or I'm seeing queue items that have been stuck for months. I'll keep an eye on it.

I wanted to share some more info:

In the queue table I found items with a name like "feeds_feed_refresh:FEED_NAME". When I ran drush cron I saw a "RuntimeException: File /tmp/feeds_http_fetcher_1_kqRgck does not exist." error for each item. To me this implied these queue items were stuck, would never be processed, and I should delete them manually.

To delete items from the queue you can run:

drush queue:delete feeds_feed_refresh:FEED_NAME

To delete items from the queue on a Pantheon environment, you can run:

terminus remote:drush SITE-NAME.ENV queue:delete feeds_feed_refresh:FEED_NAME

tlo405’s picture

@maskedjellybean I've been seeing the same issues that you are seeing (with the patch applied). If I remove those stuck items from the queue, then at the next cron run everything appears to work again.

If the queue is never cleared out, updates to the feed I am pulling in are never imported into Drupal. Cron runs every 10 minutes in my site, however we import the feed every 15 minutes. The problem is, each cron run Drupal tries to import those items from the queue (if any exist). It fails due to that "RuntimeException: File /tmp/feeds_http_fetcher_1_kqRgck does not exist." error.

When that exception is thrown, I notice $feed->finishImport(); gets called. That method ends up pushing the 'next import' into the future. Then 10 minutes later when cron runs again, the same thing happens. Feeds attempts to run those queued up items...it fails again..then pushes the 'next import' date into the future. So updates to the external feed we pull in never make it into Drupal.

So for instance, say it's 12 noon, cron is set to run at 12:10, but my 'next import' for my feed is set to run at 12:15. If there are no items in the queue, cron runs at 12:10, and the 'next import' stays at 12:15 (which is correct). Then when cron runs at 12:20, the feed runs correctly. The 'next import' gets set to 12:35 (15 minutes into the future) which is correct.

However, if there is an item in the queue, the cron run at 12:10 attempts to run the queued items and fails. Due to the code in $feed->finishImport();, the 'next import' gets pushed from 12:15 to 12:25. However, when cron runs again at 12:20, the queued up items fail once again, then the 'next import' gets pushed to 12:35...and so on. So the 'next import' never runs because it keeps getting pushed into the future every time...therefore Drupal never imports the new data from the feed.

So it seems that the patches here do not work. Also, it seems that the patches here now fail against the new beta1 release that came out yesterday.

meladawy’s picture

tsurvuli’s picture

Thank you very much for this patch, it's been very helpful so far. However, I upgraded to core 9.3.14 this morning and now the patch will not apply. Any chance of an updated version?

xeiro’s picture

Thank you for the patch @meladawy! I can confirm the patch was working with 9.4.1 however, has failed to apply on 9.4.3.

xeiro’s picture

Thanks for posting 2912130-82, @jmcerda. The following cron error returns on this version:
RuntimeException: File <em class="placeholder">/tmp/feeds_http_fetcherrsGHI4</em> does not exist. in Drupal\feeds\Result\FetcherResult->checkFile() (line 53 of /code/web/modules/contrib/feeds/src/Result/FetcherResult.php).
- D 9.4.3 / PHP 8.0

Topplestack’s picture

#110 fails for me as well.

MegaChriz’s picture

Assigned: Unassigned » MegaChriz

I'm working on this issue. I've made a draft for this issue, but it is currently tangled with possible fixes for other, but related, issues. The most important of these other issues is #3193610: Items going missing, which only needs some more testing. The other issues I'm working on are #3132198: Clean queue tasks when unlocking a feed to prevent potential data loss and #2778373: Auto-purge old temporary feed downloads.

someshver’s picture

Fixed the apply fail error. Feed Module Version 3.0.0.0-beta2

someshver’s picture

MegaChriz’s picture

I implemented a possible solution for this issue! The code in the branch 2912130-in-progress-directory does the following:

  1. It puts HTTP downloaded files in a configurable directory that defaults to "private://feeds/in_progress/[Feed ID]" or "public://feeds/in_progress/[Feed ID]" in case no private file system is configured.
  2. It adds a service called "feeds.file_system.in_progress" with which you can interact with the directory for Feeds temporary files.
  3. The config key for the directory is feeds.settings:in_progress_dir. I have no UI added for configuring it, as I think there is usually no need to change the default. But those that want to change it can do it via the YAML file for now.
  4. It includes a possible fix for #2778373: Auto-purge old temporary feed downloads as well, since that issue was so closely related I worked on that one at the same time. I could try to remove a fix for that issue from here upon request. The benefit of that would be that there's less code to review here. That could help getting this finalized quicker.

I haven't tested this into the "wild" yet. The only testing I did was with automated tests that I added.
And let's see if it breaks any other tests.

andileco’s picture

@MegaChriz I tested "2912130-in-progress-directory" and the feed run completed in one run (which I didn't expect, but there were no new items, so I guess that makes sense); however I saw this notice in the dblog: The file private://feeds/in_progress/11 was not deleted because it does not exist.

MegaChriz’s picture

I've fixed the issue reported in #119 by @andileco. The issue was that Feeds tried to clean up files for feed imports that do not use the HTTP Fetcher. This is now fixed by checking upon finishing (or unlocking) an import if there are files to clean up.

I also did some small refactoring in the Feeds Log module, LogFileManager now extends FeedsFileSystemBase. This is because there was a lot overlapping in behavior.

smustgrave’s picture

Status: Needs review » Needs work

Tested the MR on the environment we are seeing it on and still getting the error 100s of times on cron.

RuntimeException: File /tmp/feeds_http_fetchergiFocL does not exist. in Drupal\feeds\Result\FetcherResult->checkFile() (line 53 of /home/site/wwwroot/modules/contrib/feeds/src/Result/FetcherResult.php).

MegaChriz’s picture

@smustgrave
Okay, but does it happen with completely new imports too? Because that path is supposed to be the old location where things are saved. For new imports, the temporary file is expected to be stored in private://feeds/in_progress/[feed id].
But thanks for helping with testing!

smustgrave’s picture

Sorry don't fully follow. We have a number of feeds that we can't recreate as it will cause duplicate issues I believe.

Didn't test with a brand new import though. Think for that test I would have to create across environments and sync the databases I think for a valid test.

MegaChriz’s picture

@smustgrave
For clarity, with "new import" I don't mean a new feed. Just make sure that for a particular feed no existing import is still running. If you only have feeds that are in an unfinished import state, try to unlock these. Maybe also clear all existing feeds tasks from the queue table. It is possible that some old ones got stuck there. In the latest release of Feeds I found a solution for cleaning up stale queue tasks, but that only applies for newly created tasks - not tasks that were already there when using older Feeds releases.

Hm, maybe I should think about if there is a way to clean up stale Feeds queue tasks that had been created in the older format. But the challenge for that would be to detect if they are really stale. If cron doesn't run often on a site, they could potentially still belong to an active import.

smustgrave’s picture

May be overkill but maybe a UI to view the items in the queue, their created date. And let the user choose to delete. May be out of scope for this ticket. But does make it hard to test with old values in the queue.

MegaChriz’s picture

@smustgrave
For monitoring what's in the queue, there is already the module Queue UI. To find the original created date of a queue item is hard, because when an exception occurs during executing a queue item, the item is getting requeued (and then it gets a new creation date). At least, in the case of Feeds. Sometimes that is a good thing, for example if an external source is temporary unavailable you would want Feeds to retry. But for refining that process (for example implement a way that there is a maximum number of retries) is indeed something for an other issue. See for example #3277999: Add DelayedRequeueException feature to feeds queue - D9.1.

There is by the way an other issue closely related to this one and that is #3158678: FetcherResult::checkFile() throws \RuntimeException, which is never caught. That one could help you out of the loop, but it also stops the retry attempts that I talked about above.

MegaChriz’s picture

Setting this back to "Needs review" because I think that the issue reported here needs to be fixed in an other issue, possibly #3158678: FetcherResult::checkFile() throws \RuntimeException, which is never caught.

smustgrave’s picture

So we are using redis for our queue so maybe the bad values are in there. Clearing the queue table doesn't do anything for us.

MegaChriz’s picture

@smustgrave
Hm, that's unfortunate. Since 8.x-3.0-beta3, Feeds automatically cleans up tasks for a specific feed from the queue table whenever that feed finishes an import or when the feed gets unlocked. But if that queue is external, Feeds cannot clean up tasks from that queue. The queue has no API method to clear a specific task from it. The queue API only provides a method to clear a whole queue. The Feeds queues are currently defined per feed type. Perhaps each feed should gets its own queue? That could potentially lead to a large amount of queues, not sure if that's a problem.

smustgrave’s picture

So have to dig more but if the feeds module is querying the queue table directly that seems wrong to me. If it's using the queue service then when a queue is cleared it will clear from what other source it is using.

Experiencing many times drupal default queue will fail and cause performance issues.

MegaChriz’s picture

@smustgrave
We should probably discuss the queue thing further in an other issue than this, but see the methods hasQueueTasks() and clearQueueTasks() in FeedImportHandler. hasQueueTasks() is used by FeedsLockBackend to determine if an import still appears to be active. clearQueueTasks() is called when finishing an import or when unlocking a feed. When unlocking a feed, it is important that any active imports for that feed get killed and that is done now by removing specific tasks from the queue table.

The code:

  /**
   * Checks if there are still tasks on the feeds queue.
   *
   * @param \Drupal\feeds\FeedInterface $feed
   *   The feed to look for in the queue.
   *
   * @return bool
   *   True if there are still tasks on the queue. False otherwise.
   */
  public function hasQueueTasks(FeedInterface $feed): bool {
    // First check if the queue table exists.
    if (!$this->database->schema()->tableExists('queue')) {
      // No queue table exists, so no tasks can exist on it.
      return FALSE;
    }

    $result = $this->database->select('queue')
      ->fields('queue', [])
      ->condition('data', 'a:3:{i:0;i:' . $feed->id() . ';%', 'LIKE')
      ->countQuery()
      ->execute()
      ->fetchField();

    return $result > 0;
  }

  /**
   * Removes all queue tasks for the given feed.
   *
   * @param \Drupal\feeds\FeedInterface $feed
   *   The feed for which to remove queue tasks.
   */
  public function clearQueueTasks(FeedInterface $feed): void {
    if (!$this->database->schema()->tableExists('queue')) {
      return;
    }
    $this->database->delete('queue')
      ->condition('data', 'a:3:{i:0;i:' . $feed->id() . ';%', 'LIKE')
      ->execute();
  }

I didn't see an other way to cleanup stale queue tasks. The only other option is to remove the whole queue, but that would kill imports of other feeds of the same feed type as well. Queues are currently defined per feed type, not per feed.

rdivince’s picture

Tested the MR and also experienced the RuntimeException error with /tmp/feeds_http_fetcher not existing in our load-balanced environment.

Still testing but I wonder if it has something to do with our site also using a s3fs for a remote file system, which is quite useful for load-balanced environments. I noticed the tempnam method in the FeedsFileSystemBase class was using fileSystem->realpath to get the path. The Drupal API docs say regarding realpath:

"Resolves the absolute filepath of a local URI or filepath. The use of this method is discouraged, because it does not work for remote URIs. Except in rare cases, URIs should not be manually resolved. Only use this function if you know that the stream wrapper in the URI uses the local file system, and you need to pass an absolute path to a function that is incompatible with stream URIs.

https://api.drupal.org/api/drupal/core%21lib%21Drupal%21Core%21File%21Fi...

MegaChriz’s picture

@rdivince
Thanks helping with testing! I've no experience with s3fs. Can you check what happens with that filesystem if you try to write a file the way FeedsFileSystemBase::tempnam() does that? So test that separately, outside of the Feeds context?

also experienced the RuntimeException error with /tmp/feeds_http_fetcher not existing in our load-balanced environment

That's probably because there are still queue items left on the queue. So in order to check if the solution here works for future imports, try to empty these Feeds queues first. One way of doing that is by installing the Queue UI module. It can also be done with the drush command drush queue:delete, but then you need to know the exact queue name, which you can see with drush queue:list.

MegaChriz’s picture

@rdivince
I tried to find a way how I could test the patch with a S3 filesystem. It looks like I would need an account for Amazon’s Simple Storage Service (S3) for that. I tried to create an account there but I got stuck at the step asking for Credit Card details. In the Netherlands it is not self-evident to have that. So I’m tempted to commit this without the S3 check.

smustgrave’s picture

FYI after I cleared my redis queues and applied this patch across all our environments we are no longer seeing the problem.

We are using Azure with a share file mount. Our prod and stage are scale up to multiple instances as well.

rdivince’s picture

@MegaChriz thanks for the info! I was able to clear the queue with the drush commands provided. But it did not fix the import errors. I tested file writing with tempnam outside of the feeds context but did not have any luck. But I did get the MR to work as followed:

In FeedsFileSystemBase::tempnam(), instead of writing a temporary file with tempnam, I wrote a file w/ \Drupal::service('file.repository')->writeData(). This method should support local and remote stream wrappers. Then I got additional error information that the getAsync call in HttpFetcher::get() appears to open a file with "w+" permissions by default which s3 (and perhaps other remote stream wrappers) do not seem to allow: https://docs.aws.amazon.com/sdk-for-php/v3/developer-guide/s3-stream-wra...

You can upload larger files by streaming data using fopen() and a “w”, “x”, or “a” stream access mode. The Amazon S3 stream wrapper does not support simultaneous read and write streams (e.g. “r+”, “w+”, etc). This is because the HTTP protocol doesn’t allow simultaneous reading and writing.

First opening the stream through GuzzleHttp\Psr7\Utils::tryFopen() with "w" mode and passing that filestream as the sink to getAsync worked perfectly. Not sure if other use cases require "w+"

While these tweaks to the MR successfully resolved my errors with s3fs, I know mine is just one of many use cases and you'll know those better than me. So I've attached my tweaks as a patch and would like to hear your thoughts.

rdivince’s picture

Marking as hidden

pingwin4eg’s picture

Status: Needs review » Reviewed & tested by the community

The fix from the merge request works - the default private directory (private://feeds/in_progress/#) is used instead of the temporary one.

For those who still see the error with the tmp file, check also other fetcher classes. For example, in our client's project there are also the feeds_http_auth_fetcher module with it's own fetcher, and the custom one, and existing feed entities use all 3. So I had to change them too similar to the http fetcher.

  • MegaChriz committed 4062e01b on 8.x-3.x
    Issue #2912130 by Fant0m, MegaChriz, artematem, n4r3n, merauluka, bburg...
MegaChriz’s picture

Status: Reviewed & tested by the community » Fixed

@rdivince
I had been giving Feeds a low priority for the past month, so I guess that for your patch I'll need to dive deeper to find out how remote file systems work. So because of that I'm merging the changes of this issue now and move your patch to a follow-up issue. Perhaps we can find a way to add test coverage for remote stream wrappers too?

MegaChriz’s picture

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

bkosborne’s picture

Glad to see this landed! thanks everyone who contributed.

@MegaChriz any idea when you might cut a new beta that includes this?

MegaChriz’s picture

@bkosborne
I hope within a month from now. There is a follow-up issue that I would like to include in the next release as well: #3158678: FetcherResult::checkFile() throws \RuntimeException, which is never caught. While the fix for this issue only aims to prevent imports ending abruptly for new imports, that other issue should help cleaning up existing imports that are (still) stuck in a loop as a result of the issue that was reported here.

C-Logemann’s picture

I understand this new feature for fixing loadbalancing issues. But this new tmp files inside the private folder can also break some backup strategies. So where is the "configurable" part?

Put http downloaded files in a configurable directory and added service to interact with this directory - to fix issue with missing temporary files in load balanced environments.

jwilson3’s picture

Use the in_progress_dir config element in the Feed module's feeds.settings.yml config file.

The default location is now private://feeds/in_progress assuming you've configured a private files directory, otherwise, it is public://feeds/in_progress.

Review the commit from above for additional context:

https://git.drupalcode.org/project/feeds/-/commit/4062e01bb8471254d9daa5...

C-Logemann’s picture

@jwilson3 Thanks for the description. So it's a good candidate to set in settings.php

In many situations I think configuration via settings.php is enough especially for very special functionality. And this feature is such a special thing to solve load balancing problems but can rise problems for non advanced admins. So maybe it would be fine to make make this feature and a setting for it more visible. But it's just an idea.

For our customers we will just exclude this folder via backup scripts. This we already do with other temporary folders like image styles rendered css etc.