As part of a migration, we've got a bucket with 618,644 objects in it. However, drush s3fs-refresh-cache kept getting killed. The root cause was folder handling; there were OOM issues with the monolithic array, then the INSERT query tried to do everything at once.

The resolution was to create a separate, unindexed table to store folders and allow duplicate entries, and writing to the folder table right after the file entries. At the end, I used GROUP BY to eliminate duplicates and inserted the result back into the file table, then dropped the folder table.

I've also added instrumentation, debugging and messaging to s3fs-refresh-cache; if you use drush s3fs-refresh-cache --debug you'll get running updates about what operations are taking place and the memory usage.

Comments

fluxsauce’s picture

Status: Active » Needs review
StatusFileSize
new15.85 KB
coredumperror’s picture

At a glance, this looks awesome. I haven't got time right now to review your patch, though. I'll get to it as soon as I have a few free hours at work.

Anonymous’s picture

I can confirm this helped me out on a bucket with > 500,000 items

coredumperror’s picture

Status: Needs review » Fixed

Sorry this languished in the issue queue for so long. I've now applied your patch, and pushed it up to git. And I'd say it's about time for a new recommended release, so I'm going to do that as well.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

jacob.embree’s picture

Issue tags: +Missing from 7.x-3.x

A child issue can be created to get this into 7.x-3.x. It cannot be cherry-picked.

ram4nd’s picture

Issue tags: -Missing from 7.x-3.x

Looks like this has been committed already.