As part of a migration, we've got a bucket with 618,644 objects in it. However, drush s3fs-refresh-cache kept getting killed. The root cause was folder handling; there were OOM issues with the monolithic array, then the INSERT query tried to do everything at once.
The resolution was to create a separate, unindexed table to store folders and allow duplicate entries, and writing to the folder table right after the file entries. At the end, I used GROUP BY to eliminate duplicates and inserted the result back into the file table, then dropped the folder table.
I've also added instrumentation, debugging and messaging to s3fs-refresh-cache; if you use drush s3fs-refresh-cache --debug you'll get running updates about what operations are taking place and the memory usage.
| Comment | File | Size | Author |
|---|---|---|---|
| #1 | s3fs-refresh_scaling-2464031-2-D7.patch | 15.85 KB | fluxsauce |
Comments
Comment #1
fluxsauce commentedComment #2
coredumperror commentedAt a glance, this looks awesome. I haven't got time right now to review your patch, though. I'll get to it as soon as I have a few free hours at work.
Comment #3
Anonymous (not verified) commentedI can confirm this helped me out on a bucket with > 500,000 items
Comment #5
coredumperror commentedSorry this languished in the issue queue for so long. I've now applied your patch, and pushed it up to git. And I'd say it's about time for a new recommended release, so I'm going to do that as well.
Comment #7
jacob.embree commentedA child issue can be created to get this into 7.x-3.x. It cannot be cherry-picked.
Comment #8
ram4nd commentedLooks like this has been committed already.