s3fs: What is the right way to address a file schema-agnostically in a file_save? [#2697103]

Module works like a dream together with s3fs_migrate and s3fs_file_proxy. I have this specific configuration because I am rolling out and having to slowly piecemeal migrate out ~60,000 files. Once it has moved all files, s3fs_migrate will conveniently set the "use s3 for public / use s3 for private" s3fs settings automatically.

In a module, I issue a file_save() to capture a file in File API. How can I supply something other than a hardcoded "public://" or "s3://s3-fs-public" so that when the file writes, it is either in public:// or s3:// depending on if migration is complete? More abstractly, is there a way to file_save() without caring what the system default schema is? note: file_default_schema() always reports "public" regardless of s3fs/s3fs_migrate settings and status.

Thanks!
-Bronius

tldr; Solution:
As a developer, if your module uses file_save_data() and s3fs is set to handle public writes, set your schema to public:// as usual, and s3fs will stream it out to S3. URLs generated against this public schema will also present the s3 public link.

(And if that was not TL, and you're still R'ing the tldr, in my specific recipe with s3fs_migrate and s3fs_file_proxy, this makes all the things happen seamlessly)

Comments

Comment #1

30 March 2016 at 17:10

texas-bronius created an issue. See original summary.

Comment #2

coredumperror commented 30 March 2016 at 17:55

I don't know how s3fs_migrate works, so I can't really answer that question. You'll have to figure out how to detect the "migration is complete" state, and use that to determine which schema to use, I guess?

Comment #3

texas-bronius commented 30 March 2016 at 17:59

Ok and so that also confirms that there is no "Hey Drupal save this file in the way you know how to" function or schema-grabber that s3fs works with. It will be up to my module code to be s3-aware and intentionally write to s3://s3-fs-public directly?

Comment #4

coredumperror commented 30 March 2016 at 18:08

Well, yes and no. You'll need to write the files into the S3 bucket under the s3fs-public folder, but the uri for the file doesn't actually include "s3fs-public/". The way S3FS works with public files is to treat "public://path/to/file" similarly to how it would treat "s3://s3fs-public/path/to/file. In other words, the "s3fs-public" folder is a transparent implementation detail that file URIs never directly reflect.

Due to this, I can't guarantee that your idea will even work. Putting files into S3 and expecting Drupal to treat them as public when the "Use S3 for public:// files" option isn't enabled probably won't work.

I'm curious why you feel the need for this strange hybrid approach. What are you doing that requires this?

Comment #5

texas-bronius commented 30 March 2016 at 18:19

Ok thanks!

I am treating S3 as "just a replacement for sites/default/files" across this Drupal app, which I think is how it is designed. The s3fs_migrate bit is just a clever module that migrates existing files from public:// to s3:// in batches on cron over time. It insists that s3 for public and private are /not/ set so that it can marshal the requests, and then once it sees it's out of files to migrate, it sets your s3 for public/private checkboxes and steps out of the picture. Files newly uploaded to the system while s3fs_migrate is still active will be set as public:// and then magically turn to s3://s3-fs-public in file_managed once migrated.This is all working just fine.

What's baking my noodle, is a custom module that generates a publicly served image file: I assumed I could indefinitely write it as public:// and let s3fs_migrate move it over time (except that it disables itself). Then I thought, "Ok, there must be a way, instead, a file_write alter or something exposed or implemented by s3fs.module that makes managing in File API seamless/transparent to the module." Surely not every custom file write needs to care what Drupal's File API is calling its default schema? How does s3fs capture files once it is declare the default handler for public and private?

Does that make sense? Am I thinking about it the wrong way? Happy to hop into IRC or elsewhere .. if you're up to it.. :)

Comment #6

coredumperror commented 30 March 2016 at 20:50

It seems to me that s3fs_migrate's operation depends upon a method of running a production website that is exceptionally dangerous, from my point of view. Converting a live site that stores files locally to a site that stores files in s3 "on the fly" is very hard to imagine as being safe.

The way I would do it would be to create a copy of the live site (e.g. www.example.com copied to test.example.com), configure that copy to use S3 and copy all the local files into your test site's bucket all at once. Once the test site is fully working with the S3 backend, switch the DNS to point your production URL at the server that's hosting the test site.

You'd have to migrate any changes made to the live site while setting up the test site, but that's a pretty common thing to need to do for any major change like this. The Migrate module should make it pretty easy, too. My code shop does that occasionally, though we have custom migration scripts for that purpose.

As for your questions about how s3fs works:

> Surely not every custom file write needs to care what Drupal's File API is calling its default schema?
The way this works is that a File field is configured to use a specific schema, and it uses that schema forever unless you delete the field and recreate it with another one. The default schema is only consulted when creating the field. I'm not entirely sure how it works with non-Field files though (e.g. aggregated css), but I would guess that many are just hard-coded to use public://.

> How does s3fs capture files once it is declare the default handler for public and private?
S3fs's public takeover method hooks into Drupal's schema definition mechanism using hook_stream_wrappers_alter and claims "I own public:// now!". From then on, any URI that that starts with public:// gets handled by the S3fsStreamWrapper class, instead of DrupalPublicStreamWrapper.

Comment #7

texas-bronius commented 31 March 2016 at 14:20

Thank you for engaging in this kind of in depth back-and-forth! :)

Ok, makes great sense except the last part: The point of s3fs_migrate is to 1) move files from local physical storage and 2) update the schema in the file_managed table to reflect that the file has been moved. But what I'm hearing from you is that if one were to simply move the 60,000 files from local hdd to s3, then the file_managed table needn't be updated? Because referring to a file as public:// is going to automatically seek out s3 instead?

I feel that I've complicated things by introducing s3fs_migrate, but (from my understanding of it), it's a really elegant solution to bridge the period of time between deployment and 100% migrated. It's knuckleheads like me who don't know how to reliably generate an img src uri in an outbound email that get caught up in the details. Right now, I have something like src="sites/default/files/the_image.png" and wrote s3fs_file_proxy to attempt to fetch the file automatically from s3 if it was not found locally... I /think/ this is a solid solution.. not sure how else to output a link that should live "forever."

Comment #8

coredumperror commented 31 March 2016 at 16:59

> referring to a file as public:// is going to automatically seek out s3 instead?
Yes. You can check out the _s3fs_copy_file_system_to_s3() function (the meat behind the drush s3fs-copy-local command) to see how I've implemented this kind of migration.

I suppose that s3fs_migrate's solution is actually pretty good for the typical Drupal site, run by a non-programmer. Doing it "right" (from my perspective) is certainly quite a lot more complicated than simply using s3fs_migrate.

As for image URLs, I can see how s3fs_file_proxy fills that niche of making the same URL work both before and after a migrate from locally stored files to S3. I've never actually done that before, so I hadn't previously considered the side effects, like stale URLs.

Comment #9

texas-bronius commented 31 March 2016 at 19:23

Ok brilliant! Very nice. But then back to the top with my original confusion: How should my code write a file in Drupal File API so that it ends up at on Amazon S3?

The original developer had the file generating process write an unmanaged file to disk under sites/default/files (and not capture it in File API). I added a couple lines of code that takes the newly written file in its current path and issues a file_save() to create the Drupal file entry and then point an existing node's file field to it. I need help understanding what part of this to change how to make it work with s3fs. Please note (just for clarity) that if s3fs_migrate were running all the time (ie. did /not/ disable itself once it finished all files) it would eventually pick up and move these locally placed files on its own :) We create 100s of these files almost every day.

-Bronius

Comment #10

coredumperror commented 31 March 2016 at 19:29

> How should my code write a file in Drupal File API so that it ends up at on Amazon S3?

So if I'm understanding you, you want a file that's being uploaded to the public file system to end up in S3 even though s3fs isn't currently overriding the public:// schema? Is that right?

Comment #11

texas-bronius commented 31 March 2016 at 19:46

Yes that's correct. Except, it's not for uploads (browser -> apache -> drupal) but a custom drupal module writing the file directly (where it writes it doesn't matter -- I can of course write it somewhere other than sites/default/files directly).

Is there a public s3fs function I should just call directly?

Comment #12

coredumperror commented 31 March 2016 at 19:53

The problem here is the one that I mentioned near the top of this discussion: the place you put the file within your bucket will matter in a way that will make it inaccessible while public:// takeover remains disabled in s3fs. But I guess s3fs_migrate must somehow bridge that gap?

OK, I just read the migration code in s3fs_migrate, so I think I know what it's doing. If I'm right, you should set up the file creation code to give the file a uri of "s3://s3fs-public/path/to/file". That will cause s3fs to step in and put the file into your bucket, and it'll be in the right place for when you eventually move to public:// takeover mode.

However, that's not a uri that s3fs has been configured to handle normally. When public:// takeover is enabled, s3://s3fs-public/path/to/file is equivalent to public://path/to/file, in terms of where the file is stored in your bucket, but not in terms of how s3fs treats those files. I'm not entirely sure what will happen once public:// takeover gets enabled and you have a bunch of files with an s3://s3fs-public/path/to/file format. That could get hairy, but I honestly don't know.

One issue that could crop up is that the uri for the file in the file_managed table may be different than the uri for the file in the s3fs_file table. I'm pretty sure that's bad, but I couldn't really say what, if any, types of problem that could trigger.

Comment #13

texas-bronius commented 31 March 2016 at 21:00

Hey, I appreciate your bearing with me on this. What you're proposing is a sort of workaround, I think, to the current config I plan to have at deployment: s3fs_migrate enabled and s3fs not currently taking over public://. In this case, I should actually write the file as normal, call it public:// ($file->uri = 'public://' . $uri;), and migrate will move it once on its next cron batch, and all will be ok thenceforward.

I looked over your unit tests and see you're using file_save_data(). That makes your "s3://s3fs-public/.." recommendation make sense! s3fs will see it and ship the file out to S3. I've been using file_save, bc I'm just capturing the unmanaged file already written to sites/default/files.

OK IT WORKS! I will abandon hope that there is a mechanism that says, "Throw me the file and a name, and I will put it wherever I see fit, because right now you seem to be in two minds about it." I have changed it to write to /tmp initially, and then file_save_data() with the s3://s3fs-public prefix which allows your module to stream it to S3. (I had tried just setting s3://s3fs-public/.. with file_save(), no dice: the file stays at sites/default/files, and the file_managed entry is shown as s3://s3fs-public/.. as expected, which is not correct).

Thanks again!
-Bronius

Comment #14

texas-bronius commented 31 March 2016 at 21:18

Say... How is this line going to blow things up?
$img_uri = $base_uri . '/sites/default/files/' . file_uri_target($column_file->uri);
I use $img_uri in an img src tag. I think it's going to start giving me "sites/default/files/s3-public/..." now :(.

#whysitgottabesohard

Solution to this part: file_create_url(). I can use this function now because I am streaming to S3 directly now and know with certainty where the file will be when the user opens her email with this embedded img src reference. Phew!

Comment #15

coredumperror commented 31 March 2016 at 21:12

I don't think file_save() actually writes anything to disk. It think it just writes info to the file_managed database, and nothing else. I'm pretty sure the file_unmanaged_save() function is the one you that writes file data. But if you've got it working, there isn't really much reason to change, I don't think.

Comment #16

texas-bronius commented 31 March 2016 at 21:25

Issue summary:

View changes

Comment #17

texas-bronius commented 31 March 2016 at 21:26

Status:

Active

» Closed (works as designed)

Thanks again for the module and your extensive, in-depth help. I added a TLDR summary of the solution (which you were probably already screaming).

Comment #18

coredumperror commented 31 March 2016 at 21:27

> I think it's going to start giving me "sites/default/files/s3-public/..." now :(.

Yeah, it probably will. Under normal circumstances, if you've got an s3:// uri, you can call this to convert it into the URL for the file in your bucket:

file_create_url($uri)

I'm pretty sure that'll work regardless of any weirdness with your s3:// vs. public:// stuff.

What that ultimately does is call the getExternalUrl() function on the stream wrapper that's responsible for that uri's scheme. For s3://, that's S3fsStreamWrapper, and for public:// that's DrupalPublicStreamWrapper. Unless s3fs has public:// takeover enabled, in which case it's also S3fsStreamWrapper.

s3fs: What is the right way to address a file schema-agnostically in a file_save?