Problem/Motivation

  • A number of useful requests require a re-think of how this module is structured
  • There are inefficiencies in the code that could be greatly improved upon, making it easier for developers to extend the module for their own purposes.

Proposed resolution

Create a new 4.x version using the bones of 8.x-3.x with the following changes:

  1. New S3fsBucketType plugin type (step too far? Figure this way we - or the dev using it - can make small adjustments for things like MinIO by extending a plugin and making edits)
  2. New S3fsAwsBucket plugin - easy to configure, hides a bunch of the customisation options
  3. New S3fsCustomBucket plugin - shows the more complex customisation options such as custom host (required)
  4. Config becomes config entity S3fsBucket; adjust settings forms appropriately, including choice of plugin
  5. Tightly integrate config entities with Key module for safe key storage if not using EC2 profile creds
  6. New S3fsStreamWrapperFactory to build stream wrappers out of config/plugins at runtime
  7. Move all caching out of the stream wrapper into base and child plugins where appropriate
  8. Expose internal read/write/(more?) API endpoints on S3fs service so developers can:
    • inject the service into their module
    • set the plugin to be used
    • read/write/(?) directly via the service without having to mess with streamwrappers

Configuration that isn’t bucket-specific:

  • Bucket and path for public file stream
  • Bucket and path for private file stream

Remaining tasks

Gather feedback
Set up 4.x branch
Create new version including tests
Review / adjust until ready
Release

User interface changes

Many, particularly around configuration.

API changes

Internal APIs will be explicitly described rather than just searching for public methods

Data model changes

Complete restructure of configuration data.
May need assistance with upgrade path.

Issue fork s3fs-3206162

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Darvanen created an issue. See original summary.

darvanen’s picture

Title: 4.x Plan » 4.x Version
Priority: Normal » Major
larowlan’s picture

It would be good to extract the client/connection logic into a separate module so you could write custom code that requires an S3 client without hand-rolling your own client factory every time.

rivimey’s picture

First off, great initiative, thanks !

I would suggest that the module shipped with a MinIO plugin as standard, because doing so will validate the plugin structure/api is up to the job. I know when I did a plugin type a year or so ago I missed some important points about the API which required redoing it a couple of times. Oh, and MinIO is a good target anyway!

I tried implementing bucket entities a while ago and my biggest issue was in making suitable stream wrappers. I suggest you start on that aspect first.

rivimey’s picture

Just a thought, for testing (both of the module and of site installs) it might be very helpful to have a Local bucket - that is, 'buckets' that map directly onto local files. It would make finding problems due to interaction with AWS and Drupal much easier.

darvanen’s picture

re #3:

After speaking with @larowlan I understand what's being requested is a factory which instantiates clients when given an address and credentials. I'm very happy to separate this out and I can understand how a separate module means if that's all you need, you then don't need to load all the bucket plugins into memory.

At this time I'm thinking a sub-module of this namespace that is required by the main module is the best bet here.

re #4:

Having separate plugin types is *all* about being able to support other vendors like MinIO. I might need help developing that particular plugin though.
Can you elaborate on why the stream wrappers were an issue for you?

re #5:

I'll keep that in mind and spin one up if I see an easy way to do it :)

rivimey’s picture

You may be interested in knowing of the module google_api_client, which is designed to create google clients from credentials. I would suggest a similar approach, although I think the case here is more complex because although there is an assumption of AWS, that's not the only option.

It was a while ago, but the reason I found stream wrappers difficult is that it seemed the only way to make a stream wrapper that was performant was to code-generate the wrapper class in the same way Twig does. The option of creating just one class using a variable 'wrapper name' didn't work because the only distinguishing thing php gives you is the class name, and if that is fixed you only have one wrapper name!

There's more info, and the code as far as I got, here: https://www.drupal.org/project/s3fs/issues/2982231#comment-13202854

I think a code-generated option is the only way forward, but it obviously needs to be done right.

darvanen’s picture

Assigned: darvanen » Unassigned

I have to put this down for a little while but I've got a certain distance:

  • As requested, there is a new s3_client module which simply wraps the AWS Client in a factory to enable mocking in tests.
  • There is a new s3_bucket module that defines a plugin system based on the pattern used by search_api.
  • There is a SimpleAWSBucketPlugin which has just one field for now which shows the configurable plugin forms will render
  • The form submission does not yet work.

My plan from here would be to:

  1. Integrate the config entity (not the plugins) with the key module for credential storage (TBD?)
  2. Make the form submission work, check that config is being saved correctly
  3. Finish off the bare minimum fields for the SimpleAWSBucketPlugin
  4. Attempt to create a FileStreamWrapper that utilises a configured plugin to determine way forward in restructuring the current service
  5. Seek further input

I'll reassign to myself if I get another good chunk of time to allocate towards this.

You'll find all the work in the issue fork, it has recently been rebased on the various commits going into 3.x.

rivimey’s picture

Hey Darvanen, thank you very much for your efforts. They sound great!

cmlara’s picture

Version: 8.x-3.x-dev » 4.0.x-dev

Changing version to new 4.0.x-dev branch for easier tracking.

cmlara’s picture

I've rebased on latest 3.x and pushed a rough concept build up to the issue branch for feedback to see if there are better ways to do some of actions than the way I am doing.

At the moment the S3CustomBucket plugin appears to work as long as the bucket is in us-east-1 region (or perhaps if you manually reimport config to change the region, there is no region detection code yet).

I would like feedback to see if this is poorly implemented before going too much further with it.

Design:
There is a Metadata service that should hold the majority of the responsibility of working with the local 'cache' database (s3fs_file table)
Plugins handle the majority of the actions that need bucket configuration to impact how it is acted upon.
Stream wrapper calls the plugins as needed for decisions.

StreamWrappers are created by a decorated StreamWrapperManager and are cached here. (this means there isn't an actual 'service' for each StreamWrapper as exists in core by default however I don't see this causing an issue.) This does cause a concern since to my knowledge few (if any) contrib modules currently decorate the StreamWrapper Manager service that an unknown amount of contrib could be mistakenly typehinted on StreamWrapperManager instead of StreamWrapperManagerInterface.
I'm currently relying on the machine name of the entity to be the protocol though I suspect in the future we will want to change that.

Majority of this code is copy/paste from the existing StreamWrapper with changes for variable names and removed settings.

Tests have been modified to remove features that no longer exist and to convert them to configure the Config Entity instead of ::Settings.

At the moment the Drush tests are failing because the drush command looks for the s3fs streamwrapper service (we can likely switch away from this as I'm not even sure we use it in drush any more.)
The FileService test fails because I haven't converted the S3fsFileService to use the new Config Entities yet, and 2) The test has a flaw that shows up when using the native filesystem service, that it is trying to call PHP's move_uploaded_file() on a file that was not uploaded via a POST.

Thoughts? Better ways this could be implemented? Concerns?

larowlan’s picture

Aw man, I didn't realise there was a separate client module coming, I realeased https://drupal.org/project/s3client this week as a standalone module spun out of https://drupal.org/project/preview_site

When this gets a stable 4x release, I'll mothball that

darvanen’s picture

This is starting to get into realms I do not understand at all, thank you for picking it up!

The only thing I will say is that I intended for bucket plugins to be applicable to more than one configurable stream. So you could configure one bucket and use it for both private and public streams for instance.

I wanted to create named streams that were configured to use a particular bucket - I had not yet worked out how to do this (or if it was even possible).

rivimey’s picture

@cmlara, I have had a browse through the code and I'm not seeing very much to worry about, which is good. It has the feeling of being on the right track. The only exception to that is, perhaps, whether you really need to duplicate the StreamWrapperManager class rather than inherit it. Perhaps I'm missing something?

I would find it helpful if you could post a high-level overview of what you are aiming for, structurally, and perhaps even post that to an appropriate docs location. There are a lot of 'moving parts' here and noticing important ones is hard.

I admit I haven't been aware of StreamWrapperManager service before, but am now informed! However, regarding: "few (if any) contrib modules currently decorate the StreamWrapper Manager service" if that is so, then it's simply their bug to fix and not a reason to not do it right. I was concerned about one thing -- again possibly my lack of knowledge -- which is if the manager service keeps a record of the known stream wrappers, is it a bad thing to split that into an S3 wrappers list and another 'others' list? Might that not result in valid names not being recognised correctly?

Personally I find the thought of the machine name being the name of the protocol quite suitable, and if I was decision-making would want a very good reason to make it otherwise!

When looking at the wrapper 'stat' code, I noticed that the 'nlink' field should be 1, not 0, for files. Ideally, 'directories' should have nlink >=2 but it's not so important. nlink is used to signal how many hard links a file has, and some code checks it. '0' would signal an orphan file that was not in any directory -- normally an error. '1' indicates it's in one directory -- the normal case, while >1 would signal a multiply-hardlinked file. Directories often have nlink = 2+{num.dir.entries}; '2+' because of the parent directory link (".."), and its link to itself (".").

How have you solved the stream wrapper problem I mentioned before (in #7)?

@darvanen, Allowing bucket plugins to be usable for more than one stream sounds like a reasonable idea. I guess the most important aspect is that the bucket metadata cache is not tied into the stream definition but the bucket, and the stream uses the appropriate bucket data, which sounds like a good plan anyway. I guess the stream would the option of a non-zero 'directory' prefix to separate the stream files, to keep private files private.

I hope the metadata cache is already thread safe but not too heavyweight in its locks, as otherwise updates will be slow or break things.

cmlara’s picture

Per request in #14 here is an architecture layout.

NOTE: I agree with comments in #13 requesting the bucket configuration be separate from the StreamWrapper configuration. Those comments will cause significant changes to Bucket Plugin and either minor or signification changes to S3fsBuketStream. They will cause a rethink on some of the operations and its worth having the discussion. This documents the code as published at the time of writing.

Target Architecture Goals:
Multiple buckets across multiple S3 providers should be supported.
StreamWrappers can be added/removed as needed through Config.
Settings.php should be used only when absolutely necessary for a setting.
High ability to be tested (much of the code in 3.x can only be tested at a functional level currently due to dependencies)

Architecture as proposed (based of of what I believe I understood suggestions to be) along with reasons for the choice in method:

s3fs_client sub module:
Provides a service that will return S3Client objects. For s3fs as a project I believe its main advantage is that it gives us an option to be able to mock the S3 service for tests which will hopefully lead to better tested code. It has ancillary benefits to other who may want to access an S3Client as a service.

s3fs_bucket_stream Drupal(Symfony) service:
Its sole mission is to provide NON SHARED copies of the S3fsBucketStream class to the s3fs_stream_wrapper_manager_decorator service. This allows others modules to override our StreamWrapper class if ever needed. Note: normal services are SHARED services, they use one instanced repeatedly.

S3fsStreamWraperManager ( s3fs_stream_wrapper_manager_decorator service):
This is a (Symfony) Service Decorator. Normally in Drupal a StreamWrapper is registered with the core stream_wrapper_manager service by having a ‘tag: stream_wrapper” entry in its Drupal Service entry. Tags for services to be registered into the stream_wrapper_manager service are only processed during the Compiler Pass, this is BEFORE the Drupal Container is online, we do NOT (to my knowledge) have access to the majority (any) of Drupal services at this point. Of most significant importance to us is we can not get to the entity_type.manager service to load ConfigEntities to look up what StreamWrappers we may want to register. Once the container is initialized and we have the entity_type.manager service available as a dependency the ContainerBuilder will than try and add the known (core and contrib) StreamWrappers by calling addStreamWrapper. During the addStreamWrapper call it allows us a chance to build our own list of s3fs based StreamWrappers. I could think of no other place that we could be guaranteed to be called up as to my knowledge there is no way to force a service to be loaded until it is called as dependency for injection. Additionally the addStreamWrapper function is documented as ‘internal use only’ even though it is a public function, I interpret that to mean it is the Core teams intention that only Core and Service Decorators should make calls to that function. Later during during bootstrap Core calls for the stream_wrapper_manager service to register() the StreamWrappers which is when they actually become available for use by php for I/O operations. We have a very small widow between ContainerBuild and StreamWrapper register to injected our s3fs based streams or we risk breaking core. As such I chose the Service Decorator method.

The stream_wrapper_manager service in core maintains a mapping of schema (public/private/s3/etc) to service mappings so that it may call for Drupal functions (getExternalUrl being the most common called function.) Since we do not have any registered services for our StreamWrapper (because of the ConfigEntities limitation) I have the stream_wrapper_manager maintaining an initialized class for each StreamWrapper in a way that I believe is similar to how the Symfony service system maintains shared instances in services.

S3fsBucketStream StreamWrapper:
This is primarily a shim class that provides the function mappings required by both Drupal and Php for StreamWrappers and provides the information necessary for \Aws\S3\StreamWrapper to perform the actual file operations against the S3 bucket. Where possible this shim will return answers directly for setting that should not be impacted by plugins (primarily related to the fact that this is a Remote StreamWrapper) and when necessary it will call the Bucket plugins to ask it to provide information. Information requested is primarily around “where in the bucket should this file be”, “what S3 Options need to be set before we call hook alter to allow third party modules to make changes” and “what is the metadata on this file”

S3fs Bucket Plugin:
Bucket plugins exist to allow for customization of the StreamWrapper operations without needing to modify the StreamWrapper code itself. It provides a configured S3Client that can be used by S3fsBucketStream to pass to the AWS SDK. Additionally it exists to aid in simplifying configuration by allowing for configuration options to be set by the plugins. The StreamWrapper operations can however be impacted by the plugin settings.

The Bucket plugins has become the primary worker as it relates to our control of the StreamWrappers. The majority of the code that has historically been in \Drupal\s3fs\StreamWrappers\S3fsStream is now moved to the Bucket plugin. The plugins store the configuration for each StreamWrapper and performs actions as necessary. Should we encounter a provider that does not work with existing settings

S3fsMetadataService:
Its targeted to have the primary responsibility of writing data to the local metadata cache table (s3fs_file) instead of the current behavior where database reads/writes are happening as inline code as needed in the StreamWrapper. Not all code has been fully migrated to use this service yet.

I believe this covers the architecture as submitted to the issue fork.

cmlara’s picture

#13
I will make no claim of being an expert in this area either, this has been mostly a “learn how this is done and see if it applies to us” task for me.

Regarding buckets being separate from the StreamWrapper, I can see validity in that request,

From an implementation standpoint I think the big question is does any code go to a new “StreamWrapper” plugin (so the StreamWrapper configuration pages can be customizable) or do I revert back to how S3fsStream does it in s3fs where the majority of the code is in the StreamWrapper and we expect the configuration form to be standardized for the StreamWrappers.

I agree in this scenario the cache should be associated to the bucket instead of to the StreamWrapper.

This does mean that the whole bucket path likely needs to be stored into the cache table, #2584619: Proper Impementation for Primary Key would become more important to get a larger keyspace.

Minor concern about the fact files can show up in multiple StreamWrappers if they end up sharing the same path on a bucket, however that is honestly much better than the current scenario where files could be silently overwritten in a similar situation.

#14 @rivimey
Hopefully post #15 helps.

Regarding the StreamWrapperManager Service code not extending: Main reason is that it allows other modules to eventually decorate as well. Our manager only needs to be responsible for s3fs provided wrappers and can pass everything on to other code. https://www.phase2technology.com/blog/using-symfony-service-decorators-d... , The magic method does allow less boiler plate code, however it doesn’t work if someone tries to call method_exists().

Regarding names not being recognized correctly. It is of course not risk free, however I feel that relying on core to deal with registering its StreamWrappers gives us a much lower work load, and is required to be done that way to allow other modules to make changes in the future, #3209813: Breaks other StreamWrappers that use an http/https api calls (S3FS) comes to mind as a module where we would want other modules to make a choice on how they process stream wrappers.

Regarding the #7 performance concerns:

First off I haven’t profiled this code yet to know just how big an impact it will have(open to advice on best way to do that.) I did however take the following steps to attempt to limited the impact as much as possible:

1) Many of the functions inside Drupal that need the StreamWrapper are actually called through the StreamWrapperManager, getExternalUrl() being the most likely largest offender. Because of this the StramWrapperManager decorator caches a StreamWrapper and reuses it avoiding a penalty to initialize a new StreamWrapper (core normaly does this by storing the service_id of the registered stream_wrapper.<protocol> while I am just storing them directly as initialized classes, to my knowledge this is exactly the same as how the Symfony Services system stores them, I just had to to move it into the StreamWrapperManager to fit our needs. This saves time on obtaining the configuration of how the wrapper should be presented (prefix configs, cnames, etc) and allowing this to stay in memory.

2) Much like we do currently in 3.x I made sure we maintain a drupal_static of the StreamWrapper configs so that any new invocation (every time an open/seek/read/etc) of the StreamWrapper occurs from core php we will already have this data in memory and will not need to check the database for configs.

I do see one change I could make, currently every setUri() call causes a preg_match() from ::getSchema, the cached wrappers could likely store a status $this->isStaticSchema = TRUE so it becomes if (!$this->isStaticSchema) { S3fsStreamWrapperManager::getSchema() }this could knock off a large number of regex calls to increase performance.

Typehint Contrib:
Agree a typehint fault in other modules would not be hard stop against this, main reason I brought up the concern was just to make sure everyone is aware that this, based on our experience with decorating the FileSystem Service we know this does have the potential to unexpectedly impact contrib modules causing a WSOD, we will want to encourage awareness of this during the testing phases.

Stat code: Good catch, that is actually a direct copy from the 3.x branch so the fault exist in current code as well.

Machine Name:
My biggest concern as to why we should possibly allow a choice is that by machine names are based off the initial name of the config entity. If my entity is named “LongBuckeId bucket” we end up with a very long prefix on all files for their path “LongBucketId://path/to/file.txt” using up a lot of space for filename length (Drupals core managed_files table has a 255 character length limit for uri’s). Additionally when they become renamed or all share the same prefix it becomes much harder to understand what schema is which StreamWrapper config.

darvanen’s picture

#16:
I would absolutely create a StreamWrapper plugin that each consume a Bucket plugin. I would have started down that path but I couldn't see what would be needed at the other end. Agree this is a process of learning - but you do have little more experience with the S3 side of things than I do.

Re files existing on multiple streamwrappers... that would be possible even without this separation. We can't protect against everything.

From my perspective it's fairly simple, a bucket plugin represents a single S3 bucket that I have configured on the cloud, each streamwrapper represents a way of using that bucket, if I choose to use it for multiple purposes.

rivimey’s picture

Thanks for extensive comments... I think many of the comments in #15 could be added as @file comments to appropriate php files.

I'm going to try to rephrase those comments, to check I've understood:

The Drupal StreamWrapperManager service exists to register streams (such as public:) based on tags in the code, which are discovered in the Compiler pass of the boot sequence. At that point the framework is not initialized. Once the compiler pass completes, the class' addStreamWrapper() function is called for those tags discovered to make those streams real.

The S3fsStreamWrapperManager extends the existing Drupal service to include dynamically created S3 tags, which are then added using the same addStreamWrapper() function. It is invoked by virtue of having its own statically declared tag: stream_wrapper and thus being found during the compiler pass. [[more here I think]].

The S3 Stream Wrapper uses instances of the Bucket plugin to manage each individual remote bucket, and those have an associated subsection of the s3 file metadata cache. Interactions with S3 files are all performed via the bucket instances and hence with the information from the metadata cache. The stream wrapper instance links the bucket instance with Drupal, [[and should IMO include an optional bucket path prefix]]. Drupal FileSystem settings, etc, are therefore interpreted in the wrapper.

For performance reasons many of the stream wrapper objects are cached in memory, using drupal_static and other methods, and the file metadata cache itself is constructed as a Drupal Cache Bin, enabling offload to e.g. memcache.

The s3client submodule provides instances of the AWS S3 SDK client object to callers, and nothing else. Further configuration or coordination is up to the caller.

Please feel free to extend/correct!

I get the feeling that we should locate other devs who have experience of the Drupal stream wrappers to provide guidance on this architecture, as none of us seems confident that we understand how it should be used. Looking through commit logs, perhaps @Berdir, @heddn or @mikelutz?

jonathanshaw’s picture

#15

Tags for services to be registered into the stream_wrapper_manager service are only processed during the Compiler Pass, this is BEFORE the Drupal Container is online, we do NOT (to my knowledge) have access to the majority (any) of Drupal services at this point. Of most significant importance to us is we can not get to the entity_type.manager service to load ConfigEntities to look up what StreamWrappers we may want to register.

I wonder if you could handle this with a custom compiler pass:
https://www.drupal.org/docs/drupal-apis/services-and-dependency-injectio...
https://symfony.com/doc/3.4/components/dependency_injection/compilation....

The idea would be to let the default compiler pass happen, making entity_type.manager available on the container, and then use that entity_type.manager in a second compiler pass.

No idea if this would really work, beyond my knowledge this stuff.

darvanen’s picture

I second @rivimey's call for an optional bucket prefix on the stream wrapper instance, that is where I was planning to put the existing behaviour of setting prefixes.

cmlara’s picture

#19 jonathanshaw
I appreciate the suggestion, and indeed it was an area I considered as a possible option, unfortunately when I had tested it before decorating the StreamWrapperManager serviceI discovered it would throw an exception

Constructing service "entity_type.manager" from a parent definition is not supported at build time.

Two symfony issues that are relevant basically indicate that get() should never be called during the container builder process.
https://github.com/symfony/symfony/issues/18673#issuecomment-216183042
https://github.com/symfony/symfony/pull/18728
As I understand it, even though we are in a different compiler pass we are actually still inside the original compile() before the build has completed without a fully functional container.

#18 @rivimey
If any of them were to opine I would certainly listen.

The situation with the StreamWrapperManager is mostly about about ensuring we register all StreamWrappers early enough for Drupal Core to utilize them. To date the modules I have seen that have an arbitrary number of StreamWrappers (flysystem is one that comes to mind) have primarily used settings.php. Settings.php has an advantage in that it is available during the compiler pass and as such can be used to provision normally, this has a much easier developer experience however it requires StreamWrapper configuration to be in settings.php where it is subject to less configuration validation and control, arguably providing an administrator experience that could be improved upon.

The S3fsStreamWrapperManager extends(decorates) the existing Drupal service to include dynamically created s3fs based StreamWrappers, which are added to its internal list during the addStreamWrapper() function. It is invoked by virtue of the fact that core attempts to add the standard Drupal StreamWrappers. Registration of the StreamWrappers as valid path handlers is invoked by Drupal Boostrap calling the register() function.

(note: as I write that I realize I should probably not be using the addStreamWrapper() function to populate the internal structure and instead should move that to the constructor)

For performance reasons many of the stream wrapper objects are cached in memory, using drupal_static and other methods, additionally the file metadata cache is persisted in the Database to avoid metadata requests sent to the S3 Bucket and utilizes a Drupal Cache Bin for performance enhancement enabling offload to e.g. memcache.

Note: this is the same as the 3.x architecture, the database holds the long term data and upon a "cache read" we check the Drupal Cache Bin and failing that fall back to the Database where if successful we populate the Cache Bin.

Beyond the minor proposed tweaks I believe your rephrase is accurate, and agree much of this information can be put into the files for documentation.

cmlara’s picture

I've pushed a new version up for opinion. I've populated more of the documentation this time.

WARNING: This version is no longer compatible with the 3.x s3fs_file table. Ensure 3.x is uninstalled before deploying. There is no cache refresh code yet, this code is really only useful in a dev lab on an isolated bucket for previewing the changes. Assuming that the feedback is positive I will work on adding the Refresh Cache and File Migration code in next to allow use on testing sites.

A couple new 'helper' services have been added. I didn't think these belong inside the PluginManager services, however open to opnion on that.
S3fsStreamWrapperProvider/
-- Provides a list of enabled streamWrapper plugins
-- Caches and returns constructed plugins on request.

S3fsBucketProvider
-- Provides list of enabled buckets.
-- Caches and returns constructed plugins on request.
--- Used by the streamWrapper plugins.

I tried to implement this so that a Bucket Plugin could make changes and it works all the way back to the streamWrapper, and the same for the streamWrapper plugins modifying data in the middle and passing it both directions. In theory the design I belive allows for a situation where the Bucket does not match the same structure as the StreamWrapper, I don't expect this to be a common scenario and am no

The Bucket Plugin works with paths while the streamWrapper plugins will work with URI's. This was done to allow the Bucket plugins to manage the cache and operations in an agnostics manner from the streamWrapper plugins and allow the streamWrappers to share the data. Did a quick test and was able to access file data on two different registered streamWrappers sharing the same bucket showing consistency.

Probably worth a discussion on do all these functions belong in sub modules or in the main module to decide where components go.

cmlara’s picture

I've pushed up additional changes with the commit messages detailing the purposes to help flush out this some more and make testing easier.

For those looking at the code the modules/ folder is the main area of active code, some code outside of modules folder is still used as it has yet to be ported away from the s3fs.settings page (mostly the css/js aggregation code,) however The S3fsService.php that has provided the majority of the heavy lifting in 3.x is currently not used at all for the active streamWrappers.

Going back to #14 @rivimey:
I did hit one issue during testing where you were right to be concerned about the two stream_wrapper_managers having their own list. Testing 'public://' takeover without disabling public:// did create an scenario where the Drupal streamWrapper was registered for PHP to use, but the S3FS based module was providing to Drupal requests. I've solved this by changing the order of registering, where we now register the Core wrappers first, and than register ours, forcing a stream_wrapper_unregister() the same way that Core does. Alternatively we could instead register the Core wrappers and check if any schemes exist in Core before and refuse to register on our side. I choose to go with our streams taking priority as I believe it will make it easier for S3FS to take over for any other streamWrapper should a site need to migrate from one streamWrapepr to another, however I could see arguments being made for going the more conservative route of refusing to register. I'm open to more opinions on this.

I do think we should decide on a list of protocols that should at be forbidden to configure as the machine name the UI, and possibly even checked for inside the S3fsStreamWrapperManager before registering. Just looking at my local PHP install I can see the following names that we would likely want to forbid and I'm sure there are more we should consider.

    [0] => https
    [1] => ftps
    [2] => compress.zlib
    [3] => compress.bzip2
    [4] => php
    [5] => file
    [6] => glob
    [7] => data
    [8] => http
    [9] => ftp
    [10] => sqlsrv
    [11] => zip
    [12] => phar
    [14] => temporary

The http/https are very important as registering those could break the S3Client ability to get to the bucket provider, the rest seem very reasonable to protect.

#22:
Its been asked in slack do we actually need the helper services, and I could see an argument for removing them. Originally I thought I would need them a lot more often than I have utilized them. Additionally I've since come to the opinion it is a better practice to use the Entity to create the instance rather than calling the pluginManager as it keeps the code in one place and would be shorter code than loading the entity, obtaining the config, and initiating the plugin each time. At the very least they may need a rename from 'provider' to fit better with what other developers may be expecting.

I've also pushed up a commit to the issue for that restores the ability to refresh the metadata cache and to copy the local files to a bucket. An uninstall would still be needed to reset the schema (we can look to a migration path from 3.x later after more stable) though at least now this code can be used against existing buckets and dev sites for testing.

While I questioned having sub-modules for each component I am starting to appreciate the separation they are providing at the very least making the code easier to manage. At this point I'm thinking unless objection is raised that s3fs_client, s3fs_bucket and s3fs_streamwrapper will stay as sub-modules and the code that is currently in s3fs will (where possible) continue to be moved into the appropriate module.

#16
My concerns about machine name being used to identify the protocol have subsided. For some reason I wasn't thinking about the fact that the UI allows the machine name to be edited when first created, and that the UI shows the Machine Name on the pages. I agree with @rivimey in #14 that unless a good reason shows up the Machine Name is the best field for defining the protocol.

Currently known not working features:
StreamWrapper Configuration > Link Generation > Deliver using Drupal (aka private:// takeover and the ability to use any bucket as private://)

rivimey’s picture

@clmara thanks for investigating the stream_wrapper_managers issue. I'm glad you have been able to characterise the issue further. I presume that in calling s-w-unregister that means core will no longer respond for that name in any way even if by some unknown means it did get asked about it. I did tweet at @berdir about help but didn't get a response at that time. It may be useful for you to contact him directly?

I would encourage you to actively delete code that is no longer used :-) I agree there should be a list of forbidden protocols, and I think the list you showed is a very good starting point. I'm not sure whether the list should be externally modifiable (eg from settings.php) - I suspect it should be, but cannot think of a reason so far.

Unless the helper services area adding something I suggest they can be removed unless we know of instances when they need to be used externally. Perhaps an interim solution would be to prove the current service implementations can be replaced by new implementations that call the new code.

Agree a migration path is needed, but well done for getting the metadata cache and local files working again. Submodules are good, I think, also because they enable unneeded functionality to be ignored and so reduce the attack surface and complexity of the module.

darvanen’s picture

Just installed to take a look.

I think in making the Key module a hard requirement for bucket config and putting bucket config in a submodule I've set us up to fail.

It doesn't make sense to require drupal/key in the composer.json file for people who only want the client factory. By that same measure... if someone only wants the client factory why make them download the rest of the package? I wonder if it would be better to split at least that piece out into a separate project like @larowlan already did (I think)?

cmlara’s picture

#24 @rivimey:

A little background: StreamWrapper classes are used in two places, first PHP will construct them directly for any request to access a registered stream wrapper. Only one class can be registered at a time to any protocol. Secondly constructs them as services, which are used by the StreamWrapperManager to access the classes.

Once unregistered the PHP Core no longer calls the class and waits until a new class is registered. For the StreamWrapperManager since we are a decorator we intercepts any requests from other modules (either via file_system service or direct to the StreamWrapper service) and (as long as we don't have a bug in our service) provide our own class so the core one doesn't matter.

Since CORE uses services it would be possible to do $container->get('stream_wrapper.public') and have direct access to the StreamWrapper class. However I saw no location in Core doing this. I belive this would be considered a bug in most modules as they should not need to interact with the streamWrapper classes directly and should always call the StreamWrapperManager, however we can't say it will never happen. I ended up making this same mistake in the 3.x branch for the Batch Update proccess.

Secondly from memory I recall is possible to get the original Core StreamWrapperManager, I don't have the exact code in front of me but from memory I believe it is something like $innercontainer = $container->get('StreamWrapperManager)->getDecoratedServiceName(); $container->get($innercontainer); This type of code would error out if the container wasn't decorated so it is not likely this is just sitting around in some random module. Its not impossible and I disclose it as a method, but I don't believe this is something we need to fear.

We could handle both of these by having a settings.php variable that lists protocols to disable, and have a ContainerPass where we de-register existing ones causing both scenarios above to not be possible. We can't do this when we know the actual protocols for the same reason that we need to decorate the StreamWrapperManager, we do not have entity access at the container build time.

#25:
I would be ok with the client being its own its own Drupal project. If we use @larowlan's that he linked previously we would, if I recall correctly, need to work to have an extra Config field added to the createClient() function, but that is minor. Your implementation and his ended up being very similar.

#7 Performance Testing:
I sat down this afternoon and did some quick performance testing. I did actually find out that I had failed to cache the client in the S3BucketPlugin which added around 400ms of extra time to my initial test run figures. I've patched that in my local branch and will soon push it up into the issue fork for others (along with some additional cleanup patches of moving files around and getting closer to removing unused code)

I ran two different tests scenarios, one scenario was calling file_create_url() and the other was calling is_dir(). I called each function 1000 times in a run executed through the devel php execute module in the UI. Each run of each function was done twice against each version. Cache was cleared through the Devel Cache Clear button just before starting the first test with the 3.x and 4.x versions. After each test I increased the starting value of $i by 1000 and the max value by 1000.

Sample of code used in the php-execute field:

for ($i=1000; $i < 2000; $i++) {
  file_create_url("s3://$i.tmp");
}
;

0-3999 were for the 3.x version of s3 and 4000-7999 were for the 4.x branch.

Results:

file_create_url() test
Test 3.x Run 1 3.x Run 2 4.x Run 1 4.x Run 2
Total 1813218 1831283 1120725 1147932
file_create_url() 1783128 1809568 1099787 1131605
S3Client→getObjectUrl 1182878 1188114 711359 734633
S3fsStream→readCache() 540014 573968
CustomS3BucketPlugin-readCache() 327932 335137
is_dir() test
Test 3.x Run 1 3.x Run 2 4.x Run 1 4.x Run 2
Total 354926 322647 366035 372398
is_dir() 342581 312436 353007 359131
s3fsStream→stat() 330499 301281
s3fsStream→readCache() 325494 298722
S3fsBucketStream->stat() 347935 353947
S3fsBucketStream→readCache() 313301 318738

I selected the major functions consuming time and did not list the smaller functions or core functions unrelated to the code under test. Time is in microseconds from xdebug profiler runs on a laptop.

I am a bit concerned that for file_create_url test the 4.x getObjectUrl() calls and readCache calls finished much quicker showing possibly some variability in my local laptop setup. However at least from a relative standpoint we can use the total times vs the time of the function as an initial comparison for any 'red flags' on timing.

To me these relative times appear to be similar for file_create_url. The AWS Client took between 64-66% of the time in all 4 file_create_url tests with the database reads the next largest time.

The is_dir shows a bit larger differences. On the 3.x branch the Database took up around 95-96% of the time while in 4.x it was only around 88%. Not sure if this is significant yet.

alfaguru’s picture

#21 cmlara

Regarding the error constructing entity_type.manager, I found a workaround which I think might be useful to others. I am doing my processing in an extra compiler pass, and at this stage at least it is possible to to access the config.storage service directly, so one can load and read config entities as pure configuration, which is adequate to my needs.

So in the service provider:

<?php
  /**
   * {@inheritdoc}
   */
  public function register(ContainerBuilder $container) {
    $container->addCompilerPass(new DefineBucketsCompilerPass());
  }

?>

And then in the CompilerPass class:

<?php
  /**
   * {@inheritDoc}
   */
  public function process(ContainerBuilder $container) {
    // Use low-level service as the container is not fully built yet.
    $configStorage = $container->get('config.storage');
    $prefix = '??'; // Depending on module and config entity id
    $names = $configStorage->listAll($prefix);
    // Process the config items ...


?>
cmlara’s picture

@alfaguru

Thank you for that example!

I was unaware of the config.storage service and how we could utilize it during a compiler pass.

A very quick glance does indeed appear to indicate that this will provide us with enough information to perform actions that are dependent upon the configured and active streamWrappers in a compilerPass.

This does appear that it should allow us to drop the custom code for decorating the stream_wrapper_manager and bring ourselves more in line with a vanilla core deployment.

alfaguru’s picture

@cmlara, there's another thing I discovered later, which is that after the custom compiler pass you need to add an extra instance of the core pass that registers streams, to actually have the stream wrappers registered. So, for example:

<?php
    $container->addCompilerPass(new DefineBucketsCompilerPass());
    $container->addCompilerPass(new RegisterStreamWrappersPass());
?>
cmlara’s picture

Pushed some commits that have been sitting on my machine for a bit. The biggest changes being increasing the required PHP version to 8.1 (not strictly necessary but with 7.4 being EOL and 8.1 being supported in D9.4 it allows us an option to use newer PHP features) and the next being the base code to use use Drupal for the file delivery method.

Also included the suggestion from 27/29 that allows us to remove the decorator on stream_wrapper_manager.

The start of tests for the s3fs_bucket sub-module are included in the latest push.

I've not ported SA-CONTRIB-2022-057, primarily because the factors for it are not equivalent in 4.x as. Private:// and public:// do not have to be under s3:// and instead we can make s3:// to be under a prefix in the bucket. The primary case where this is now relevant would be migration from 3.x to 4.x, we will want to discuss that as part of how we handle the version migration to see if we need to port some form of path exclusion or if we need to provide documentation for changing file locations.

One thought that came to mind:
I would like to get rid of the S3fs Bucket File Migration code. This code is really only useful for provisioning for public:// and private:// takeover and is never used again. In the current design it requires every bucket plugin to implement some level of code to import the files (even if its just inheriting from the base classes) for a feature that is essentially unused.

With the new layout where a bucket is configured and ANY scheme can be created, we no longer need to use the s3 api directly to push the files into the bucket, instead we can mount public_s3_migrate:// and just do copy('public://test.txt', 'public_s3_migrate://test.txt') and later create the real 'public://' s3fs streamWrapper to 'takeover' public. Not needing to use the s3 API directly allows us an option to use more generic file management utilities.

At the moment no such generic utilities to my knowledge exist however I'm thinking it may be worth the effort to create a new (non-s3fs) module that handles files and file entities which would allow us to pull the code from s3fs while providing options to solve solve the requests discussed in #3317315: Migration old files per field and #3061878: Uninstall does not allow reversion to local file system.

darvanen’s picture

This version upgrade is definitely the place to introduce anything that requires 8.1, soon enough Drupal 9 will be EOL and Drupal 10's minimum PHP version is 8.1 so it will be a moot point by then.

I also strongly support separating the file migration code into a different module that can be uninstalled once finished with. I have less of an opinion on how to structure that.

rivimey’s picture

I'm getting a little more invested in v4 again now, wondering what strands of work are currently blocking etc. Any suggestions? Also where do 'compiler passes' come into things?

Re the migration code - the best bet I can think of is a submodule you enable to do the migration as needed, which can then be turned off again. -- which I think is the gist of #31. As #30 says, the v4 api should be up to the job of migration without specialised interfaces.

alfaguru’s picture

@rvimey compiler passes get involved because the container is still being built ('compiled') at the point the register method of a service provider is called, and hence not all services are available at that point. One can work round this by using compiler passes which are executed later. The gotcha in this case is that you are then too late for the pass that registers stream wrappers, so you need to invoke it a second time.

cmlara’s picture

wondering what strands of work are currently blocking etc.

At the moment the code loads and to my knowledge functions at least at a basic level.

The general todo list is:

  • Write tests for the s3fs_streamwrapper module, preference for unit tests for the bulk of operations with a few minor functional tests where needed to verify end to end operations. Hopefully the s3fs_bucket tests are a good example of where I want this to go
  • Fix any bugs the testing finds(I'm sure some exist)
  • If it can't be tested rework the api to make it testable
  • Cleanup API and methods, since a lot of this is pulled from 3.x and was made to make it easy to migrate I'm sure there is code that can be removed. I pulled a few methods off the API for the bucket module while looking at tests and realizing "why is that a public method?"
  • Address any inline @todo's for issues we saw at the time of porting into the new architecture but didn't make any design decisions on.

Testing is the big one for me, I don't want to merge this into the 4.x branch without good testing to start with (lack of in depth testing has been a pain point for the 3.x branch) beyond that though I imagine were close to actually opening up the 4.x branch for a more structed development cycle.

For migration there are two aspects:

The first part is the file import to bucket from local disk:
This is the "copy files from the Drupal host to the bucket" -- This is what I'm speaking of removing in #30. Locally I have already built up a basic module to do this using basic apis l I'm hoping if we run this as an independent 'bulk_file_management' project we may be able to get more community support for more advanced features in the long term.

The second aspect of migration is going to be with dealing with sites that move from 3.x to 4.x:

  • Config Migration; are we able to (and should we) attempt to migrate the existing 3.x config into 4.x equivalents when a site upgrades.
  • Database schema: The 4.x version of the s3fs_file table is a incompatible with the 3.x table so we need to test and document this impact on upgrading. If we convert config we may be able to convert the database schema at the same time.
  • Do we need to have sites move objects in their bucket or do we need to code a new solution to deal with
  • At the moment there is no code in 4.x to protect against SA-CONTRIB-2022-057. Personally I would prefer to avoid having code in the built in plugins that tries to exclude paths and work off only using prefixes, however to do so could require admins to migrate objects in their buckets to new paths.

In any case the upgrade from 3.x to 4.x is going to a major architecture change, depending on the decisions we make on design will determine how much effort a conversion will be.

cmlara’s picture

I think I'm happy with this as a base.

There is still a decent amount of work to do, a few more plugins are needed to meet the goal of easier deployments, API cleanup may still be needed, I want to clear up a lot of phpstan warnings, this needs a phpcs run over, docblocks may need some work, a few bugs might still need to be ported forward from 3.x, additional testing could be useful, might want to reorganize class locations, etc.

That said I think this is a solid base to start working with for a 4.x branch. and everything else can be handled as its own issues.

Opened the MR so that tests can run against the current code and inline comments can be added if desired.

This is a sizeable MR considering its essentially an overhaul and likely not the easiest to review. I'm not expecting this to be "lets tag the first alpha" code on its own however it also is not code that is non functioning either.

cmlara’s picture

Status: Active » Needs review

Forgot to set NR to trigger DrupalCi.

  • cmlara committed 9c089f3b on 4.0.x
    Issue #3206162 by cmlara, darvanen, rivimey, alfaguru, larowlan,...
cmlara’s picture

Status: Needs review » Fixed

Merged into 4.0.x branch to allow this to be our base to work with.
-- Side note: I will probably move to a 4.x mainline dev branch when we have 4.1.0 plans, however for now I will keep the existing 4.0.x as the target to avoid having to sync multiple branches. branches

Updating the Git Default branch to become the new target for future development

I will create followup tasks for the issues I can already think of that I want to see work done for and any new tasks as they come up. Assistance is always appreciated to move them forward.

Thank you all for the assistance and input in helping to overcome the design and implementation challenges for this conversion.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.