Problem/Motivation
When using CPS, a site can collect many revisions over time. Most often the revisions are long text fields, and over time these take up a lot of database disk space. Purging legitimate revisions isn't an option. One option is to placeholder old values and export them to a "cold storage" like the disk and placeholder their values in the database.
Proposed resolution
Given a revision, replace supported field values with a placeholder token. The original value is written to the private filesystem. When the revision is loaded later on, the cps_archiver module checks if the field values exist on disk and replace the field values.
Remaining tasks
- A warning that disabling the cps_archiver module can lead to data loss unless archived revision field values restored
A cps_archiver_archived table to track revisions which have been exported- Archived field values must be encrypted
Issue fork cps-3212174
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #2
mglamanComment #4
mglamanReady for an initial review, but has work to be done.
Comment #5
mglamanThere are a few problems here. Technically, Drupal always loads the latest field revision value and when a site version is published, that version is also archived. The current code results in the
field_data_body
and the matching revision field table row to have the placeholder.Comment #6
mglamanComment #7
douggreen CreditAttribution: douggreen as a volunteer commentedIf there is any archived data, couldn't we just make the archiver a hard dependency, and thus not have to warn the user.
Comment #8
douggreen CreditAttribution: douggreen as a volunteer commentedWhy do we need to encrypt the file data, where does this requirement come from. If it's in the database unencrypted, it's no less secure than being on the file system unencrypted. ... maybe what we should do is make sure that the directory permissions are 700 and that the file permissions are 400, so that only the web server can read it. We might need to use some umask, because it's possible that we'll need 770 and 440 respectively.
Comment #9
douggreen CreditAttribution: douggreen as a volunteer commentedI think that we need a way to restore the archived fields, mainly so that this module can be safely disabled.
Comment #10
douggreen CreditAttribution: douggreen as a volunteer commentedI've pushed a commit to this PR that does the following:
* use a shorter placeholder to take up less space
* removes cron queue because this is already been run inside cron or drush #3226803: Add a new drush command that archive's everything that can be archived
* prevents disabling if anything is archived
* adds encryption
* adds a directory hierarchy that hashes based on the revision id, so that we have at most 1000 files per directory
* stores files in /fields subdirectory of cps_get_archive_location(), works best with #3226803: Add a new drush command that archive's everything that can be archived
Comment #11
douggreen CreditAttribution: douggreen as a volunteer commentedI think we should rename this as cps_field_archiver to not be confused with the cps_entity archiving that already happens as part of cps.module.
Comment #12
douggreen CreditAttribution: douggreen as a volunteer commented