Currently the drush sql-sanitize command works by invoking a hook to identify data to delete or munge. This means that every time new tables or columns are added to a Drupal site the HOOK_drush_sql_sync_sanitize must be updated to cover those new tables/columns. If a developer forgets to update the hook to cover the new table/column then data will be preserved that possibly should be deleted.

It would be nice to have a system where specific data has to be whitelisted for preservation and/or munging so that newly added tables will be emptied by default until they are explicitly whitelisted for preservation.

I'm thinking of a system where a hook returns an array like so:

  $tables = [
    'users' => [
      'queries' => [
        'DELETE FROM {users} WHERE status = 0',
        "UPDATE {users} SET name = concat('user', uid), mail = concat(uid, 'example.com')",
      ],
      'fields' => ['name', 'mail'],
    ],
  ];
  • The 'queries' should be a query with table prefixes for passing to db_query.
  • Any columns that are not in the array would be set back to their default value as defined in hook_schema if possible, if not possible, the table should be truncated and an error logged.
  • Any tables that are not listed would be truncated.

I'm currently planning to add this code into a module called paranoiasanitize.module which will be a sub-module of paranoia.

Comments

greggles created an issue. See original summary.

c4rl’s picture

It seems possible that the action will need to be an array of actions.

It seems to me that if the table isn't wholly truncated (by making absent in the structure), some (or all) data is kept either whole or munged.

Could then the 'action' key be 'delete where', and it provides a SQL string WHERE clause? Thus, to keep the table whole, one omits 'delete where'.

I'd also propose that the 'columns' structure allow for three forms of return:

'columns' => [
  // 1. Predefined constants that point to default callbacks.
  'uid' => KEEP_WHOLE,
  'name' => MUNGE_STRING,
  'pass' => MUNGE_PASS,
  // 2. Closures that receive the entire native row.
  'mail' => function ($row) {
    return sprintf('foo-%d@example.com', $row['uid']);
  },
  // 3. Literal values for every row
  'signature' => 'foo',
]
greggles’s picture

Could then the 'action' key be 'delete where', and it provides a SQL string WHERE clause? Thus, to keep the table whole, one omits 'delete where'.

I think that works, I'll give it a shot. My initial thought with explicitly saying truncate was that it might make it easier to understand what is happening, but maybe that's a bit anticipatory.

I'd also propose that the 'columns' structure allow for three forms of return:

I believe a closure would require running an update on each row which seems likely to be dramatically slower than doing an update. But maybe this table will be dealing with so few rows that it would be acceptable if it made the code simpler? I'll reconsider as I get closer to that work.

coltrane’s picture

I think a hook should return:

  1. a sanitize action that's a full SQL query
  2. a list of columns that are covered by the query

This drush command doesn't need to be too smart.

The drush command could:

  • invokes hooks
  • for tables and data returned from a hook
  • run introspection on the table (schema or describe?) to get columns
  • if any columns on the table and not returned from the hook
  • see if they can be set to default or empty string
  • if cannot set default truncate whole table and log error (but continue drush command)
  • if no new columns on table run table SQL sanitization query
  • move on to next table

The drush command needs to have default actions (truncate or e.g. update column to empty string) but otherwise its first responsibility is to keep private data out of the final DB and secondly to record when a table's registered sanitization is inadequate (and to fallback to safety). Leave the specifics up to the developer within each hook.

hook return for failing users table sanitization:

'users' => [
  'query' => "update {users} set name = concat('user', uid), set mail = concat(uid, 'example.com');"
  'fields' => ['name', 'mail']
]
greggles’s picture

Issue summary: View changes

Just want to start saying that I think most of Ben's question in #4 deals with the column level data as opposed to the table level data.

OK, so I got to the point of having a working version of tables and here's what I've found as the node_paranoia_sql_sanitize_tables_list() implementation:

    // Current:
    'block_node_type' => array('delete where' => '1 <> 1'),
    'history' => array('delete where' => ''),
    'node' => array('delete where' => 'status = 0'),
    // TODO: Make these aware of prefixes.
    'node_access' => array('delete where' => 'nid <> 0 AND NID NOT IN (SELECT nid FROM node)'),
    'node_revision' => array('delete where' => 'nid NOT IN (SELECT nid FROM node)'),
    'node_type' => array('delete where' => '1 <> 1'),

I think that restricting to the where clause is nice because it reduces the text that gets repeated, but it's also really limiting. For example, it forces the node_access/node_revision to be a subselect when it might be better for them to be written as a DELETE with an inner join.

I think Ben's proposed structure works quite well with some small adjustments, so I'm going to update the issue summary to show that.

  • greggles committed dbc42b4 on 7.x-1.x
    Issue #2640480: add a sanitize whitelist system
    
greggles’s picture

Status: Active » Needs review

I've now pushed some code that provides this feature to git. Given it's a new sub-module rather than a patch to the existing module that seemed like an easier and reasonable solution. It's had some internal review, so it's at least not horrible :)

I see that as the first version and am happy to revert or amend as appropriate. Reviews welcome!

  • greggles committed 944b9af on 7.x-1.x
    Issue #2640480 greggles, dpintats, c4rl: fix review feedback
    
coltrane’s picture

Status: Needs review » Needs work

Found that the generated hooks include prefixed table names and that drush_paranoiasanitize_paranoia_sql_sanitize_whitelist() also applies prefixes resulting in doubly-prefixed tables.

  • greggles committed 1ac83a7 on 7.x-1.x
    Issue #2640480 greggles, coltrane: this and the build caused double-...
greggles’s picture

Status: Needs work » Fixed

I fixed that by deleting the line and filed #2660678: Sort out how to ship hooks from d.o and handle prefixes to track future work on it.

I think it might be time to release this and followup work can be in followup issues.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.