Is there a way to exclude tables from backup sets, either through the UI or through a code hook?

There are many reasons why it might be undesirable to have certain tables included in backup sets. Some, such as cache tables, take up disproportionately huge amounts of space and are simply not needed. Others, such as notifications_queue, can actually cause a bit of trouble if backed up and restored (people may start getting duplicate emails from the site).

Comments

Dane Powell’s picture

Does this issue better belong in another Aegir queue, or on the community site? I'd really like some more information on it... even just advice on how I might go about writing a contrib module to deal with it.

Like I said, there are very good reasons to exclude these tables (caches and search index are huge, and can always be rebuilt; and other tables such as the notifications queue can cause real trouble), and I have no idea how else to get around this problem.

Steven Jones’s picture

Project: Hostmaster (Aegir) » Provision
Version: 6.x-0.4-alpha3 » 6.x-1.2
Category: support » feature

This probably belongs in the provision queue, so moving over there.

As for how to do this, I suspect that we'd need to send some more commands to MySQL dump to not dump the data for specific tables. However, I don't think the logic is going to be as simple as, here's a list of tables to not dump, because we could break a site very easily like that (by clearing the D7 registry for example) or creating a cache stampede.

I suspect that most of want you want to be able to do this for could be covered by a some kind of 'environment' migrate, so you'd do a migrate to 'DEV' which would know to empty the notifications table after migration, whereas the migrate in LIVE would not.

Dane Powell’s picture

@Steven Jones - good point, I would want to exclude tables during regular backups and migrations to dev platforms, but NOT exclude tables during migration of live sites.

anarcat’s picture

So you can use --ignore-table in mysqldump, but the problem with that is that it *completely* excludes the table from the dump, even the CREATE TABLE statements. Using that flag would result in a broken dump... The trick I have used in the past to workaround this problem is to dump the DB structure first (with -d/--no-data) and then dump the table data only for the relevant tables (--no-create-info, -t and --ignore-tables), but that means calling mysqldump twice and appending to the file on the second pass...

In other words, it's not just adding a flag. :) But I do agree this would be a useful optimization...

Dane Powell’s picture

Another useful and related feature would be to replace all of the email addresses in the user table with user@example.com or null@localhost or something like that when migrating to a Dev platform, so that emails can't be sent from a development site to actual users.

joestewart’s picture

#5 seems to be a different feature, but a little context to a solution for sanitizing email addresses - drush 4.5 should have a command - #1112142: Run sanitize without running full sql-sync

Dane Powell’s picture

@joestewart- thanks for the tip! Not to hijack this thread too much, but can you point to any advice on how to utilize that with Aegir migrations? i.e., is it possible to implement hook_drush_sql_sync_sanitize() and somehow detect the source and target platforms of a site migration?

jason.fisher’s picture

Title: Exclude tables (i.e. cache*) from backups » Exclude tables (i.e. cache*) from backups (w/workaround)
Version: 6.x-1.2 » 6.x-1.9

If you have sed installed you can use this workaround to clean data from the database.sql after extraction and prior to gzip. A PHP-native solution would probably use stream filters to avoid memory issues..

YMMV -- if you are using this in production, you better test restores thoroughly. The regexp can certainly be improved.

Paste the following into: /var/aegir/.drush/zbackup_filter.drush.inc -- be careful of excess whitespace.


function drush_zbackup_filter_pre_provision_backup() {

  $exclude_tables = array(
        'watchdog',
        'cache',
        'cache_.*',
        'search_dataset',
        'search_index',
        'search_node_links',
        'search_total',
        'captcha_sessions',
        'civicrm_import_job_.*',
        'civicrm_task_action_.*',
  );

  $olddir = getcwd();
  if (!chdir(d()->site_path)) {
    return drush_set_error('PROVISION_BACKUP_PATH_NOT_FOUND', dt('cannot change directory to %dir', array('%dir' => d()->site_path)));
  }

  $backup_file = drush_get_option('backup_file');
  drush_log(dt("Filtering !backup_file database.sql to save space ..", array('!backup_file' => $backup_file)), 'backup');

  copy('database.sql', 'database.sql.in');
  foreach($exclude_tables as $table) {
    exec("sed '/LOCK TABLES `" . $table . "`/,/UNLOCK TABLES/d' database.sql.in >database.sql.out");
    rename('database.sql.out', 'database.sql.in');
  }
  rename('database.sql.in', 'database.sql');

}

Dane Powell’s picture

Nice bit of lateral thinking there jason, sanitizing the database while it's in text form using a simple regex. It's still not quite ideal, since mysql is still going to burn unnecessary cycles dumping the tables that will just get thrown away anyway, but it's a start!

Of course, it's still true that you might not want to sanitize the databases for every backup- for instance, during a simply migration, you might not want to kill the caches and search_index, which would have to get rebuilt if the migration fails. A good alternative would be to add this script to Hosting Backup GC as a way to free up space without losing any valuable data.

jason.fisher’s picture

Yeah, not quite ideal .. but was helpful in making our Cloud Files offsite a bit more manageable.

Here's an update that adds filtering of certain files from the sites/*/files/ directory. I am using it to exclude backup_migrate backup files from aegir backups. Again .. it is another hack. This one moves backup_migrate out of the sites/*/files directory prior to gzip and moves it back in after. If the process fails, those backup_migrate files may be left in /tmp. YMMV.

I would love to see some way to extract tables as individual files, so space could be saved by de-duping those tables that rarely change..


function drush_zbackup_filter_pre_provision_backup() {

  $exclude_tables = array(
	'watchdog',
	'cache',
	'cache_.*',
	'search_dataset',
	'search_index',
	'search_node_links',
	'search_total',
	'captcha_sessions',
	'civicrm_import_job_.*',
	'civicrm_task_action_.*',
  );

  $exclude_files = array(
	'backup_migrate',
  );

  $olddir = getcwd();
  if (!chdir(d()->site_path)) {
    return drush_set_error('PROVISION_BACKUP_PATH_NOT_FOUND', dt('cannot change directory to %dir', array('%dir' => d()->site_path)));
  }

  $backup_file = drush_get_option('backup_file');
  drush_log(dt("Filtering !backup_file database.sql to save space ..", array('!backup_file' => $backup_file)), 'backup');

  copy('database.sql', 'database.sql.in');
  foreach($exclude_tables as $table) {
    exec("sed '/LOCK TABLES `" . $table . "`/,/UNLOCK TABLES/d' database.sql.in >database.sql.out");
    rename('database.sql.out', 'database.sql.in');
  }
  rename('database.sql.in', 'database.sql');
			
  // temporarily hold files that should not be included in the backup
  mkdir('/tmp/files-held/' . $backup_file, 0700, true);
  foreach($exclude_files as $file) {
    rename('files/'. $file, '/tmp/files-held/'. $backup_file . '/' . $file);
  }

}

function drush_zbackup_filter_post_provision_backup() {

  $olddir = getcwd();
  if (!chdir(d()->site_path)) {
    return drush_set_error('PROVISION_BACKUP_PATH_NOT_FOUND', dt('cannot change directory to %dir', array('%dir' => d()->site_path)));
  }

  $backup_file = drush_get_option('backup_file');

  // restore held files
  exec('mv -f /tmp/files-held/'.$backup_file.'/* files/');
  rmdir('/tmp/files-held/' . $backup_file);

}

Steven Jones’s picture

Version: 6.x-1.9 » 6.x-2.x-dev
mvc’s picture

that's an enormous chunk of text in a single sed regex. at minimum i would recommend searching for the INSERT lines instead.

why don't we just use the --structure-tables-* options for drush sql-dump?

mvc’s picture

ergonlogic’s picture

Version: 6.x-2.x-dev » 7.x-3.x-dev

New features need to be implemented in Aegir 3.x, then we can consider back-porting to Aegir 2.x.

anarcat’s picture

@mvc is right, we shold just use --structure-tables-list or --structure-tables-key.

aegir3 has a new local.drushrc.php config file that can be edited by site admins that allows operators to just drop options like this. you can then use this drush config to skip the data on the tables you want to exclude:

/**
 * List of tables whose *data* is skipped by the 'sql-dump' and 'sql-sync'
 * commands when the "--structure-tables-key=common" option is provided.
 * You may add specific tables to the existing array or add a new element.
 */
$options['structure-tables']['common'] = array('cache', 'cache_*', 'history', 'search_*', 'sessions', 'watchdog');

then calling:

drush sql-dump --structure-tables-key=common

... will do what you'd expect. you could also skip those tables entirely for all commands with:

$options['structure-tables-key'] = array('cache', 'cache_*', 'history', 'search_*', 'sessions', 'watchdog');

i guess we could add those as options to provision-backup, provision-migrate and (maybe) provision-clone ...

this is where an eventual patch should go...

helmo’s picture

We would have to change the implementation of generate_dump() in e.g. db/Provision/Service/db/mysql.php. It now just calls mysqldump.

There must have been a reason why we did'nt use drush sql-dump, but chances are that Drush does all we need today.

Steven Jones’s picture

@helmo drush sql-dump wasn't around at first, and then when it was it leaked DB credentials into the system process listing. Anarcat will know more on these topics I suspect.

ergonlogic’s picture

I believe the db credentials leaks are solved in Drush 7.x. We should consider refactoring as suggested by @helmo in #16. It'd be pretty straight-forward to modify the tables in a backup in any number of ways at that point.

colan’s picture

Yes, those leaks appear to be plugged as per mysql credentials leak in drush sqlc.

So we can basically just replace this line in generate_dump()

$cmd = sprintf("mysqldump --defaults-file=/dev/fd/3 --single-transaction --quick --no-autocommit %s", escapeshellcmd(drush_get_option('db_name')));

...with drush sql-dump --structure-tables...?

Not sure what that --defaults-file is about.

Looks like we can simplify importing as well, dropping safe_shell_exec(), but let's do that in a follow-up issue.

helmo’s picture

The 'defaults-file' was to pass in credentials.

Big +1 to rewrite this, hosting_sync already uses sql-sync.