It has been requested that Tripal should support multiple chado instances in the same database.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

guignonv’s picture

I started an extension module to achieve this task so people would be able to use it if needed. In order to have it working, I need the chado_set_active() function to be changed. Here is the (first) patch.

This patch just adds a "hook_chado_search_path_alter" to that function in order to allow contributed modules to interfere and change either current search path or even the current database connection using Drupal Database::addConnectionInfo() and db_set_active().

guignonv’s picture

FileSize
3.53 KB

I updated the patch to fix a minor issue and add some documentation on how to use the hook added.

spficklin’s picture

Thanks for submitting the patch. Tripal should definitely support multiple Chado installations. However, I'm not quite sure I understand the purpose of the hook. Can you explain more? Also, how do you deal with the chado_xxxxx linker tables where the association is made to the node and the chado table where the data is housed? How does the code know, when loading a particular node, to switch to a different chado database?

guignonv’s picture

Handling different database is quite tricky. I'll try to explain the idea and the tricks. But first, I'd like to point out that this hook could also be used by other module, not just the multi-chado one. You could think of a module that would need to do some initialization stuff at the time a connection is made to a same Chado instance (create temporary tables/views, prepare some cache, call some initialization function, record who has just connected to the db,...). So the point of this hook is not just to answer to the Multi-Chado needs.

Then, about the multi-chado module in particular... When installed, the multi-chado will create duplicate tables of "public.chado_xxxxx" tables (as "public.multichado_xxxxx"). Those tables will have the same columns as their template tables plus one more which would store a db connection setting identifier. When an authenticated user needs to connect to a different chado instance than the default one, the hook implementation (see code below) will be called. It will check if the connection to that instance has already been initialized or not (there are several ways to do it). If it's a new connection instance, it will create in that connection session temporary views that have the same name as each "public.chado_xxxxx". Those views will query the "public.multichado_xxxxx" tables and filter results using the additional column we described before (nb. that column won't be returned). So when a tripal (core/extension) module will use any of those tables to serve data, they will just work like before without knowing they are working on a different chado instance. Note: by the time the database connection is gone, the temporary views are also gone, so it's clean: temporary views are session-specific and are not shared between database connections (even for concurrent connections of a same user).

The thing is that you will have to synchronize each "public.multichado_xxxxx" for each chado instance. So, when you will run drush, you will have to specify which "connection" to use. So far, I just plan to assign to a user a specific connection, so you would just have to use that user in the drush command line but we can imagine other ways. It's just a start.

Here is the hook implementation I would use in the multi-chado module (not finished yet):

/**
 * Implements hook_chado_connection_alter().
 */
function tripal_multi_chado_chado_connection_alter(&$settings) {
  global $user;

  // try to find a connection associated to current user
  $query = new EntityFieldQuery();
  $query->entityCondition('entity_type', 'tripal_multi_chado_connection')
    ->fieldCondition('field_associated_users', 'target_id', $user->uid)
    ->range(0, 1) // only take in account first connection (if several)
  ;
  $result = $query->execute();

  // user has a specific connection?
  if (!empty($result)) {
    // yes
    // do we want to connect to Chado?
    if ('chado' == $settings['dbname']) {
      // yes, load user Chado connection settings
      $connection_id   = array_keys($result['tripal_multi_chado_connection']);
      $connection_data = entity_load('tripal_multi_chado_connection', $connection_id);
      $connection_data = array_values($connection_data)[0];

      // prepare database connection settings
      $new_database = array(
        'database' => $connection_data->database,
        'username' => $connection_data->username,
        'password' => $connection_data->password,
        'host'     => $connection_data->host,
        'port'     => $connection_data->port,
        'driver'   => 'pgsql',
        'prefix'   => '',
      );
      $connection_name = 'tripal_multi_chado_' . $user->uid;

      // add user database connection on-the-fly
      Database::addConnectionInfo($connection_name, 'default', $new_database);
      // and activate that connection
      db_set_active($connection_name);

      // alter Chado connection settings in case schema name is not 'chado'
      $settings['new_active_db'] = $connection_data->schemaname;
      $settings['new_search_path'] = $connection_data->schemaname . ",public";
      
      //+FIXME: check if new connection has been initialized
      // if not, call hook_tripal_synchronized_tables to get the list of
      //  synchronized tables
      // check which tables are missing and create them
      // create temporary views
    }
    else {
      // user used a specific connection and now don't want Chado any more
      // then fall back to default database
      db_set_active();
    }
  }
}

I also though that other modules may want to handle other synchronizations than the tripal core ones. So I'll include in the multi-chado module a hook that will allow those modules to tell the multi-chado module to create also temporary views for their tables.

To conclude, so far, I just want to start by associating a specific "Chado Connection" to a given authenticated user. But in future plans, I could think of use several connection for a same user on a same page. But I'll start with the 'easy' stuff first. ;-)

I hope I'm clear enough but if not, let me know! But I guess the first alpha/beta of the multi-chado module will demonstrate better how it works. (but I know I'll have to document it anyway... ;-) )

spficklin’s picture

Thanks for the explanation. That helps. One more question before I give some feedback. If you have multiple instances of Chado, each living in separate databases, how are you dealing with Views integration for each of those Databases? Unless I'm wrong, Views doesn't support cross database joining right?

guignonv’s picture

You are right, as far as I know, cross-database queries are not supported by postgreSQL (yet) while cross-schema queries are. There might be some issues left and I'll have to figure out how to solve them. At least, I think everything would work in a same database but with different schema. I'll implement that first. For different databases (that might also be on different servers), we'll see once we got there.

spficklin’s picture

So, if your database resides in a separate database (not in the same schema) can you use the $database setting of Drupal (set in the settings.php file )to initiate the connection rather than create the connection in the hook?

So, to make support for multiple Chado databases more a part of Tripal itself, what if we did the following:

1) Added a 'schema' field to the chado_xxxxx tables. It can be set to allow nulls so as it's backwards compatible. If it's null it defaults to using the 'chado' schema when loading a node. Otherwise we can adjust the code to load from the specified schema. This way, you don't need to create additional tables and views.

2) We can add a global variable somewhere, perhaps a Drupal variable, that indicates the current Chado database. When the chado_set_active is called it will default to using the active chado database if one is not specifically provided.

Would those two cases get you what you need?

guignonv’s picture

So, if your database resides in a separate database (not in the same schema) can you use the $database setting of Drupal (set in the settings.php file )to initiate the connection rather than create the connection in the hook?

Short answer: not really convenient. See after for a long answer.

So, to make support for multiple Chado databases more a part of Tripal itself, what if we did the following:

1) Added a 'schema' field to the chado_xxxxx tables. It can be set to allow nulls so as it's backwards compatible. If it's null it defaults to using the 'chado' schema when loading a node. Otherwise we can adjust the code to load from the specified schema. This way, you don't need to create additional tables and views.

It would work but it would mean that Tripal would natively support multiple schema. I have nothing against that but I wonder if it should be part of Tripal core or a separate module. And what about other Tripal extensions that are not aware of this modification?

2) We can add a global variable somewhere, perhaps a Drupal variable, that indicates the current Chado database. When the chado_set_active is called it will default to using the active chado database if one is not specifically provided.

We would definitely need something like that.

Would those two cases get you what you need?

Not really and to explain my point of view, I'll try with 2 use cases.

"settings.php" is a secured file. Only system admins can access this file and modify it; it's read-only to Drupal system. Therefore, you can't have a Drupal interface that allows you to manage connection and password change. So using settings.php is convenient if you use what I would call "static anonymized access". What I would like to do is different than that. I would like to manage different user access to databases (same or different DB).

Let's say I have 2 users John and Sarah. John is a gene curator and uses an external software to connect to Chado, like Apollo, Artemis or some other direct database connection system. With his personal account, he can modify the Chado database content. For security reasons, his account is not allowed to access to the "public" (Drupal) schema of the database. When he makes some change on the database, the audit module (Chado Controller History) can record what he personally did because he is using his personal account.

The second user, Sarah, is a reviewer or a partner. She needs to access to some parts of the database content that are not open to public yet but she is also not allowed to access to everything. She only has a read-only access to the database. To manage that, we use the Chado Controller and set specific rights to her account. And like for John, she has no access to the "public" schema of the database.

Both of those 2 users can access the database directly and through Tripal. If they browse Chado features using Tripal, John should be able to see everything and Sarah should be able to see some parts that the default "public" account does not see. In my current design of the multi-chado module, when John connects to Drupal, his Drupal user account is associated to a specific Chado connection settings. When he will access to features through Tripal, the multi-chado module will use John's connection settings and not the default public Tripal connection settings. Fortunately, we have some background system that auto-synchronize Tripal Chado data when some modifications are made. So John will be able to check what he changed using Tripal while a regular user won't have access to his work. Now, about Sarah, same kind of story: she can see some data that are not public yet.

If we store their access in settings.php file, it would work but what if we want to add a new collaborator Patty? The system admin will have to edit settings.php again? And if 1 day later, we would like to remove Sarah access, we'll have to ask the sys admin once again to edit that file?
What if they use an external database interface to change their password? "settings.php" would need to be updated. How?

If we manage Chado database access in Drupal database schema rather than in settings.php, we can tell the user his/her password changed and needs to be updated when it happens. The user can enter his/her new password that can be saved for the next time he/she needs to connect. It could also enable users to manage their own set of connections. Let's imagine this scenario: I'm a banana genome collaborator and I use the public Banana Genome Hub (BGH, that runs Tripal). I also have a private Chado database that has some specific data I'm working on. I'd like to compare my data with the public one using the available online tools of the Banana Genome Hub. I could add to my personal set of connections to Chado databases a new one to my private database thanks to the multi-chado module and then use another new Tripal extension module (not existing yet...) that would allow me to query each database and display results on 2 panes for comparison. I wouldn't need provide the credential to my private database to any BGH admin.

Now, about security concerns, each Chado database access settings would be stored in Drupal database (public schema) however, they could include or not a password. If no password is stored for a given access, the user will be prompted to provide it during his session and it will be stored in temporary session data. But if a password is stored, it would be encrypted (not hashed) using a hash made from the Drupal user password (generated for the session when the user connects to Drupal). The encrypted version will also be "salted" using a hash salt similar to the Drupal one ($drupal_hash_salt in settings.php). If, for some bad reason, some hacker could get a dump of the database, it wouldn't be enough to steal users credential to their Chado databases. The hacker would also have to get access to settings.php (or the separate hash file) and the user Drupal account password.

That answers why I wouldn't use settings.php to store Chado connection settings and how I would ensure that settings are securely stored in database. Now about avoiding duplicate tables, your proposition to add a field to existing chado_xxxx tables could be a solution. I even thought about doing that using Drupal schema alter hook (https://api.drupal.org/api/drupal/modules!system!system.api.php/function...) but I did not investigate enough yet to see the pros and cons. If it would be natively supported by Tripal, it would be nice of course (less code for me and less worries). :-)

spficklin’s picture

Assigned: Unassigned » spficklin
Status: Active » Closed (fixed)

Phone conversation better clarified the issue. The settings.php file can not be used in this case because user access to Chado is controlled at the database level. This is to enforce consistent access controls for other tools besides Tripal (such as Artemis). The patch properly allows a module to alter the database connection details on a per-user bases. Because the chado_set_active() function is called by other Tripal API functions the custom module has no ability, other than with the new hook supported by the patch, to change the active database connection.