I'm implementing #609892: Multiple custom variables and I'm trying to block all tokens that contain personally identifying information for privacy reasons in the form validation. I have collected a good number, but I think it will become a never ending story to maintain a list like:

  $token_blacklist = array(
    '[current-user:edit-url]',
    '[current-user:name]',
    '[current-user:mail]',
    '[current-user:uid]',
    '[current-user:url]',
    '[current-user:path]',
    '[node:author]',
    '[node:author:edit-url]',
    '[node:author:mail]',
    '[node:author:name]',
    '[node:author:path]',
    '[node:author:url]',
    '[node:author:uid]',
  );

I'm really asking me how to make this generic... My first idea was to block all node:author:* stuff, but this may leave other stuff out. You may argue this the admin/marketing need to decide what tokens may be used, but I do not think so. I know users like to collect personally identifying information and I do not like to help/support them in any way to archive their illegal goal.

Comments

Dave Reid’s picture

I'm not sure how this could be accomplished.

hass’s picture

This is very bad... Am I the only person how care about personal data? For now I'm doing this with below code, but this is not save and false positives may occour, too.

We should think about implementing something like a prefix or an info that defines if a value could contain personally identifying information. This becomes more and more an issue and so many people do not follow the data protection rules. Additionally a way to only return tokens *without* personal identifying data in token_tree view.

/**
 * Validate if a string contains forbidden tokens not allowed by privacy rules.
 *
 * @param $token_string
 *   A string with one or more tokens to be validated.
 * @return boolean
 *   TRUE if blacklisted token has been found, otherwise FALSE.
 */
function _googleanalytics_contains_forbidden_token($token_string) {
  // List of strings in tokens with personal identifying information not allowed
  // for privacy reasons. See section 8.1 of the Google Analytics terms of use
  // for more detailed information.
  //
  // This list can never ever be complete. For this reason it tries to use a
  // regex and may kill a few other valid tokens, but it's the only way to
  // protect users as much as possible from admins with illegal ideas.
  //
  // User tokens are not prefixed with colon to catch 'current-user' and 'user'.
  //
  // TODO: If someone have better ideas, share them, please!
  $token_blacklist = array(
    ':author]',
    ':author:edit-url]',
    ':author:url]',
    ':author:path]',
    ':mail]',
    ':name]',
    ':uid]',
    'user:edit-url]',
    'user:url]',
    'user:path]',
  );

  return preg_match('/' . implode('|', array_map('preg_quote', $token_blacklist)) . '/i', $token_string);
}
Dave Reid’s picture

Priority: Critical » Normal

Did I ever say I don't care about personal data? I just have no idea how this can be solved currently. The only thing I can think of is for you to perform your own validation on the textfields which users can enter tokens and raise form errors if specific tokens can be found. And how do we define which tokens are or are not and how do we enforce that standard in other contrib modules?

hass’s picture

Component: Miscellaneous » Code
Category: support » feature

Well nobody have said he do not care, but for the simple reason that I seems to be the very first asking for this it shows clearly that nobody cared about it - until now :-)

I'm not sure about the design of such a feature. From the placeholder structure it's very difficult to maintain a list. Every module names the variables how it likes and a generic naming that could be catched is not really available. It may also be a good idea to simply rename some placeholders to be generic and maybe prefix them in a way that makes it possible to run a regex on it. Today this is not easy if not impossible.

nicksanta’s picture

This is a terrible, terrible idea. Why are you making the assumption that :author tokens contain personally identifiable information? Drupal user accounts aren't necessarily people, nor do they necessarily contain personal information.

There should absolutely be a way to override this, or alter the list to the developer's requirements.

hass’s picture

You are completly wrong!!! A Drupal user account makes an individual personally identifyable (IP, Email, names, timestamps, etc). Additional to this usernames very very often contain real names. You are NOT allowes to push this data to the Google analytics system.

nicksanta’s picture

A website I'm maintaining (online publication with 25+ writers) uses user accounts to associate nodes to their writer. The accounts are never used to log in, the editorial team simply changes the node author.
The analytics team are wanting the author of the current node to be set as a custom variable in the page scope to track whether people choose to read articles by the same author.

For the most part the exclude list works perfectly fine, and stops laymen from abusing personal information and violating google's T&Cs. But if someone has the tech skills to write an alter hook and change that exclude list for evil, then they have the ability to write their own custom GA tracking code to do it anyway.

In the end, it's not up to the Drupal-integration maintainer to enforce Google's terms and conditions. I think this patch is a reasonable request all things considered, because you shouldn't be making assumptions about how sites have been put together.

nicksanta’s picture

Bumping this.

I still do not understand why it is the responsibility of this module to ensure that people do not violate their agreement with Google.

nicksanta’s picture

If the maintainer of this module insists on including this functionality, then I implore him to include a patch from this ticket: http://drupal.org/node/1307452

rcross’s picture

i think the idea of form validation is better than trying to black list these tokens. For anyone who wanted to circumvent this, they could just rewrite some tokens to expose the details. It would be better to have a notice that tells users about the privacy risk and links to google's ToS

hass’s picture

Title: How to block tokens with personally identifying information? » Extend token API to support flag for personally identifying information
Issue tags: +privacy

@Dave: Can I get your support for adding a personal-identifying-data = TRUE flag (or any other named flag) to the token info and follow up requirements? If there is no explicit === FALSE we default to TRUE. I hope you can share your ideas so I'm not implementing something that has no chance to get in.

Idea:

  • We expect that a module that does not support personal-identifying-data flag contains data that is privacy relevant.
  • If a token has personal-identifying-data = FALSE explicitly set, we expect that this token can be used.
  • token_tree need to allow filtering for personal-identifying-data = FALSE items
  • hook_token_element_validate() and/or token_scan() need to be able to check for this boolean value or token_scan() need to just refuse it as an invalid token.

Example:

  // Current user tokens.
  $info['tokens']['current-user']['ip-address'] = array(
    'name' => t('IP address'),
    'description' => 'The IP address of the current user.',
    'personal-identifying-data' => TRUE,
  );