The taxonomy field and widgets were developed while looking at options widgets and list fields. Now is the time to fix some of the bad or non-performing code.

See taxonomy_field_validate:

function taxonomy_field_validate($obj_type, $object, $field, $instance, $langcode, $items, &$errors) {
  $allowed_values = taxonomy_allowed_values($field);
...

taxonomy_allowed_values is a fine function to call when a select widget or an options widget for this field is being rendered, because it returns an array suitable for the #options property of the element. But for validation of all fields? No way.

function taxonomy_allowed_values($field) {
  $options = array();
  foreach ($field['settings']['allowed_values'] as $tree) {
    $terms = taxonomy_get_tree($tree['vid'], $tree['parent']);
    if ($terms) {
      foreach ($terms as $term) {
        $options[$term->tid] = str_repeat('-', $term->depth) . $term->name;
      }
    }
  }
  return $options;
}

A vocabulary with 3 million terms... we want to load that, make an options array out of it, just so we can check whether the item's value is a key in the array?

Because of the possibilities that multiple vocabulary subtrees are indicated in the field's allowed values option, we probably need a function like:

/**
 * Determine if the item's value is allowed for the field's domain of terms.
 */
function taxonomy_field_is_allowed($item, $field) {
...
}

Remember that the field's allowed values is an array of subtree definitions. This will be the simplest and most common arrangement, and it's what Drupal's Field UI can do.

array(
  array(
    'vid' => '1',
    'parent' => '0',
  ),
);

That can be validated by loading the item as a term and checking the vid. The whole vocabulary (vid=1) is used when the parent is 0.

Fields which use multiple whole vocabularies will not perform badly either.

array(
  array(
    'vid' => '1',
    'parent' => '0',
  ),
  array(
    'vid' => '2',
    'parent' => '0',
  ),
);

The performance of fields that use subtrees is going to be a problem because these will always require determining the ancestry of the term being loaded.

array(
  array(
    'vid' => '2', // organizations
    'parent' => '423', // international organizations
  ),
);

This is where the performance will be sticky, because there is no quick and easy way to determine in one step whether term ID X has term 423 in its ancestry. In these cases, the vid is not really important, because the parent implies the vid.

Making notes here:

When the parent is 0, we get to do simple stuff and use the vid in the allowed values.
When the parent isn't 0, we gotta do it the hard way and the vid won't help us.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

catch’s picture

subscribing.

catch’s picture

Issue tags: +Performance
FileSize
2.38 KB

Here's a patch.

Instead of loading the full tree for the vocabulary, we do the following:

Load all the terms to be validated with taxonomy_term_load_multiple().

If the setting only contains a vid (this is the only option via the uid), we just check $term->vid against the setting and that's it.

If the setting contains a sub-tree, we call taxonomy_get_parents_all() for each term, and check if the parent is found in that array. This way performance only starts to get affected if you have to validate a lot of terms at deep hierarchy, and it's small individual queries rather than a memory sink from loading the full sub-tree.

We have existing tests for this, which pass with the patch, and failed while I was working on it, so ample coverage there.

catch’s picture

Priority: Normal » Critical
Status: Active » Needs review

Also this is critical - we need to get to a situation in core where we never, ever call taxonomy_get_tree() unless it's absolutely necessary, since there's no way to know how big a vocabulary is, and it's equivalent to having code which just randomly loaded all nodes in one go.

Dries’s picture

I think this function would benefit from a higher level description explaining what we're validating, and how it should be validated. Only when you understand the big picture, the code comments will start to make sense. Can we try to add one or two sentences painting that picture?

catch’s picture

Hmm, we probably need similar comments in taxonomy_allowed_values() too, but I'm not sure where best to put this without repeating it (inline in hook_field_info() doesn't seem that good either), so I added something to the top of taxonomy_field_validate() for now, but we might need a spin-off issue to improve the documentation of this feature more generally, especially since it's not exposed via the UI (you can only choose a single vocabulary via the UI, no fancy stuff).

yched’s picture

Status: Needs review » Reviewed & tested by the community

Fully agreed with the patch.

chx’s picture

Yes. We will port nowpublic.com to Drupal 7 rather soon and we have millions of terms. In older Drupals we hacked taxonomy module so that taxonomy_get_tree simply does not run and I would be rather delighted not to hack core for D7.

Dries’s picture

Status: Reviewed & tested by the community » Fixed

Committed to CVS HEAD.

yched’s picture

Is there any sense in keeping taxonomy_allowed_values() as an API func that can bring your site on its knees in a snap ?
It's now only used by hook_options_list() (generate the list of options for select widgets).

catch’s picture

Everywhere we build a select list from a taxonomy vocabulary we potentially bring a site to its knees - that's currently only here, and in _taxonomy_term_select().

This needs something like #191360: Scalable menu selector or #30993: Scaleable/themeable taxonomy selection widget - both of which are at D8. We'd need a select which converts to an autocomplete if there's more than a certain number of terms to display, and possibly one which shows hierarchy as well. I think if there was a valid patch, there'd be a chance of getting something simple into D7 since a lot of sites fall down on this, but I don't know if that's feasible at all.

chx’s picture

Easy, don't use options_select as a widget for large vocabs or if you do then set parent to -1 and then alter the taxonomy_get_tree query for that query to select certain things with a -1 parent: ->addExpression('SELECT CASE tid IN :tids THEN -1 ELSE h.parent END', 'parent') I can post more code later.

Status: Fixed » Closed (fixed)
Issue tags: -Performance

Automatically closed -- issue fixed for 2 weeks with no activity.