The inform module works well with reasonable data sets but doesn't scale to large datasets well. It was built to use Drupals' taxonomy system to provide good integration (There are performance issues with this as well
with large vocabularies)

However when it scales towards 100,000 nodes, 100,000 terms, and 1,000,000 tags performance of the whole system starts to suffer, inserts get slow, pages with terms on get slow and the taxonomy admin pages get slow.

There will have to be a major rewrite of the module to get away from Drupal taxonomy and get to use a more scalable system.

I am researching this and have found the following useful.

Tagging Systems

Comments

JeremyFrench’s picture

Looking at the queries used, the worst offender by far is the related topics to topics query.

JeremyFrench’s picture

Another thing to look at is excluding terms from pathauto and having a dedicated menu hook.

moshe weitzman’s picture

We can get related topics to topics from apachesolr. Something to consider. It is the 'More like this' feature of solr.

moshe weitzman’s picture

Yeah, inform_related_subjects() has a nasty query. Could you explain here or in the code exactly what we are trying to do there? Would help me think of simplications. Thanks. Here it is:

db_query_range("SELECT {term_data}.tid as term_id,
                                    coalesce(ts.name,{term_data}.name) as name,
                                    sum(nt1.inform_score)
                             FROM {term_data}
                             INNER JOIN {term_node} nt1 ON nt1.tid = {term_data}.tid
                         INNER JOIN {term_node} nt2 on nt1.nid = nt2.nid
                             LEFT OUTER JOIN term_synonym ts ON ts.tid = term_data.tid
                             WHERE nt2.tid = %d
                             AND {term_data}.tid != nt2.tid
                             GROUP BY {term_data}.tid, coalesce(ts.name,{term_data}.name)
                             ORDER BY sum(nt1.inform_score) desc", array($tid), 0, $count);
JeremyFrench’s picture

It should actually be ORDER BY sum(nt1.inform_score*nt2.inform_score) desc", array($tid), 0, $count); so there is a bug in it anyway. The query is trying to find which topics occur together the most.

I think unless we can find a quicker way to write this query it should be turned off (or at least only optionally on). It is not an essential feature.

JeremyFrench’s picture

There are some easy things we can do to inform which may get it to scale better.

  • Turn off related stories to stories
  • Use some caching
  • Use slave db servers (if available)
  • Put the admin functions into an include (already a ticket)
  • Add option to not use path module as terms may bloat the path table
  • others...
    • I am going to try these and benchmark the results, before looking at more drastic measures like denormalized tables.

moshe weitzman’s picture

Drupal already has a notion of related terms. It is stored in the term_relations table. But Drupal core does not really use this feature. It just has functions for loading and saving this stuff. See taxonomy_get_related()

I propose that inform_cron() populate the term_relations table once a week or something like that. Then we use taxonomy_get_related() or similar custom query in inform_related_subjects().

JeremyFrench’s picture

StatusFileSize
new23.81 KB

Here is a patch, with a few things which have been worked on locally to improve performance overall. It covers the first three points above.

rgristroph’s picture

StatusFileSize
new2.21 KB
new49.54 KB

This patch is very comprehensive and well tested in a large environment. It is the accumulation of a rewrite and several months testing, and I suggest that we make it the next dev release.

There is the patch, and then two additional files providing drush commands and a default view.

This addresses most of the scalibility issues by not using the core taxonomy module. The drush commands allow for background processing and various types of inspection. There is a new hook, hook_inform_request, which allows you to implement custom data manipulation on the node body before it is sent to inform.

I think I have CVS commit, but I have not tried yet - I can commit this if it seems like the right thing to do.

--Rob

JeremyFrench’s picture

Go for it. Well done.