Scalability. [#684558]

The inform module works well with reasonable data sets but doesn't scale to large datasets well. It was built to use Drupals' taxonomy system to provide good integration (There are performance issues with this as well
with large vocabularies)

However when it scales towards 100,000 nodes, 100,000 terms, and 1,000,000 tags performance of the whole system starts to suffer, inserts get slow, pages with terms on get slow and the taxonomy admin pages get slow.

There will have to be a major rewrite of the module to get away from Drupal taxonomy and get to use a more scalable system.

I am researching this and have found the following useful.

Tagging Systems

Comment	File	Size	Author
#9	inform-684558.patch	49.54 KB	rgristroph
#9	inform_new_files.tgz	2.21 KB	rgristroph
#8	scalability.patch	23.81 KB	JeremyFrench

Comments

Comment #1

JeremyFrench commented 14 January 2010 at 15:26

Looking at the queries used, the worst offender by far is the related topics to topics query.

Comment #2

JeremyFrench commented 14 January 2010 at 15:48

Another thing to look at is excluding terms from pathauto and having a dedicated menu hook.

Comment #3

moshe weitzman commented 26 January 2010 at 18:19

We can get related topics to topics from apachesolr. Something to consider. It is the 'More like this' feature of solr.

Comment #4

moshe weitzman commented 27 January 2010 at 21:32

Yeah, inform_related_subjects() has a nasty query. Could you explain here or in the code exactly what we are trying to do there? Would help me think of simplications. Thanks. Here it is:

db_query_range("SELECT {term_data}.tid as term_id,
                                    coalesce(ts.name,{term_data}.name) as name,
                                    sum(nt1.inform_score)
                             FROM {term_data}
                             INNER JOIN {term_node} nt1 ON nt1.tid = {term_data}.tid
                         INNER JOIN {term_node} nt2 on nt1.nid = nt2.nid
                             LEFT OUTER JOIN term_synonym ts ON ts.tid = term_data.tid
                             WHERE nt2.tid = %d
                             AND {term_data}.tid != nt2.tid
                             GROUP BY {term_data}.tid, coalesce(ts.name,{term_data}.name)
                             ORDER BY sum(nt1.inform_score) desc", array($tid), 0, $count);

Comment #5

JeremyFrench commented 28 January 2010 at 07:15

It should actually be ORDER BY sum(nt1.inform_score*nt2.inform_score) desc", array($tid), 0, $count); so there is a bug in it anyway. The query is trying to find which topics occur together the most.

I think unless we can find a quicker way to write this query it should be turned off (or at least only optionally on). It is not an essential feature.

Comment #6

JeremyFrench commented 28 January 2010 at 11:56

There are some easy things we can do to inform which may get it to scale better.

Turn off related stories to stories
Use some caching
Use slave db servers (if available)
Put the admin functions into an include (already a ticket)
Add option to not use path module as terms may bloat the path table
others...

I am going to try these and benchmark the results, before looking at more drastic measures like denormalized tables.

Comment #7

moshe weitzman commented 28 January 2010 at 13:56

Drupal already has a notion of related terms. It is stored in the term_relations table. But Drupal core does not really use this feature. It just has functions for loading and saving this stuff. See taxonomy_get_related()

I propose that inform_cron() populate the term_relations table once a week or something like that. Then we use taxonomy_get_related() or similar custom query in inform_related_subjects().

Comment #8

JeremyFrench commented 29 January 2010 at 17:00

Status	File	Size
new	scalability.patch	23.81 KB

Here is a patch, with a few things which have been worked on locally to improve performance overall. It covers the first three points above.

Comment #9

rgristroph commented 24 January 2011 at 02:21

Status	File	Size
new	inform_new_files.tgz	2.21 KB
new	inform-684558.patch	49.54 KB

This patch is very comprehensive and well tested in a large environment. It is the accumulation of a rewrite and several months testing, and I suggest that we make it the next dev release.

There is the patch, and then two additional files providing drush commands and a default view.

This addresses most of the scalibility issues by not using the core taxonomy module. The drush commands allow for background processing and various types of inspection. There is a new hook, hook_inform_request, which allows you to implement custom data manipulation on the node body before it is sent to inform.

I think I have CVS commit, but I have not tried yet - I can commit this if it seems like the right thing to do.

--Rob

Comment #10

JeremyFrench commented 24 January 2011 at 09:13

Go for it. Well done.

Scalability.

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

News items

Our community

Documentation

Drupal code base

Governance of community