Active
Project:
Inform
Version:
6.x-1.x-dev
Component:
Code
Priority:
Critical
Category:
Bug report
Assigned:
Reporter:
Created:
14 Jan 2010 at 14:29 UTC
Updated:
24 Jan 2011 at 09:13 UTC
Jump to comment: Most recent file
The inform module works well with reasonable data sets but doesn't scale to large datasets well. It was built to use Drupals' taxonomy system to provide good integration (There are performance issues with this as well
with large vocabularies)
However when it scales towards 100,000 nodes, 100,000 terms, and 1,000,000 tags performance of the whole system starts to suffer, inserts get slow, pages with terms on get slow and the taxonomy admin pages get slow.
There will have to be a major rewrite of the module to get away from Drupal taxonomy and get to use a more scalable system.
I am researching this and have found the following useful.
| Comment | File | Size | Author |
|---|---|---|---|
| #9 | inform-684558.patch | 49.54 KB | rgristroph |
| #9 | inform_new_files.tgz | 2.21 KB | rgristroph |
| #8 | scalability.patch | 23.81 KB | JeremyFrench |
Comments
Comment #1
JeremyFrench commentedLooking at the queries used, the worst offender by far is the related topics to topics query.
Comment #2
JeremyFrench commentedAnother thing to look at is excluding terms from pathauto and having a dedicated menu hook.
Comment #3
moshe weitzman commentedWe can get related topics to topics from apachesolr. Something to consider. It is the 'More like this' feature of solr.
Comment #4
moshe weitzman commentedYeah, inform_related_subjects() has a nasty query. Could you explain here or in the code exactly what we are trying to do there? Would help me think of simplications. Thanks. Here it is:
Comment #5
JeremyFrench commentedIt should actually be
ORDER BY sum(nt1.inform_score*nt2.inform_score) desc", array($tid), 0, $count);so there is a bug in it anyway. The query is trying to find which topics occur together the most.I think unless we can find a quicker way to write this query it should be turned off (or at least only optionally on). It is not an essential feature.
Comment #6
JeremyFrench commentedThere are some easy things we can do to inform which may get it to scale better.
I am going to try these and benchmark the results, before looking at more drastic measures like denormalized tables.
Comment #7
moshe weitzman commentedDrupal already has a notion of related terms. It is stored in the term_relations table. But Drupal core does not really use this feature. It just has functions for loading and saving this stuff. See taxonomy_get_related()
I propose that inform_cron() populate the term_relations table once a week or something like that. Then we use taxonomy_get_related() or similar custom query in inform_related_subjects().
Comment #8
JeremyFrench commentedHere is a patch, with a few things which have been worked on locally to improve performance overall. It covers the first three points above.
Comment #9
rgristroph commentedThis patch is very comprehensive and well tested in a large environment. It is the accumulation of a rewrite and several months testing, and I suggest that we make it the next dev release.
There is the patch, and then two additional files providing drush commands and a default view.
This addresses most of the scalibility issues by not using the core taxonomy module. The drush commands allow for background processing and various types of inspection. There is a new hook, hook_inform_request, which allows you to implement custom data manipulation on the node body before it is sent to inform.
I think I have CVS commit, but I have not tried yet - I can commit this if it seems like the right thing to do.
--Rob
Comment #10
JeremyFrench commentedGo for it. Well done.