biblio causing massive use of resources [#306014]

Hi,

I've been using biblio for the last year on a university department website, the site has been live for a few months now and users are putting lots of publications in. However, over the past month our server keeps falling over, a dual quad core processor machine with 4GB of RAM and a striped RAID, its only running Drupal and nothing else. If several requests are made for a biblio listing one after the other (such as when I get hit by google or yahoo who ignore my robots.txt!) this causes absolute mayhem. For some reason the biblio view eats up vast amounts of RAM in each apache process and causes mysqld to get up to 100% CPU usage. Eventually it looks like mysqld gets stuck in some bizzarre infinite loop not doing anything. This only happens when consecutive requests are made to biblio, not any other page.

Is there anything anyone can think of that I can do to optimise the biblio queries in some way? I've checked and it looks like there are several biblio specific indexes in the database so I assume the queries are all quite sane?

Originally I had biblio set to display 25 results so I put it down to 10 but it doesn't appear to make any difference to performance.

Please let me know if anyone else is having similar issues.

Many Thanks,
Matt.

Comment	File	Size	Author
#5	biblio_add_plus_groups.GIF	10.69 KB	cminor9

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

Kador CreditAttribution: Kador commented 10 September 2008 at 10:26

Hi,

I'm currently running Biblio with 11.000 records and things are indeed getting slower (Sun Solaris machine). One place I've noticed is slow is the filter form, it takes a long time to load this form. I've looked a bit at the _get_bilbio_filters() function and this seems to take quite a long time, also every node from the database seems to be loaded and parsed to get to the possible filter values.

One simple optimisation seems to be adding a DISTINCT claise to the following query: SELECT $taxo_fields FROM $table $taxo_joins ORDER BY $taxo_order. This caused a reduction in page-load-time for the filter form of about 5 seconds.

I've been looking at the first query too (SELECT $fields FROM $table $join ORDER BY $order) and I wonder if it wouldn't be better served by 3 queries (one to determine the years, one for the authors and one for the types). I've done some tests on the database and I think this might help. But I haven't tested it thoroughly.

Of course, this only helps the filter form. Not sure what can be done to speed up the list records view.

Comment #2

cminor9 CreditAttribution: cminor9 commented 20 February 2009 at 20:17

Version:

5.x-1.9

» 6.x-1.x-dev

For what it's worth, I am having the same problems. It looks like there are two queries causing trouble:

SELECT DISTINCT n.*, b.*, bt.name as biblio_type_name  
FROM node n 
left join biblio b  on n.vid=b.vid  
inner join biblio_types bt  on b.biblio_type=bt.tid   
WHERE n.type='biblio'   
ORDER BY b.biblio_year DESC, b.biblio_date DESC, SUBSTR(n.title,1,3) ASC   
LIMIT 0, 15;

which takes well over 30 sec to run. I added an index to biblio.biblio_type (since it's used in a join) and that didn't help.

and

SELECT t.word AS realword, i.word 
FROM search_total t 
LEFT JOIN search_index i ON t.word = i.word 
WHERE i.word IS NULL;

which only takes two seconds but runs a LOT (I suspect it runs each iteration of a Biblio import, which could be thousands of times.)

Looking at the first query, I am automatically suspicious of any query that uses a * in the select clause. If nothing else, that's a waste of bandwidth returning all those unneeded columns.

Biblio is doing what I need for a project I am working on and seems to be very well thought out. But when I run an import, things crawl to a halt. Since an import of a bibliography can take some serious time (hours!) that affects performance for a considerable amount of time. Right now, I have over 45000 records, and may eventually have millions of rows. Clearly, performance problems with only 45k rows is a bad sign. I hope either me or someone else can solve this performance problem. If I get it figured out I'll certainly share my results.

Comment #3

rjerome CreditAttribution: rjerome commented 20 February 2009 at 21:37

I don't doubt there could be some optimization of the queries, and that is getting very close to the top of my TODO list.

I'm guessing the inclusion of the title (SUBSTR(n.title,1,3) ASC) in the ordering of that first query isn't helping much. What if you were to remove that (you'll find it around line 178 in biblio.pages.inc).

When you say an import takes hours, how many records would that entail? What file format are you using? I hadn't though of the ramifications that re-indexing of the search data might have on bulk imports I'll definitely look into eliminating that during file imports. You can turn that indexing off on the "admin/settings/biblio" page in the "Search" category where it says: "Re-/Index a biblio node when creating or updating.".

Could you give a brief description of your hardware/software stack?

I haven't tested with greater than about 10-15K records, but at that level, on a realtively modest machine (dual 2.4Ghz XEON w/2G ram, running Red Hat Enterprise Linux 5.2) I am seeing nowhere near 30 second load times.

Comment #4

cminor9 CreditAttribution: cminor9 commented 24 February 2009 at 18:42

Hey there rjerome, just noticed your reply. Thank you for getting back to me.

I am running Drupal on a dev box with a single p4 2.2GHz and 1G of RAM. A modest box, for sure, but it's a dev box. It is running on Debian Lenny, PHP 5.2.6, MySQL 5.0.5.

First, importing: On a single import, I did ~35K rows. I foresee this being a somewhat typical use case based on sample data I have been given. The project I am working on can be found here: http:biodiversitylibrary.org, associated with Encyclopedia of Life (and the Lifedesk part of that site is using Biblio already AFAIK.) We are working on a new feature to allow researchers or research teams to upload bibliographies.

I'll try turning off the indexing, thanks for the tip.

On the second query: The original query's explain plan showed no table scans or anything horrid like that. That backed up my initial assessment that the Biblio db is pretty well designed and indexed.

I found that this query worked very quickly and more importantly, had a good execution plan (using range, eq_ref, eq_ref for the joins).

SELECT b.biblio_date, b.biblio_year, n.title, bt.name as biblio_type_name  
FROM biblio b  
left join biblio_types bt  on b.biblio_type=bt.tid   
inner join node n on b.vid = n.vid
where n.type='Biblio'
ORDER BY b.biblio_year DESC, b.biblio_date DESC, n.title   ASC
LIMIT 0, 15;

Really, this query resulted in *no* change to the execution plan from the original query. But it DID result in a response time of about 1.04 seconds as opposed to 30+. That makes me think that the root cause is that the select * was simply providing more data than the read buffer could handle on one shot (read buffer size is set to 1M on my box). So I figured decreasing the output would fit the bill...for now.

Now, I know I need more cols in the select clause, but I don't know which. I haven't yet traced the query through the PHP code to know that, and I haven't totally wrapped my head around biblio_build_query() yet. Obviously, the law of unintended consequences could come back and bite me if I leave off a column that is needed. What is a minimal set of columns that can be selected? If I can find out that info, I'll try that query in the code to see what happens.

Back to the read buffer thing. I assume (but don't know) that the read_buffer is filled before the LIMIT keyword kicks in. IOW, the LIMIT is the last thing processed. That would be a problem as data sets grow really large, since a search could well find 50K matches. Obviously, you need order by and limit if you want a usable search with pagination and so on. OTOH, I have never seen a DB search that worked great on millions of rows; DB searches have their limitations. So that got me to thinking, why not look at SOLR/Lucene to index this data? Have you thought about that, or is that something that you have any interest in?

Comment #5

cminor9 CreditAttribution: cminor9 commented 4 March 2009 at 18:10

File	Size
biblio_add_plus_groups.GIF	10.69 KB

Ok, so the whole SOLR thing was a lame question. Got that up an running, mostly because there already IS a SOLR module for Drupal.

Been getting more and more familiar with Drupal (kind of new to it.) I have add the Organic Groups module to Drupal, along with some others. When I go to add an item to Biblio, I now have a little block for selecting a Group. Is there any reason I cannot seem to get this working for the import feature? No matter what I do it doesn't seem to work. I am guessing it wasn't intended to. I have attached a picture of the add/edit biblio item page.

Comment #6

rjerome CreditAttribution: rjerome commented 4 March 2009 at 22:11

That's a whole other issue... You see the main input form is a "standard" Drupal hook thus when it is built other modules also get the opportunity to add items to it before it is rendered. The import form on the other hand is specific to the biblio module and thus does not get the same treatment. I might be able to implement something with the form_alter hook, I'll look into it.

Ron.

Comment #7

cminor9 CreditAttribution: cminor9 commented 12 March 2009 at 17:59

Ron, thanks for the responses. Agree, the groups thing was a separate issue. I guess I kind of shoehorned it into this conversation.

I just wanted to say too that removing the indexing on import really helped. As does your new batch import feature. Thanks a lot!

Comment #8

rjerome CreditAttribution: rjerome commented 12 March 2009 at 18:48

Is that groups thing a show stopper?

Adding a call to the hook that the node module uses to build the main input form would open the door to all modules added stuff to that form and thus could get messy.