The query

db_result(db_query_range("SELECT CASE WHEN status=1 THEN 0 ELSE 1 END FROM {access} WHERE type = '%s' AND LOWER('%s') LIKE LOWER(mask) ORDER BY status DESC", $type, $mask, 0, 1));

Is the the slowest query on drupaldb.drupal.org (after all the pager queries and the search queries have been sent to the slave).

This is probably due to the " AND LOWER('%s') LIKE LOWER(mask) " part.

We should fix this query.

mask can be an IP, a username, or an email address.

I suggest we make sure these are all lowercase in the database and then we can omit the LOWER() in the query by using tolower() on the argument.

Comments

killes@www.drop.org’s picture

Title: Replace quers in drupal_is_denied by somethign with better performance » Replace quers in drupal_is_denied by something with better performance

The real problem appears to be the ORDER BY without an appropriate index:

mysql> explain SELECT CASE WHEN status=1 THEN 0 ELSE 1 END FROM access WHERE type = 'XXX' AND 'XXX' LIKE mask ORDER BY status DESC LIMIT 0, 1;
+----+-------------+--------+------+---------------+------+---------+------+------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+------+---------+------+------+-----------------------------+
| 1 | SIMPLE | access | ALL | NULL | NULL | NULL | NULL | 125 | Using where; Using filesort |
+----+-------------+--------+------+---------------+------+---------+------+------+-----------------------------+

killes@www.drop.org’s picture

Status: Active » Needs review
FileSize
1.68 KB

Here's an attempt at a patch, needs testing with pgsql.

killes@www.drop.org’s picture

FileSize
1.69 KB

No reason to not do a limited query, reintroducing db_query_range

snufkin’s picture

tested on pgsql with banning email and username, both worked fine. of course non-banneds can register/enter.

hswong3i’s picture

first of all, this query will NEVER function within DB2 as i mentioned before: http://drupal.org/node/168403#comment-291944. i have no idea for pgsql now (even though i will start my studying about pgsql after beta1 release), therefore sorry that i can't comment about its correctness under pgsql :(

P.S. sorry for off topic: i have proposed another issue, which not only letting this part function, but even improve its features, plus none of performance degrade: http://drupal.org/node/168403. as that implementation is database and query independent, it should consider as an alternative solution of this fancy query problem.

BTW, as chx mentioned (http://drupal.org/node/165160#comment-298860), solving problem with a much complicated solution shouldn't be a good choice for us, especially after code freeze and JUST before beta release ;p

hswong3i’s picture

some minor footnote: since we MUST call drupal_is_denied() for EVERY cached or non-cached pages, this is a static overhead. BTW, as MySQL is not a suitable source material for enterprise-scale clustering solution, but the one and the only one and the "most suitable" choice under our current implementation (PgSQL still need more love...), we should try to reduce some query-level overhead. it should always be a better choice, if we are able to implement similar feature (or even better) in PHP level, as web server clustering is something that more easy to be implemented (e.g. with apache).

on the other hand, db_query_range() will always introduce more database-level and abstract-level overhead, based on different database requirement and driver implementation. world is always not as perfect as case of MySQL and PgSQL, with LIMIT + OFFSET supporting. most database require to implement db_query_range() with 2~3 level subqueries (please refer to http://drupal.org/node/39260 and http://drupal.org/node/165788 for latest Oracle/DB2 drivers implementation, and http://phplens.com/lens/adodb/tips_portable_sql.htm for a more complete picture of database specific limitation). implement a feature with fancy query will always introduce cross database compatibility problem.

finally, as rules are now defined in SQL format (and used for compare, directly), it may easily cause SQL injection. it should be something that we always hope to escape from it :)

chx’s picture

Status: Needs review » Reviewed & tested by the community

based on #4 and my feeble attempts with mysql this one reads OK. hswong3i, there is no SQL injection as both the select and the insert uses proper escaping, just because we are using mask as something that contains SQL wildcards that hardly means security hole.

moshe weitzman’s picture

Title: Replace quers in drupal_is_denied by something with better performance » Optimize query in drupal_is_denied

Looks good to me as well.

Gábor Hojtsy’s picture

Status: Reviewed & tested by the community » Needs review

Hwsong3i: your alternate patch (http://drupal.org/node/168403) is for D7, it includes lots of new features and changes. This patch is quite simple and clean. Maybe you can help people actually making it DB2 compatible. It seems that the comment you point to talks about LIKE usage. So seems like if we swap the LOWER('%s') LIKE LOWER(mask) to LOWER(mask) LIKE LOWER('%s') then it should be fine. Any input to bring this patch forward?

hswong3i’s picture

@Gabor: please don't swap them: that is what we are actually needed for. I try to swap it once before, and so break our handling (and, rollback...: http://drupal.org/node/165160). On the other hand, I am not trying to introduce my other patch for D6, as it seems to be (but actually not) too complicated. I will try to improve that API for D7, so we can make it useful for many cases, e.g. block visibility control with "Apache-mode" (http://drupal.org/node/171131): a feature that I have dreamed for more than years :)

I guess committing this simple patch will be the only choose of our current status, even though it isn't the most perfect solution. DB2 should be something that need not to be consider within D6 official life cycle (I may try for something "unofficial", but please simply forget that right now).

Anyway, I am really question about the effect of this patch: can't it really help? I may try for some benchmarking and so prove its effectiveness. I will really suggest not to use db_query_range() if it is possibile, as it may introduce unnecessary complicated abstraction handling (even though it seems to be quite simple within MySQL implementation), which MUST slower than normal simple query :)

killes@www.drop.org’s picture

Status: Needs review » Reviewed & tested by the community

This patch has nothing to do with hswong's DB2 effort. It won't improve his situation, but it won't make it worse either.

I am afraid his comments in this issue did obfuscate the real topic somewhat.

Gábor Hojtsy’s picture

Status: Reviewed & tested by the community » Fixed

hwsong3i: The patch does not introduce a db_query_range(), this was there before the patch. The patch however removes a CASE and an ORDER BY on non-indexed fields in favor of a simpler WHERE check, which makes for a smaller result set to do the range limiting on. Whether the range limit helps performance or not probably depends on multiple factors: eg. bigger result set vs. more time to do the limiting. In the db platforms core supports, the limit only chops off the list at the first item (there is no ordering or anything else to do), so there is not much a time cost to do the limit.

That said, I committed the patch. Feel free to come up with benchmarks if you strongly feel the query is speedier without the limit.

hswong3i’s picture

according to my interest about performance different, with different query implementation, i try out a benchmarking for both CVS (before commit), #2 and #3. the benchmarking reuse previous environment (http://drupal.org/node/172541#comment-298711), with 1000 access rule generated by a simple random script (http://edin.no-ip.com/html/?q=node/313). the test preform for 3 different concurrency level with each 3 times of testing (so comes with totally 27 result sets). the result shown as below:

                  +--------------+---------------+---------------+
                  | c1, n500     | c5, n500      | c10, n500     |
+-----------------+--------------+---------------+---------------+
| cvs             | 290 +- 13.6  | 1449 +- 277.4 | 2870 +- 447.5 |
+                 +--------------+---------------+---------------+
|                 | 291 +- 9.6   | 1422 +- 201.5 | 2831 +- 773.3 |
+                 +--------------+---------------+---------------+
|                 | 292 +- 14.3  | 1432 +- 220.9 | 2835 +- 647.6 |
+-----------------+--------------+---------------+---------------+
| is_denied       | 292 +- 13.9  | 1448 +- 245.9 | 2849 +- 757.3 |
+                 +--------------+---------------+---------------+
|                 | 306 +- 254.7 | 1425 +- 239.2 | 2820 +- 525.5 |
+                 +--------------+---------------+---------------+
|                 | 292 +- 12.9  | 1440 +- 288.4 | 2967 +- 737.2 |
+-----------------+--------------+---------------+---------------+
| is_denied_0     | 292 +- 12.0  | 1450 +- 245.6 | 2872 +- 818.9 |
+                 +--------------+---------------+---------------+
|                 | 293 +- 27.3  | 1449 +- 233.0 | 2900 +- 686.5 |
+                 +--------------+---------------+---------------+
|                 | 293 +- 16.7  | 1446 +- 212.8 | 2862 +- 801.6 |
+-----------------+--------------+---------------+---------------+

i can't believe about the result... it is almost without any different... we may seems this benchmarking as not solid, since rules are randomly generated, none of them will be matched and so always return as allowed. it may not show off the differences between db_query() and db_query_range(), as both return empty result set :(

BTW, i guess we should also consider this benchmarking result from other point of view: we usually duel with limited amount of rules (says within 50, i guess), but not in thousand-scale. the benchmarking don't show for any significant performance different, so should we consider the bottleneck from other point of views? e.g.:

  1. as we just hope to check if there are any rules matched, should we replace it as COUNT(*) but not range query?
  2. maybe the bottleneck is located in the LOWER('%s') LIKE LOWER(mask), as we are able to define mask with SQL % and _ format? since this may introduce a NxN comparison?

it should be no question about patch committed: code clean up is always preferred. i am not sure about the effectiveness of this patch, as i am only able to test it within an enclosed benchmarking environment. the effect may only show off when we are facing a huge project with huge loading as drupal.org, with some accesses denied. i feel so sorry that the benchmarking can't provide a relatively solid supporting for our changes. i am totally agree about optimizing drupal_is_denied() for a better performance, but seems we may still need some more effort and love about it?

Anonymous’s picture

Status: Fixed » Closed (fixed)
killes@www.drop.org’s picture

Status: Closed (fixed) » Active

The patch was an improvement as it removed an unneccessary ORDER BY.

However, the new query still shows up in the slow query log on drupal.org.

The LOWER part is probably to blame for this.

killes@www.drop.org’s picture

Status: Active » Needs review
FileSize
1.1 KB

here is an idea that might help.

The "host" check is done for each page view, the others only on log in or account creation. It makes thus sense to optimze the host check separately.

The patch does this my omitting the LOWER part for this type of check. IPs are numbers after all.

I also wonder if it would make sense to add an index on the "type" column.

killes@www.drop.org’s picture

FileSize
2.75 KB

After some discussion with Larry, I revert to storing all rules as lower case. No, this is not "user data".

I also add an index on "type".

killes@www.drop.org’s picture

FileSize
3.31 KB

oops, some bugs crept in...

catch’s picture

Status: Needs review » Closed (duplicate)

Marking this as duplicate of: http://drupal.org/node/181625

Yes this is the older issue, but the approach is similar, and 181625 deals with more occurrences. I'll copy killes' last patch over there for easier review.

catch’s picture

Status: Closed (duplicate) » Needs review

No, no it's not.

hswong3i’s picture

Title: Optimize query in drupal_is_denied » Optimize query in drupal_is_denied: remove LOWER(), add index
Assigned: Unassigned » hswong3i
Priority: Normal » Critical
FileSize
4.41 KB

The handling is already fully discussed in LOWER() issue (http://drupal.org/node/83738): it is not necessary to save natural case of access.mask, which indexing will be broken by LOWER() during runtime. On the other hand, type and mask should be indexed in order to boost up performance if possible.

This is a critical issue which degrade overall performance. Should we consider this issue as individual from user.name handling, and commit this bug fix on time? This should delay no more.

Patch reroll from #18 via latest CVA HEAD, plus some required comment and schema documentation. It is tested for both D6 flesh install, and D5.3 -> D6 upgrade with MySQL with no error. It is ready for commit.

hswong3i’s picture

Title: Optimize query in drupal_is_denied: remove LOWER(), add index » Optimize query in drupal_is_denied: remove LOWER(), add indexes
hswong3i’s picture

FileSize
5.13 KB

minor update: add strtolower() in user_block_ip_action()

Shiny’s picture

tested on postgres 8.2.5 adding access, with and without mask works. no sql errors.

moshe weitzman’s picture

Status: Needs review » Reviewed & tested by the community

code looks fine. based on my review and the test report in #24, is RTBC

Gábor Hojtsy’s picture

Status: Reviewed & tested by the community » Needs review

hswong3i: why do we need to strtolower() the IP address in user_block_ip_action()?

hswong3i’s picture

@gabor: I just clone this handling from chx's patch in http://drupal.org/node/83738. I don't really think it is meaningful for strtolower($ip); BTW, as chx do so, I guess it should have its meaning ;p

chx’s picture

because of IP6, obviously.

chx’s picture

Status: Needs review » Reviewed & tested by the community

IPv6 , doh!

Gábor Hojtsy’s picture

Status: Reviewed & tested by the community » Needs work

Guys, this is not going to work. I realized this, while reviewing the other lower() issue (user name lowercase checks).

1. The update stuff does: update_sql('UPDATE {access} SET mask = LOWER(mask)');
2. The runtime code does: db_query_range("SELECT ... '%s' LIKE mask", ... strtolower($mask), 0, 1));

LOWER() is a clever collation aware function, which means it takes the collations used into account (ie. lowers UTF-8 stuff properly) while strtolower() is a dumb ASCII lowercase function, which (no wonder works faster) does not lower anything beyond ASCII chars.

Then even if we use strtolower() in the update, running row by row, we still get inconsistent results, as a filter for "depth" would match "DEPth", but a filter for "exposé" would not match "ExpoSÉ", as these are "exposé" and "exposÉ" after strtolower().

So we should actually use LOWER() as far as I see. We can use it when *setting* the value, and not when searching (instead of strtolower()-ing in PHP when setting the value) and we need to use it on the static value in the query, when searching which I hope the database server is clever enough to not do on each row (did not test myself now). So in short, we should move the string handling magic to the UTF-8 friendly environment.

hswong3i’s picture

Status: Needs work » Needs review
FileSize
5.16 KB

If strtolower() is not UTF-8 aware, how about drupal_strtolower()?

Dries’s picture

all: what is the performance gain obtained with the latest patch? Previous tests with a faster version of this patch actually demonstrate a negligible performance difference.

killes: when you said it is the slowest query on drupal.org -- do you meant that the aggregated time of this query makes it the most expensive query, or that one execution of this query is really expensive.

hswong3i’s picture

I think it is not easy to demonstrate the effect of this patch by benchmarking, unless apply it to a real site with huge traffic. BTW, as mentioned in #83738, LOWER() will ALWAYS cancel the effect of indexing; on the other hand, the benchmarking in #83738 also show the performance different between with and without an effective indexing. So if we may implement in some other way that indexing can ALWAYS functioning, maybe we should give a chance to it?

Gábor Hojtsy’s picture

Status: Needs review » Needs work

hswong3i: Refrain to drupal_strtolower() if we are fine with *requiring* the PHP mbstring extension for Drupal 6, or our code does funny things. If you look into drupal_strtolower(), it either works with the mbstring extension or falls back on a custom implementation with support for latin-1 only (not even suitable for this area of Europe, where I live, and not suitable for the Hong Kong official language either, which is relevant for you I guess).

I wonder why would a database not handle things such as SELECT ... WHERE column_with_lower_case_value = LOWER('literal string') nicely, with the index (the literal string only needs to lowercased once), so we can do UPDATE ... SET column_with_lower_case_value = LOWER('new value').

Again, drupal_strtolower() is far from being the same as LOWER(), so we cannot use these interchangably (ie. LOWER() in the update, drupal_strotolower() runtime).

hswong3i’s picture

FileSize
5.11 KB

Gabor: Thank you very much! How good idea it is! :)

Patch updated based on your suggestion. Test with MySQL, save rules in random case, and all converted to lower case. Top page access fine. It should be same effect as previous patch :)

Gábor Hojtsy’s picture

Status: Needs work » Needs review

OK, now that we seem to technically doing the same thing as before the patch (but moving the dynamic LOWER out of the runtime queries), let's get Dries' concerns/requests answered. I also find it a bit odd to have this performance "NOTE" on the interface. A note in the overview interface overview help text might be better. Anyway, as I don't have technical issues with this patch now, let's get Dries' concerns resolved, before taking time with moving help text around :)

hswong3i’s picture

FileSize
5.96 KB

Patch reroll via CVS HEAD, remove some meaningless comment, and add performance "NOTE" for 'admin/user/rules' help message. It should looks much clean and elegant :)

wmostrey’s picture

Patch applies cleanly, nice work!

hswong3i’s picture

Is it able for RTBC?

wmostrey’s picture

If someone else can test and confirm, it is.

Gábor Hojtsy’s picture

Well, I don't see how this comes to RTBC without answering Dries (http://drupal.org/node/174025#comment-625509).

wmostrey’s picture

The patch itself works, I don't have the set-up to test its actual performance.

hswong3i’s picture

This patch is always "theoretically" correct, where not easy to test it out with benchmarking, as mentioned in http://drupal.org/node/174025#comment-625517. BTW, I can't help about "how is it actually affect drupal.org's performance?" :'(

Gábor Hojtsy’s picture

Version: 6.x-dev » 7.x-dev

Seems like no interesting in validating this patch, and it is past the December 1st deadline, not even being part of the exception list for beta 4, so happy hacking with this in Drupal 7. Thanks for your efforts in making Drupal more performant, it is appreciated. Please understand that we are working towards a release, and it is about stabilizing what is there, and not modifying behaviors.

Fordi’s picture

AFAIK, the 'LIKE' operator operates caselessly on TEXT symbols. Is `mask` a BLOB? Is that why we're explicitly casting both sides to lower case? The other side, '%s', is already explicitly cast to a string; LOWER() shouldn't be needed at all.

kbahey’s picture

@all

I did not see this query come up as a slow query on several sites I run or consulted for. Perhaps d.o is unique because the access table is large from blocking crawlers or bad PCs, but normally that table is not large on other sites. I would be interested in seeing how many rows it has on d.o.

@Fordi

What you say is fine for MySQL. It is case insensitive and the LOWER() is superfluous. However, PostgreSQL is case sensitive and hence the LOWER is needed.

catch’s picture

Status: Needs review » Closed (duplicate)

Query was refactored and override put in here: http://drupal.org/node/228594