We currently have an index on the name column of the users table, but it isn't doing us any good because the typical name-based user queries look for SELECT * FROM users WHERE LOWER(name) = LOWER('%s'). This effectively blocks the database from using the index and forces it to iterate through all user rows, perform LOWER on the name, and do the string comparison. This is a significant performance bottleneck in the following situations:

  • user login
  • user_load(array('name' => 'foobar'));
  • new user validation
  • user search
  • user autocomplete

The motivation for using LOWER is that usernames should be case insensitive. They should be case insensitive to prevent people from spoofing other users' identies (you're "Big_chief" on your site and someone registers as "Big_Chief" and starts acting like the site owner).

The solution I am proposing involves adding a column to the users table, name_lower, which contains an already lowercase version of name. The name_lower column has an index and the index actually gets used because the queries are now written like this:
SELECT * FROM users WHERE name_lower = LOWER('%s').

To test whether the speed increase would be worthwhile, I created a test database with 500K users. Here is the before and after:

Before name_lower

Table size: 66.76 MB.

1237.3 0 eval select * from users where LOWER(name) = LOWER('cfcd208495d565ef66e7dff9f98764da')
1295.39 0 eval select * from users where LOWER(name) = LOWER('c4ca4238a0b923820dcc509a6f75849b')
1287.55 0 eval select * from users where LOWER(name) = LOWER('c81e728d9d4c2f636f067f89cc14862c')
1266.64 0 eval select * from users where LOWER(name) = LOWER('eccbc87e4b5ce2fe28308fd9f2a7baf3')
1273.97 0 eval select * from users where LOWER(name) = LOWER('a87ff679a2f3e71d9181a67b7542122c')

After name_lower

Table size: 76.29 MB.

0.53 0 eval select * from users where name_lower = LOWER('cfcd208495d565ef66e7dff9f98764da')
0.38 0 eval select * from users where name_lower = LOWER('c4ca4238a0b923820dcc509a6f75849b')
0.39 0 eval select * from users where name_lower = LOWER('c81e728d9d4c2f636f067f89cc14862c')
0.36 0 eval select * from users where name_lower = LOWER('eccbc87e4b5ce2fe28308fd9f2a7baf3')
0.37 0 eval select * from users where name_lower = LOWER('a87ff679a2f3e71d9181a67b7542122c')

So in a typical case we're comparing 1273.97 ms to 0.37 ms. It is a big difference.

Remaining questions

  1. Where do UTF-8 and collations fit into all of this? I'm not in a position to speak to these issues with authority.
  2. Are there other solutions that don't introduce extra columns of essentially duplicated data?
  3. Will this work as well on Postgres?

I have rolled the patch against 4.7 but will provide patches against HEAD as requested. I'm hoping to get feedback and garner discussion before I spend more time rolling patches and benchmarking.

Oh, and you really have to try the user autocomplete field before and after this patch! Where it once felt sluggish, and often fell behind the user's typing, now it is brilliantly fast, even with 500K users!!!

Comments

chx’s picture

Version:4.7.3» x.y.z
Priority:Normal» Critical
Status:Active» Needs review

This is very true. Taxonomy should follow. I can attest that the slowest query (far surpassing search and all) at a huge site I run is the LOWER(name) query of taxonomy.

chx’s picture

Why critical? Two reasons. One, the autocomplete Robert mentions and the other is the taxonomy that should follow -- have you ever tried taxonomy pages when your table has hundreds of thousands of terms?

Dries’s picture

This is not critical, IMO. Also, the proposed solution is a hack, imo.

robertDouglass’s picture

I'd be thrilled with an alternate solution. But if we cannot get the same performance gain any other way, I am fully comfortable with the proposed solution as well. What are our alternatives?

Dries’s picture

By default, MySQL searches are not case sensitive so the lower()s could be removed. Unfortunately, PostgreSQL is case sensitive so it can't be removed. Maybe there is a clean way to make PostgreSQL not case sensitive?

Dries’s picture

Crazy idea (not tested): can we create an index on lower(field)?

CREATE INDEX my_index ON my_table (LOWER(name));

robertDouglass’s picture

I'll give it a try =)

robertDouglass’s picture

That would be for Postgres, right? (guess I should learn how to use it) It doesn't seem like it would do any good to have the index be lowercase since MySQL can't use the index to begin with. And the goal of creating the index that way would have to be to force Postgres into behaving like MySQL (ie case insensitive search). It seems like the best solution would be if Postgres had some setting that would configure the behavior of their string comparison operator. Lacking that, I don't see another solution (other than whipping out the database-specific queries... yuck).

robertDouglass’s picture

Just to clarify the problem once more: any time we write SQL that DOES_SOMETHING(column-name), eg. LOWER(name), there is no chance for existing indexes of any nature to be used. Forming the index differently won't correct the problem, only taking the function off of the column name can help.

webchick’s picture

Not sure...

CREATE INDEX my_index ON my_table (LOWER(name));

My mad DB skillz are a bit rusty, but it strikes me that even if one or two oddball DBMS systems support something like that, it's going to be the exception rather than the rule. This is especially important now that people are active developing Oracle and MS SQL abstraction layers for Drupal.

An extra column, on the other hand, is something that can be used across the board.

Gerhard Killesreiter’s picture

Just a comment: I don't think that adding a lowered version of the column is a particularly bad hack. it is about similiarly hacky as the introduction of the node_comment_statistics table.

dww’s picture

we're talking 4 or 5 orders of magnitude faster, here. ;) i've seen far worse hacks for far less benefit. i agree w/ killes, this doesn't strike me as particularly hacky at all, and storing an already-LOWER() copy of name seems much more portable than playing with various DB-specific hacks.

that said, a little bit of info from the pgsql docs and google results:

i'll leave it upto the core committers to argue about portability vs. simplicity vs. hackery. personally, i think storing the LOWER(name) as a seperate column is just fine, and will definitely be the easiest to port (if slightly inefficient w/ space in the DB).

m3avrck’s picture

What if we had a global variable ... kind of like $connection, but with the $db_type (or do we have this? I forget):

if mysql:
select * from users WHERE name = 'bob'

else if pgsql:
select * from users WHERE lower(name) == lower('bob')

Wouldn't that be the cleanest way?

chx’s picture

my 2cents are on the separate column. you are heading in the a very dark woods with that if $db_type -- what happens if someone else wants to write a similar query ? Multiple that if? brrrr.

webchick’s picture

The problem with that approach is that core can conceivably have many more than just mysql and postgresql supported. There are patches in the queue to add Oracle and MS SQL support, for example. And only a matter of time before someone does an SQLite port, or a Sybase port, or a DB2, etc. That kind of database-specific code will get unmanagable really, really fast.

webchick’s picture

Also, then postgres and other database systems don't benefit from the incredible performance benefit of not having to lowercase the fields.

So yep, duplicate column still seems the best way to go for now. :)

m3avrck’s picture

I concur about the if $db_type, just playing devils advocate there.

Way back when I talked with Robert about this I actually suggested the extra "lower" column in the first place as probably the best solution. And I think it still stands.

Still doesn't hurt to throw out bad suggestions ... makes the best solution look all that much better :-p

I still agree with my original thought (and the path this patch took), have a separate "lower" column.

jvandyk’s picture

I took the extra column approach a while ago with a very popular site. I think the payoff is worth the extra column.

drinkypoo’s picture

Just in case anyone needed further confirmation, I just applied this patch, and now autocomplete is actually usable. (I'm on a shared host, so it was pretty useless before.)

dww’s picture

Status:Needs review» Needs work

finally had a chance to look closely at the patch itself. ;) mostly, it seems fine on a quick skim (i haven't tried testing it personally). however, system_update_183() is obviously missing the pgsql cases, so this needs work before it could be committed...

robertDouglass’s picture

@dww - did you test on HEAD or 4.7? I rolled it against 4.7 and haven't had time to apply it to HEAD.

dww’s picture

i didn't test at all. i just visually inspected that there's no 'pgsql' case in system_update_183() in http://drupal.org/files/issues/name_lower.patch

speaking of which, if this issue is targetting the trunk (as the version currently states), that update # needs to be changed, too...

sammys’s picture

-1 to the patch as it can't be tested on pgsql. Time for someone to practice some pgsql syntax! :)

Other than that, IMHO the concept is good and the extra 10% of space for 4 orders of magnitude more speed is worth it in core.

Sammy Spets
Synerger
http://synerger.com

sammys’s picture

Status:Needs work» Needs review
StatusFileSize
new5.03 KB

patch including pgsql update code for 4.7.

robertDouglass’s picture

This is getting rolled out on a client site today, so we'll have some reports from the wild for you soon.

Zen’s picture

Or you could just get rid of the "preserve case" feature and store the usernames in lower case right from the get-go? Capitalisation or w/e can be done in the presentation layer.. The only people who'll get ruffled by this are those with 1337 nicks like "deEp ThRo4T" or "Marquis de Sade" ..

Unless I'm missing the big picture here, this is a minor feature that we can just nix..

Just a thought.
-K

webchick’s picture

@Zen: I dislike that idea. It goes against Drupal's philosophy of not mangling user input.

I should be able to have the nick "webchick" and someone else should be able to have the nick "Joan of Arc" -- no amount of logic on the theme layer is going to allow both of those things to happen.

Zen’s picture

Also, if this feature needs to be retained, it might be more elegant to use PgSQL's operator overloading feature to accomplish this. Not sure if it's possible, but I'll try and ask Cvbge about it. The separate column approach is IMHO rather tacky.

Cheers,
-K

robertDouglass’s picture

It would be good to research some of the other rdbms solutions out there as well and see what issues we might run into if we do an Oracle or MSSQL port. The one shining advantage of the name_lower approach is that we can eat our cake and port it too. It is absolutely guaranteed to run the same on any database we use.

I'm not against a db specific approach, especially if we can apply it to the other major systems I mentioned. But the "hackishness" or "tackiness" or "this hurts my normalized eyes-ness" of the patch shouldn't prevent us from harvesting the performance gain. One way or another, this bottleneck needs to be resolved.

drumm’s picture

Version:x.y.z» 4.7.3

Looks like a 4.7 patch.

killes@www.drop.org’s picture

Version:4.7.3» x.y.z

The issue still exists in HEAD:

dww’s picture

i think neil's point is that the performance implications of this fix are so enormous that we could consider this a bug that deserves to be fixed in 4.7, in addition to being fixed in the trunk. i'm not going to move the version again, but i'd vote for fixing this in both branches.

pwolanin’s picture

(subscribing) I'd wondered why the autocomplete lookup of usernames was so painfully slow. I'll try to test this patch on a copy of my 4.7 site.

pwolanin’s picture

With this patch in place, username autocomplete is remarkably faster (qualitatively).

gopherspidey’s picture

Well at least for postgres, it looks like comment #6 is the correct way to help list problem. According to the postgres manual.

http://www.postgresql.org/docs/8.1/interactive/indexes-expressional.html

Since mysql is by default case insensitive, and does not seem to allow indexs on SQL functions. A hack around the problem, we could just remove the "LOWER()" from the where clauses in select statements before we make the query.

dww’s picture

re: pgsql -- yeah, as i linked in #12. however, instead of trying to find DB-specific code for this on every DB we currently (and in the future, might) support, we could just add the extra field. it's simple, it'll always work, we'll never have to port it, and it's 4 or 5 orders of magnitude faster than what we have now.

JohnAlbin’s picture

No one seems to be addressing Robert's comment about spoofing addresses.

The motivation for using LOWER is that usernames should be case insensitive [...] to prevent people from spoofing other users' identies (you're "Big_chief" on your site and someone registers as "Big_Chief" and starts acting like the site owner).

If you look at this comment (http://drupal.org/node/56487#comment-107050) on login spoofing:

AIM (AOL Instant Messenger) [...] allows spaces in logins, but "folds" them, so an AIM login of "JimBob32" is equivalent to "Jim Bob 32" is equivalent to "J i m B o b 3 2".

..we could extend the "second column idea" to be both index-able and less spoof-able.

So this indexable field would be created by lowercasing name and stripping all non-alphanumeric characters.

Ideally, this new value could be placed in name, with a new field called display_name which has the capitalization, spacing and punctuation the user prefers to be displayed in comments, etc. (As a bonus, you could allow apostrophes in the the display name for those named O'Brian). And since the display_name can be "folded" to get the name, you could still use the display name in logins and Drupal Authentication.

Of course, this would mean the ubiquitous $user->name would need to be changed to $user->display_name. And would affect user login validation and a slew of other components.

I don't know if this is feasible, but I thought I would make the suggestion before we implement a second column.

pwolanin’s picture

That is certainly a valuable suggestion, but I think that changing the meaning of $user->name is unlikely to find much support (at least prior to Drupal 6.0) due to the widespread breakage that would ensue. For now, it seems it would be pretty trivial to strip the whitespace, etc. from the entry in the name_lower column and that would make the system more robust.

robertDouglass’s picture

Nice suggestions. I think we'll most likely keep $user->name (as the display name), but I can definitely see stripping white space and non-alphnumeric chars from the name_lower column. This, like you said, would make it harder to spoof user names and easier to customize your own user name.

This is for a later patch, however, because there is still a lot of resistance to the idea of having a second column for lookups to begin with.

Hayakawa’s picture

Version:7.x-dev» x.y.z
Component:database system» user system
Status:Closed (duplicate)» Needs review

i was really lost in the comments. is this final version of patch after the suggestions? and are there any bottlenecks or imcompatibilities caused by this patch?
thank you very much

robertDouglass’s picture

@Hayakawa: this patch is ready to go, if it is decided that it is an approach that we want to pursue. If someone comes up with a different approach that people feel is a better architecture, it might win out. So far, many prominent developers have expressed support for the patch, and a couple have expressed misgivings. What that means is that a better solution is hoped for. But if you want to use the patch, I feel confident that it will work well for you. It is being used by the biggest client that I work for on their production site and has caused the performance gains that are hoped for.

JohnAlbin’s picture

If Robert's latest patch can be finalized and put into Drupal 5.0 (or even 4.7), I would recommend that we do it. It provides (right now) a substantial performance improvement.

And for those developers who are reticent to add a new column simply to get a performance gain, think of it this way...

In Drupal 6.0, the new column could be a trifecta for these issues:

  1. username lookup performance
  2. prevention of much username spoofing
  3. better display name capabilities (allowing apostrophes, ampersands, commas, etc.)
Steven’s picture

Stripping the lower column of whitespace is a neat idea, but could break existing sites. It would be cool if we could add this for new sites though.

One possible approach would be to only use literal lowercasing in system_update_N() to create the column, but to insert the stripped version for any new users from then on. Their usernames will be checked with the strict restrictions too.

That way we don't have to provide an option and we don't break any site.

pwolanin’s picture

@Steven - I'm not sure that would work, since you need a apply a function to lower case and strip whitespace before you run the query. If you don't know in advance whether or not to strip, that won't work.

Is it possible to provide a pre-update tool to check whether there will be conflicting usernames after the whitespace stripping? Or is there any way to provide an interactive process during the update?

sammys’s picture

FYI: Correct syntax for creating pgsql indexes for drupal tables is:

CREATE INDEX {table}_fieldname_idx ON {table} (fieldname)

moshe weitzman’s picture

Status:Needs review» Needs work

needs a 5.0 port.

but further, this is one of those updates which is difficult in our update syste, once you update your code, you can't login. we handled that at one time by having people manually update their sessions table. might be needed here too.

chx’s picture

Status:Needs work» Needs review
StatusFileSize
new4.96 KB

I rerolled for HEAD, was surprisingly easy, only hunk failed for user.module.

Why you can't login? I do not see it.

moshe weitzman’s picture

Status:Needs review» Reviewed & tested by the community

sorry, i was unclear. if a user is not logged as uid=1 after updating source, he is unable to login and perform update because the new column doesn't exist. however, i tested setting $access_check = FALSE in update.php and that works so is a good workaround for this situation. we'll just have to document this in the upgrade notes for 5.0

i tested chx' recent patch for functionality and update path. works well. rtbc.

Dries’s picture

Assigned:Unassigned» Dries
Category:bug» task
Priority:Critical» Normal

I still think this is not the cleanest approach. Also, I am going to benchmark this before it can get committed. I'm tempted to postpone this, as I think we might be able to do a better job with this ...

Changing the status from 'critical bug report' to 'normal task'. This doesn't break your website and therefore isn't critcal.

Dries’s picture

Status:Reviewed & tested by the community» Needs work

Also changing the status to code needs work. Will update the status once benchmarked.

robertDouglass’s picture

Let me know if there is a way I can help with the benchmarking.

coreb’s picture

Version:x.y.z» 5.x-dev

Moving out of the "x.y.z" queue to a real queue.

It doesn't appear to be a new feature, so I'm putting it in 5.x-dev.

m3avrck’s picture

Let's bring this back into the light -- lower() sucks :-)

Gary Feldman’s picture

I just came across a problem where users can't login, because the clause "LOWER(name) = LOWER('The Users Name') was failing when the name was in mixed case (on MySQL 4.1, Drupal 4.7.5). This seems to be related to this thread. I fixed it by simply removing the lower (for now) in user_load, trusting that the unique key will prevent duplicates that differ in case.

It seems incredible to me that a problem this severe could exist for this long (though given the number of keyboards without shift keys, perhaps it's not so incredible) . So I'm willing to assume for the moment that I'm missing something. If so, I'd appreciate it if someone could let me in on the secret. If not, should I raise the priority of this or submit it as a separate bug?

Gary Feldman’s picture

Removing the LOWER made the user id case sensitive. So I changed it to

   ...CONVERT($key using utf8) = '%s'";

This is specific to MySql 4.1.1 and higher.

The way I read the MySQL docs, the existing Drupal code using LOWER would work on previous versions of MySQL. Furthermore, I think it would work on a clean install of Drupal with later versions of MySQL, because the clean install sets users to be a varchar. It's the upgrade process that converts varchar columns to varbinary, for reasons I have yet to discern.

JohnAlbin’s picture

Version:5.x-dev» 6.x-dev

Hi Gary,

This thread is about a proposed performance improvement. And while the patch does remove the LOWER() MySQL function call, it's really unrelated to the bug you are experiencing.

Please, submit a new bug so that we can tackle both of these issues separately.

Thanks!

quicksketch’s picture

Revisiting Dries's idea of making pgSQL case insensitive. It's possible to override operators in PostGreSQL (so I've heard). If it would be possible to make the = operator case insensitive for varchar comparisions, maybe that would be better solution as it would work for all database columns and keep the entire problem contained in the pgsql database implementation.

From http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:49827:

Some operations on data can be case insensitive. So your solution is not case insensitive data, your
solution is a case insensitve operator. Luckily it is very easy to define your own operators in
PostgreSQL. For instance, this defines case-insensitive equality and inequality operators for text
datatypes:

CREATE FUNCTION case_insensitive_equality(text, text) RETURNS boolean
    AS 'SELECT LOWER($1) = LOWER($2)'
    LANGUAGE SQL
    IMMUTABLE
    RETURNS NULL ON NULL INPUT;
CREATE FUNCTION case_insensitive_inequality(text, text) RETURNS boolean
    AS 'SELECT LOWER($1) <> LOWER($2)'

    LANGUAGE SQL
    IMMUTABLE
    RETURNS NULL ON NULL INPUT;
CREATE OPERATOR === (
    leftarg = text,
    rightarg = text,
    procedure = case_insensitive_equality,
    commutator = ===,
    negator= <=>
);
CREATE OPERATOR <=> (
    leftarg = text,
    rightarg = text,
    procedure = case_insensitive_inequality,

    commutator = <=>,
    negator= ===
);

I would recommend against overwriting the current = and <> operators with these operators because I
have no idea what the side-effects may be and I don't think they are indexable.

So it looks like we have some questions to answer here. They might not indexable, but surely this will offer a performance improvement over the LOWER() method currently in use which kills the indexes anyway for all database types.

dww’s picture

They might not indexable, but surely this will offer a performance improvement over the LOWER() method currently in use

huh? ;) read the definition of the === operator you just pasted above:

SELECT LOWER($1) = LOWER($2)

by defining your own case-insensitive equality operator, you're just hiding the LOWER() calls, not removing them.

kbahey’s picture

@robertDouglass

Are you sure the query cache is not a factor in the improved results at the top of this page? Try restarting MySQL between tests and see.

@All
This approach is one of the no-nos in database design. The reason is that you have data redundancy and one field is a product of the other in the same row. Things get out of hand if one part of the application would update one field but forget to update the other, and then you have data integrity issues and all sorts of problems.

This may not be such an issue if we only have one place for the INSERT and one place for the UPDATE of the name. But if we have more then we are just opening ourselves to trouble in the future.

One approach that allows a new column, yet ensure its integrity, is if we had triggers: this would automatically update the name_lower field every time we have an insert/update to the row, and would be transparent to the application. But these are only available in MySQL 5.0.2 onwards.

The other option to solve this is to get rid of the LOWER()s in SQL altogether. The database is case insensitive anyway, and uniqueness is guaranteed since we have a unique index on name. We can do the lower part in PHP on the name before we pass it to db_query().

kbahey’s picture

I mean instead of this:

<?php
db_query
("SELECT uid FROM {users} WHERE LOWER(name) = LOWER('%s')", $name);
?>

We do something like this:

<?php
db_query
("SELECT uid FROM {users} WHERE name = '%s'", strtolower($name));
?>

Can someone benchmark this?

m3avrck’s picture

I agree with Khalid. Do this at the PHP level with strtolower(), the database should not store 2 representations of the same thing, it's going to cause problems eventually.

quicksketch’s picture

by defining your own case-insensitive equality operator, you're just hiding the LOWER() calls, not removing them.

Yeah you're right, I should read code more carefully. Shoot, another option down the drain :(

quicksketch’s picture

The problem with kbahey's approach is that the user table stores the data including mixed case usernames. Using strtolower() in PHP and then searching will cause any username containing an uppercase letter to fail when using pgSQL whose comparisons are case sensitive, the original cause of all the trouble in this issue.

kbahey’s picture

But Robert's original test and patch was on MySQL.

The motivation was bad performance on MySQL because of the lack of using the index since the name was wrapped in LOWER().

pwolanin’s picture

Until reliable triggers are available in MYSQL, it seems very hard to think of a solution other than storing the lower-cased username as another column. While this may not be a great practice, it's database-agnostic.

Even if Drupal was pushed in MySQL 5.0, it's not clear that most users would be able to create a TRIGGER:

CREATE TRIGGER was added in MySQL 5.0.2. Currently, its use requires the SUPER privilege.

http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html

robertDouglass’s picture

I'm still in favor of the extra column, but rerolled to remove whitespace as was suggested earlier in the thread. Anybody is welcome to beat me to it, since I won't be doing it this week.

pwolanin’s picture

How do you deal with the update path then? by removing the white space, previously valid usernames may collide.

kbahey’s picture

I still don't like an extra derived column. It is not good practice and a whole can of worms.

Here are some benchmarks using MySQL 5.0.22.

I used today's HEAD and created 10,000 users using devel.

Then I ran two queries, searching for a user called "paritoh" (lower case p in the db):

  $sql = "SELECT * FROM {users} WHERE LOWER(name) = LOWER('%s')";
  db_query($sql, 'Paritoh'));

This is how it is in core today.

Result: Time=22.491 Rows=1 SQL=select * from users where LOWER(name) = LOWER('%s')

  $sql = "SELECT * FROM {users} WHERE name = '%s'";
  db_query($sql, strtolower('Paritoh'));

This is using the case conversion outside the database:

Result: Time=0.395 Rows=1 SQL=select * from users where name = '%s'

56X orders of magnitude improvement.

So, for MySQL, the performance problem is solved by doing the case conversion in PHP, with no backward compatibility issues or the need for a derived column.

This leaves PostgreSQL, which is case sensitive.

By using database abstraction, we always have compromises such as lowest common denominator, or optimizing for one but not the other.

Alternatives:

- Store names in lower case. They are already unique and have a unique index, so no data issues are there, but will be offputting to users to see their names change case. In this case, we can still use LOWER in the SQL, but like so SELECT * FROM {users} WHERE name = LOWER('%s') which works with all databases, and is as fast as the PHP strtolower() for MySQL.

- Do some magic in the abstraction layer to rewrite the SQL for each engine. For example, if we have LOWER(name), we make it just name for MySQL.

- Other alternatives?

catch’s picture

Status:Needs work» Closed (duplicate)
catch’s picture

Status:Closed (duplicate)» Needs work

oops, setting status back, too many tabs.

merlinofchaos’s picture

Category:task» bug
Priority:Normal» Critical

Coming back over from http://drupal.org/node/181625

I am setting this back to critical and back to bug report.

From firebus' report (I've been working with him), not fixing this made nanowrimo.org unusuable with 800 logged in users; fixing this got him up to 1400 logged in users and a slow but usable site.

This is a critical performance problem.

hswong3i’s picture

Subscribing

chx’s picture

Assigned:Dries» chx
Status:Needs work» Needs review
StatusFileSize
new10.71 KB

I would side with merinofchaos, as posted in another issue, one single term query on NowPublic runs for 2.42s w/ LOWER and 0.02 without. So, I would argue Dries' earlier

This doesn't break your website and therefore isn't critcal.

-- while the website does run but if its so slow that it can't be used, then IMO that's a broken. The patch is not difficult. The update is not nice but that's life.

chx’s picture

StatusFileSize
new10.93 KB

Doh! Forgot the indexes.

chx’s picture

StatusFileSize
new12.91 KB

forgot access table as I only search modules. I do not think there is any point keeping the mask case sensitive it's not the usual user submitted data that can not be altered. So, I simply lowercase that and not keep a lower_mask.

hswong3i’s picture

Give a simple scan about the patch in #74 and sounds beauty: including both upgrade and normal queries patch.

BTW, seems not yet include the handling of drupal_is_denied() (http://drupal.org/node/174025)? It is also a critical issue that affected by non-indexed LOWER() handling. Should we also take care of that within this issue? May be also handled by 2 columns? One for compare (lower case) and the other for admin UI (natural case) ?

chx’s picture

StatusFileSize
new13.47 KB

hswong3i says, I should warn the admin about the mask change. Be it.

John Morahan’s picture

Status:Needs review» Needs work

Applied #77, installed a new site into a fresh database, and on completing installation, user 1 was not logged in. Trying to log in gives this error: "The username admin has not been activated or is blocked."

Also, the indexes were not created. (looks like they were only added to the upgrade path)

hswong3i’s picture

Sorry, please hold on for a while! A simple suggestion: should we split the user "real name", "account name" and "user ID" as 3 different type of fields? Let's take vexim (http://silverwraith.com/vexim/) database users table schema as example:

  1. user_id: as like as our uid, auto_increment.
  2. localpart: the local part of email address (e.g. localpart@domain). It is always unique, and restricted with email local part restriction: [a-z0-9_-]. Similar as our name field, plus lower case limitation.
  3. realname: An additional field with NO strict limitation. User will able to input whatever string as their display name.

So my suggested changes:

  1. uid: unchange.
  2. name: field keeped, and translate all account name into lower case. As like as name_lower proposed in chx's patch.
  3. realname: new field, free typing. As backward compatible, we can simply copy the existing name into this new field.

Benefit include:

  1. No change to name and keep its function. Contribute developers should not be confused by the lower case transformation.
  2. realname is much more meaningful and useful. E.g. I always use profile.module and create some field similar as this nature for user full/real name.
  3. name_lower is only meaningful when compare, but no other benefit. On the other hand, contribute developer may feel confusing: which field should use for which cases? Should I always use name_lower during compare? Or combine the use of both field? Or some other style? We need to document the change clearly.

Yes, this suggestion is a bit complicated when compare with chx's patch: patch in #77 is much simple and direct, and should be functioning with no question (may be we need some more debug, but this shouldn't be difficult). This suggestion may require for more changes, e.g. most UI should update and display both user real name and account name. But this is not a new idea: most forum system supported for this style.

Please feel free to work hard with chx's name_lower patch, it is target for today. BTW, my suggestion shouldn't be too complicated, and is suitable for a long term development. If it should belongs to D7, I will come back for this afterward, and focus on chx's work for the needs of today :)

alpritt’s picture

@#79: While I agree with a need for real display names, it seems like something that would complicate this issue. Whatever the outputted username is, will be the one people search for and must therefore be translatable into the lowercase version. So we cannot have a display name of 'John Doe' and then a username of 'jdoe' since other users would end up searching for the wrong name.

Comment #37 seems like a much more workable idea. And probably something that should be taken into consideration in support of the above patch -- even if we intend to implement it at a later date. If this is the case, maybe the naming of the field should not be as specific as 'name_lower', but should be something more generic like 'name_unique'.

alpritt’s picture

From #79: "This suggestion may require for more changes, e.g. most UI should update and display both user real name and account name. But this is not a new idea: most forum system supported for this style."

Yes, this would solve the problem I raised above. However, it will do so at the cost of the UI. Displaying both names would be confusing.

webchick’s picture

I think hswong's suggestion is when someone enters "John Doe" as their username on registration, behind the scenes, "john doe" is stored as the user.username field, and "John Doe" is stored in the user.realname field.

How does all this lowercase business work with accented letters, Japanese characters, etc?

catch’s picture

@webchick: This suggests extra lower case column for accented characters so shouldn't be an issue in this case.

Japanese doesn't have a concept of lower case, so at the db level I assume it makes no difference at all (could only find a mysql internals list discussion from 2000 to support this, and it wasn't directly related).

hswong3i’s picture

For clarify my idea (http://edin.no-ip.com/html/?q=node/346), here are some useful screenshots from vexim and eGroupWare:
http://edin.no-ip.com/html/files/vexim_adminuseradd.png
http://edin.no-ip.com/html/files/vexim_adminuserchange.png
http://edin.no-ip.com/html/files/egroupware_edit_user.png
http://edin.no-ip.com/html/files/egroupware_setup_authentication.png

The proposal (split field into realname, name and uid) is not too complicated, backward compatible, and widely adopted by many other systems. Should this be one of the possible solution?

hswong3i’s picture

May be we should restrict "account name" (name) as strict as email local part: [a-z0-9-_.], and leave "real name" (realname) as free as possible. This can simplify the checking (whatever UTF8 or not, we only accept simple text input), and also extensible if we may adopt SMTP authentication backend.

catch’s picture

hswong3i - username currently accepts most UTF-8 characters - including Japanese etc., your suggestion would be a regression if it restricted to [a-z0-9-_.], and I really think the "real name" idea should be left for D7 - we already have profile.module for this.

Looks like #77 needs schema declaration updated in .install, will try to reproduce John Morahan's bug later on.

John Morahan’s picture

I'm using MySQL version 5.0.22-Debian_0ubuntu6.06.5-log - if that helps.

hswong3i’s picture

@catch: agree. The idea of restricted name ([a-z0-9-_.]) may be too new and also too late within D6 dev cycle. We may need more time and discussion for this additional issue. I am just exploring about its possibility.

BTW, It should be a good idea if we employ the pair up of realname + name, rather than name + name_lower. This should be more meaningful, and backward compatible: no harm to contribute developer even they don't update the old LOWER(name) handling, will not break existing code but just without performance boost.

kbahey’s picture

As I said earlier in #68, the problem does not exist in MySQL, and we can overcome it with a simple code change that does not involve any schema changes.

Again, I am objecting to chx's approach of derived columns that only hold a variation of an existing column. These are just trouble down the road, and we will pay dearly for them in maintenance and complexity.

If Postgres needs that, then why not limit the complexity to it?

As to hswong3i's suggestion, I need clarifications on:

- The local part of the email may not be unique. There can be someone@hotmail and someone@yahoo with the same "someone" part, but different people. Unlikely but possible.

- Your scheme enforces that the login name ('name' in your writeup) is always lower case. This means that now we cannot have 'Dries', or 'Amazon', but only 'dries' and 'amazon'? That may not be acceptable to some people ("Why can't the login name be what I want it to be?") Also, what about UTF-8 characters? How would Jose/Andre with accents on the e be handled? Strip those as well?

moshe weitzman’s picture

your suggestion to use a different schema for pg users is is more complex than the extra column, IMO. what about al lthe other DBMS? you have objected without proposing an acceptable alternative.

hswong3i’s picture

@moshe: totally agree. We shouldn't have a hybrid solution which is case-by-case and database-by-database dependent.

@kbahey: local part is not unique, just as like as it is not unique when using drupal.module within D5.3. SMTP authentication should always binded with A authentication server. In case of eGroupWare, we can define the "Mail domain (for Virtual mail manager)", so user will able to use both "username@example.com" and "username" if Mail domain is set as "example.com" (Sorry this topic shouldn't belongs to here, I will create a new one for D7)

On the other hand, login name is restricted, but not realname. Realname allow UTF-8, with no strict limitation. The idea is similar as the "Full name" field which is used within drupal.org but created by profile.module. Realname should be something always binded with user, no matter how the drupal site is used for (in general speaking...). It is not necessary to split this essential field into a non-standardized profile.module handling.

chx’s picture

realname whatever it is , too big for D6 and does not belong to this thread. Bring that somewhere else. (So much banter and noone fixed the patch!)

John Morahan’s picture

StatusFileSize
new15.18 KB

Found the login bug: there was a missing break; in the switch statement in user_save(). Leaving as CNW because this still needs the schema update.

John Morahan’s picture

Also, user_save() does not populate {users}.name_lower for newly created accounts.

John Morahan’s picture

StatusFileSize
new15.68 KB

Now it does.

John Morahan’s picture

Status:Needs work» Needs review
StatusFileSize
new17.12 KB

Added the schema change, and added placeholder-for-uid-1 to {users}.name_lower in system_install(), so as not to break the new unique key.

catch’s picture

Ok so I patched a fresh checkout and installed with no issues. offset with comment module but I think my tarball is a little out of date.

Made a new user manually and logged in with them, then logged out and back in with the admin, all fine. Posted nodes and comments, did some user admin etc. no problems there either. Also checked the db and the name_lower column and respective index were there as expected.

Looking pretty good!

catch’s picture

Status:Needs review» Reviewed & tested by the community

missed a bit.

jvandyk’s picture

Status:Reviewed & tested by the community» Needs review
StatusFileSize
new17.02 KB

Fixed some grammar issues.

The tests for lowercase masks in the access rules UI were incorrect.

David Strauss’s picture

The index on user.name needs to go, but this patch looks great otherwise.

John Morahan’s picture

StatusFileSize
new17.13 KB

Like so?

John Morahan’s picture

StatusFileSize
new17.89 KB

Seems user_autocomplete() got lost somewhere along the way

kbahey’s picture

@moshe and @hswong3i:

My comment on MySQL, is that 95% of Drupal sites use MySQL already (we can debate the figure separately). So for the sake of the 5% we are introducing a derived column, denormalizing the schema, which is not needed by those 95%. Dowing a strlower() in PHP solves the problem for MySQL.

I don't use PostgreSQL, nor my clients do, and hence have no need to solve that problem. Someone who uses it can scratch that itch.

The issue with using derived columns is that they have to computed every time the original changes, and the probability of errors because they get out of sync is there. Yes, the UPDATE and INSERT can be in only a few places, but there are modules that can mess this up easily.

Anyways, it seem that I am the only one voicing this concern. I wish we can poll a PostgreSQL expert on what options are there before we see this as the ultimate solution to this problem.

hswong3i’s picture

Sorry that since drupal_is_denied() is more high priority than user.name (drupal_is_denied() is called for EVERY page generation, whatever deny or not), I adopt both killes's patch (http://drupal.org/node/174025#comment-317309) and chx's patch comment (http://drupal.org/node/83738#comment-332240), and shift the focus of drupal_is_denied() back to its own issue (http://drupal.org/node/174025#comment-332305).

drupal_is_denied() should delay no more, and shouldn't wait for user.name :)

hswong3i’s picture

I folk this issue and handle with what I proposed in #79: http://drupal.org/node/188734. That issue should not be as simple as name_lower, but should be more complete and elegant :)

markus_petrux’s picture

hmm... not sure if someone checked this already... there's a way to extend PostgreSQL by creating new data types, so it is possible to create a "Case insensitive text data type". Here's a project that does this:

http://gborg.postgresql.org/project/citext/projdisplay.php

chx’s picture

denormalizing is not for speed purposed, next version we can even put in triggers, about citext, well, that's not civarchar so it's somewhat problematic.

John Morahan’s picture

Status:Needs review» Needs work
StatusFileSize
new18.03 KB

I worked on this a bit while d.o was down.

I don't think there should be a unique key on {term_data}.name_lower - what if you had terms with the same name in different vocabularies, or in different locations in a hierarchy? I changed it to a regular index in this version.

I also renamed _system_update_6036_strotolower() to _system_update_6036_strtolower() (assuming this was just a typo) and corrected a comment "fail over" to "fall through" in user_load().

I tested the upgrade path and found several problems: db_add_field() needs 'type' => 'varchar', not 'name' => 'varchar'; the access update was checking for db_table_exists('term_data'); the unique key on {users}.name_lower was added before the field was populated; and the multipart update did not increment its counter. These should all be fixed in the attached patch.

I guess the access / drupal_is_denied() stuff should probably be removed from this patch, now that it's being dealt with in the other issue. I haven't done that yet.

John Morahan’s picture

StatusFileSize
new18.03 KB

Note to self: read patch before uploading.

John Morahan’s picture

Another thing: edit and save a user account, the email address is folded to lowercase. This should not happen as email addresses (left of the @) are case sensitive.

David Strauss’s picture

"This should not happen as email addresses (left of the @) are case sensitive."

I think it's OK to lowercase the email addresses. I don't know a single email server that actually observes case.

hswong3i’s picture

There is a miss typing in system_update_6035 and system_update_6036, from #102:

db_add_field($ret, 'users', 'name_lower', array('name' => 'varchar', 'length' => 60, 'not null' => TRUE, 'default' => ''));

It should be 'type' => 'varchar' but not 'name' => 'varchar'

chx’s picture

Assigned:chx» Unassigned
Status:Needs work» Needs review
catch’s picture

OK tested adding and deleting users, created rules, logging in etc. everything fine. The message in access rules is sound as well (rules are stored in lower case etc., reassuring).

However changing e-mail address to something uppercase in user/edit you see it being LOWERed when it saves. I can't actually remember ever seeing an e-mail address with uppercase letters in it but I guess some people might use them - does it need a message to say "All e-mail addresses are stored in lower case, however this will not affect e-mails the system sends to you?" on user/edit maybe? But it should only do this if there's uppercase characters otherwise it'd be annoying. That's tiny, and could be a very small followup patch.

hswong3i's comments in #113 are already fixed in the patch from #109. I can't see any other issues but this could probably do with one more set of eyes.

John Morahan’s picture

Status:Needs review» Needs work

Well, if email addresses are going to be lowercased, then they need to be lowercased on creating a new account, and on updating from D5, too.

chx’s picture

Status:Needs work» Needs review
StatusFileSize
new16.33 KB

Creating a new acc, installing or not, goes through user_save which lowercases. The update needed some love indeed.

John Morahan’s picture

Status:Needs review» Needs work

Creating a new account goes through the else branch of the big if (is_object($account) && $account->uid) in user_save(), and that part lowercases only name_lower, not mail / init.

John Morahan’s picture

Status:Needs work» Needs review
StatusFileSize
new18.62 KB

So I added another case to the switch.

chx’s picture

StatusFileSize
new16.85 KB

I have changed the INSERT switch to default strtolower the same as the UPDATE one. Signature is held case sensitive.

John Morahan’s picture

Okay, that works too.

John Morahan’s picture

StatusFileSize
new18.71 KB

There's now another system_update_6035(); rerolling.

JirkaRybka’s picture

Status:Needs review» Needs work
StatusFileSize
new2.14 KB

I tested the patch by upgrading an existing 6.x clone of real site. Everything seems to work well (upgrade too), adding access rules, editing users... By reading the patch, I got an impression that more things are probably affected than I managed to test, though.

Minor problems:
- The upgrade functions should return some sensible message. Currently updates 6038, 6040, and 6041 claim that "No queries were executed", which is quite not true.
- In user.module, one of the switch ($key) blocks, there are two cases on the same line.

I tried to benchmark this patch, expecting a bit of extra speed due to the access rules check optimized. The results are attached. Unfortunately, the gain with my site's data seems to be rather small:
- Cached pages: 2.0%
- Not cached pages: 1.4%

This makes me inclined to say, although there IS some gain indeed, that such a small gain doesn't make a good reason for schema change this late in D6 release cycle. Other changes than access check are not covered by my benchmarking, but I believe these are targetting rarely requested pages, making no good reason either.

Marking as 'Needs work' for the two tiny problems mentioned above, AND for the need to do some more benchmarking and/or consideration to decide whether this should go into 6.x or 7.x (I'll vote for 7.x).

David Strauss’s picture

@JirkaRybka Your performance tests are completely unusable unless we know the size of your user and access tables.

chx’s picture

Status:Needs work» Reviewed & tested by the community

Two cases on the same line, great, noone said that's a problem, evah. Update messages can be added later. As you say it's not only this one. Benchmark, what more benchmark? I benchmarked above. Create a term_data table with a couple hundred thousand terms and then try to SELECT with LOWER('foo')=LOWER(name) and then 'foo'=name (see above) 2.42s vs. 0.02s. What more do you want? Do not expect this patch to boost your Drupal install with a hundred terms or users. Try, as I say, a hundred thousand.

JirkaRybka’s picture

Status:Reviewed & tested by the community» Needs work

users: 554 rows
access: 17 rows
term_data: 102 rows

JirkaRybka’s picture

Status:Needs work» Reviewed & tested by the community

I didn't mean to change status now...

Re: #126 - if you think other than me, OK, you've all right to that. I said my opinion. Schema change seem late to me, unless there's a lot of gain. (I can't test with hundreds of thousands, if you expect that, then I just wasted hours of my time.)

chx’s picture

No you did not waste your time because functionality testing is needed and welcome.

Gábor Hojtsy’s picture

I am handing this issue to Dries for consideration, whether such a database change is worth the gains so late in the cycle.

chx’s picture

Of course it is! :)

m3avrck’s picture

FWIW, I have been running a variation of this patch on most of my sites for the past 4-5 months... have not noticed any ill effects and my database is happy :-)

merlinofchaos’s picture

As firebus said, fixing this problem allowed him to double the number of logged in users his site was able to handle on the same hardware.

If this is not a critical performance enhancement, I don't know what is.

mfer’s picture

subscribing

Gábor Hojtsy’s picture

Status:Reviewed & tested by the community» Needs work

LOWER() and strtolower() are not the same, the former is UTF-8 aware, the later is not. So they are not interchangable. Read more here: http://drupal.org/node/174025#comment-625306

I mark this issue "needs work", but actually I would welcome if we could solve the smaller issue at http://drupal.org/node/174025 for which there is an alternate fix embedded in the bigger patch here for whatever reason. Once the UTF-8 friendlyness issue is fixed there, that patch can be committed, and this patch can also get smaller, and can stand on its own instead of trying to solve multiple issues at once.

moshe weitzman’s picture

sounds like a job for drupal_strtolower() in unicode.inc

Gábor Hojtsy’s picture

Well, refrain to drupal_strtolower() if we are fine with *requiring* the PHP mbstring extension, or our code does funny things. If you look into drupal_strtolower(), it either works with the mbstring extension or falls back on a custom implementation with support for latin-1 only (not even suitable for this area of Europe, where I live).

hswong3i’s picture

So if mbstring our standard and suggested PHP extension? e.g. I found that admin/logs/status (D5.3) have a "Unicode library" checking. Should we assume users understand the risk if they don't have mbstring installed when using unicode with drupal?

hswong3i’s picture

Are strtolower() nor drupal_strtolower() suitable for solving slow LOWER() issue? NO, they don't; they can't replace LOWER() unicode compatible feature (please refer to Gabor's comment for more detail: http://drupal.org/node/174025#comment-625544).

What we should do with slow LOWER() is: use LOWER() to filter INSERT/UPDATE, and use LOWER() for filtering user input ONLY during SELECT (but don't use LOWER() for column or else cancel indexing). The key point is: fully utilize LOWER(), on the other hand don't cancel indexing :)

P.S. I have already update both http://drupal.org/node/174025 and http://drupal.org/node/188734 based on Gabor's suggestion. If we decided to keep users.name_lower as slow LOWER() solution, some works may needed for it :(

klando’s picture

I haven't read the all the thread, but options I see are :

* the so perfect index on table(lower(field))
* use ILIKE operator which work under postgresl as an insensitive comparaison, but the first option is the best. in this second one, index on the name should be : ceate index foo on table(field text_pattern_ops);

adding a new field for that ?! it is very very strange idea .....

David Strauss’s picture

Why don't we just use the LOWER() function in the database for the constant values? There's no performance penalty.

markus_petrux’s picture

...there is also the option to add a new data type to PostgresSQL, as pointed to by #106, or the possibility to build different queries depending on the SQL engine.

BTW, citext seems to work for text and varchars. It would be nice to hear from someone versed in PostgreSQL about this approach.

klando’s picture

as far as only 'lower(field)=lower($name)' is resquested (i.e. not 'field like ...'), the best approach in postgresql *is* the index on lower(). I am not a drupal user's since long time, and I will investigate more in the core code next weeks (I hope). I can claim to be versed to postgresql...

In postgresql good result ordering are defined by lc_collate well positionned for the language indexed when using like/ilike operator. Also, you have to use the special 'text_pattern_ops' when indexing in the idea to use LIKE. when using the '=', functionnal index can be used (the lower() for example, or special criteria more complex-the words > x letters and with no numbers..or everything you want)

chx’s picture

yes, use LOWER on constants. About pgsql being able to index on lower, that's not too significant as we won't have two different schema and query and all for pgsql.

klando’s picture

Well, afais it is an issue for big website, those which get people able to add an index on a specific demand.

hswong3i’s picture

@klando: I am totally agree with chx's comment in #144. We should always avoid database specific tweaking and hacking if possible. If an universal solution is available as alternative choose, seems it should be chosen as a better solution :)

kbahey’s picture

A far saner approach to this can be seen in #191504 without the hoops we have to jump through to do this.

hswong3i’s picture

Assigned:Unassigned» hswong3i
Status:Needs work» Needs review
StatusFileSize
new5.48 KB

Recall my previous comment for users.fullname (http://drupal.org/node/188734#comment-628268):

On the other hand, I guess we are now able to conclude name_lower as the most suitable solution for D6, by the counter prove of fullname for D7, based on its complete research founding and successful foreseeing. Yes, name_lower is very late within D6 development cycle; it also seems ugly, lossy and temporary solution. BTW, name_lower hidden most handling from UI, and there is no change to client behavior. It also give no impact to existing API, as no additional abstraction is required. And the most important is, name_lower should give us the most smooth upgrade path for D7 with minimum impact :)

I merge the progress result of users.fullname with users.name_lower, which utilizing the use of LOWER() in order to boost up overall performance if possibile. Since the performance bottleneck is mainly related to user_load() and user_access(), I try JUST FOCUSING on the use of users.name_lower within this 2 functions, so reduce the impact for D6, and also the impact for D6 (name_lower) -> D7 (fullname) migration as minimum as possible.

Patch is tested with MySQL (5.0.32-Debian_7etch1-log) and PostgreSQL (8.1.9, Debian 4.1.1-21) with no error. Also tested with D5.3 -> D6 upgrade (MySQL). Currently, only handle users related patch; taxonomy patch will handle afterward.

hswong3i’s picture

StatusFileSize
new8.71 KB

Follow up with taxonomy patch. UPDATE is required after drupal_write_record() since LOWER() is not replaceable by strtolower() nor drupal_strtolower(). JUST FOCUSING on taxonomy_save_term() and taxonomy_get_term_by_name() as they are the key point of performance bottleneck. Change as minimum as possible in order to reduce the impact for D6 and the upgrade path for D7 (with whatever alternative and better solution).

Patch is tested with MySQL (5.0.32-Debian_7etch1-log) and PostgreSQL (8.1.9, Debian 4.1.1-21) with no error. Also tested with D5.3 -> D6 upgrade (MySQL).

chx’s picture

Upcoming is a patch which will use triggers to maintain name_lower on databases that support that and something else for MySQL 4.1. The MySQL part is done, I am waiting for the postgresql trigger.

hswong3i’s picture

Component:user system» database system

@chx: WOW! Trigger! I love it (Oracle's trigger give me A LOT OF FUN once before...)! Just one minor concern: seem schema API don't currently support trigger officially, how can we support trigger without affecting existing D6 implementation?

klando’s picture

Component:database system» user system

1. SQL language does not provide trigger facility to postgreql, we need at least plpgsql.
2. perhaps some rules/views
3. I still find *bad* the adding column, or this implementation.
4. some operator rewriting are also possible, perhaps ....

As I read(in posts), what is needed is a NAME and a UNIX_NAME. Which mean some name for webpage and some unix_name for auth. Not a name and name_lower (even only the field name change, it is important for the model)

Another point is that, core drupal devs claims same queries for any db, I am not sure it is good:
* when you have an issue on mysql, patchs impact postgresql and vice-versa
* when you want to optimise the code you can not.

Every db engine have is own positive point, and I think it is important to catch the advantages of the db server. It can not be done with only identical queries.

I am not already really versed to drupal code, so I am disapointed : I can not offer any help until I get time to look at the code.

Is there timeline in drupal which make so urgent the patch ? Be patient, working on a solution can be longest than few days.

chx’s picture

Interested parties can get the code from http://drupalbin.com/107 combine with postgresql: create function term_name_lower() returns trigger language plpgsql as $f$ begin NEW.name_lower := lower(NEW.name); return NEW; end; $f$; and select has_language_privilege(oid,'USAGE') from pg_language where lanname='plpgsql'; to check for plpgsql hence trigger possibility. I have no time to roll together right now.

chx’s picture

Assigned:hswong3i» chx

The PostgreSQL code is from AndrewSN on #postgresql who also said:

[AndrewSN] if you absolutely have to use the same query on mysql and pg, then the trigger is probably the best solution
[chx] AndrewSN: may I quote this on drupal.org ?
[AndrewSN] yeah, you can quote me

So here were are. Patch, as said, forthcoming either very late today or early tomorrow. If someone does not do it before, of course. All code is provided.

hswong3i’s picture

Component:user system» database system

@klando: If we are talking about the pair up of NAME and UNIX_NAME, it is just similar idea as users.name (for auth) and users.fullname (for webpage): http://drupal.org/node/188734. As proved by users.fullname, we will need some time for UI update, which seems not suitable for current D6 status. I am giving a great hope for this in D7.

Every DB engine have its own positive point, e.g. Oracle support both trigger and sequence (I use it for emulating MySQL auto_increment within Oracle driver), on the other hand SQLite don't. BTW, that's not means we MUST have trigger and sequence supporting; complicated handling can always split into small and simple alternative solution, which usually simpler for code maintenance and more portable.

Anyway, I am agree about time limitation. name_lower + trigger should able to simplify overall implementation, but debug of trigger may also delay overall D6 release schedule (well, I don't really know if we still have such schedule...). If works can be done by a simpler solution, it should be more suitable for our current requirement with smaller impact.

klando’s picture

for info, I provide here some infos from mysql:
http://dev.mysql.com/doc/refman/5.1/en/case-sensitivity.html

So mysql can perfectly be case sensitive.

Additionnaly, in this thread, the guy provide an explain of a request which *use* index. (WHERE LOWER(name) LIKE LOWER('a%') ORDER BY name LIMIT 100;)

I have no mysql here, perhaps having a look at that ?

klando’s picture

chx’s picture

klando, I told you on IRC we are aware of the fact that MySQL can be case sensitive but I pray how that's relevant here? We are adding a name_lower column anyways.

cburschka’s picture

What is needed to review this patch? (Unfortunately I don't have pgSQL, and I can only benchmark a large site if I first generate one with devel).

chx’s picture

Status:Needs review» Needs work

Um sorry for not updating the status.

hswong3i’s picture

Status:Needs work» Needs review

We seems now a bit off-topic. This issue is all about LOWER(name) queries perform badly; add column name_lower and index. From my point of view, it is already complete in #149. name_lower + indexing can solve slow LOWER() for D6, without require any other additional API/database abstraction/schema changes. It is the most simple solution that we can now figure out, with minimum impact to existing implementation (yes, it is even simpler than users.fullname issue).

Using trigger for name_lower maintenance should always be a better and handy solution than calling UPDATE manually, but this is completely another issue. We still have many other concern about trigger, e.g. what I can guess about: 1. it is not the lowest common factor of databases, some database (e.g. SQLite) may not able to support this; 2. we need most modern database for trigger supporting (e.g. MySQL 4.x+, PgSQL + plpgsql); 3. do we really need trigger? will it really give us benefit when compare with alternative simple solution? 4. do we really have time and chance for committing trigger within D6? etc.

This issue seems now hijacked by trigger. We may come back and update the ugly and lossy UPDATE handling by trigger, when trigger goes core support; but that's not means we need to wait for it or else postpone the commit of name_lower for D6. According to our understanding, slow LOWER() is a critical performance impact which should fix within D6 if possible. Should we shift focus back to #149, in order to commit this issue on time?

robertDouglass’s picture

Just for the record, I agree with hswong3i in #161. I have (obviously) supported the name_lower + UPDATE solution from the beginning.

cburschka’s picture

It seems to be the lowest common denominator, and a very straightforward space-for-searchtime trade, like any form of indexing. Trying to get it to use the special API of each of the database systems looks clean in theory, but the schema split would be impractically complex.

As a precedent example, what about db_next_id and {sequences}? MySQL's auto_increment feature hasn't been used since D5...

chx’s picture

StatusFileSize
new15.07 KB

First of all, you left out a few LOWER(name) calls in users. Second, taxonomy_autocomplete searched for '%foo%' which is not consistent with user_autocomplete, not consistent with how the world works (specifically checked with gmail) and the world works this way because if you start with a % then you can't index. There is no worse place to not index than in an autocomplete function.

Keeping the name_lower update in the update query only is burying our heads in the sand. You presume people will always use the API to change the database. We all know this is not the case. Of course, they can also take care about keeping the DB in sync but if it comes for free, why not have core do that? And indeed it comes for free if your database supports triggers and as MySQL 4.1 general support ended almost a year ago, it's quite OK to suppose that 5.0 is prevalent. You might say, even if it's so, some users have ISPs who do not support triggers. For them, we provide a function which does a table scan, this comes at the same cost as one LOWER(%s)=LOWER(name) did which also scanned the table, and during the scan, it picks those fields where the name_lower is out of sync -- presumedly very few -- and updates those. This function is called from taxonomy_save_term and taxonomy_get_term_by_name but not from taxonomy_autocomplete -- autocomplete must be speedy. users table is synced in user_search and in user_save but nowhere else, this should be more than enough. If you are worried by there are reads without a sync call, please note that on every save we sync every row in the table and there are extra synces too -- so this is a lot better than just simply keeping an update in the save functions.

Almost forgot to credit David Strauss for the underlying idea. Good we can edit comments now.

markus_petrux’s picture

What about performance? cost of using triggers? cost of db_sync_name_lower() for those that can't use triggers?

Also, does this change affect/limit the possibilities to implement more DB engines into Drupal? What about performance in those cases? (I belive there was something already made for Oracle or DB2).

Then, maybe I missed this if it was already commented before... comparing name_lower = LOWER('%') might be unnecessary, if when you inserted the name_lower value using drupal_strtolower(), the same conversion would apply when comparing, no?

sammys’s picture

Status:Needs review» Needs work

Looking ok with a visual check. Not testing yet as chx and I have discussed caching the return value of db_has_trigger() into a variable. Of course, this will require some care if the system is moved from one version of MySQL to another. However, the need to avoid the query overhead outweighs the number of incidents AND the likelihood of downgrading is very low.

--
Sammy Spets
Synerger
http://synerger.com

hswong3i’s picture

@markus_petrux: LOWER() is not replaceable by strtolower() nor drupal_strtolower(). Please read #139 for more information.

@chx and sammys: just a little request: would you mind to free name_lower + indexing from trigger, and handle trigger as another issue? This issue may still have chance for sliding into D6 even it is such late, if we are able to limit the impact as small as possible; but I have no idea if it is hijacked by trigger. As mentioned in #161, we can raise TOO MANY different questions about trigger, and the duty of answering these questions should belong to its own issue :'(

chx’s picture

Status:Needs work» Needs review

I will implement sammys' suggestion later but that should not stop reviews coming. @hswong3i, the change without triggers/sync is not acceptable for me because it's utterly fragile. @markus, I have detailed the cost of sync (almost the same as an old SELECT unless there is sync to do, then one UPDATE) and the cost of trigger, since when we care about the speed of writes? I guess it's a bit slower but also, a very little bit.

sammys’s picture

Giving chx a little backup on this... cost of using the trigger only affects DB writes. Not worth worrying about write cost on the user table when the benefit far outweighs. Thanks to chx for rolling the patch!

sammys’s picture

StatusFileSize
new16.22 KB

Patch updated to fix following:
* Unique index fails because creation attempted before field was populated
* Update/install fails to create triggers on PostgreSQL because of incorrect CREATE TRIGGER syntax

Also rolled in the latest commits made to system.install so the update numbers are correct.

Checked update/insert with queries on PostgreSQL and it behaves as expected.

Please double check this new patch on a MySQL installation and then mark as RTBC if all ok.

markus_petrux’s picture

Oh, I see the LOWER() stuff... good then. Thanks for pointing that out :-|

So if using Triggers... shouldn't db_sync_name_lower() be invoked at INSERT time, not at SELECT time?

Also, I believe SHOW TRIGGERS needs SUPER privilege on MySQL, is this ok with the MySQL user used to access the DB?

Then, maybe checking for Triggers support could be cached somehow? Maybe detecting if CREATE TRIGGER works or not?

Edited: hmm... if using SHOW TRIGGERS, then what if that statement returns a lot of informacion. Wouldn't it be better to query the MySQL system catalogs?

markus_petrux’s picture

Oh, variable $name_field seems to be undefined in db_sync_name_lower().

chx’s picture

OK, I will check db_sync_name_lower and also store in a variable the trigger results. This we already agreed on, just not implemented.

The crux of the matter is that sync is not enough on write time because then you need to always do it. This is very fragile as triggers fire automatically and it's quite safe to presume that the developer will have MySQL 5.0. If you say, no triggers, put the update into user_save, then the first contrib query direct against users can break this -- if we denormalize then we should not put the burden introduced by denormalizaiton on contrib authors if we can avoid.

markus_petrux’s picture

What about using cron to do the db_sync_name_lower stuff?

Also, say you're running MySQL 4.x and install D6 (how many sites are like this today, is everyone using MySQL 5.x outthere?). Well, you have no triggers, so D6 installation/upgrade doesn't populate them. Let say your hosting upgrades to MySQL 5.x, so you could use triggers, but... how could you migrate? So... would it be possible to trigger this from a site option? ...maybe a script to make the relevant changes?

chx’s picture

The site status report screen could try installing 'em and warning you if !db_has_trigger. And I believe most of the world is on MysQL 5.x, again active support ended for 4.1 on 2006 dec 31. I would not rely on cron for the syncing when we are aiming at low end sites here -- what if they run cron daily?

David Strauss’s picture

"what if they run cron daily?"

Or never.

chx’s picture

StatusFileSize
new15.26 KB

Added a variable, kept fresh on system_status about triggers. Fixed up db_sync_name_lower.

chx’s picture

StatusFileSize
new15.57 KB

Doh, I really fixed name_lower field but then throwed it away because i forogt to merge in sammys' patch first ah well here it is.

chx’s picture

StatusFileSize
new15.18 KB

W/o menu.inc part.

adrian’s picture

Why not use rules in postgres? (as in what are the actual issues with them)

[?
create or replace "users_upd" as on update do also (
update users set name=LOWER(NEW.real_name) WHERE uid = NEW.uid;
);

create or replace "users_insert" as on insert do also (
update users set name=LOWER(NEW.real_name) WHERE uid = NEW.uid;

);

?]
although i concede we might need to use currval('{users}_uid_seq') on insert to get the value regardless of whether the insert statement had a uid field or not.

chx’s picture

Rule is very debated. let's leave that for a followup. klando even says they will fail for certain queries. please leave them out of this already very long and troubled issue.

chx’s picture

StatusFileSize
new15.18 KB

Applied with offset only so I rerolled.

chx’s picture

StatusFileSize
new15.18 KB

System.install had a conflict in update numbering.

chx’s picture

StatusFileSize
new16 KB

install works again without triggers.

catch’s picture

Status:Needs review» Needs work

Patched clean d6, ran install from scratch. No problems.

Then I tried to create a user from admin >> users and got this:

    * user warning: Duplicate entry '' for key 3 query: INSERT INTO users (name, mail, pass, status, timezone, init, created, access) VALUES ('catch', 'xxx@googlemail.com', '088f6bcd4621d373cade4e832627b4f6', 1, 0, 'xxx@googlemail.com', 1196505455, 1196505455) in drupal6/modules/user/user.module on line 315.
    * notice: Trying to get property of non-object in drupal6/modules/user/user.module on line 330.
    * notice: Trying to get property of non-object in drupal6/modules/user/user.module on line 2180.
    * notice: Trying to get property of non-object in drupal6/modules/user/user.module on line 2183.
    * notice: Undefined property: stdClass::$uid in drupal6/modules/user/user.module on line 2198.
    * notice: Undefined property: stdClass::$name in drupal6/modules/user/user.module on line 2198.
chx’s picture

Status:Needs work» Needs review
StatusFileSize
new15.99 KB

I created a uniq key on users.name_lower which collided with anonymous user. Checked: term_data.name_lower is just an index, as users is now.

chx’s picture

StatusFileSize
new16.04 KB

Patch had a typo. Sigh.

catch’s picture

Status:Needs review» Reviewed & tested by the community

OK I tested this with and without triggers in MySQL 5 by adding SUPER for my db user during the second of two patched installs.

With both I tested creating users, adding rules, adding users who should be denied by the rules, and adding taxonomy terms. All worked as expected, so rtbc.

Output of SHOW TRIGGERS when the MySQL user has SUPER included for reference:

mysql> SHOW TRIGGERS LIKE '%';
+------------------------------------+--------+-----------+--------------------------------------+--------+---------+----------+-----------------------+
| Trigger                            | Event  | Table     | Statement                            | Timing | Created | sql_mode | Definer               |
+------------------------------------+--------+-----------+--------------------------------------+--------+---------+----------+-----------------------+
| term_data_name_lower_before_insert | INSERT | term_data | SET NEW.name_lower = LOWER(NEW.name) | BEFORE | NULL    |          | user@localhost |
| term_data_name_lower_before_update | UPDATE | term_data | SET NEW.name_lower = LOWER(NEW.name) | BEFORE | NULL    |          | user@localhost |
| user_name_lower_before_insert      | INSERT | users     | SET NEW.name_lower = LOWER(NEW.name) | BEFORE | NULL    |          | user@localhost |
| user_name_lower_before_update      | UPDATE | users     | SET NEW.name_lower = LOWER(NEW.name) | BEFORE | NULL    |          | user@localhost |
+------------------------------------+--------+-----------+--------------------------------------+--------+---------+----------+-----------------------+
4 rows in set (0.04 sec)
markus_petrux’s picture

ahem... SHOW TRIGGERS requires the MySQL user to have SUPER privilege. I belive this might be a problem for some installations when Trigger support could be checked at installation time by checking the result of CREATE TRIGGER, or maybe using a query against the system catalogs.

Gábor Hojtsy’s picture

Version:6.x-dev» 7.x-dev

These level of changes, and the introduction of triggers is way past what is possible now with Drupal 6. Please focus on stabilizing what's there in the code, and happy hacking with this early in Drupal 7. We are moving in circles and would never reach Drupal 6 stable if we introduce new stuff on this scale still.

chx’s picture

Assigned:chx» Unassigned

If that is so then I unassign myself, I do not really care, my site has solved it simply (we run mysql, you see) and i can keep my core hacks around. *shrug*

markus_petrux’s picture

Status:Reviewed & tested by the community» Needs work

Well, it looks like this needs a bit more love.

catch’s picture

At the risk of repeating myself, I still think it'd be fairly simple to do it like this:

Expand username to allow more characters, currently apostrophes are excluded. This field is always displayed in theme_user_name() whatever happens, and by default, it's also the name that people use to log in (which will need to happen on all existing Drupal sites that upgrade anyway).

Add a new 'login-name' field - which is lowercased in the database to save the bad LOWER() queries we currently have. This is very similar to: http://drupal.org/node/83738 which only dealt with the performance aspects.

Add an admin option for users to specify separate login and display names. When this happens, there's two fields on registration user/edit instead of one, and they're populated separately instead of from the same field. We populate the column even if the option isn't switched on (to fix the LOWER() issue), the difference is you'd be able to edit them separately in the UI instead of one field. In fact that'd be pretty much the only difference.

login names are forced to be unique - since we do LOWER() on usernames this ought to work with existing data. Admins get options for the following:

1. Allow users to change their login name AND/OR display name (if enabled) so two permissions.
2. Specify whether 'display names' should be unique (if display names are enabled).

If you disable display names on a live site, you're warned that all display names will be deleted from the database and that usernames will be displayed instead (which'd in code terms would be update users set display_name = login_name

We might need to keep the login name 'as typed' somewhere (a third column, maybe another table since it'd only be used very rarely) - this would make the 'switch display names off' behaviour less destructive, for example - so if I'm Dries, and I make a display name Dries Buytaert, it get's set back to Dries instead of dries when the admin disables display names).

klando’s picture

I agree with catch.

Except (perhaps) the update users set display_name = login_name, I am not enought confident with that part of the code, but it sound better to me that the admin choice is only done at php level (what to display ? login | username).

catch’s picture

I have a patch for taxonomy over at http://drupal.org/node/277209 using mb_strtolower() which appears to be working fine. This approach ought to be applicable to user.module if it's any good.

Brumisek’s picture

Any glue for 5.x version of Drupal? I have big site with Drupal 5.x and exactly THIS problem is reason why my site is slow. I'm try find any ideas how to take it out, but here are patchs only for 7.0 version :( Could you anybody help me?

ThomasH’s picture

Version:7.x-dev» 5.2
StatusFileSize
new5.67 KB

Drupal 5 patch attached... hope it useful for someone!

catch’s picture

Version:5.2» 7.x-dev

Back to 7.x.

mikhailian’s picture

I am appying this kind of patches for a few years already, from 4.x times (I am running 6.x now).

Just wondering when this LOWER() nonsense finally disappears from the Drupal codebase.

mfer’s picture

@mikhailian This will change in drupal when someone writes a patch that's a generic solution (not just MySQL). Care to take a shot at it?

catch’s picture

Status:Needs work» Closed (duplicate)

There's a more up-to-date patch not requiring trigger/SUPER permissions at #279851: Replace LOWER() with db_select() and LIKE() where possible - just needs a re-roll and some benchmarking. Was leaving this open since it's a different solution, but not been much activity here for a while so closing this as duplicate.

mikhailian’s picture

@mfer No, I do not need case-insensitive behavior. My patches are all about removing LOWER(), that's it.

This solution is database-agnostic.

Status:Needs review» Closed (duplicate)