This patch attempts to address the issues at #188734: LOWER(name) queries perform badly; add fullname and restrict name as lowercase., #83738: LOWER(name) queries perform badly; add column name_lower and index., #102679: Add a Display Name field to core in addition to Username and #181625: LOWER() is slow and can not be indexed - but it since it doesn't fit that neatly into any of those issues, takes a new approach, and they've all got quite long discussions already about different approaches, I'm hoping a new issue will be OK. However, if you're planning to suggest a radically different approach to this, please read those issues first - since many different approaches have been discussed and rejected in those issues already. Thanks :)

What this patch does:
Adds users.username to {users} table in addition to user.name
Modifies user queries to use that table to avoid expensive LOWER() queries, however $user->name is returned just the same as now.

What it doesn't do:
Change anything in the UI or break backwards compatibility for either site owners or modules which use $user->name

What it will allow us to do:
Populate $user->name to anything they like via a contrib module or followup core patch either programmatically or via form_alter on user forms - they'll need to also handle validation and searches for custom user names if they do so - since there could be various approaches, it's best that complexity is left out of this patch.

So, we get a big performance win, and the flexibility to have more variations in user name handling down the line in either core or contrib.

Current version of the patch causes 6 failures in UserRegistrationTestCase() (starting with user_check_password(), which ought to be unrelated), but marking as needs review for general concept.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Senpai’s picture

Subscribing for a review.

cburschka’s picture

You didn't mention whether {user}.username will be constrained to lower case, but I presume that that's how you got rid of the LOWER() query?

I'll review and run some test cases...

cburschka’s picture

I'm getting the same 6 fails in registration, plus 25 fails in uploading a picture (haven't compared with the unpatched site yet). I'll investigate a bit.

catch’s picture

{user}.username will be constrained to lower case, yeah.

cburschka’s picture

False alarm on upload. It's just curl being stupid again, returning zero-length responses. When I'm done with this, I'll rewrite my copy of simpletest to use drupal_http_request; it's really completely unusable as it is for me.

This is the content of a debug file for user_check_password:

Password: foo
Account:  {"uid":"3","name":"simpletest_zyFE","username":"simpletest_zyfe","pass":"$P$Cy92GxnBfhOWNhWMZJq7wXKTQ3J13I.","mail":"simpletest_zyFE@example.com","theme":"","signature":"","created":"1215551808","access":"0","login":"0","status":"1","timezone":"3600","language":"","picture":"","init":"simpletest_zyFE@example.com","data":null}
Old hash: $P$Cy92GxnBfhOWNhWMZJq7wXKTQ3J13I.
New hash: $P$Cy92GxnBfSETyVzJNGe2gNaGPEMKBQ1

Password: 4xB6FEL37E
Account:  {"uid":"3","name":"simpletest_zyFE","username":"simpletest_zyfe","pass":"$P$Cy92GxnBfhOWNhWMZJq7wXKTQ3J13I.","mail":"simpletest_zyFE@example.com","theme":"","signature":"","created":"1215551808","access":"1215551816","login":"1215551816","status":"1","timezone":"3600","language":"","picture":"","init":"simpletest_zyFE@example.com","data":null,"roles":{"2":"authenticated user"}}
Old hash: $P$Cy92GxnBfhOWNhWMZJq7wXKTQ3J13I.
New hash: $P$Cy92GxnBfI76LOCSfUyxhbe1n0no1K1

Password: 4xB6FEL37E
Account:  {"uid":"3","name":"simpletest_zyFE","username":"simpletest_zyfe","pass":"$P$Cy92GxnBfhOWNhWMZJq7wXKTQ3J13I.","mail":"simpletest_zyFE@example.com","theme":"","signature":"","created":"1215551808","access":"1215551816","login":"1215551816","status":"1","timezone":"3600","language":"","picture":"","init":"simpletest_zyFE@example.com","data":null}
Old hash: $P$Cy92GxnBfhOWNhWMZJq7wXKTQ3J13I.
New hash: $P$Cy92GxnBfI76LOCSfUyxhbe1n0no1K1

I'll look at _crypt_password next...

pwolanin’s picture

Status: Needs review » Needs work

bug here:

+    else if ($key == 'username') {
+      $query[] = "username = LOWER '%s')";
+      $params[] = $value;
+    }

catch’s picture

Status: Needs work » Needs review
FileSize
7.86 KB

Fixed that bug. Also spoke to pwolanin about this in irc, and he suggested a way to avoid drupal_strotolower in user_save() - which means that all the lowercasing is handled by the database now instead of some php, some database.

Still 6 fails in user registration tests. Thanks Arancaytar for the debugging :)

pwolanin’s picture

perhaps this line in the test should change too?

+    $this->assertEqual($user->username, drupal_strtolower($name), t('Username matches. '));
Crell’s picture

Bless you. :-) Visual review:

- I still do not like not having a unique constraint on the display name (nee name). As coded, what would happen if someone set their name to "Crell" and someone else to "crell"? The unique key is on username, which is the case-folded version. So what would happen is username would get set to "crell" for both of them, which would create an SQL error, wouldn't it?

- This code:

+  // Set the username to a lower case version of $user->name if not set.
+  if (empty($array['username'])) {
+    $array['username'] = db_result(db_query("SELECT LOWER('%s')", $array['name']));
+  }

Why are you using the database for the lowercasing? Would its result be any different than doing it in PHP space and avoiding the extra query?

- So then for a contrib module to modify the name value, it would need to:
+ form_alter to show "username"
+ form_alter to NOT show "name"
+ inject its own validate and submit hooks *before* the default ones to set the "name" field by its own logic.

Or alternatively, do the same in hook_user validate/save. Right?

Otherwise I think this is a good direction at this point.

catch’s picture

FileSize
7.84 KB

@Crell: I've added the unique constraint back to users.name - if a contrib module wants to handle that other issue that's been discussed elsewhere, they could remove the unique key via schema API.

Using the database for lowercasing exclusively was pwolanin's idea, which I reluctantly agree with. There's a chance that certain collations on certain databases are going to operate differently to mb_strtolower (I don't have enough practical experience with collations to know how much of a risk this is) - but by using the database we ensure consistency. Also the database is only being used for lowercasing on writes now.

Additionally, php.net suggests that LOWER() is actually faster than mb_strtolower, so for micro-optimisation there's probably not much in it.

Contrib needs to do exactly what you suggest, yes.

I'll probably update user.test to attempt creating users with siMiLar usernames if that's not in there already. Still at a loss as to why the password hashing is failing.

Dries’s picture

Personally, I'd find it very confusing if my display name was 'Dries Buytaert' but my login/username was 'dries'. I'd consider that to be a serious usability regression. Unfortunately, this patch seems to take us down that road and I'm not a big fan of that. Maybe I misunderstood where this was heading though.

cburschka’s picture

I would argue that it makes sense for login names to be permanent and displayed names to be changeable. The big problem with changing login names is that it is hell for the password manager - you don't necessarily want a different login each time the displayed name changes...

Damien Tournoud’s picture

Edit: I misunderstood the scope of this patch.

@Dries: the new 'username' column is just pregenerated to 'LOWER(name)', in order to avoid a LOWER operation during SELECTs (that can't be indexed).

There is no change whatsoever here on the UI or user-experience part.

@others: would it be possible to improve that patch a little by removing spaces from the username column? That way you will not be able to register both 'Dries Buytaert' and 'DriesBuytaert'.

Did a quick visual review of the code, and it looks great.

catch’s picture

@Dries, my primary concern here is to remove the LOWER() in user queries. Robert's comparisons from 2006 show the difference in performance with 500k users. And the queries haven't changed since then:

in a typical case we're comparing 1273.97 ms to 0.37 ms

I agree 100% on the 'display name' issue, however, there's a lot of demand for it (including from people who wouldn't argue for it without good reason like Crell and Webchick). This issue is an enabler for that - either in core or contrib, but I'm absolutely not going to introduce any explicit support for such a thing with this patch - there are zero API or UI changes here. The only connection is that this patch doesn't introduce the extra column in such a way that it would block the other issue (by requiring yet another column or something).

@Damien - existing sites might already have a Dries Buytaert and DriesBuytaert - which'd give us a duplicate key error on update and lock one of them out. This patch currently has zero effect on the username policies of any existing sites, and I'd like to keep it that way.

@Arancaytar: #102679: Add a Display Name field to core in addition to Username is over there ;)

cburschka’s picture

Sorry, I misunderstood the scope of this patch.

How severe should we make this uniqueness constraint, as in how much should be stripped from the username field? Beyond spaces, we could strip everything but letters, so that "Dries Buytaert02345" would be unavailable as well. However, the stronger the constraint, the more problems people will have when updating their databases that may violate the new rules.

Crell’s picture

@Dries: There are actually extremely good reasons to allow that distinction. In my case, I can't control the login name people choose when signing up. However, nobody on my site knows each other by their login name. We actually use real mother-given names for everything. Then when I try to select someone's name using the auto-complete author field, how do I know that "Dries Buytaert", who I'm trying to set as the author of a node, is really "trkfan849"? I don't. Allowing these fields to be split is a usability improvement, because right now if you use theme_username then you get a disconnect between user's identifying strings in different contexts. That said, this patch as pwolanin said doesn't change the default behavior other than silently setting up a pre-lowercased username for performance. I suspect Drupal.org would never override that.

@pwolanin: If that's the reason for using LOWER() in SQL, that should be documented in the code.

Also, having written a D5 module that tries to custom-set the login name field, I can tell you that it is not as easy as that to set a custom value there because the user workflow sucks. :-) I'd have to try it with this mechanism, but let's be careful that "just form alter it" isn't our answer unless we know that will actually work. Anyone have time to write up a demo module as proof of concept? (I wish I did, but I probably don't.)

catch’s picture

The unique constraint is exactly like it is now - a unique index on the column and form validation to avoid duplicates with case variations. This means you can't have Dries and dries, but you can have Dries1 etc. - again, exactly the same as now. Anyone wanting to force more unique usernames also needs to find another issue to do so in, since it's not within scope.

catch’s picture

FileSize
8.14 KB

@Crell. I've added documentation to both user.module and user.test to explain the justification for using LOWER(). If someone wants to come along and tell me it's not necessary, that's fine too.

JohnAlbin’s picture

I think this patch also dovetails nicely as an excellent first step on the road to "display name" capabilities in core (and that’s the last I’ll say about it in this issue.) Given how out of control the others issues have become, we definitely need to limit the scope of this issue.

Catch, are you re-rolling the patch after your IRC chat with chx? Or did I misread that discussion?

catch’s picture

FileSize
8.5 KB

Following discussion with chx, we keep LOWER() within the database during user_save(). I've also modified user.test to use a real SELECT query as well.

Also, this version of the patch passes all user tests now (as a result of that change), and doesn't harm any other tests either.

catch’s picture

FileSize
9.46 KB

Last patch had some silly errors in it, this one should be good.

dahacouk’s picture

@catch: Well done for coming up with this approach! Will your patch make it easier for a contrib modules to add non-unique display name functionality (emphasis on "non-unique")? In other words will $user->name (the display name) be able to be made non-unique if desired?

@Dries in #11: There are many webadmins calling out for display names being distinct from login names. There are many hacks being produced for what, some see, as a core function of a web site: being able to display a real name as different from their login name. This scenario would be optional for webadmins. And there's no reason why the current situation (of just one display/login name) couldn't be maintained too. But it's good to get some standardisation for this functionality into core so that developers can start to deliver more consistent solutions. Have a peek at #102679: Add a Display Name field to core in addition to Username to see more people wanting display name functionality in core. Sure, this patch doesn't go all the way but it's a start.

@Arancaytar in #12: Why is the "big problem with changing login names... hell for the password manager"? Surely, before a login name change request is actioned, the password manager just has to check that there are no duplicates. How is that "hell"? I say keep things flexible unless there's a real security issue or big hit on performance.
-1 to permanent login names

@Damien Tournoud in #13: Why "removing spaces"? If you are talking about "removing spaces from the username" - as in display name - then we wont be able to mimic real names. So, I see no advantage here.
-1 to removing spaces from the username column

@JohnAlbin #19: Nicely restrained! ;-) You're right, it does seem that we need to do this in small steps. Looks like we'll get "display name" capabilities in core around Drupal9. ;-)

Cheers Daniel

Damien Tournoud’s picture

FileSize
10.16 KB

@dahacouk: I misunderstood the scope of this patch at first, please don't broaden it. This patch will ease the job of contrib modules to implement display names distinct from login names. It is only a first step in that direction.

Here is an updated version of catch#21:

  • Change queries in user_save() to use uid not name as the key,
  • Change user_is_blocked() to use the username, (this needs to be discussed)
  • Fix query in user_search(),
  • Fix _user_edit_validate() to allow modules to implement not distinct names.
catch’s picture

Title: Add column 'username' to {user}, remove LOWER(name) » Performance: Add column 'username' to {user}, remove LOWER(name)
FileSize
12.39 KB

Changing the title to be more descriptive.

I checked coverage.cwgordon.com and noticed that some of the user queries aren't being tested yet (i.e. user search), so this adds a test for search/user to verify they work as expected.

I'm not really set up for benchmarks, but will see if I can do some basic comparisons.

JohnAlbin’s picture

Title: Performance: Add column 'username' to {user}, remove LOWER(name) » Add column 'username' to {user}, remove LOWER(name)
Status: Needs review » Needs work

Visual review:

  1. user_is_blocked still has a $name variable in it. Should be $username.
  2. In _user_edit_validate, it should be $has_username = !empty($edit['username']); note the missing “!”.
  3. Also, Damien, I think your additions to _user_edit_validate to allow non-distinct names is scope creep.
  4. So I’m looking at the un-patched user_is_blocked and I’m wondering how WHERE status = 0 AND name = LOWER('%s') even works; shouldn't that be WHERE status = 0 AND LOWER(name) = LOWER('%s')?
Damien Tournoud’s picture

@JohnAlbin:

1. user_is_blocked still has a $name variable in it. Should be $username.
2. In _user_edit_validate, it should be $has_username = !empty($edit['username']); note the missing “!”.

Indeed, thanks.

3. Also, Damien, I think your additions to _user_edit_validate to allow non-distinct names is scope creep.

We are not far from being able to drop the unique constraint on {user}.name, why not following the route a little further? At the minimum, we need to only validate the uniqueness of {user}.username, because if we don't, contrib modules will not be able to drop the constraint themselves.

4. So I’m looking at the un-patched user_is_blocked and I’m wondering how WHERE status = 0 AND name = LOWER('%s') even works; shouldn't that be WHERE status = 0 AND LOWER(name) = LOWER('%s')?

It works on MySQL because the default collation is case-insensitive. Obviously there is a bug here.

catch’s picture

Title: Add column 'username' to {user}, remove LOWER(name) » Performance: Add column 'username' to {user}, remove LOWER(name)
Status: Needs work » Needs review
FileSize
12.44 KB

Crossposted with Damien, and looks like JohnAlbin crossposted with me. #23 introduces 5 test failures which aren't in #24 and also uses uid as a string in user_save.

Rolled the user_search fix from #23 into this one, which otherwise is identical to #24. I'm not knee-jerk against the other changes, but please no test regressions or scope creep.

catch’s picture

FileSize
12.48 KB

Or typos ;)

JohnAlbin’s picture

Status: Needs review » Needs work

Trying VERY hard not to cross post again…

My review in #25 was of Damien's patch in #23. :-p

We are not far from being able to drop the unique constraint on {user}.name, why not following the route a little further? At the minimum, we need to only validate the uniqueness of {user}.username, because if we don't, contrib modules will not be able to drop the constraint themselves.

Again, Damien, this idea is scope creep. Dropping the unique constraint on {user}.name is highly debatable (even for contrib.) But the debate is out-of-scope for this issue. We can revisit your proposed change to _user_edit_validate in a follow up issue. :-)

Catch, I think Damien is correct; the user_save db queries do need to do WHERE uid = %d instead of WHERE name = '%s'", $array['name']. That’s what user_save uses currently: http://api.drupal.org/api/function/user_save/7 Or does that cause simpletest breakage? I haven't run them as I’m still visually reviewing the patches.

Also patch in #28 has “No newline at end of file”. But that’s highly pedantic, so you can ignore me on that one if you want.

Damien Tournoud’s picture

Dropping the unique constraint on {user}.name is highly debatable (even for contrib.) But the debate is out-of-scope for this issue. We can revisit your proposed change to _user_edit_validate in a follow up issue. :-)

Well, we can drop the INDEX constraint on the {user}.name column while keeping it on {user}.username. In the standard behavior, the uniqueness of the {user}.name column is still guaranteed (because out of the box the {user}.username column is the lowercase form of {user}.name, thus if the former is unique, the latter also is).

catch’s picture

Status: Needs work » Needs review
FileSize
12.58 KB

Bug in the test. Should try to get patches right harder than I try not to crosspost.

@JohnAlbin's question in #25 - using uid instead of $array['name'] there caused simpletest breakage, yeah. Ways around this much appreciated.

Damien Tournoud’s picture

FileSize
12.85 KB

Catch: sorry for #23 I should have indicated that those were completely untested.

Here is a update from #31, including the change to use uid as a key in user_save().

Edit: fix bad spelling that hurt my eyes.

catch’s picture

#32 is fine. Thanks Damien :)

Here's a quick comparison for performance on 98,000 users generated by devel, these run from the command line:

SELECT * FROM users WHERE LOWER(name) = LOWER('nogahi');
1 row in set (0.14 sec)

Subsequent queries for different names always take between 0.13 to 0.14 seconds. query_cache is only used if the name is identical.

SELECT * FROM users WHERE username = LOWER('nogahi');
1 row in set (0.04 sec)

query_cache gets used even for different names - so 0.00 seconds for subsequent similar queries.

So we get a double improvement - 3-4 times faster on these selects the first time they're run, but the query_cache in MySQL will handle most subsequent requests.

If someone has a decent benchmarking setup, it'd be great to get more scientific data than this - or with some advice I can probably do something a bit better at home.

pwolanin’s picture

minor concerns re: text/code comments:

The schema description text seems a little incorrect or misleading:

-        'description' => t('Unique user name.'),
+        'description' => t('User name for display.'),
+      ),
+      'username' => array(
+        'type' => 'varchar',
+        'length' => 60,
+        'not null' => TRUE,
+        'default' => '',
+        'description' => t('Unique user name for login.'),

$user->name *is* still constrained to be unique, so I think the description does not need to change at all.
The username description should reference the fact that it's a LOWER() version of $user->name used to speed lookups during login, search, etc.

+    // We use the database to do the lowercasing to ensure that there are\
+    // no discrepencies in collation handling between the database and PHP.

I think should reference character handling or character set, not collation, since I think collation just determines sort order.

something like:

+    // We use the database to do the lowercasing to ensure correct searches
+    // in case of any discrepancy in character handling between the database and PHP.
Damien Tournoud’s picture

@pwolanin: in MySQL terminology, "collations" refers to the set of rules for comparing characters (for both inequalities and equalities). Collations for are used for sorting, for matching and for normalizing strings before putting them into index classes.

This said, I like the wording of your comment better :)

catch’s picture

FileSize
13.15 KB

Those changes are fine, and while collation is correct, I agree with Damien that the revised version explains the motivation better.

New patch attached.

pwolanin’s picture

to be consistent with other schema comments, this:

t('LOWER() version of $user->name for login, search performance.'),

should be written more like:

t('LOWER() version of {users}.name for login, search performance.'),
catch’s picture

FileSize
13.18 KB

Yes. good point.

R.Muilwijk’s picture

Tests look good to me , ran them all. It doesn't break any tests that don't fail normally. I also like the idea of not having to use a LOWER() on the database. So +1 from me!

Senpai’s picture

Status: Needs review » Reviewed & tested by the community

Tested against a fresh checkout of HEAD, but not a clean DB.

I created 10 users with all manner of mixed case, spaces, symbols, and the like. All form submissions were treated as they should have been, with illegal symbols being prohibited, and mixed-case or spaces being allowed. In each instance, I checked the database to verify core's actions and they were good.

I then deleted nine of those users, and in each case the deletion happened flawlessly as it should have. I kept the last user for a Change Own Username test, and it passed that with flying colors.

This patch allows users to login with either their uppercase name, lowercase name, or any mixed-case variant as long as the $user->name field matches the same sequence of letters and numbers stored in $user->username. I don't see that as a problem, but some people might. It basically allows me to login as either 'Dries Buytaert' or 'dries buytaert' but not 'driesbuytaert' because there's a space in the original $user->name.

Simpletests
User.mod == 231 passes, 0 fails, 0 exceptions
(should we be testing for user name changes as well?)

I'm good with this patch. RTBC.

dahacouk’s picture

Is this patch meant to enable me to have a (display name) $user->name = "Daniel Harris" and a (login name) $user->username = "dahacouk" - as I usually like to do?

Or is $user->username only just a lowercase representation of $user->name?

Damien Tournoud’s picture

@dahacouk: this is out of this patch's scope, but this patch do open the way for contribution modules to implement what you describe.

Senpai’s picture

Daniel, This patch creates a duplicate username of every user in the system, and automatically lowercases it. That's all it does. It's an enabler, if you will. A stepping stone to further greatness. It has to go into core before anything like the Add a Display Name to core issue could even begin to be realized.

The great thing about this patch is that it speeds up Drupal by a magnitude or two when trying to locate or manage users within the system. Once this is committed, other patches can surface to handle what actually happens to these usernames. For now, we're all going to be extremely happy when this is committed.

catch’s picture

FileSize
13.04 KB
JohnAlbin’s picture

FileSize
12.81 KB

The short description of a function's docblock should be on one line, so this updated patch only changes the docblock for user_update_7002(). No code changes from catch's patch, so credits still go to him. :-)

Re-ran ALL tests; they all still pass after the patch. “4940 passes, 0 fails, 0 exceptions”

Senpai’s picture

Bumping this as an RRTBCTRBC (Really Really Reviewed and Tested by the Community and Thoroughly Ready to be Committed).

Commit commit commit!

maartenvg’s picture

Status: Reviewed & tested by the community » Needs work

Patch no longer applies.

patching file modules/system/system.install
Hunk #1 FAILED at 366.
1 out of 1 hunk FAILED -- saving rejects to file modules/system/system.install.rej
patching file modules/user/user.install
patching file modules/user/user.module
Hunk #7 succeeded at 1545 (offset 2 lines).
patching file modules/user/user.pages.inc
Hunk #1 FAILED at 12.
1 out of 1 hunk FAILED -- saving rejects to file modules/user/user.pages.inc.rej
patching file modules/user/user.test
Hunk #3 FAILED at 468.
1 out of 3 hunks FAILED -- saving rejects to file modules/user/user.test.rej
drawk’s picture

Status: Needs work » Needs review
FileSize
12.71 KB

Reroll

catch’s picture

This probably ought to be re-rolled again for the new db layer changes. However, since it was at RTBC for a very long time with no comment, I'd quite like view from webchick or Dries on the general approach - which I don't expect to change after conversion.

BioALIEN’s picture

@catch: It's a performance improvement, but I guess they want to see some more benchmarks before they commit this? What you've covered in #33 is great. Can we see some benchmarks on bigger (or even tiny) sites to better understand the performance gain?

+1 on testing this out on scratch.d.o

pwolanin’s picture

The patch adds a new unique key for 'username' but if we are doing all look ups using the new column is there any reason to retain the unique key on user.name?

David Strauss’s picture

@pwolanin: There is no need to keep any key/index on {user}.name. We should never be putting conditions or sort criteria on {user}.name after this patch.

Crell’s picture

Not true. I may want to list all users alphabetically by their public name as part of a member directory. At minimum having an index on that field will keep such queries from killing the server. (I've already said my piece on the uniqueness question.)

pwolanin’s picture

@Crell - I don't understand your comment. This patch maintains a 1-to-1 mapping of name to username.

pwolanin’s picture

FileSize
62.06 KB

I re-rolled the patch to remove the index on name, and altered the code in other modules so all WHERE or ORDER BY should be using username rather than name.

Ran all tests - all tests pass with this patch (5431 passes, 0 fails, 0 exceptions).

maartenvg’s picture

It applies fine, and I get no errors. But I was wondering: although this is a database change, there is nothing for update.php to do, is that by design?

pwolanin’s picture

If you already applied this patch before, the update might not run.

maartenvg’s picture

Nope, that is not the case, I applied it to a fresh install.
- download fresh 7x tarball
- extract
- install
- apply patch
- run update.php

The last step says "No pending updates."

I've checked and the installation does not detect the presence of user_update_7002() (database reports last scheme to be 7001). I already came across the same strange behavior when creating a patch for #165786: DB Silliness: aggregator_item.description ought to be called body, but thought it might be my fault.

I believe this behavior was introduced when #286035: Remove update.php number dropdowns went in, not sure though.

EDIT: I've traced this back to http://drupal.org/cvs?commit=137218, therefore it is unrelated to this issue. You can ignore my comments about not functioning updates.

JohnAlbin’s picture

Status: Needs review » Needs work

Yes, Maarten, #286035: Remove update.php number dropdowns is the culprit for the "no pending updates" issue. But it has just now been fixed (again).

@pwolanin: re: Crell's last comment. He meant he's already talked about the uniqueness several times. starting with #9 above in this issue and all over #102679: Add a Display Name field to core in addition to Username.

And, after a really, REALLY long discussion, we finally got everyone in #102679 to agree that name needs to remain unique.

In addition, removing the unique constraint on name isn't needed for a performance patch. Let's keep the scope focused so as not to repeat #102679.

catch’s picture

Status: Needs work » Needs review

Unnecessary indexes are a performance hit on insert, there's no reason to add that hit if we don't need it. Uniqueness is enforced in PHP anyway, at least without contrib intervention, so I'm marking this back to needs review.

David Strauss’s picture

@Crell: I do agree that a unique index on the case-preserved name would enhance data integrity a little, but it's going to be inconsistent. The index is built off the collation that DB server is using, so the uniqueness on "name" will be case-insensitive on MySQL and case-sensitive on PostgreSQL. Trying to create equivalent contraints on "name" brings us back to the problem of making a portable, case-insensitive index, which is the problem this patch is trying to solve.

Crell’s picture

OK, I think everyone misunderstood what I was saying in #53.

1) I've said my piece on having a UNIQUE constraint on the name column, so I'm not going to say more on it here.

2) There are use cases where you will want to ORDER BY name, so having an index (unique or otherwise) on that field will help performance in those cases. So -1 on not having an index on the name column at all. That field won't be modified often enough that the extra cost on INSERT will matter, IMO.

maartenvg’s picture

I agree with Crell on at least 2). There are enough use cases where we would want to sort on the `name` column, rather than the `username` column. Especially with the future (by core or contributions) in mind where `name` won't be a 1-to-1 mapping to `username` anymore.

dahacouk’s picture

@JohnAlbin in #59: I am not trying to "repeat #102679" here - this is a performance patch after all.

But I must take exception to your statement "And, after a really, REALLY long discussion, we finally got everyone in #102679 to agree that name needs to remain unique."

Firstly, I did not agree to uniqueness of name so please either remove that statement for the record or qualify who exactly you mean by "everyone". Lets be accurate folks!

#102679 and this issue have different goals, remember. This issue does not solve the issue of #102679. Please don't muddy the waters.

After this patch has been accepted I'll be excited to find out how it can really help fulfil #102679.

Full steam ahead! ;-)

Cheers Daniel

JohnAlbin’s picture

Re: catch: "Unnecessary indexes are a performance hit on insert, there's no reason to add that hit if we don't need it."

Agreed, but I don't feel this is an unnecessary index.

Since core won't be using the name column for any of its querying (just for displaying a capitalization-preserved version of username), it opens up the possibility that contrib can extend the data of that field. But actively deciding to remove the existing index means we have to fight to get it back in another issue, because (regardless if its unique or not) that column would need an index to query on it. And if the length of #102679 is any indication, getting it back will never, never happen.

And, yes, I know my argument is scope-creep. But I'm not asking for a new feature of the patch, I'm asking for you to not remove an existing index.

Also, how big is this hit for a db with a large number of users? seconds? miliseconds? http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html shows that, for a table with 320,000,000 records (everyone in the US), each index insertion would be log(320,000,000) = 8.5 slower than an empty table, but that page doesn't show actual numbers. When I tested inserting a single user into a 200,000 users table with and without a name index, both results were the same "Query OK, 1 row affected (0.01 sec)". That, of course, is nowhere near exhaustive testing, but extrapolating to a table with 320 million users means the performance hit would be somewhere less than 0.085 seconds. Unless my math is wrong, which it could be, leaving in the index seems a negligible performance "hit".

Crell’s picture

JohnAlbin said what I was trying to say far better than I did. :-)

Let's also remember we're talking about a table that on a super-busy site like d.o has had ~300,000 inserts in 7 years. It's not like we're talking about hammering watchdog. If we lose a tenth of a second when creating a new user but avoid a full table scan for any contrib that wants to do something useful with that column, that's a huge win.

Really, removing the index is unnecessary premature optimization.

catch’s picture

Status: Needs review » Needs work

I'm fine with putting the index back in. This needs re-rolling for dbtng as well.

webchick’s picture

Holy cow, guys. ;)

Ok. My take on this patch is:

  1. The only point of this patch is to add a lowercase username column is for improved query performance. That's it. It's not for a usability improvement. It's not so we can do non-unique display names. It's not so users can have separate display names from internal usernames. It's not for enhanced security. It's not for $insert_pet_registration_feature_here. Performance! That's it. Any other related features/changes enabled by this patch should be discussed in other issues. Thanks for doing your part to keep patches to one context.
  2. It should not affect any of the existing behaviour of the registration/login system, from a user perspective. They should not see the user.username field exposed to them, and should not be required to log in with anything but their normal users.name. This was Dries's concern, and it's mine as well. The patch as it stands does not (or at least doesn't intend to; haven't tested it myself yet to make sure it works), and that is wonderful and as it should be.
  3. I'm not that fussed about index performance on inserts, to be honest. I agree that there are many use cases, particularly with Views, where pulling up a list of users sorted by users.name, so having an index on it probably makes a lot of sense. The important thing to me, again, is to NOT change the behaviour of the current code in any way.

I think that's everything that needs weighing in on here. Hope that helps.

pwolanin’s picture

In terms of an extra and unneeded index - I think the size in memory of the index is much more of a concern than insert speed - hence I'd assert that #52 is correct. No query should be putting a condition on name after this patch, only on username. Hence we don't need the index.

robertDouglass’s picture

I'm thrilled with the direction of this patch. Thorn in my side since 2006.

catch’s picture

Since username and name with be identical in core, if contrib wants to try to do something funky with name, they could also add the index back via schema API. If I can find the motivation I'll re-roll with both options though if no-one beats me to it.

Crell’s picture

admin/user/user in core allows sorting on the username. Presumably that will use "name" not "login name" in the new setup, so that when contribs do fork off the display name it will make sense. (And let's be honest, that's not the main purpose of this patch but we all know that is going to happen.) Thus, a core page will be sorting by name, not login name. Thus, we do need the index on that column.

pwolanin’s picture

@Crell - see above - a contrib module that needs a different index can add it. Let's deal with what core needs.

catch’s picture

FileSize
17.66 KB

Here's a re-roll to re-sync with HEAD and use named placeholders in all cases except the user_load query builder (which needs updating in a separate patch) and wildcards since I'm not sure how they work now. I'm getting ten fails with dblog.test and some in the user tests which I'll track down tomorrow. Looks like user.admin.inc was missed in pwolanin's patch to me, we should be able to use username for sorting there too really.

catch’s picture

Status: Needs work » Needs review
FileSize
18.82 KB
20.16 KB

Fixed the typo that was causing test failures and updated user.admin.inc to use the correct column for tablesort_sql. So here's a fully functional patch for review.

yoroy’s picture

Status: Needs review » Needs work

Catch polled me about what made moste sense to me in the sorting: A-Z, then a-z,or a combined Aa - Zz.
I'd strongly suggest using Aa - Zz, otherwise you basically get 2 seperate lists.

catch’s picture

Status: Needs work » Needs review

Sticking back to needs review since this is the behaviour in the current patch, I think yoroy cross-posted.

yoroy’s picture

yes, I didn't change status, must've been a crosspost.

chx’s picture

Status: Needs review » Reviewed & tested by the community

This patch is a good step forward. While I am not entirely surely this is the best direction we can go, there are indeed a big number of advantages of this, namely the possibility of having a separate username (edit:), the small code footprint, the simplicity. By fulfilling this popular request, you nicely sidestepped the necessity to sync the username and LOWER(name) . For now, I think it will do. We can always do more later.

webchick’s picture

Status: Reviewed & tested by the community » Needs review

So. Since the upgrade path from D6 => D7 is horribly broken right now, I /know/ none of you all tested the upgrade path. :P

Can someone please bring over their users table from a production D5/D6 site with lots of real life examples including names like "Bill O'Malley" and "SüP3RguRL27! and run user_update_7002() and report results?

catch’s picture

hmm, I tested the upgrade path from d7 - d7 yesterday, but not 6-7. I'm also pretty sure I tested the updates before the 6-7 path got broken (last time it was RTBC), but it seems I didn't post about it and I wouldn't want to swear 100% I actually did.

maartenvg’s picture

I did a test on a user base (about 1200 users). Everything went smooth but during the update, as was the endresult in the database itself. Although I have to admit that I have nothing as extreme as "SüP3RguRL27!, most users have numbers, there are a lot with punctuations (_.:;'^<>) and some use the occasional é and è.

Perhaps someone using a non-Latin alphabet can confirm my positive results?

EDIT: It was a user table from D5, that I copied into the D7 database (and changed the schema a bit to match that of the old D7)

webchick’s picture

Ok, and those people with punctuation and és are able to log in okay, and are displayed properly in the user list at admin/user/user?

maartenvg’s picture

I'm getting some conflicting results. I'll do some more rigorous testing.

Damien Tournoud’s picture

I upgraded the whole users table from drupalfr.org (5000+) users. All went smoothly. The conversion was done properly.

But users with special characters in their names could not login... That's a different issue thou: #310447: Add back SET NAMES="utf8".

With SET NAMES back, all works as expected.

xzilla’s picture

I'm curious if anyone has done testing to see if there is a performance regression by copying all of the usernames into a second column within the table, thereby adding additional footprint and/or i/o load to all of the other queries referencing the table? I ask because I have addressed the above problem (slow lower(name) queries) by adding a function index on lower(name) which solves this specific problem much better while not cuasing additional overhead where it isn't needed.

maartenvg’s picture

Ok. The problems I am having are not related to this issue, but with collation and other char set problems during importing .e.d.

But when I manually create a few users with strange names, and then apply the patch, I do not get problems with displaying or editing these users.

chx’s picture

Status: Needs review » Reviewed & tested by the community

Setting status based on the latest comments especially #85 .

lilou’s picture

Status: Reviewed & tested by the community » Needs work

Patch #75 cannot apply against CVS/HEAD :

drupal\7.x\modules\system\system.install (Cannot apply hunk @@ 17 )
drupal\7.x\modules\user\user.module (Cannot apply hunk @@ 7 )

Back to CNW.

robertDouglass’s picture

In system.install, why is status left out of this query?

+  db_query("INSERT INTO {users} (name, username, mail, created, data) VALUES(:placeholder, :placeholder, :placeholder, :time, :serial)", array(':placeholder' => 'placeholder-for-uid-1', ':time' =>  REQUEST_TIME, ':serial' => serialize(array())));
-  db_query("INSERT INTO {users} (name, mail, created, status, data) VALUES('%s', '%s', %d, %d, '%s')", 'placeholder-for-uid-1', 'placeholder-for-uid-1', REQUEST_TIME, 1, serialize(array()));
robertDouglass’s picture

Status: Needs work » Needs review
FileSize
18.94 KB

Rerolled.

robertDouglass’s picture

100% tests pass: 6988 passes, 0 fails, 0 exceptions

catch’s picture

Status: Needs review » Needs work

We should be using db_insert() for that query too.

c960657’s picture

Patch no longer applies to HEAD.

David Strauss’s picture

FileSize
16.28 KB

Here's a re-roll. I have not tested it.

David Strauss’s picture

Status: Needs work » Needs review
swentel’s picture

Status: Needs review » Needs work

We don't use t() anymore in schema descriptions, so that has to go out.

David Strauss’s picture

Status: Needs work » Needs review
FileSize
16.28 KB

Re-roll to remove t() from schema description.

David Strauss’s picture

FileSize
16.44 KB

Let's also fix the erroneous description of the old "name" column.

Status: Needs review » Needs work

The last submitted patch failed testing.

David Strauss’s picture

Status: Needs work » Needs review
FileSize
16.44 KB

Re-roll with non-conflicting update.

Status: Needs review » Needs work

The last submitted patch failed testing.

David Strauss’s picture

From what I can tell, we seem to be experiencing a bug in the test framework. It doesn't make sense for user login to fail only under the picture submission test case. The exceptions result from trying to insert "u" as the line number into the test results table.

c960657’s picture

This is due to #328781: Fix horrible things in the error reporting. If you apply the patch from that issue, you will get better error messages when running the tests.

David Strauss’s picture

All tests pass if we apply #328781, though we still get "Undefined index: name" exceptions in various places.

stewsnooze’s picture

subscribe

Dave Reid’s picture

Issue tags: +Performance
asimmonds’s picture

Status: Needs work » Needs review
FileSize
17.45 KB

Rerolled for current HEAD.

The "Undefined index: name" exceptions were from that user_save() can be called with only what needs to be updated in the $edit array. For example, in /admin/user when you block/unblock multiple users, $edit only has the status attribute set, and not the user or pass attributes.

Status: Needs review » Needs work

The last submitted patch failed testing.

asimmonds’s picture

Status: Needs work » Needs review
FileSize
18.91 KB

Rerolled for current HEAD.

Dave Reid’s picture

Status: Needs review » Needs work

1) These in system_install() will fail on either PostgreSQL or SQLite because of a bug in PDO that placeholders cannot be reused:

+  db_query("INSERT INTO {users} (name, username, mail) VALUES(:empty, :empty, :empty)", array(':empty' => ''));
+  db_query("INSERT INTO {users} (name, username, mail, created, data) VALUES(:placeholder, :placeholder, :placeholder, :time, :serial)", array(':placeholder' => 'placeholder-for-uid-1', ':time' =>  REQUEST_TIME, ':serial' => serialize(array())));

2) Having the following code twice in user_save is dubious:

+    // Ensure $user->username is lowercase if it's the same as $user->name.
+    // We use the database to do the lowercasing to ensure correct searches
+    // in case of any discrepencies in character handling between the
+    // database and PHP.
+    // LOWER() is also slightly faster than mb_strtolower() for this task.
+    if (!empty($edit['name']) && $edit['username'] == $edit['name']) {
+      db_query("UPDATE {users} SET username = LOWER(username) WHERE uid = :uid", array(':uid' => $account->uid));
+    }

I don't see why we can't do something in the beginning of user_save() instead of having to add an extra query to every call to user_save(). Are the comments still valid even though we now require PHP 5.2 and have drupal_strtolower()? In fact, you have an added section earlier in user_save:

+  // If username isn't set, populate it with name.
+  if (!empty($edit['name']) && empty($edit['username'])) {
+    $edit['username'] = $edit['name'];
+  }

That should be simply:

  $edit['username'] = drupal_strtolower($edit['name']);

3) On the new test case, class UserSearchTestCase needs a phpdoc block right above it, getInfo() does not, setUp() does not, and testUserSearch() does need a phpdoc block. See the testing code standards at http://drupal.org/node/325974.

4) In addition, there are some sections in testUserSearch() that look smooshed together.

5) A few coding issues:

+  db_query("UPDATE {users} SET uid = uid - uid WHERE username = :empty", array(':empty' =>''));  
Crell’s picture

To be clear, the non-reusability of placeholders in PDO is a design decision, not a bug. Some databases support reuse, some don't. Earlier versions of PDO tried to work around that limitation for those databases that didn't, and failed miserably. So in later PDO versions they ripped that out and declared placeholders non-reusable on any DB. (I spoke with Wez Furlong about that about 2 years ago. :-))

catch’s picture

@davereid, we can't guarantee that the result of drupal_strtolower() and LOWER() will be exactly the same on all collations due to (potentially) different handling of characters in their lowercase versions. I haven't yet seen an example of where this would actually break at all, but it's one of the annoyances which makes both this and the equivalent taxonomy issue ugly. If anyone's got decent information on that subject - whether it's critical, whether we could just document the potential behaviour, that'd be handy.

catch’s picture

Marked #83738: LOWER(name) queries perform badly; add column name_lower and index. as duplicate despite being the older issue to concentrate efforts in one area.

markus_petrux’s picture

Minor issue about the patch in comment #110: it mixes 'Username' and 'User name' in translatable strings. Which one is correct?

kbahey’s picture

Subscribing.

(Before dww fixes +1 subscribe)

chx’s picture

I think the db_query("UPDATE {users} SET username = LOWER(username) WHERE uid = :uid", array(':uid' => $edit['uid'])); piece needs to go in a helper function. It can't be avoided, agreed.

catch’s picture

Title: Performance: Add column 'username' to {user}, remove LOWER(name) » Performance: Add column 'username' to {users}, remove LOWER(name)
blake_’s picture

I don't think we should be storing the LOWER() value in the database at all, especially if there's a unique index on it. For any characters that aren't ASCII, each different database may handle the LOWER call differently. For example: MySQL & MSSQL use whatever collation is currently relevant to lower characters, I couldn't find out what Postgres does, SQLlite only lowers ASCII.

My view is that we're introducing here a new column, to provide a sort of "normalised name" that can be used to search for users / prevent a user creating a new account with "similar" characters. I think that the column should be named after what it actually does (something like "normalized_name", "standard_name" etc), and there should be a function (in PHP) that generates this value that then gets stored in the DB when we insert a user. To start with this function can just use PHP string lower function.

The advantages are:

  • We don't depend on an external function (LOWER) that could be implemented differently between different databases.
  • A module can strengthen the function if desired. The admin might decide that spaces should be removed, or that vowels with accents/umlauts/macrons were treated as undecorated, or that the characters I l 1 | all got transformed into one value etc. This would reduce similar looking usernames on such sites.
  • We actually fix this performance issue! We will just be using an index over values in a column, rather than computing something every row.
  • It extends to any future database engine. It is only another database column. We don't need to write triggers, and we don't need to worry about DB case sensitivity/insensitivity.

My $0.02 :) Discuss.

joshmiller’s picture

Just read through this post and wanted to find out how it went from RTBC to needs work and not updated for some time. So here's where we are (for all of us):

Why should this go in? A stat in by catch #14 and a quote by Angie in #65

in a typical case we're comparing 1273.97 ms to 0.37 ms

The only point of this patch is to add a lowercase username column is for improved query performance. That's it.

Issue summary since last RTBC #89

  • #85 through #111 - Issues raised have been addressed
  • #111 - Dave Reid has some issues (see below for summary)
  • #112 - Crell says #111-01 issue is real and should be addressed (but, BTW it is by design) -- I added my own "reading between the lines" filter
  • #113 - catch raises a new issue for LOWER()
  • #118 - chx agrees that #111-02 is the way to go
  • #119 - blake_ points out that we could just make it a php function to avoid #113

What should the next patch include to help it be RTBC?

Ok, so that's the summary, and depending on how people feel about #119, we could potentially find another RTBC prior to 9/1 by addressing the following from #111 and #112

  1. Queries cannot include multiple placeholders
  2. We have duplicated code in user_save
  3. On the new test case, class UserSearchTestCase needs a phpdoc block right above it, getInfo() does not, setUp() does not, and testUserSearch() does need a phpdoc block.
  4. Code formatting issues in testUserSearch()
  5. There is a code problem with an SQL statement
  6. Document or fix the LOWER() issue raised by catch in #113

That's what the next patch needs. Hopefully this helps. Also, lets keep in mind Angie's guidance in #65

joshmiller’s picture

Ignore my last comment, as we have a much better solution provided by catch in IRC:

"I think we can just convert these queries to db_select(), use LIKE which is case insenstive and mapped to ILIKE on postgresql and that'll work without the extra column."

Josh

joshmiller’s picture

Title: Performance: Add column 'username' to {users}, remove LOWER(name) » Rework username queries to improve performance

changing title

Crell’s picture

That particular feature of DBTNG was added with this issue in mind. I am not convinced that it is a good solution, mind you, and a lookup column would likely be faster and more flexible, but in theory it should work.

kbahey’s picture

I am actually in favor of a solution that does not add an extra column. An extra column can introduce inconsistencies, and complicates teh schema. Since only PostgreSQL is case sensitive and the ILIKE solves that, I like that solution since it will remove a slow query for MySQL.

And it seems that we already solved it this way when catch did the user_load_multiple() patch. It already uses db_select(). So the one that caused the most trouble is already out of the way!

    // If the conditions array is populated, add those to the query.
    if ($conditions) {
      // TODO D7: Using LIKE() to get a case insensitive comparison because Crell <--- COMMENT
      // and chx promise that dbtng will map it to ILIKE in postgres. <--- COMMENT
      if (isset($conditions['name'])) {
        $query->condition('u.name', $conditions['name'], 'LIKE');
        unset($conditions['name']);
      }
      if (isset($conditions['mail'])) {
        $query->condition('u.mail', $conditions['mail'], 'LIKE');
        unset($conditions['mail']);
      }
      foreach ($conditions as $field => $value) {
        $query->condition('u.' . $field, $value);
      }
    }

It just needs cleanup by taking out the comment above.

The others that are done by the admin (e.g. searching for a user, ...etc.) are infrequent and not a cause for performance issues, but for the sake of code consistency and DX, it is a nice to have.

There is one in comment that is checked when a comment is posted. So only applicable for sites that get very frequent comments.

robertDouglass’s picture

Sorry. Kbahey, you've stated this many times in the history of the issue:

I am actually in favor of a solution that does not add an extra column. An extra column can introduce inconsistencies, and complicates teh schema.

But you've never really explained what these inconsistencies are? Or how the "complication" of the schema is so evil that it offsets the known performance gain?

I'm glad we've come to a possible solution with LIKE() that we think will make everyone happy. Maybe, like kbahey says, we're just a cleanup patch away from closing it. It seems that we did so without any benchmarking, so I hope that it actually does make Drupal faster.

robertDouglass’s picture

And joshmiller, thank you for the excellent summary!

kbahey’s picture

@Robert

It is all about data redundancy vs normalization. If you have the same data in different columns, then there are risks that contradict good database design. For example, there is the risk that at some point a new piece of code will update one place and forget the other, leading to inconsistency. Also looking at different places gives different answers. There is also the added storage overhead. This is why deriving data that can be derived is better, for consistency, than storing multiple pieces.

Having triggers and stored procedure in the database help minimize this risk, since no external code need to be modified, but we are not using any of those, and they have their own challenges (database portability, opaqueness, ...).

I understand that breaking normalization rules is necessary in some cases. Done right, they are a great help (e.g. tracker2, prev_next, ...). They are sort of external to the schema and the data is totally replaceable any time.

Since we have better options here, this is a case where an extra column is bad.

Yes, we only need cleanup for this just for DX and consistency. The LOWER that is used now is only for admin pages, which are infrequently used.

pwolanin’s picture

Just discussed this problem with Davide Strauss - our consensus was that we should add a case insensitive comparison operator to DBTNG and push this into the driver layer.

This would allow us to optimize this for MySQL (at least), would allow this to be used also for terms, etc, where we are doing text-based lookups and would be readily available for contrib.

postgres version would initially at least just use the LOWER() = LOWER(). So performance would be the same as it is currently in core.

pwolanin’s picture

talking with catch, he confirms what Khalid says above - essentially this is in core already - so this is really a cleanup issue to make sure it works right in SQLite3 and that we are removing LOWER() where we don't need it in core queries.

The bad part is that we are still generally doing %string% matches, which cannot be indexed.

Dave Cohen’s picture

Do I understand correctly that with this patch username is used for authentication and searches, while name is used for display and third-party module can choose to make name non-unique?

If so, FINALLY!

pwolanin’s picture

@Dave Cohen - that has been proposed elsewhere, and there was no consensus. This issue is just about finding ways to improve query performance.

dahacouk’s picture

Display Name Code Sprint Saturday September 5th...

Dave Cohen and I just had a really good face to face discussion with Angie (webchick). In summary, we are going to use the Saturday code sprint to make sure that Drupal 7 will facilitate us building contributed modules to implement display names.

This may mean altering some parts of core in a very small way. If you are passionate about cleaning up "theme_username" throughout core and making it easier for contributed modules to implement display names in a clean and consistent way, then join us at DrupalCon Paris or over the wires.

Saturday September 5th 2009 10:00 to 17:00 CEST
Location: DrupalCon Paris and online elsewhere

If you use IRC then #kendra is ready and waiting for you during those times.

I'll make sure developers are kept well feed and watered. So, if you are at DrupalCon and passionate about making it easy to implement display names then get in touch.

This is pretty much our last chance for making Drupal 7 display name friendly! Let's make it happen!

Cheers Daniel

mobile: +44 7853 627 355
username: dahacouk on skype, jabber, irc...
Kendra Initiative - http://www.kendra.org.uk

dahacouk’s picture

jromine’s picture

subscribe

dahacouk’s picture

OK. So, the strategy over the next few weeks (during code sludge) is to get theme_username() used consistently in core. From there contributed modules will be able to implement display names with more ease.

So, if you have time over next few weeks please help Dave Cohen work on the following issue:

#192056: User's raw login name should not be output directly

Many thanks,

Daniel

catch’s picture

Title: Rework username queries to improve performance » Replace LOWER() with db_select() and LIKE() where possible
Assigned: catch » Unassigned
Category: task » bug
FileSize
832 bytes

Now we have LIKE/ILIKE support in the database layer, it's a bug not to be using it where appropriate.

Here's a start on a conversion, hoping someone else can pick up the other places in core where this needs doing.

Wherever you see LOWER(:placeholder) this can be converted to db_select() and LIKE. The only exception is LIKE LOWER(%:placeholder%) which has its own problems we can't deal with here.

catch@catch-laptop:~/www/7$ grep -rl "LOWER(" *
modules/system/system.install
modules/user/user.module
modules/comment/comment.module
modules/taxonomy/taxonomy.module
modules/statistics/statistics.admin.inc
modules/profile/profile.admin.inc
modules/profile/profile.pages.inc
modules/profile/profile.module

Marking CNR to give test bot a go at it.

catch’s picture

Status: Needs work » Needs review
catch’s picture

FileSize
1.55 KB

Did a couple more.

TODO:
user_autocomplete()
profile_admin_settings_autocomplete()
profile_autocomplete()

catch’s picture

FileSize
2.36 KB

Somehow managed not to inciude the previous patch, todos remain the same.

Also user_is_blocked() should cast to bool, but keeping this to minimum necessary changes for now.

meba’s picture

A small review:

- Applied succesfully
- Created a new account with some uppercase characters - works
- Logged in as this user - works
- Created a new article and commented on it - works
- Created a profile field with a category, including some uppercase characters. Created second with in the same category - works
- Blocked my new user, tried to log in as the blocked user - doesn't allow login - works

c960657’s picture

Status: Needs review » Needs work

The wildcard characters % and _ need to be escaped. Otherwise confusing things happend if they appear in the user-supplied string. I suggest something like this:

-    $query->where('LOWER(category) = LOWER(:category)', array(':category' => $category));
+    $query->condition('category', addcslashes($category, '%_\\'), 'LIKE');

AFAICT SQL-92 says that there is no escape character if ESCAPE is omitted in the LIKE expression (... WHERE foo LIKE 'abc%' ESCAPE '\'), though MySQL and PostgreSQL uses \. In order to make all databases behave the same, we can make the following change to DatabaseCondition::mapConditionOperator() in database/query.inc:

@@ -1326,15 +1326,15 @@ class DatabaseCondition implements Query
       'NOT IN' => array('delimiter' => ', ', 'prefix' => ' (', 'postfix' => ')'),
       'IS NULL' => array('use_value' => FALSE),
       'IS NOT NULL' => array('use_value' => FALSE),
       // These ones are here for performance reasons.
       '=' => array(),
       '<' => array(),
       '>' => array(),
       '>=' => array(),
       '<=' => array(),
-      'LIKE' => array(),
+      'LIKE' => array('postfix' => " ESCAPE '\\''"),
     );
     if (isset($specials[$operator])) {
       $return = $specials[$operator];
     }
     else {

This pattern should be applied elsewhere in core, e.g. in various autocomplete callbacks.

Alternatively, perhaps we can make a new pseudo-operator in mapConditionOperator() that handles escaping automatically:

+    $query->condition('category', $category, 'CASE_INSENSITIVE_COMPARE');

It would make the code easier to read at a glance, though the reader may have trouble figuring out where that operator is defined. If the schema API some day supports case-sensitivity to be specified for each column individually, we cannot necessarily rely on LIKE (on MySQL the behaviour of LIKE depends on the column's collation), and then a special operator may be useful.

Crell’s picture

There was an issue open for a year to figure out what to do with LIKE vs. ILIKE and similar mapping and no one cared about it so we just left it at "fold everything to use case-insensitive LIKE like MySQL". As it's an API change to rewrite that mapping I don't think we're able to do that at this point.

SelectQuery is not cheap, and does incur an additional PHP-space cost due to all the stack calls. I'd want to see benchmarks before we start using it everywhere.

The % question is... ugh. I'm not sure what the correct way to handle that is, since we *do* want to allow both real wildcards and % in a literal. That's probably separate from this thread, though ,which is a disturbingly long performance thread.

catch’s picture

@Crell I'm pretty sure there's benchmarks earlier in this issue or its predecessors, db_select() takes at most 1-2ms, LOWER() can take hundreds of milliseconds or seconds afaik on large datasets.

We're already relying on LIKE/ILIKE mapping elsewhere in core, this issue is just applying it consistently. That means we probably need a spin-off for the % and _ issue which is nasty, and perhaps CASE_INSENSITIVE_COMPARE - both of these would be API additions rather than changes as such, and fixing a critical performance issue, so doesn't seem too bad at all.

c960657’s picture

I have spun off the LIKE xxx ESCAPE '\\' issue into #654662: No way to escape wildcard characters in LIKE queries.

coltrane’s picture

Status: Needs work » Needs review
FileSize
5.06 KB

While #654662 is pending here's 139 rerolled with the following done

user_autocomplete()
profile_admin_settings_autocomplete()
profile_autocomplete()

I removed LOWER, which I think is OK in db_query_range() but let me know if not.

catch’s picture

In db_query_range() we still need LIKE() - so those would have to instead be converted to a db_select() with ->limit().

chx’s picture

with ->range. limit is a --badly named-- pager method.

c960657’s picture

Status: Needs review » Needs work
sun’s picture

catch’s picture

Status: Needs work » Needs review
FileSize
5.31 KB

Updated this after #654662: No way to escape wildcard characters in LIKE queries - haven't run all tests although the ones I did passed, see what the test bot says.

Our code style for db_select() doesn't work well for if ($query) { so I've left that on one line in the case that appeared since it's chained and everything else I tried looked worse.

chx’s picture

Status: Needs review » Reviewed & tested by the community

OK!

webchick’s picture

Status: Reviewed & tested by the community » Needs work
Issue tags: +Needs documentation

I read through this extremely long issue again. Thanks for the summary mid-way through, Josh!

I too have witnessed LOWER() queries wreaking absolute havoc on large datasets (I think Robert and I are both thinking of the same project ;)). So I'd like to see this fixed.

The gist is that the first solution for this was an extra DB column to store the lowercase version of the username. This felt wrong for a variety of reasons: you'd need a term_lower, an aggregator_item_title_lower, etc. which is just not scalable to all of core/contrib.

This new approach takes the existing db_like() operator, which is already in core, and applies it consistently, instead of using LOWER() in queries. This means we can optimize MySQL/PostgreSQL to use the behaviour that makes the most sense.

Syntax-wise, it looks like:

->condition('name', db_like($name), 'LIKE')

Which is kind of a WTF, esp when compared with a straight-up LOWER() query. But I can't think of anything better, and as noted in the op, many, many, many issues have been started over the past 3.5 years to solve this problem. :P

So.... Committed to HEAD. This practice will need to be documented in the module upgrade guide.

catch’s picture

Status: Needs work » Fixed
c960657’s picture

Status: Fixed » Needs review
FileSize
10.19 KB

The patch left a number of LOWER() calls untouched. Was that intentional?

This patch removes the remaining calls. It also introduces a 'NOT LIKE' operator.

Which is kind of a WTF, esp when compared with a straight-up LOWER() query. But I can't think of anything better,

#141 suggests an approach that at least is easier to read. This may even be extended to STARTS_WITH, CONTAINS and ENDS_WITH operators. But still somewhat WTF.

catch’s picture

Somewhat intentional yeah.

If you use a LIKE query with a string like '%foo', that also doesn't use an index. I've not benchmarked using LOWER() vs. not using LOWER() in that scenario but as far as I know it won't make a lot of difference - would need to be tested on a very large dataset. If for some reason it does make a difference, or if we just wanted to remove LOWER() completely for consistency (or because we hate it so much) I'd be happy though.

webchick’s picture

I think I'd prefer the consistency, personally. I read your description of when to use it and when not to use it like 5 times and I still don't get it. ;)

catch’s picture

Priority: Critical » Normal
Status: Needs review » Reviewed & tested by the community

This is no longer critical. However I read through the patch and the changes look great, also gets rid of some ->where() which is lovely.

c960657’s picture

FileSize
10.03 KB

Reroll.

Dries’s picture

Status: Reviewed & tested by the community » Fixed

Committed to CVS HEAD. Thanks.

Status: Fixed » Closed (fixed)
Issue tags: -Performance, -Needs documentation

Automatically closed -- issue fixed for 2 weeks with no activity.