This patch attempts to address the issues at #188734: LOWER(name) queries perform badly; add fullname and restrict name as lowercase., #83738: LOWER(name) queries perform badly; add column name_lower and index., #102679: Add a Display Name field to core in addition to Username and #181625: LOWER() is slow and can not be indexed - but it since it doesn't fit that neatly into any of those issues, takes a new approach, and they've all got quite long discussions already about different approaches, I'm hoping a new issue will be OK. However, if you're planning to suggest a radically different approach to this, please read those issues first - since many different approaches have been discussed and rejected in those issues already. Thanks :)
What this patch does:
Adds users.username to {users} table in addition to user.name
Modifies user queries to use that table to avoid expensive LOWER() queries, however $user->name is returned just the same as now.
What it doesn't do:
Change anything in the UI or break backwards compatibility for either site owners or modules which use $user->name
What it will allow us to do:
Populate $user->name to anything they like via a contrib module or followup core patch either programmatically or via form_alter on user forms - they'll need to also handle validation and searches for custom user names if they do so - since there could be various approaches, it's best that complexity is left out of this patch.
So, we get a big performance win, and the flexibility to have more variations in user name handling down the line in either core or contrib.
Current version of the patch causes 6 failures in UserRegistrationTestCase() (starting with user_check_password(), which ought to be unrelated), but marking as needs review for general concept.
Comment | File | Size | Author |
---|---|---|---|
#158 | 279851-more-lower-2.patch | 10.03 KB | c960657 |
#154 | 279851-more-lower-1.patch | 10.19 KB | c960657 |
#150 | 279851-150.patch | 5.31 KB | catch |
#145 | 279851-remove-lower-145.patch | 5.06 KB | coltrane |
#139 | lower.patch | 2.36 KB | catch |
Comments
Comment #1
Senpai CreditAttribution: Senpai commentedSubscribing for a review.
Comment #2
cburschkaYou didn't mention whether {user}.username will be constrained to lower case, but I presume that that's how you got rid of the LOWER() query?
I'll review and run some test cases...
Comment #3
cburschkaI'm getting the same 6 fails in registration, plus 25 fails in uploading a picture (haven't compared with the unpatched site yet). I'll investigate a bit.
Comment #4
catch{user}.username will be constrained to lower case, yeah.
Comment #5
cburschkaFalse alarm on upload. It's just curl being stupid again, returning zero-length responses. When I'm done with this, I'll rewrite my copy of simpletest to use drupal_http_request; it's really completely unusable as it is for me.
This is the content of a debug file for user_check_password:
I'll look at _crypt_password next...
Comment #6
pwolanin CreditAttribution: pwolanin commentedbug here:
Comment #7
catchFixed that bug. Also spoke to pwolanin about this in irc, and he suggested a way to avoid drupal_strotolower in user_save() - which means that all the lowercasing is handled by the database now instead of some php, some database.
Still 6 fails in user registration tests. Thanks Arancaytar for the debugging :)
Comment #8
pwolanin CreditAttribution: pwolanin commentedperhaps this line in the test should change too?
Comment #9
Crell CreditAttribution: Crell commentedBless you. :-) Visual review:
- I still do not like not having a unique constraint on the display name (nee name). As coded, what would happen if someone set their name to "Crell" and someone else to "crell"? The unique key is on username, which is the case-folded version. So what would happen is username would get set to "crell" for both of them, which would create an SQL error, wouldn't it?
- This code:
Why are you using the database for the lowercasing? Would its result be any different than doing it in PHP space and avoiding the extra query?
- So then for a contrib module to modify the name value, it would need to:
+ form_alter to show "username"
+ form_alter to NOT show "name"
+ inject its own validate and submit hooks *before* the default ones to set the "name" field by its own logic.
Or alternatively, do the same in hook_user validate/save. Right?
Otherwise I think this is a good direction at this point.
Comment #10
catch@Crell: I've added the unique constraint back to users.name - if a contrib module wants to handle that other issue that's been discussed elsewhere, they could remove the unique key via schema API.
Using the database for lowercasing exclusively was pwolanin's idea, which I reluctantly agree with. There's a chance that certain collations on certain databases are going to operate differently to mb_strtolower (I don't have enough practical experience with collations to know how much of a risk this is) - but by using the database we ensure consistency. Also the database is only being used for lowercasing on writes now.
Additionally, php.net suggests that LOWER() is actually faster than mb_strtolower, so for micro-optimisation there's probably not much in it.
Contrib needs to do exactly what you suggest, yes.
I'll probably update user.test to attempt creating users with siMiLar usernames if that's not in there already. Still at a loss as to why the password hashing is failing.
Comment #11
Dries CreditAttribution: Dries commentedPersonally, I'd find it very confusing if my display name was 'Dries Buytaert' but my login/username was 'dries'. I'd consider that to be a serious usability regression. Unfortunately, this patch seems to take us down that road and I'm not a big fan of that. Maybe I misunderstood where this was heading though.
Comment #12
cburschkaI would argue that it makes sense for login names to be permanent and displayed names to be changeable. The big problem with changing login names is that it is hell for the password manager - you don't necessarily want a different login each time the displayed name changes...
Comment #13
Damien Tournoud CreditAttribution: Damien Tournoud commentedEdit: I misunderstood the scope of this patch.
@Dries: the new 'username' column is just pregenerated to 'LOWER(name)', in order to avoid a LOWER operation during SELECTs (that can't be indexed).
There is no change whatsoever here on the UI or user-experience part.
@others: would it be possible to improve that patch a little by removing spaces from the username column? That way you will not be able to register both 'Dries Buytaert' and 'DriesBuytaert'.
Did a quick visual review of the code, and it looks great.
Comment #14
catch@Dries, my primary concern here is to remove the LOWER() in user queries. Robert's comparisons from 2006 show the difference in performance with 500k users. And the queries haven't changed since then:
I agree 100% on the 'display name' issue, however, there's a lot of demand for it (including from people who wouldn't argue for it without good reason like Crell and Webchick). This issue is an enabler for that - either in core or contrib, but I'm absolutely not going to introduce any explicit support for such a thing with this patch - there are zero API or UI changes here. The only connection is that this patch doesn't introduce the extra column in such a way that it would block the other issue (by requiring yet another column or something).
@Damien - existing sites might already have a Dries Buytaert and DriesBuytaert - which'd give us a duplicate key error on update and lock one of them out. This patch currently has zero effect on the username policies of any existing sites, and I'd like to keep it that way.
@Arancaytar: #102679: Add a Display Name field to core in addition to Username is over there ;)
Comment #15
cburschkaSorry, I misunderstood the scope of this patch.
How severe should we make this uniqueness constraint, as in how much should be stripped from the username field? Beyond spaces, we could strip everything but letters, so that "Dries Buytaert02345" would be unavailable as well. However, the stronger the constraint, the more problems people will have when updating their databases that may violate the new rules.
Comment #16
Crell CreditAttribution: Crell commented@Dries: There are actually extremely good reasons to allow that distinction. In my case, I can't control the login name people choose when signing up. However, nobody on my site knows each other by their login name. We actually use real mother-given names for everything. Then when I try to select someone's name using the auto-complete author field, how do I know that "Dries Buytaert", who I'm trying to set as the author of a node, is really "trkfan849"? I don't. Allowing these fields to be split is a usability improvement, because right now if you use theme_username then you get a disconnect between user's identifying strings in different contexts. That said, this patch as pwolanin said doesn't change the default behavior other than silently setting up a pre-lowercased username for performance. I suspect Drupal.org would never override that.
@pwolanin: If that's the reason for using LOWER() in SQL, that should be documented in the code.
Also, having written a D5 module that tries to custom-set the login name field, I can tell you that it is not as easy as that to set a custom value there because the user workflow sucks. :-) I'd have to try it with this mechanism, but let's be careful that "just form alter it" isn't our answer unless we know that will actually work. Anyone have time to write up a demo module as proof of concept? (I wish I did, but I probably don't.)
Comment #17
catchThe unique constraint is exactly like it is now - a unique index on the column and form validation to avoid duplicates with case variations. This means you can't have Dries and dries, but you can have Dries1 etc. - again, exactly the same as now. Anyone wanting to force more unique usernames also needs to find another issue to do so in, since it's not within scope.
Comment #18
catch@Crell. I've added documentation to both user.module and user.test to explain the justification for using LOWER(). If someone wants to come along and tell me it's not necessary, that's fine too.
Comment #19
JohnAlbinI think this patch also dovetails nicely as an excellent first step on the road to "display name" capabilities in core (and that’s the last I’ll say about it in this issue.) Given how out of control the others issues have become, we definitely need to limit the scope of this issue.
Catch, are you re-rolling the patch after your IRC chat with chx? Or did I misread that discussion?
Comment #20
catchFollowing discussion with chx, we keep LOWER() within the database during user_save(). I've also modified user.test to use a real SELECT query as well.
Also, this version of the patch passes all user tests now (as a result of that change), and doesn't harm any other tests either.
Comment #21
catchLast patch had some silly errors in it, this one should be good.
Comment #22
dahacouk CreditAttribution: dahacouk commented@catch: Well done for coming up with this approach! Will your patch make it easier for a contrib modules to add non-unique display name functionality (emphasis on "non-unique")? In other words will $user->name (the display name) be able to be made non-unique if desired?
@Dries in #11: There are many webadmins calling out for display names being distinct from login names. There are many hacks being produced for what, some see, as a core function of a web site: being able to display a real name as different from their login name. This scenario would be optional for webadmins. And there's no reason why the current situation (of just one display/login name) couldn't be maintained too. But it's good to get some standardisation for this functionality into core so that developers can start to deliver more consistent solutions. Have a peek at #102679: Add a Display Name field to core in addition to Username to see more people wanting display name functionality in core. Sure, this patch doesn't go all the way but it's a start.
@Arancaytar in #12: Why is the "big problem with changing login names... hell for the password manager"? Surely, before a login name change request is actioned, the password manager just has to check that there are no duplicates. How is that "hell"? I say keep things flexible unless there's a real security issue or big hit on performance.
-1 to permanent login names
@Damien Tournoud in #13: Why "removing spaces"? If you are talking about "removing spaces from the username" - as in display name - then we wont be able to mimic real names. So, I see no advantage here.
-1 to removing spaces from the username column
@JohnAlbin #19: Nicely restrained! ;-) You're right, it does seem that we need to do this in small steps. Looks like we'll get "display name" capabilities in core around Drupal9. ;-)
Cheers Daniel
Comment #23
Damien Tournoud CreditAttribution: Damien Tournoud commented@dahacouk: I misunderstood the scope of this patch at first, please don't broaden it. This patch will ease the job of contrib modules to implement display names distinct from login names. It is only a first step in that direction.
Here is an updated version of catch#21:
user_is_blocked()
to use the username, (this needs to be discussed)user_search()
,_user_edit_validate()
to allow modules to implement not distinct names.Comment #24
catchChanging the title to be more descriptive.
I checked coverage.cwgordon.com and noticed that some of the user queries aren't being tested yet (i.e. user search), so this adds a test for search/user to verify they work as expected.
I'm not really set up for benchmarks, but will see if I can do some basic comparisons.
Comment #25
JohnAlbinVisual review:
$has_username = !empty($edit['username']);
note the missing “!”.WHERE status = 0 AND name = LOWER('%s')
even works; shouldn't that beWHERE status = 0 AND LOWER(name) = LOWER('%s')
?Comment #26
Damien Tournoud CreditAttribution: Damien Tournoud commented@JohnAlbin:
Indeed, thanks.
We are not far from being able to drop the unique constraint on {user}.name, why not following the route a little further? At the minimum, we need to only validate the uniqueness of {user}.username, because if we don't, contrib modules will not be able to drop the constraint themselves.
It works on MySQL because the default collation is case-insensitive. Obviously there is a bug here.
Comment #27
catchCrossposted with Damien, and looks like JohnAlbin crossposted with me. #23 introduces 5 test failures which aren't in #24 and also uses uid as a string in user_save.
Rolled the user_search fix from #23 into this one, which otherwise is identical to #24. I'm not knee-jerk against the other changes, but please no test regressions or scope creep.
Comment #28
catchOr typos ;)
Comment #29
JohnAlbinTrying VERY hard not to cross post again…
My review in #25 was of Damien's patch in #23. :-p
Again, Damien, this idea is scope creep. Dropping the unique constraint on {user}.name is highly debatable (even for contrib.) But the debate is out-of-scope for this issue. We can revisit your proposed change to _user_edit_validate in a follow up issue. :-)
Catch, I think Damien is correct; the user_save db queries do need to do
WHERE uid = %d
instead ofWHERE name = '%s'", $array['name']
. That’s what user_save uses currently: http://api.drupal.org/api/function/user_save/7 Or does that cause simpletest breakage? I haven't run them as I’m still visually reviewing the patches.Also patch in #28 has “No newline at end of file”. But that’s highly pedantic, so you can ignore me on that one if you want.
Comment #30
Damien Tournoud CreditAttribution: Damien Tournoud commentedWell, we can drop the INDEX constraint on the {user}.name column while keeping it on {user}.username. In the standard behavior, the uniqueness of the {user}.name column is still guaranteed (because out of the box the {user}.username column is the lowercase form of {user}.name, thus if the former is unique, the latter also is).
Comment #31
catchBug in the test. Should try to get patches right harder than I try not to crosspost.
@JohnAlbin's question in #25 - using uid instead of $array['name'] there caused simpletest breakage, yeah. Ways around this much appreciated.
Comment #32
Damien Tournoud CreditAttribution: Damien Tournoud commentedCatch: sorry for #23 I should have indicated that those were completely untested.
Here is a update from #31, including the change to use uid as a key in user_save().
Edit: fix bad spelling that hurt my eyes.
Comment #33
catch#32 is fine. Thanks Damien :)
Here's a quick comparison for performance on 98,000 users generated by devel, these run from the command line:
Subsequent queries for different names always take between 0.13 to 0.14 seconds. query_cache is only used if the name is identical.
query_cache gets used even for different names - so 0.00 seconds for subsequent similar queries.
So we get a double improvement - 3-4 times faster on these selects the first time they're run, but the query_cache in MySQL will handle most subsequent requests.
If someone has a decent benchmarking setup, it'd be great to get more scientific data than this - or with some advice I can probably do something a bit better at home.
Comment #34
pwolanin CreditAttribution: pwolanin commentedminor concerns re: text/code comments:
The schema description text seems a little incorrect or misleading:
$user->name *is* still constrained to be unique, so I think the description does not need to change at all.
The username description should reference the fact that it's a LOWER() version of $user->name used to speed lookups during login, search, etc.
I think should reference character handling or character set, not collation, since I think collation just determines sort order.
something like:
Comment #35
Damien Tournoud CreditAttribution: Damien Tournoud commented@pwolanin: in MySQL terminology, "collations" refers to the set of rules for comparing characters (for both inequalities and equalities). Collations for are used for sorting, for matching and for normalizing strings before putting them into index classes.
This said, I like the wording of your comment better :)
Comment #36
catchThose changes are fine, and while collation is correct, I agree with Damien that the revised version explains the motivation better.
New patch attached.
Comment #37
pwolanin CreditAttribution: pwolanin commentedto be consistent with other schema comments, this:
t('LOWER() version of $user->name for login, search performance.'),
should be written more like:
Comment #38
catchYes. good point.
Comment #39
R.Muilwijk CreditAttribution: R.Muilwijk commentedTests look good to me , ran them all. It doesn't break any tests that don't fail normally. I also like the idea of not having to use a LOWER() on the database. So +1 from me!
Comment #40
Senpai CreditAttribution: Senpai commentedTested against a fresh checkout of HEAD, but not a clean DB.
I created 10 users with all manner of mixed case, spaces, symbols, and the like. All form submissions were treated as they should have been, with illegal symbols being prohibited, and mixed-case or spaces being allowed. In each instance, I checked the database to verify core's actions and they were good.
I then deleted nine of those users, and in each case the deletion happened flawlessly as it should have. I kept the last user for a Change Own Username test, and it passed that with flying colors.
This patch allows users to login with either their uppercase name, lowercase name, or any mixed-case variant as long as the $user->name field matches the same sequence of letters and numbers stored in $user->username. I don't see that as a problem, but some people might. It basically allows me to login as either 'Dries Buytaert' or 'dries buytaert' but not 'driesbuytaert' because there's a space in the original $user->name.
Simpletests
User.mod == 231 passes, 0 fails, 0 exceptions
(should we be testing for user name changes as well?)
I'm good with this patch. RTBC.
Comment #41
dahacouk CreditAttribution: dahacouk commentedIs this patch meant to enable me to have a (display name) $user->name = "Daniel Harris" and a (login name) $user->username = "dahacouk" - as I usually like to do?
Or is $user->username only just a lowercase representation of $user->name?
Comment #42
Damien Tournoud CreditAttribution: Damien Tournoud commented@dahacouk: this is out of this patch's scope, but this patch do open the way for contribution modules to implement what you describe.
Comment #43
Senpai CreditAttribution: Senpai commentedDaniel, This patch creates a duplicate username of every user in the system, and automatically lowercases it. That's all it does. It's an enabler, if you will. A stepping stone to further greatness. It has to go into core before anything like the Add a Display Name to core issue could even begin to be realized.
The great thing about this patch is that it speeds up Drupal by a magnitude or two when trying to locate or manage users within the system. Once this is committed, other patches can surface to handle what actually happens to these usernames. For now, we're all going to be extremely happy when this is committed.
Comment #44
catchRe-rolled following #237381: user_save parameter poorly named
Comment #45
JohnAlbinThe short description of a function's docblock should be on one line, so this updated patch only changes the docblock for user_update_7002(). No code changes from catch's patch, so credits still go to him. :-)
Re-ran ALL tests; they all still pass after the patch. “4940 passes, 0 fails, 0 exceptions”
Comment #46
Senpai CreditAttribution: Senpai commentedBumping this as an RRTBCTRBC (Really Really Reviewed and Tested by the Community and Thoroughly Ready to be Committed).
Commit commit commit!
Comment #47
maartenvg CreditAttribution: maartenvg commentedPatch no longer applies.
Comment #48
drawk CreditAttribution: drawk commentedReroll
Comment #49
catchThis probably ought to be re-rolled again for the new db layer changes. However, since it was at RTBC for a very long time with no comment, I'd quite like view from webchick or Dries on the general approach - which I don't expect to change after conversion.
Comment #50
BioALIEN CreditAttribution: BioALIEN commented@catch: It's a performance improvement, but I guess they want to see some more benchmarks before they commit this? What you've covered in #33 is great. Can we see some benchmarks on bigger (or even tiny) sites to better understand the performance gain?
+1 on testing this out on scratch.d.o
Comment #51
pwolanin CreditAttribution: pwolanin commentedThe patch adds a new unique key for 'username' but if we are doing all look ups using the new column is there any reason to retain the unique key on user.name?
Comment #52
David Strauss@pwolanin: There is no need to keep any key/index on {user}.name. We should never be putting conditions or sort criteria on {user}.name after this patch.
Comment #53
Crell CreditAttribution: Crell commentedNot true. I may want to list all users alphabetically by their public name as part of a member directory. At minimum having an index on that field will keep such queries from killing the server. (I've already said my piece on the uniqueness question.)
Comment #54
pwolanin CreditAttribution: pwolanin commented@Crell - I don't understand your comment. This patch maintains a 1-to-1 mapping of name to username.
Comment #55
pwolanin CreditAttribution: pwolanin commentedI re-rolled the patch to remove the index on name, and altered the code in other modules so all WHERE or ORDER BY should be using username rather than name.
Ran all tests - all tests pass with this patch (5431 passes, 0 fails, 0 exceptions).
Comment #56
maartenvg CreditAttribution: maartenvg commentedIt applies fine, and I get no errors. But I was wondering: although this is a database change, there is nothing for update.php to do, is that by design?
Comment #57
pwolanin CreditAttribution: pwolanin commentedIf you already applied this patch before, the update might not run.
Comment #58
maartenvg CreditAttribution: maartenvg commentedNope, that is not the case, I applied it to a fresh install.
- download fresh 7x tarball
- extract
- install
- apply patch
- run update.php
The last step says "No pending updates."
I've checked and the installation does not detect the presence of user_update_7002() (database reports last scheme to be 7001). I already came across the same strange behavior when creating a patch for #165786: DB Silliness: aggregator_item.description ought to be called body, but thought it might be my fault.
I believe this behavior was introduced when #286035: Remove update.php number dropdowns went in, not sure though.
EDIT: I've traced this back to http://drupal.org/cvs?commit=137218, therefore it is unrelated to this issue. You can ignore my comments about not functioning updates.
Comment #59
JohnAlbinYes, Maarten, #286035: Remove update.php number dropdowns is the culprit for the "no pending updates" issue. But it has just now been fixed (again).
@pwolanin: re: Crell's last comment. He meant he's already talked about the uniqueness several times. starting with #9 above in this issue and all over #102679: Add a Display Name field to core in addition to Username.
And, after a really, REALLY long discussion, we finally got everyone in #102679 to agree that name needs to remain unique.
In addition, removing the unique constraint on name isn't needed for a performance patch. Let's keep the scope focused so as not to repeat #102679.
Comment #60
catchUnnecessary indexes are a performance hit on insert, there's no reason to add that hit if we don't need it. Uniqueness is enforced in PHP anyway, at least without contrib intervention, so I'm marking this back to needs review.
Comment #61
David Strauss@Crell: I do agree that a unique index on the case-preserved name would enhance data integrity a little, but it's going to be inconsistent. The index is built off the collation that DB server is using, so the uniqueness on "name" will be case-insensitive on MySQL and case-sensitive on PostgreSQL. Trying to create equivalent contraints on "name" brings us back to the problem of making a portable, case-insensitive index, which is the problem this patch is trying to solve.
Comment #62
Crell CreditAttribution: Crell commentedOK, I think everyone misunderstood what I was saying in #53.
1) I've said my piece on having a UNIQUE constraint on the name column, so I'm not going to say more on it here.
2) There are use cases where you will want to ORDER BY name, so having an index (unique or otherwise) on that field will help performance in those cases. So -1 on not having an index on the name column at all. That field won't be modified often enough that the extra cost on INSERT will matter, IMO.
Comment #63
maartenvg CreditAttribution: maartenvg commentedI agree with Crell on at least 2). There are enough use cases where we would want to sort on the `name` column, rather than the `username` column. Especially with the future (by core or contributions) in mind where `name` won't be a 1-to-1 mapping to `username` anymore.
Comment #64
dahacouk CreditAttribution: dahacouk commented@JohnAlbin in #59: I am not trying to "repeat #102679" here - this is a performance patch after all.
But I must take exception to your statement "And, after a really, REALLY long discussion, we finally got everyone in #102679 to agree that name needs to remain unique."
Firstly, I did not agree to uniqueness of name so please either remove that statement for the record or qualify who exactly you mean by "everyone". Lets be accurate folks!
#102679 and this issue have different goals, remember. This issue does not solve the issue of #102679. Please don't muddy the waters.
After this patch has been accepted I'll be excited to find out how it can really help fulfil #102679.
Full steam ahead! ;-)
Cheers Daniel
Comment #65
JohnAlbinRe: catch: "Unnecessary indexes are a performance hit on insert, there's no reason to add that hit if we don't need it."
Agreed, but I don't feel this is an unnecessary index.
Since core won't be using the name column for any of its querying (just for displaying a capitalization-preserved version of username), it opens up the possibility that contrib can extend the data of that field. But actively deciding to remove the existing index means we have to fight to get it back in another issue, because (regardless if its unique or not) that column would need an index to query on it. And if the length of #102679 is any indication, getting it back will never, never happen.
And, yes, I know my argument is scope-creep. But I'm not asking for a new feature of the patch, I'm asking for you to not remove an existing index.
Also, how big is this hit for a db with a large number of users? seconds? miliseconds? http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html shows that, for a table with 320,000,000 records (everyone in the US), each index insertion would be log(320,000,000) = 8.5 slower than an empty table, but that page doesn't show actual numbers. When I tested inserting a single user into a 200,000 users table with and without a name index, both results were the same "Query OK, 1 row affected (0.01 sec)". That, of course, is nowhere near exhaustive testing, but extrapolating to a table with 320 million users means the performance hit would be somewhere less than 0.085 seconds. Unless my math is wrong, which it could be, leaving in the index seems a negligible performance "hit".
Comment #66
Crell CreditAttribution: Crell commentedJohnAlbin said what I was trying to say far better than I did. :-)
Let's also remember we're talking about a table that on a super-busy site like d.o has had ~300,000 inserts in 7 years. It's not like we're talking about hammering watchdog. If we lose a tenth of a second when creating a new user but avoid a full table scan for any contrib that wants to do something useful with that column, that's a huge win.
Really, removing the index is unnecessary premature optimization.
Comment #67
catchI'm fine with putting the index back in. This needs re-rolling for dbtng as well.
Comment #68
webchickHoly cow, guys. ;)
Ok. My take on this patch is:
I think that's everything that needs weighing in on here. Hope that helps.
Comment #69
pwolanin CreditAttribution: pwolanin commentedIn terms of an extra and unneeded index - I think the size in memory of the index is much more of a concern than insert speed - hence I'd assert that #52 is correct. No query should be putting a condition on name after this patch, only on username. Hence we don't need the index.
Comment #70
robertDouglass CreditAttribution: robertDouglass commentedI'm thrilled with the direction of this patch. Thorn in my side since 2006.
Comment #71
catchSince username and name with be identical in core, if contrib wants to try to do something funky with name, they could also add the index back via schema API. If I can find the motivation I'll re-roll with both options though if no-one beats me to it.
Comment #72
Crell CreditAttribution: Crell commentedadmin/user/user in core allows sorting on the username. Presumably that will use "name" not "login name" in the new setup, so that when contribs do fork off the display name it will make sense. (And let's be honest, that's not the main purpose of this patch but we all know that is going to happen.) Thus, a core page will be sorting by name, not login name. Thus, we do need the index on that column.
Comment #73
pwolanin CreditAttribution: pwolanin commented@Crell - see above - a contrib module that needs a different index can add it. Let's deal with what core needs.
Comment #74
catchHere's a re-roll to re-sync with HEAD and use named placeholders in all cases except the user_load query builder (which needs updating in a separate patch) and wildcards since I'm not sure how they work now. I'm getting ten fails with dblog.test and some in the user tests which I'll track down tomorrow. Looks like user.admin.inc was missed in pwolanin's patch to me, we should be able to use username for sorting there too really.
Comment #75
catchFixed the typo that was causing test failures and updated user.admin.inc to use the correct column for tablesort_sql. So here's a fully functional patch for review.
Comment #76
yoroy CreditAttribution: yoroy commentedCatch polled me about what made moste sense to me in the sorting: A-Z, then a-z,or a combined Aa - Zz.
I'd strongly suggest using Aa - Zz, otherwise you basically get 2 seperate lists.
Comment #77
catchSticking back to needs review since this is the behaviour in the current patch, I think yoroy cross-posted.
Comment #78
yoroy CreditAttribution: yoroy commentedyes, I didn't change status, must've been a crosspost.
Comment #79
chx CreditAttribution: chx commentedThis patch is a good step forward. While I am not entirely surely this is the best direction we can go, there are indeed a big number of advantages of this, namely the possibility of having a separate username (edit:), the small code footprint, the simplicity. By fulfilling this popular request, you nicely sidestepped the necessity to sync the username and LOWER(name) . For now, I think it will do. We can always do more later.
Comment #80
webchickSo. Since the upgrade path from D6 => D7 is horribly broken right now, I /know/ none of you all tested the upgrade path. :P
Can someone please bring over their users table from a production D5/D6 site with lots of real life examples including names like "Bill O'Malley" and "SüP3RguRL27! and run user_update_7002() and report results?
Comment #81
catchhmm, I tested the upgrade path from d7 - d7 yesterday, but not 6-7. I'm also pretty sure I tested the updates before the 6-7 path got broken (last time it was RTBC), but it seems I didn't post about it and I wouldn't want to swear 100% I actually did.
Comment #82
maartenvg CreditAttribution: maartenvg commentedI did a test on a user base (about 1200 users). Everything went smooth but during the update, as was the endresult in the database itself. Although I have to admit that I have nothing as extreme as "SüP3RguRL27!, most users have numbers, there are a lot with punctuations (_.:;'^<>) and some use the occasional é and è.
Perhaps someone using a non-Latin alphabet can confirm my positive results?
EDIT: It was a user table from D5, that I copied into the D7 database (and changed the schema a bit to match that of the old D7)
Comment #83
webchickOk, and those people with punctuation and és are able to log in okay, and are displayed properly in the user list at admin/user/user?
Comment #84
maartenvg CreditAttribution: maartenvg commentedI'm getting some conflicting results. I'll do some more rigorous testing.
Comment #85
Damien Tournoud CreditAttribution: Damien Tournoud commentedI upgraded the whole users table from drupalfr.org (5000+) users. All went smoothly. The conversion was done properly.
But users with special characters in their names could not login... That's a different issue thou: #310447: Add back SET NAMES="utf8".
With SET NAMES back, all works as expected.
Comment #86
xzilla CreditAttribution: xzilla commentedI'm curious if anyone has done testing to see if there is a performance regression by copying all of the usernames into a second column within the table, thereby adding additional footprint and/or i/o load to all of the other queries referencing the table? I ask because I have addressed the above problem (slow lower(name) queries) by adding a function index on lower(name) which solves this specific problem much better while not cuasing additional overhead where it isn't needed.
Comment #87
maartenvg CreditAttribution: maartenvg commentedOk. The problems I am having are not related to this issue, but with collation and other char set problems during importing .e.d.
But when I manually create a few users with strange names, and then apply the patch, I do not get problems with displaying or editing these users.
Comment #88
chx CreditAttribution: chx commentedSetting status based on the latest comments especially #85 .
Comment #89
lilou CreditAttribution: lilou commentedPatch #75 cannot apply against CVS/HEAD :
Back to CNW.
Comment #90
robertDouglass CreditAttribution: robertDouglass commentedIn system.install, why is status left out of this query?
Comment #91
robertDouglass CreditAttribution: robertDouglass commentedRerolled.
Comment #92
robertDouglass CreditAttribution: robertDouglass commented100% tests pass: 6988 passes, 0 fails, 0 exceptions
Comment #93
catchWe should be using db_insert() for that query too.
Comment #94
c960657 CreditAttribution: c960657 commentedPatch no longer applies to HEAD.
Comment #95
David StraussHere's a re-roll. I have not tested it.
Comment #96
David StraussComment #97
swentel CreditAttribution: swentel commentedWe don't use t() anymore in schema descriptions, so that has to go out.
Comment #98
David StraussRe-roll to remove t() from schema description.
Comment #99
David StraussLet's also fix the erroneous description of the old "name" column.
Comment #101
David StraussRe-roll with non-conflicting update.
Comment #103
David StraussFrom what I can tell, we seem to be experiencing a bug in the test framework. It doesn't make sense for user login to fail only under the picture submission test case. The exceptions result from trying to insert "u" as the line number into the test results table.
Comment #104
c960657 CreditAttribution: c960657 commentedThis is due to #328781: Fix horrible things in the error reporting. If you apply the patch from that issue, you will get better error messages when running the tests.
Comment #105
David StraussAll tests pass if we apply #328781, though we still get "Undefined index: name" exceptions in various places.
Comment #106
stewsnoozesubscribe
Comment #107
Dave ReidComment #108
asimmonds CreditAttribution: asimmonds commentedRerolled for current HEAD.
The "Undefined index: name" exceptions were from that user_save() can be called with only what needs to be updated in the $edit array. For example, in /admin/user when you block/unblock multiple users, $edit only has the status attribute set, and not the user or pass attributes.
Comment #110
asimmonds CreditAttribution: asimmonds commentedRerolled for current HEAD.
Comment #111
Dave Reid1) These in system_install() will fail on either PostgreSQL or SQLite because of a bug in PDO that placeholders cannot be reused:
2) Having the following code twice in user_save is dubious:
I don't see why we can't do something in the beginning of user_save() instead of having to add an extra query to every call to user_save(). Are the comments still valid even though we now require PHP 5.2 and have drupal_strtolower()? In fact, you have an added section earlier in user_save:
That should be simply:
3) On the new test case, class UserSearchTestCase needs a phpdoc block right above it, getInfo() does not, setUp() does not, and testUserSearch() does need a phpdoc block. See the testing code standards at http://drupal.org/node/325974.
4) In addition, there are some sections in testUserSearch() that look smooshed together.
5) A few coding issues:
Comment #112
Crell CreditAttribution: Crell commentedTo be clear, the non-reusability of placeholders in PDO is a design decision, not a bug. Some databases support reuse, some don't. Earlier versions of PDO tried to work around that limitation for those databases that didn't, and failed miserably. So in later PDO versions they ripped that out and declared placeholders non-reusable on any DB. (I spoke with Wez Furlong about that about 2 years ago. :-))
Comment #113
catch@davereid, we can't guarantee that the result of drupal_strtolower() and LOWER() will be exactly the same on all collations due to (potentially) different handling of characters in their lowercase versions. I haven't yet seen an example of where this would actually break at all, but it's one of the annoyances which makes both this and the equivalent taxonomy issue ugly. If anyone's got decent information on that subject - whether it's critical, whether we could just document the potential behaviour, that'd be handy.
Comment #114
catchMarked #83738: LOWER(name) queries perform badly; add column name_lower and index. as duplicate despite being the older issue to concentrate efforts in one area.
Comment #115
markus_petrux CreditAttribution: markus_petrux commentedMinor issue about the patch in comment #110: it mixes 'Username' and 'User name' in translatable strings. Which one is correct?
Comment #116
kbahey CreditAttribution: kbahey commentedSubscribing.
(Before dww fixes +1 subscribe)
Comment #117
chx CreditAttribution: chx commentedI think the db_query("UPDATE {users} SET username = LOWER(username) WHERE uid = :uid", array(':uid' => $edit['uid'])); piece needs to go in a helper function. It can't be avoided, agreed.
Comment #118
catchComment #119
blake_ CreditAttribution: blake_ commentedI don't think we should be storing the LOWER() value in the database at all, especially if there's a unique index on it. For any characters that aren't ASCII, each different database may handle the LOWER call differently. For example: MySQL & MSSQL use whatever collation is currently relevant to lower characters, I couldn't find out what Postgres does, SQLlite only lowers ASCII.
My view is that we're introducing here a new column, to provide a sort of "normalised name" that can be used to search for users / prevent a user creating a new account with "similar" characters. I think that the column should be named after what it actually does (something like "normalized_name", "standard_name" etc), and there should be a function (in PHP) that generates this value that then gets stored in the DB when we insert a user. To start with this function can just use PHP string lower function.
The advantages are:
I l 1 |
all got transformed into one value etc. This would reduce similar looking usernames on such sites.My $0.02 :) Discuss.
Comment #120
joshmillerJust read through this post and wanted to find out how it went from RTBC to needs work and not updated for some time. So here's where we are (for all of us):
Why should this go in? A stat in by catch #14 and a quote by Angie in #65
Issue summary since last RTBC #89
What should the next patch include to help it be RTBC?
Ok, so that's the summary, and depending on how people feel about #119, we could potentially find another RTBC prior to 9/1 by addressing the following from #111 and #112
That's what the next patch needs. Hopefully this helps. Also, lets keep in mind Angie's guidance in #65
Comment #121
joshmillerIgnore my last comment, as we have a much better solution provided by catch in IRC:
"I think we can just convert these queries to db_select(), use LIKE which is case insenstive and mapped to ILIKE on postgresql and that'll work without the extra column."
Josh
Comment #122
joshmillerchanging title
Comment #123
Crell CreditAttribution: Crell commentedThat particular feature of DBTNG was added with this issue in mind. I am not convinced that it is a good solution, mind you, and a lookup column would likely be faster and more flexible, but in theory it should work.
Comment #124
kbahey CreditAttribution: kbahey commentedI am actually in favor of a solution that does not add an extra column. An extra column can introduce inconsistencies, and complicates teh schema. Since only PostgreSQL is case sensitive and the ILIKE solves that, I like that solution since it will remove a slow query for MySQL.
And it seems that we already solved it this way when catch did the user_load_multiple() patch. It already uses db_select(). So the one that caused the most trouble is already out of the way!
It just needs cleanup by taking out the comment above.
The others that are done by the admin (e.g. searching for a user, ...etc.) are infrequent and not a cause for performance issues, but for the sake of code consistency and DX, it is a nice to have.
There is one in comment that is checked when a comment is posted. So only applicable for sites that get very frequent comments.
Comment #125
robertDouglass CreditAttribution: robertDouglass commentedSorry. Kbahey, you've stated this many times in the history of the issue:
But you've never really explained what these inconsistencies are? Or how the "complication" of the schema is so evil that it offsets the known performance gain?
I'm glad we've come to a possible solution with LIKE() that we think will make everyone happy. Maybe, like kbahey says, we're just a cleanup patch away from closing it. It seems that we did so without any benchmarking, so I hope that it actually does make Drupal faster.
Comment #126
robertDouglass CreditAttribution: robertDouglass commentedAnd joshmiller, thank you for the excellent summary!
Comment #127
kbahey CreditAttribution: kbahey commented@Robert
It is all about data redundancy vs normalization. If you have the same data in different columns, then there are risks that contradict good database design. For example, there is the risk that at some point a new piece of code will update one place and forget the other, leading to inconsistency. Also looking at different places gives different answers. There is also the added storage overhead. This is why deriving data that can be derived is better, for consistency, than storing multiple pieces.
Having triggers and stored procedure in the database help minimize this risk, since no external code need to be modified, but we are not using any of those, and they have their own challenges (database portability, opaqueness, ...).
I understand that breaking normalization rules is necessary in some cases. Done right, they are a great help (e.g. tracker2, prev_next, ...). They are sort of external to the schema and the data is totally replaceable any time.
Since we have better options here, this is a case where an extra column is bad.
Yes, we only need cleanup for this just for DX and consistency. The LOWER that is used now is only for admin pages, which are infrequently used.
Comment #128
pwolanin CreditAttribution: pwolanin commentedJust discussed this problem with Davide Strauss - our consensus was that we should add a case insensitive comparison operator to DBTNG and push this into the driver layer.
This would allow us to optimize this for MySQL (at least), would allow this to be used also for terms, etc, where we are doing text-based lookups and would be readily available for contrib.
postgres version would initially at least just use the LOWER() = LOWER(). So performance would be the same as it is currently in core.
Comment #129
pwolanin CreditAttribution: pwolanin commentedtalking with catch, he confirms what Khalid says above - essentially this is in core already - so this is really a cleanup issue to make sure it works right in SQLite3 and that we are removing LOWER() where we don't need it in core queries.
The bad part is that we are still generally doing %string% matches, which cannot be indexed.
Comment #130
Dave Cohen CreditAttribution: Dave Cohen commentedDo I understand correctly that with this patch
username
is used for authentication and searches, whilename
is used for display and third-party module can choose to makename
non-unique?If so, FINALLY!
Comment #131
pwolanin CreditAttribution: pwolanin commented@Dave Cohen - that has been proposed elsewhere, and there was no consensus. This issue is just about finding ways to improve query performance.
Comment #132
dahacouk CreditAttribution: dahacouk commentedDisplay Name Code Sprint Saturday September 5th...
Dave Cohen and I just had a really good face to face discussion with Angie (webchick). In summary, we are going to use the Saturday code sprint to make sure that Drupal 7 will facilitate us building contributed modules to implement display names.
This may mean altering some parts of core in a very small way. If you are passionate about cleaning up "theme_username" throughout core and making it easier for contributed modules to implement display names in a clean and consistent way, then join us at DrupalCon Paris or over the wires.
Saturday September 5th 2009 10:00 to 17:00 CEST
Location: DrupalCon Paris and online elsewhere
If you use IRC then #kendra is ready and waiting for you during those times.
I'll make sure developers are kept well feed and watered. So, if you are at DrupalCon and passionate about making it easy to implement display names then get in touch.
This is pretty much our last chance for making Drupal 7 display name friendly! Let's make it happen!
Cheers Daniel
mobile: +44 7853 627 355
username: dahacouk on skype, jabber, irc...
Kendra Initiative - http://www.kendra.org.uk
Comment #133
dahacouk CreditAttribution: dahacouk commentedComment #134
jromine CreditAttribution: jromine commentedsubscribe
Comment #135
dahacouk CreditAttribution: dahacouk commentedOK. So, the strategy over the next few weeks (during code sludge) is to get theme_username() used consistently in core. From there contributed modules will be able to implement display names with more ease.
So, if you have time over next few weeks please help Dave Cohen work on the following issue:
#192056: User's raw login name should not be output directly
Many thanks,
Daniel
Comment #136
catchNow we have LIKE/ILIKE support in the database layer, it's a bug not to be using it where appropriate.
Here's a start on a conversion, hoping someone else can pick up the other places in core where this needs doing.
Wherever you see LOWER(:placeholder) this can be converted to db_select() and LIKE. The only exception is LIKE LOWER(%:placeholder%) which has its own problems we can't deal with here.
Marking CNR to give test bot a go at it.
Comment #137
catchComment #138
catchDid a couple more.
TODO:
user_autocomplete()
profile_admin_settings_autocomplete()
profile_autocomplete()
Comment #139
catchSomehow managed not to inciude the previous patch, todos remain the same.
Also user_is_blocked() should cast to bool, but keeping this to minimum necessary changes for now.
Comment #140
meba CreditAttribution: meba commentedA small review:
- Applied succesfully
- Created a new account with some uppercase characters - works
- Logged in as this user - works
- Created a new article and commented on it - works
- Created a profile field with a category, including some uppercase characters. Created second with in the same category - works
- Blocked my new user, tried to log in as the blocked user - doesn't allow login - works
Comment #141
c960657 CreditAttribution: c960657 commentedThe wildcard characters
%
and_
need to be escaped. Otherwise confusing things happend if they appear in the user-supplied string. I suggest something like this:AFAICT SQL-92 says that there is no escape character if ESCAPE is omitted in the LIKE expression (
... WHERE foo LIKE 'abc%' ESCAPE '\'
), though MySQL and PostgreSQL uses\
. In order to make all databases behave the same, we can make the following change to DatabaseCondition::mapConditionOperator() in database/query.inc:This pattern should be applied elsewhere in core, e.g. in various autocomplete callbacks.
Alternatively, perhaps we can make a new pseudo-operator in mapConditionOperator() that handles escaping automatically:
It would make the code easier to read at a glance, though the reader may have trouble figuring out where that operator is defined. If the schema API some day supports case-sensitivity to be specified for each column individually, we cannot necessarily rely on LIKE (on MySQL the behaviour of LIKE depends on the column's collation), and then a special operator may be useful.
Comment #142
Crell CreditAttribution: Crell commentedThere was an issue open for a year to figure out what to do with LIKE vs. ILIKE and similar mapping and no one cared about it so we just left it at "fold everything to use case-insensitive LIKE like MySQL". As it's an API change to rewrite that mapping I don't think we're able to do that at this point.
SelectQuery is not cheap, and does incur an additional PHP-space cost due to all the stack calls. I'd want to see benchmarks before we start using it everywhere.
The % question is... ugh. I'm not sure what the correct way to handle that is, since we *do* want to allow both real wildcards and % in a literal. That's probably separate from this thread, though ,which is a disturbingly long performance thread.
Comment #143
catch@Crell I'm pretty sure there's benchmarks earlier in this issue or its predecessors, db_select() takes at most 1-2ms, LOWER() can take hundreds of milliseconds or seconds afaik on large datasets.
We're already relying on LIKE/ILIKE mapping elsewhere in core, this issue is just applying it consistently. That means we probably need a spin-off for the % and _ issue which is nasty, and perhaps CASE_INSENSITIVE_COMPARE - both of these would be API additions rather than changes as such, and fixing a critical performance issue, so doesn't seem too bad at all.
Comment #144
c960657 CreditAttribution: c960657 commentedI have spun off the
LIKE xxx ESCAPE '\\'
issue into #654662: No way to escape wildcard characters in LIKE queries.Comment #145
coltraneWhile #654662 is pending here's 139 rerolled with the following done
user_autocomplete()
profile_admin_settings_autocomplete()
profile_autocomplete()
I removed LOWER, which I think is OK in db_query_range() but let me know if not.
Comment #146
catchIn db_query_range() we still need LIKE() - so those would have to instead be converted to a db_select() with ->limit().
Comment #147
chx CreditAttribution: chx commentedwith ->range. limit is a --badly named-- pager method.
Comment #148
c960657 CreditAttribution: c960657 commented#654662: No way to escape wildcard characters in LIKE queries has now landed.
Comment #149
sunsubscribing
Comment #150
catchUpdated this after #654662: No way to escape wildcard characters in LIKE queries - haven't run all tests although the ones I did passed, see what the test bot says.
Our code style for db_select() doesn't work well for if ($query) { so I've left that on one line in the case that appeared since it's chained and everything else I tried looked worse.
Comment #151
chx CreditAttribution: chx commentedOK!
Comment #152
webchickI read through this extremely long issue again. Thanks for the summary mid-way through, Josh!
I too have witnessed LOWER() queries wreaking absolute havoc on large datasets (I think Robert and I are both thinking of the same project ;)). So I'd like to see this fixed.
The gist is that the first solution for this was an extra DB column to store the lowercase version of the username. This felt wrong for a variety of reasons: you'd need a term_lower, an aggregator_item_title_lower, etc. which is just not scalable to all of core/contrib.
This new approach takes the existing db_like() operator, which is already in core, and applies it consistently, instead of using LOWER() in queries. This means we can optimize MySQL/PostgreSQL to use the behaviour that makes the most sense.
Syntax-wise, it looks like:
->condition('name', db_like($name), 'LIKE')
Which is kind of a WTF, esp when compared with a straight-up LOWER() query. But I can't think of anything better, and as noted in the op, many, many, many issues have been started over the past 3.5 years to solve this problem. :P
So.... Committed to HEAD. This practice will need to be documented in the module upgrade guide.
Comment #153
catchThree years, woot! Added docs http://drupal.org/update/modules/6/7#nomorelower
Comment #154
c960657 CreditAttribution: c960657 commentedThe patch left a number of LOWER() calls untouched. Was that intentional?
This patch removes the remaining calls. It also introduces a 'NOT LIKE' operator.
#141 suggests an approach that at least is easier to read. This may even be extended to STARTS_WITH, CONTAINS and ENDS_WITH operators. But still somewhat WTF.
Comment #155
catchSomewhat intentional yeah.
If you use a LIKE query with a string like '%foo', that also doesn't use an index. I've not benchmarked using LOWER() vs. not using LOWER() in that scenario but as far as I know it won't make a lot of difference - would need to be tested on a very large dataset. If for some reason it does make a difference, or if we just wanted to remove LOWER() completely for consistency (or because we hate it so much) I'd be happy though.
Comment #156
webchickI think I'd prefer the consistency, personally. I read your description of when to use it and when not to use it like 5 times and I still don't get it. ;)
Comment #157
catchThis is no longer critical. However I read through the patch and the changes look great, also gets rid of some ->where() which is lovely.
Comment #158
c960657 CreditAttribution: c960657 commentedReroll.
Comment #159
Dries CreditAttribution: Dries commentedCommitted to CVS HEAD. Thanks.