Drupal Association members fund grants that make connections all over the world.
We currently have an index on the
name column of the users table, but it isn't doing us any good because the typical name-based user queries look for
SELECT * FROM users WHERE LOWER(name) = LOWER('%s'). This effectively blocks the database from using the index and forces it to iterate through all user rows, perform LOWER on the name, and do the string comparison. This is a significant performance bottleneck in the following situations:
- user login
- user_load(array('name' => 'foobar'));
- new user validation
- user search
- user autocomplete
The motivation for using LOWER is that usernames should be case insensitive. They should be case insensitive to prevent people from spoofing other users' identies (you're "Big_chief" on your site and someone registers as "Big_Chief" and starts acting like the site owner).
The solution I am proposing involves adding a column to the users table, name_lower, which contains an already lowercase version of name. The name_lower column has an index and the index actually gets used because the queries are now written like this:
SELECT * FROM users WHERE name_lower = LOWER('%s').
To test whether the speed increase would be worthwhile, I created a test database with 500K users. Here is the before and after:
Table size: 66.76 MB.
1237.3 0 eval select * from users where LOWER(name) = LOWER('cfcd208495d565ef66e7dff9f98764da') 1295.39 0 eval select * from users where LOWER(name) = LOWER('c4ca4238a0b923820dcc509a6f75849b') 1287.55 0 eval select * from users where LOWER(name) = LOWER('c81e728d9d4c2f636f067f89cc14862c') 1266.64 0 eval select * from users where LOWER(name) = LOWER('eccbc87e4b5ce2fe28308fd9f2a7baf3') 1273.97 0 eval select * from users where LOWER(name) = LOWER('a87ff679a2f3e71d9181a67b7542122c')
Table size: 76.29 MB.
0.53 0 eval select * from users where name_lower = LOWER('cfcd208495d565ef66e7dff9f98764da') 0.38 0 eval select * from users where name_lower = LOWER('c4ca4238a0b923820dcc509a6f75849b') 0.39 0 eval select * from users where name_lower = LOWER('c81e728d9d4c2f636f067f89cc14862c') 0.36 0 eval select * from users where name_lower = LOWER('eccbc87e4b5ce2fe28308fd9f2a7baf3') 0.37 0 eval select * from users where name_lower = LOWER('a87ff679a2f3e71d9181a67b7542122c')
So in a typical case we're comparing 1273.97 ms to 0.37 ms. It is a big difference.
- Where do UTF-8 and collations fit into all of this? I'm not in a position to speak to these issues with authority.
- Are there other solutions that don't introduce extra columns of essentially duplicated data?
- Will this work as well on Postgres?
I have rolled the patch against 4.7 but will provide patches against HEAD as requested. I'm hoping to get feedback and garner discussion before I spend more time rolling patches and benchmarking.
Oh, and you really have to try the user autocomplete field before and after this patch! Where it once felt sluggish, and often fell behind the user's typing, now it is brilliantly fast, even with 500K users!!!