Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
Problem/Motivation
PostgreSQL is currently not behaving as MySQL or SQLite in terms of case-sensitivity for text. Please see #1518506: Normalize how case sensitivity is handled across database engines for a more detailed description of the problem.
Decision questions:
- How often is postgresql-contrib package available on shared web hosts?
- Is postgresql-contrib available on most Drupal web hosting providers that support Drupal 8?
Proposed resolution
Mimic the CITEXT-extention for PostgreSQL. The CITEXT extention mimics the MySQL and SQLite behavior in terms of case-sensitiveness. By mimicing the CITEXT-extention there is no need to have the end-user install the real CITEXT-extention.
Remaining tasks
Decide
Write patch.
User interface changes
None.
API changes
None.
Comment | File | Size | Author |
---|---|---|---|
#82 | 2464481-nr-bot.txt | 172 bytes | needs-review-queue-bot |
#71 | postgres-case-insensitivity_2464481-71.patch | 5.08 KB | kalpaitch |
#32 | test-results-for-31.txt | 21.41 KB | mradcliffe |
#31 | drupal-2464481-pgsql-case-sensitive-31.patch | 7.24 KB | mradcliffe |
#31 | interdiff-2464481-29-31.txt | 5.76 KB | mradcliffe |
Comments
Comment #1
andypostAccording http://www.postgresql.org/docs/9.3/static/citext.html there's still dependency on
LC_*
settings.Another trick that CITEXT introduced in 8.3 but as option to compile so probably we need to update requirements https://www.drupal.org/requirements pointing that we need 8.4 at least
Comment #2
mradcliffeAdded some questions.
It's easy to install postgresql-contrib package on CentOS and Debian dedicated and managed servers, but what about Drupal hosting companies from the larger companies to the smaller companies (postgresql hosts)?
Comment #3
bzrudi71 CreditAttribution: bzrudi71 commented@andypost: I'm not sure if we will ever have docker testing for PG 8.4 and IMHO we should require PG 9.1 (released 2011). I think to finish this task we need to:
@mradcliffe: Yes, I already thought about this. But I don't think that there are many shared Servers around supporting PG at all?
Comment #4
mradcliffeI know one smaller web host, and I'm not sure if Acquia's clients that use postgres are using DevCloud.
I'm not sure if everything should be case insensitive.
The issue with LC_* may be relevant for a Drupal 8 Multilingual install where a site builder has enabled multiple languages but the database collation is the same. Does it matter for UTF-8? Gabor may have an answer for that.
Comment #5
bzrudi71 CreditAttribution: bzrudi71 commented@mradcliffe: I'm with you regarding the requirement of case insensitiveness. But I think that will lead to DrupalWTF in possible future tests if only PG is doing it the other way? I'm totally open to close this issue and make use of #2459745: Allow the database driver to skip test classes ;-)
The LC_* should not matter for UTF-8 IMO and we force UTF-8 on install and connection.
[EDIT] I'm currently doing a full test run with citext and will post it here later to help us make a decision if it's worth or not :-) Interesting BTW, it seems that some of the migrate tests have pass now with enabled citext!?!?
Comment #6
chx CreditAttribution: chx at MongoDB commentedDebian Jessie drops this month https://lists.debian.org/debian-devel-announce/2015/03/msg00016.html making wheezy oldstable which contains 9.1.15. Ubuntu 10.04 also ends in a month, Ubuntu 12.04 defaults to 9.1.15 as well. http://packages.ubuntu.com/precise/postgresql We can't deal with the 10 year cycle of RHEL/CentOS, RHEL5 EOL is extended to March 31, 2017 -- released in 2007. PostgreSQL has a http://yum.postgresql.org/repopackages.php repo for them.
http://www.postgresql.org/docs/9.1/static/release-9-1-2.html 9.1.2 was released more than three years ago and given the citext related upgrade problem that's the release where I would put the requirement. Also, http://www.postgresql.org/support/versioning/ 9.0 is EOL this September so putting a requirement on it makes little sense.
tl;dr: raise to 9.1.2 and be done.
Comment #7
jaredsmith CreditAttribution: jaredsmith commentedIt seems to me like this is approaching the problem backward -- trying to fix something with a workaround in PostgreSQL to "fix" something that is arguably a problem (or simple difference of opinion) with regards to text comparisons in MySQL.
Here are my reasons for leaning away from using the CITEST module:
Comment #8
bzrudi71 CreditAttribution: bzrudi71 commentedTest run finally completed and attached. Here is the list of new failing tests for the impatient:
@jaredsmith: Thanks for the feedback! Now that we have a first impression of what will fail I tend more and more to a nogo for CITEX ;-)
Comment #9
chx CreditAttribution: chx at MongoDB commented> Using CITEXT assumes that Drupal is *always* going to do case-insensitive comparisons. Is that the case? Will there every be a time when we *do* want to do case-sensitive comparisons?
We have a field and a schema setting for case sensitive collations. Please refer to #2068655: Entity fields do not support case sensitive queries
Comment #10
stefan.r CreditAttribution: stefan.r commentedWould using UPPER()/LOWER() on comparisons with binary columns be a valid workaround?
Comment #11
stefan.r CreditAttribution: stefan.r commentedHmm it looks like that is what the CITEXT extension does anyway. Only problem there seems to be that not all upper case letters have a matching lower case and vice versa so results won't 100% match MySQL.
So just to clarify, I was suggesting we update the pgsql query classes so that case insenstive columns turn into LOWER('String') = LOWER('STRING') .. or if we want to be brave we parse the final query and do the LOWER() wrapping there. We'd also have to update the Schema class and create indexes on LOWER(case_insensitive_column).
Comment #12
chx CreditAttribution: chx at MongoDB commentedYou'd need to
ORDER BY
onLOWER()
as well. SQL string parsing from PHP is a questionable choice I am afraid although of course using prepared statements as we have it helps tremendously because you do not need to account for user data in there.Also, you'll have some problems knowing on the SQL level which columns are case sensitive.
Edit: I looked up the citext source, it's very short and understandable and yes, it does lowercase everything before comparison and doesn't do anything else.
Comment #13
stefan.r CreditAttribution: stefan.r commented@chx what do you mean by that, can you explain why this is a problem? Are we talking about unique constraints etc? Aren't most of the issues covered by sending the right query? Thanks to the schema we know in PHP which columns are case sensitive...
@jaredsmith:
Not necessarily, it looks like with CITEXT you can define case insensitivity on a column level. Only columns which have binary=TRUE in the schema need have case sensitivity.
Not only that, it could also be a showstopper on large enterprise deployments where there are rigid constraints in terms of customization. Some of which still use 8.4 by the way, although with it being EOL'd that may change by the time D8 comes out .
Comment #14
chx CreditAttribution: chx at MongoDB commentedThe DBTNG driver will face an almost insurmountable challenge of figuring out which database columns are case sensitive and which are not. This information is available to, say, an entity field storage driver but not the DB driver.
What you can do instead of parsing SQL string is to override various storage drivers in D8 and set up the queries as you need them. This will be quite an endeavour -- there are a lot.
Comment #15
stefan.r CreditAttribution: stefan.r commentedchx yes that was what I suggested earlier, the drawback is we would have no support for case insensitivity in raw db_query statements.
As to figuring out case insensitive fields, couldn't we cache this information and pass it to the DB driver, it's only a handful of table.column combinations which won't really change that often.
Comment #16
chx CreditAttribution: chx at MongoDB commentedWell, the Schema class on create and alter time can certainly capture those columns and store somewhere.
Comment #17
stefan.r CreditAttribution: stefan.r commentedDo we know how well the LOWER() trick works though? Does it work well with the full UTF8 charset?
I looked into this a bit, and it's a long shot as I don't know how this will work with UTF8 data (if at all), but the cs_CZ.ISO8859-2 collation is case insensitive, and PostgreSQL 9.1 allows "per column" collations: http://michael.otacoo.com/postgresql-2/collation-in-postgresql-9-1/
/edit: but seems pretty limited http://collation-charts.org/fbsd54/fbsd54.cs_CZ.ISO8859-2.html
Comment #18
stefan.r CreditAttribution: stefan.r commentedSo as a first step, would the following make sense as far as the pgsql driver is concerned?
Schema::createFieldSql()
we store whether avarchar/character/text
column isbinary
or not somewhere (maybe in a separatedrupal_pgsql_case_sensitive_fields
table in the database?), so this information gets saved on field creation/alteration.Connection
object, we load this information from the database and put it into a property$caseSensitiveFields = array('tablename.fieldname', 'tablename2.fieldname2');
LOWER()
on indexes,orderBy
, on field names in conditions and inSelect::__toString()
Comment #19
stefan.r CreditAttribution: stefan.r commentedJust to give an update, I discussed this with @amateescu and he pointed me to
CREATE FUNCTION
, however I haven't been able to find any functionality that is smart enough to alter incoming queries the way we need to. The only thing I can think of is adding another "copy" column for every case insensitive field with a trigger that lowers the case for the "copied" value, which would essentially double the storage usage of the database. Not really a workable solution :)He also pointed to the table name parsing in the Select class, which would make
db_select
etc easy to process but it seems we can't use that for regulardb_query()
statements. These would be more challenging as without CITEXT we'd inevitably need to do some sort of query parsing. However from an initial look at the db_query statements used in core and contrib, covering 100% of queries in core and ~99.9% of queries in the most popular contrib module looks doable and shouldn't pose a huge performance penalty as 99% of db_query() statements will be simple anyway.We could do smarter parsing for the other 0.9% of queries, and ask people to rewrite the other 0.1% that are too complicated to process in a simple parser (that wouldn't try to cover any possible MySQL query), offering a pgsql_citext driver as a module in contrib for those who need 100% case sensitivity accuracy as opposed to >99.9%.
I will have a look at this over the next week. If we don't manage to implement a workable
LOWER()
solution, I think we'll have to require the postgresql citext module...Comment #20
chx CreditAttribution: chx at MongoDB commentedMy big concern here is security. Are there places where a case sensitive query versus a case insensitive can lead to information disclosure or worse? I do not think so... but I haven't went through the possible cases really.
Comment #21
stefan.r CreditAttribution: stefan.r commentedWell the opposite might be true (doing case insensitive checks where we really want case sensitive checks), but it's hard to imagine accidental case sensitivity to pose a problem, at least in core. We usually don't really care about case.
In any case this would need a security audit, I guess wrapping placeholders in LOWER() would be possible to implement safely but I'd be very wary of doing this in queries with user input where the placeholders have already been applied.
Comment #22
mradcliffePassword columns shouldn't be stored as case-insensitive text (citext), and those columns are in schema as type "text" (varchar), right?
Users table is still generated by hook_schema()? Or does it use TypedData Data Type plugins? PasswordItem exists if we can use the latter.
Comment #23
stefan.r CreditAttribution: stefan.r commentedThe password column is actually case insensitive right now ;)
Technically it shouldn't be but it's not like we store the actual password, just a hash
Created #2475539: Make password entity fields case sensitive by default anyway
Comment #24
xjmAlso see: #2477413: Increase minimum version requirement for Postgres to 9.1.2
Comment #25
bzrudi71 CreditAttribution: bzrudi71 commentedIf we could cover 90% of all cases by an own implementation that would be lovely. I just think it will require more work as we expect right now as the citext module does a bit more magic than just a simple LOWER() ;-) On the other hand citext doesn't seem to work out-of-the box and requires a lot of work also. I'm curious of the results if @stefan.r starts the challenge :-)
Comment #26
mradcliffeI'm going to tackle/discuss this tomorrow with people, but need to organize the comments from the past month.
Comment #27
mradcliffeThis is a citext patch:
1. Requires super user to create extension citext.
2. Fixes TermTest
3. Does not fix GlossaryTest because Views does a SUBSTRING which compares string to citext = fail.
4. I think number of additional fails may be low, but need to run a test.
Tomorrow I think I will not do this and look at a suggestion from @jaredsmith regarding trying to change Conditions into Having statements. I tried to add LOWER in Condition, but ran into too many regular expression hacks in the driver. I could also have been low on blood sugar because lunch was really really really late.
Comment #28
mradcliffeI couldn't quickly get the drupalci testbot working with citext because the postgres user must create the extension in a specific schema, but the web container does not have access to do this and the database container does not know the database.
Comment #29
mradcliffeHere is a different approach that fixes the 2 term tests, but breaks taxonomy create/delete and possibly lots of other tests.
Comment #30
mradcliffeOkay, this gives me an idea of the test fails that the patch above created.
It looks like langcode in condition needs to be handled and that entity delete is not working in tests (I tried taxonomy term delete page manually and it worked). However I didn't try multiple terms.
Comment #31
mradcliffeThis patch fixes some bad query construction in the previous patch, which probably caused many of the exceptions in the test run for #30.
Edit: There are still many fails.
Comment #32
mradcliffeOkay, yeah, that's much better... 1 exception, 1 fatal error, and 3 fails.
Comment #33
mradcliffeVocabularyPermissionsTest passes for me on my local development environment.
Comment #34
chx CreditAttribution: chx commentedSo then we would only support entity queries ?
Comment #35
stefan.r CreditAttribution: stefan.r commented@mradcliffe wouldn't we have to do something similar for db_query() etc?
Comment #36
mradcliffeI wanted to focus on the queries inside the failing tests, but yes, it's possible that something similar needs to be done for the query builder, but I also think it will be very messy per @chx's comment in #14 after I tried to do something similar for entity queries and modify escapeField to handle it.
Entity Query and Entity Manager will be what developers use to fetch entities, but it does use more memory and is not as efficient as fetching one column. Maybe also Views queries?
Other things that needs to happen:
- @jaredsmith mentioned indexes need to be created as case insensitive, but this should be done for any database system.
Comment #37
david_garcia CreditAttribution: david_garcia commentedThe locales_source table is using specific MySQL storage type "BLOB" but the locale() translation algorithm is designed to work with case sensitive data.
The result is that locale() internal caching is half broken on non MySQL engines under some circumstances.
Maybe I'm wrong, but this means a big bunch of queries on every t() call because some strings will not get cached.
#2490976: Locale caching algorithm is broken on Non MySQL/PostgreSQL databases
Comment #38
stefan.r CreditAttribution: stefan.r commented@david_garcia BLOBs are binary data and as such they should always be case sensitive anyway, right?
Comment #39
david_garcia CreditAttribution: david_garcia commentedThat is exactly the issue, take a look at this field definition:
This field is going to be case sensitive on MySQL but case insensitive on any other database engine.
For this field definition to be portable it should be something like this:
Or is the default behaviour of a text field supposed to be 'BINARY=TRUE' ? All the usages of 'binary' I have found do a 'binary=true' which hints as binary=FALSE being the default behaviour.
Or maybe this has changed from D7 to D8 and I'm totally confused :(
Comment #40
david_garcia CreditAttribution: david_garcia commentedComment #41
erik.erskine CreditAttribution: erik.erskine as a volunteer commentedJust tried the citext patch in #27
Getting an error during installation - should this be
:name
instead of%name
?Comment #42
bzrudi71 CreditAttribution: bzrudi71 commentedComment #43
daffie CreditAttribution: daffie commentedFor the testbot.
Comment #46
daffie CreditAttribution: daffie commentedComment #47
daffie CreditAttribution: daffie commentedThe patch from comment #31 fixes the tests from #2443679: PostgreSQL: Fix taxonomy\Tests\TermTest.
I think that for should create a followup issue for fixing the case insensitivity for db_query().
The patch looks good to me. It gets a RTBC from me.
Comment #48
bzrudi71 CreditAttribution: bzrudi71 commentedSetting back for needs review for now ;) I think we need at least a full testbot run to see what else will fail. Also the problem of lowercased indexes seems not addressed within this patch. I will do a full bot run within the next hours and report...
Comment #49
bzrudi71 CreditAttribution: bzrudi71 commentedJust an idea while bot is running. If we decide to go for PG 9.1.x as minimum requirement there is a new feature that could help us here. As of 9.1 there is support for column based collations! So I wonder if we could implement something like amateescu did for SQLite #2454733: Add a user-space case-insensitive collation to the SQLite driver and can get rid of all these workarounds here by implementing an all no case collation?
Comment #50
stefan.r CreditAttribution: stefan.r commentedI mentioned this in #17 as well but PostgreSQL uses the collations from the system -- and there aren't any case insensitive collations for UTF8 that I'm aware of.
Comment #51
bzrudi71 CreditAttribution: bzrudi71 commentedArgh, no ;) Thanks @stefan.r!
I read a bit and it seems my idea is a no-go. Hopefully we will get good results from bot to move forward with the current approach :)
Comment #52
bzrudi71 CreditAttribution: bzrudi71 commentedNice, bot run just completed and here is what fails with patch attached:
Great, I expected more fails :)
Comment #53
daffie CreditAttribution: daffie commentedAs bzrudi71 in the previous comment stated: the patch from #31 does not pass all the tests with a PostgreSQL backend. So back to need work!
Comment #54
bzrudi71 CreditAttribution: bzrudi71 commentedWe have a new patch based on this work to just fix the TermTest part over in #2443679: PostgreSQL: Fix taxonomy\Tests\TermTest. As this patch causes new fails and exceptions which isn't a good idea at this point let's postpone this one for now and do a re-roll as soon as #2443679: PostgreSQL: Fix taxonomy\Tests\TermTest is in.
Comment #55
daffie CreditAttribution: daffie commented#2443679: PostgreSQL: Fix taxonomy\Tests\TermTest is in.
Comment #56
stefan.r CreditAttribution: stefan.r commentedseems chx is still working on that issue so we may want to hold off on the reroll?
Comment #57
stefan.r CreditAttribution: stefan.r commentedattempt at a reroll anyway, I assume this will need further work once @chx finishes working on the other issue
Comment #58
stefan.r CreditAttribution: stefan.r commentedcross-posting from @Damien Tournoud in #2477413: Increase minimum version requirement for Postgres to 9.1.2:
...which would be quite an invasive change but may actually be a good idea...
Comment #59
jaredsmith CreditAttribution: jaredsmith as a volunteer commentedI have to agree with @stefan.r and @damien Tournoud.
I don't think using citext is the right answer either... I'd rather we switch to handling case-sensitive, collation-sensitive everywhere as well.
It may be an invasive change, but I think it's worthwhile.
Comment #64
andypostBtw CITEXT as capability could be exposed at driver level and drivers should take collation under control
Comment #67
andypostPg12 compatibility fixed
Comment #68
andypostIt used to select
pg_trgm
in #2988018: [PP-1] Performance issues with path alias generated queries on PostgreSQLProbably this one is duplicate
Comment #69
daffie CreditAttribution: daffie commentedI do not think this is a duplicate, but it is related. Maybe we can fix this issue in Drupal 9 with the requirement of the pg_trgm extension.
Comment #70
johnwebdev CreditAttribution: johnwebdev commentedYeah, path_alias is definitely slow, but there are multiple queries being affected by this. This problem is a huge bottleneck for performance, and I think the numbers was around 100 % faster on our site after doing a test migration to MySQL.
Comment #71
kalpaitch CreditAttribution: kalpaitch as a volunteer and commentedRe-rolling #2464481-57: PostgreSQL: deal with case insensitivity compatible with latest 8.9.x. We're still using this fix.
Comment #73
luisnicg CreditAttribution: luisnicg as a volunteer commentedThe patch #71 has some conflicts with Pathauto, after applying this patch I got this error:
The alias [alias] is already in use in this language.
The query change from this:
SELECT base_table.revision_id AS revision_id, base_table.id AS id
FROM
{path_alias} base_table
INNER JOIN {path_alias} path_alias ON path_alias.id = base_table.id
WHERE (path_alias.alias LIKE :db_condition_placeholder_0 ESCAPE '\\') AND (path_alias.langcode = :db_condition_placeholder_1) AND (path_alias.path NOT LIKE :db_condition_placeholder_2 ESCAPE '\\')
LIMIT 1 OFFSET 0
to this:
SELECT base_table.revision_id AS revision_id, base_table.id AS id
FROM
{path_alias} base_table
INNER JOIN {path_alias} path_alias ON path_alias.id = base_table.id
WHERE ((LOWER(path_alias.alias) = LOWER(:value))) AND (path_alias.langcode = :db_condition_placeholder_0) AND ((LOWER(path_alias.path) <> LOWER(:value)))
LIMIT 1 OFFSET 0
Which got some results on this file core/lib/Drupal/Core/Path/Plugin/Validation/Constraint/UniquePathAliasConstraintValidator.php on this line:
if ($result = $query->range(0, 1)->execute()) { }
is anyone else experiencing this problem?
Comment #76
kalpaitch CreditAttribution: kalpaitch as a volunteer and commentedYes, same issues with my patch in #71, doesn't work, as you indicate in #73
Comment #78
VladimirAusAs per #73 and #76.
Comment #79
VladimirAusSee patch here that should solve condition issue: https://www.drupal.org/project/drupal/issues/2490294#comment-14546671
Comment #80
daffie CreditAttribution: daffie commentedWe are not going to use the
LOWER
operator on PostgreSQL. The decision has been made to use GIST indexes instead. For that is the pg_trgm extension required for Drupal 10. See: https://www.drupal.org/docs/system-requirements/database-server-requirem....The biggest problem is with performance problems with the path alias queries. The patch from that issue also add the code to create those indexes. See: #2988018-71: [PP-1] Performance issues with path alias generated queries on PostgreSQL . When that has landed, we can use that solution in other places, like with #2490294: User email should not be case sensitive.
Comment #82
needs-review-queue-bot CreditAttribution: needs-review-queue-bot as a volunteer commentedThe Needs Review Queue Bot tested this issue. It either no longer applies to Drupal core, or fails the Drupal core commit checks. Therefore, this issue status is now "Needs work".
Apart from a re-roll or rebase, this issue may need more work to address feedback in the issue or MR comments. To progress an issue, incorporate this feedback as part of the process of updating the issue. This helps other contributors to know what is outstanding.
Consult the Drupal Contributor Guide to find step-by-step guides for working with issues.
Comment #83
Chi CreditAttribution: Chi commentedAdded a simple benchmark for the current implementation of case-insensitive conditions in Postgres.
#3361618: Postgres forcing cases case-insensitivity causes serious performance degradation.