Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
Problem/Motivation
The PostgreSQL implementation of the Upsert query added in #2542776: Add an Upsert class can be improved by using common table expressions or even the native UPSERT syntax introduced in 9.5.
The implementation of upsert as used by the Cache DatabaseBackend seems to be too slow and causes docker to think that the container is unresponsive and times out the test bot.
Proposed resolution
Research into the issue
- Implement the previous cache DatabaseBackend as a last resort.
- Implement Common Table Expression (CTE) which has various Upsert implementations. This caused issues with type checking as the implicit temporary table did not have the correct data type.
- Try adding ON COMMIT DROP and committing after an Upsert to have PostgreSQL clean up tables. This did not have any effect on the out of memory issues.
- Looked into table locking and explicit unlocking to see if that affected performance.
- #12, Damien suggested an example from the docs, but the implementation seemed more complicated.
Proposed resolution
- Use a similar implementation from the old cache DatabaseBackend in the Upsert class, but prevent deadlock issue from #2336627: Deadlock on cache_config (DatabaseBackend::setMultiple()).
- Implement native UPSERT query class for PostgreSQL >= 9.5.
Patch is in Comment #19, not #20.
Remaining tasks
Discuss the alternatives and try to implement them.Write patchReview patchRun performance benchmarkResults seem acceptable.Figure out if patch affected testbot test time length.It did not.- Create follow-up for investigating commit that slowed testbot.
User interface changes
Nope.
API changes
Nope.
Data model changes
Nope.
Beta phase evaluation
Issue category | Bug report because this breaks testing infrastructure. |
---|---|
Issue priority | Critical because this blocks drupalci which is on #2485119: [meta] The Drupal 8.0.0-rc1 Release Checklist |
Prioritized changes | The issue changes the implementation of Upsert which was a critical performance bug. |
Comment | File | Size | Author |
---|---|---|---|
#19 | interdiff.txt | 5.23 KB | amateescu |
#19 | 2549323-19.patch | 9.56 KB | amateescu |
#9 | 2549323-9.patch | 8.24 KB | amateescu |
#8 | 2549323-8.patch | 1.31 KB | amateescu |
#6 | drupal-2549323-commit-upsert-transaction-6.patch | 644 bytes | mradcliffe |
Comments
Comment #2
Crell CreditAttribution: Crell at Palantir.net commentedIt would probably be reasonably straightforward to have 2 Postgres Upsert classes, and override the upsert() method on the Connection to load one or the other depending on the version. Then we can leave the current safe version in place, and have a fancy-faster version for 9.4+ users.
Comment #3
mradcliffe@isntall and I are fairly certain that the current Upsert is causing testbot havoc as implemented by #2336627: Deadlock on cache_config (DatabaseBackend::setMultiple()). And @amateescu has been testing revert patches as well)
I'm going to set this to active, and bump it up to Major at least.
Comment #4
amateescu CreditAttribution: amateescu for Pfizer, Inc. commentedI think we have at least two options to improve the current Upsert implementation for pgsql < 9.4:
1) Use writeable common table expressions as described here: http://stackoverflow.com/a/8702291/14731
The problem with this solution is that it does not prevent a unique key violation. That is probably fine for cache tables because writing some data with the same key means you're most likely writing the same data. But I'm worried that this might not be the case for other tables. For example, we write in 'sessions' with a Merge query right now, but it would be a good candidate for converting to an Upsert query.
2) Implement the previous cache database backend solution: do a delete on the values of the 'key' field and then a multiple-insert.
@mradcliffe, do you see any other options or do you have any feedback on the proposals above?
Comment #5
mradcliffeWhat I think is happening on testbot, and it's happening on my local testbot, is that PostgreSQL is running out of shared memory creating the temporary table in the upsert class.
This might be because cache tables are large and the temporary table persists until the end of the transaction. Maybe we could try a patch that commits the transaction after the upsert is done. I don't know if any other queries are run in that connection. I don't think we would want to commit the temporary table before doing any of the other operations, right?
I'm not sure if this is explicitly causing the test to exceed Docker's timeout.
Here is a sample exception from my latest test run which was concurrency 1. Also attached sqlite database with simpletest results. For the curious,
sqlite3 2015-08-18-test.sqlite "SELECT * FROM simpletest WHERE status = 'exception'" | less
Comment #6
mradcliffeI'm going to assume that this patch is not a good approach, but I want to send something to the test bot before I go to bed.
Comment #7
mradcliffeComment #8
amateescu CreditAttribution: amateescu for Pfizer, Inc. commentedAccording to http://stackoverflow.com/a/4961809/1499564 , locks for temp tables are never released so let's try to see if we can re-use them.
Comment #9
amateescu CreditAttribution: amateescu for Pfizer, Inc. commentedSwitching course, I'm now trying to implement the query described in http://stackoverflow.com/a/8702291/1499564
The code is pretty much done but I'm getting this exception:
Any help welcomed! :)
Comment #10
mradcliffenew_values is populated from a select? Maybe the data type that is returned from the select is not the same as the base table's type?
The use of CTE is still going to create a temporary table, right?
I was also looking at temporary table creation.
* If going back to the original upsert implementation, queryTemporary() could be changed to inherit constraints probably best implemented using
LIKE
.* No indexes are created so with lots of rows, a select on the temporary table may be much slower.
* Should the driver always drop the temporary table on commit? queryTemporary() would to be modified to append
ON COMMIT DROP
. I think this would also prevent the need to manually vacuum afterward.* Currently drupalci postgresql container has temp_buffers set to default 8MB. This could be increased for performance for the testbot. Maybe 1GB?
Comment #11
mradcliffeWhen using Common Table Expressions, the type can be defined for the first row of values, but PostgreSQL will infer the type for the rest of the values.
That's a round trip to table information again. :(
Maybe it can be stored as protected on the Upsert object?
Comment #12
Damien Tournoud CreditAttribution: Damien Tournoud commentedIt is unclear to me what the concurrency requirements of our UPSERT should be.
One approach would be to do a
INSERT
followed by anUPDATE
of the rows that couldn't be inserted.Supposing an UPSERT in table "cache_render", with key "cid":
Comment #13
amateescu CreditAttribution: amateescu for Pfizer, Inc. commentedWe don't want to lose any insert or update, if that's what you're asking :)
What you describe with the cache_render code is basically the example implementation from the PostgreSQL docs: http://www.postgresql.org/docs/current/static/plpgsql-control-structures... . This is also an alternative to the wCTE implementation that I tried in #9, but I think it's a bit more complicated to do because we would have to use variable arguments for the stored function.
Comment #14
Damien Tournoud CreditAttribution: Damien Tournoud commented@amateescu: I don't think you need a stored function at all here. All this logic can be safely handled in the application, and if we batch the UPDATE patch into multi-statement, the latency is going to be just fine.
Comment #15
amateescu CreditAttribution: amateescu for Pfizer, Inc. commentedTrying out my suggestion from #4.2, which is basically implementing the upsert as MySQL's REPLACE INTO rather than INSERT .. ODKU.
Running a group of tests locally, this almost brings back the performance to the level from before #2336627: Deadlock on cache_config (DatabaseBackend::setMultiple()). I suspect the slight regression is because we're now explicitly locking the table, but I'm not sure. Let's see how the testbot reacts.
Comment #16
mradcliffeRan
ab -n 10000 -c 10 http://drupal8.dev/
anddrush cr
(repeatedly) on CentOS 7, Apache 2.4.6, PostgreSQL 9.2.13 on a bare Drupal installation.It's slow, but it didn't run into any Exceptions.
Results:
Comment #17
mradcliffeSHARE ROW EXCLUSIVE MODE looks like it would mean concurrency issues, but it does lock the table for exclusive data changes.
I have done this in Drupal 7.
If we think about the tables that currently use upsert, it's cache tables (cache_bootstrap and cache_config). The config cache could get quite large, which probably accounts for the slow cache-rebuild in the benchmark tests.
However the request isn't timing out so that seems okay.
These two large writes will probably incur an AUTO VACUUM within 3 minutes on the database server.
That would also account for a performance hit, especially on the testbot. I suggested turning off auto vacuum on the testbot, which we need to test, but should be Ok with the amount of memory the test bot has.
On a standard production server, postgresql should vacuum within 3 minutes of the cache rebuild with this many writes.
I think this is RTBC if the test bot run passes, and someone can confirm that my AB results are respectable for a cache rebuild. 30 second request after a cache clear seems bad, but it is not deadlocking, and after cache rebuilt the requests go back down to being fast with the page cache.
Comment #18
mradcliffeThe pgsql-9.1-web-5.5 test bot was running tests in about 28 minutes before.
The job above is running more than twice as long.
Job Comparison:
Comment #19
amateescu CreditAttribution: amateescu for Pfizer, Inc. commentedSince this issue is about improving the overall situation, not just the current pgsql < 9.5 implementation, here's an update which also implements the native upsert for versions >= 9.5.
Comment #20
amateescu CreditAttribution: amateescu for Pfizer, Inc. commentedLet's see if the added lock is causing this.
Comment #21
amateescu CreditAttribution: amateescu for Pfizer, Inc. commentedSo the lock is not the real reason for the performance decline, but I have to point out that my "revert upsert usage" patch from #2126447-71: Patch testing ran for 50 minutes (https://dispatcher.drupalci.org/job/default/7013/), so something else that was committed around Aug 10 made the performance worse. Or maybe some changes to the testbot itself?
In any case, I also ran the "
ab -n 10000 -c 10 http://drupal8.dev/
anddrush cr
(repeatedly)" scenario and I didn't get any exceptions either, so the patch from #19 should be good to go. We can always open a followup to improve things further (e.g. try out Damien's suggestion from #12), but for now the current patch brings us back some stability/sanity for the pgsql testbot and allows us to move forward with fixing the random exceptions.Comment #22
mradcliffeI updated the issue summary with details from the comments about the various things @amateescu or I tried.
Comment #23
mradcliffeSpecified which patch is the latest and hid #20.
Comment #24
Crell CreditAttribution: Crell at Palantir.net commentedI can't speak to the Postgres code specifically, but the DBTNG architectural bits look good to me. +1 on commit from that angle.
Comment #25
alexpottLet's get this in because then at least we have testbot results again. Committed 8dc9763 and pushed to 8.0.x. Thanks!
Comment #27
amateescu CreditAttribution: amateescu for Pfizer, Inc. commentedFixed a (confusing) typo in the IS.
Comment #29
Liam MorlandFix typo in summary.