Problem/Motivation
We've been investigating for quite a while situations when the DB service is not available in core and contrib. It's hard to reproduce because re-running the jobs usually fixes the issue, but adding CI_DEBUG_SERVICES gives us extra debug information that can be useful
That was the case, and from @dww in https://www.drupal.org/project/gitlab_templates/issues/3414252#comment-1...
Okay, here's a real failure from a job with CI_DEBUG_SERVICES enabled 🎉
https://git.drupalcode.org/project/address/-/jobs/752891
Logs are full of this:
[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:53.851198940Z netcat: connect to localhost (127.0.0.1) port 3306 (tcp) failed: Connection refused [service:drupalci/mysql-5.7-database] 2024-02-05T21:59:53.851225416Z netcat: connect to localhost (::1) port 3306 (tcp) failed: Cannot assign requested addressHere's the real culprit:
[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451301665Z 2024-02-05T21:59:23.451125Z 0 [ERROR] InnoDB: io_setup() failed with EAGAIN after 5 attempts. [service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451322612Z 2024-02-05T21:59:23.451154Z 0 [Note] InnoDB: You can disable Linux Native AIO by setting innodb_use_native_aio = 0 in my.cnf [service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451325887Z 2024-02-05T21:59:23.451266Z 0 [ERROR] InnoDB: Cannot initialize AIO sub-system [service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451328497Z 2024-02-05T21:59:23.451274Z 0 [ERROR] InnoDB: Plugin initialization aborted with error Generic error [service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451331115Z 2024-02-05T21:59:23.451281Z 0 [ERROR] Plugin 'InnoDB' init function returned error. [service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451333559Z 2024-02-05T21:59:23.451285Z 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed. [service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451338953Z 2024-02-05T21:59:23.451289Z 0 [ERROR] Failed to initialize builtin plugins. [service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451341452Z 2024-02-05T21:59:23.451292Z 0 [ERROR] Aborting [service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451344050Z [service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451346775Z 2024-02-05T21:59:23.451304Z 0 [Note] Binlog end [service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451465428Z 2024-02-05T21:59:23.451358Z 0 [Note] Shutting down plugin 'CSV' [service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.453778406Z 2024-02-05T21:59:23.453665Z 0 [Note] /usr/sbin/mysqld: Shutdown completeHowever, it's not clear why that is happening, just from these logs. Wonder if there's other output being saved somewhere that might be useful. Hopefully @fjgarlin has a chance to review this and knows where to look for the underlying problem.
I followed up on slack here: https://drupal.slack.com/archives/CGKLP028K/p1707211277736559?thread_ts=...
It seems that the quickest workaround is to add some configuration to the my.cnf file. We might need a bigger and more robust fix somewhere else, but it's not clear where or what yet, so we should address the issue here if possible.
It seems to be happening on mysql-5.7 but I'd probably do it for the other mysql versions too.
Steps to reproduce
Almost impossible to reproduce, but the above example shows a situation where it happened and detailed output.
Can be duplicated on a local system.
for i in $(seq 1 15);
do
docker run --rm --name resource_exhaust_$i drupalci/mysql-5.7:production > /dev/null 2> /dev/null &
sleep 30
done
Proposed resolution
Change my.cnf files for the images with a fix for that situation.
Remaining tasks
MR
User interface changes
API changes
Data model changes
Issue fork drupalci_environments-3419805
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
- 3419805-disable-aio
changes, plain diff MR !43
- 3419805-fix-io-problems
changes, plain diff MR !41
Comments
Comment #2
fjgarlin commentedComment #3
cmlaraAdded bash script that allows duplicating locally outside of GitLab. Nothing is special about GItLab as it relates to this issue, with the only variable being the k8 node fs.aio-max-nr value. Just periodically follow the logs (or remove the null redirect and have the logs output to the console) of each container to see if the error occurs.
On a non-tuned laptop with several other containers running (including 3 MariaDB containers) I managed to reach the 7th container ( 10 database containers total) before I saw the error.
One should be able to simulate higher concurrency by reducing sysctl fs.aio-max-nr (each container will take up a higher percentage of the max limit)
Using the first results I could pull up on Slack (from March 2023) the largest runner in the fleet at the time was m4.10xlarge, which is 40vcpu.
Assuming I understand the reservation system correct, and assuming the standard (gitlab_template) configuration of 2CPU reservations, and assuming an absolute worst case scenario that all pods are PHPUnit stage (where we have the build, helper, php, and mysql) containers along with any other containers a contrib project may have added) 20 SQL container instances may be running on a single physical K8 node(host/worker) at a time along with the other ancillary containers that consume AIO slots.
That gives a (rough) minimal target of how many instances needs to be able to launch for this to be considered 'resolved'.
Comment #6
andypostMerged as opened core's MR to test it https://git.drupalcode.org/project/drupal/-/merge_requests/6586
Comment #7
andypostComment #8
cmlaraConfig change was loaded into the mysql config for mariadb
Was able to run the script above (when set to the mariadb-10.6:dev test image) with all 15 containers starting.
Comment #9
nnewton commentedWe are starting to hit this on core gitlabci as we are trying to consolidate runs on nodes. I would suggest we globally disable AIO for these containers (mysql/mariadb). There are solutions on the node side, but they are ugly and won't be portable between testing environments.
On our larger nodes we can reproduce this fairly consistently while watching aio-nr.
1 Job - 1 Node
4 Jobs - 1 Node
And if we push 8 jobs to double that, the 8th will fail with:
Edit: Redhat looks to have also discovered this and added a way to disable AIO via env variable for their openshift containers: https://bugzilla.redhat.com/show_bug.cgi?id=1281733
Comment #10
andypostmaybe it just need to increase this value
fs.aio-max-nr=200000as most of distros doing?Comment #11
nnewton commentedWhich distros have this set to not 65536? Debian/RHEL/AL2 all seem to have this set to the default of 65536. Either way, our (and everyone elses) EKS/AL2 based clusters will have this set to 65536. Modifying this would require a custom launch template or marking this sysctl as unsafe but allowed at the kubelet level. I would advise this be changed at the container level as that is a far cleaner solution and would resolve this portably between clusters.
Comment #13
fjgarlin commentedBased on #9 I added the same setting to all other mariadb and MySQL images: https://git.drupalcode.org/project/drupalci_environments/-/merge_request...
What else would be needed? We can test this in an MR in core (they won't need to merge it I think).
I'm happy to help things move forward on this.
Comment #14
cmlaraKey note is that is a default value, not necessarily what everyone runs with.
Changing these to match the purpose of the environment is to be expected. Defaults are just that, defaults, a cluster manager is expected to manager their cluster to meet the needs of the design.
Changing the Drupal images is a start, however that does nothing for projects that don't run the DrupalCI images (few if any at the moment) and not all containers allow an easy environment variable to disable this feature (for example I couldn't find it in the wodby or dockerhub mariadb images). Few may use these right now, however its not impossible that the gitlab_templates project could move away from drupalci images if justification is provided.
It makes a lot of sense in my opinion for D.O. infra to tweak the environment to perform to expected use by the community
Comment #15
fjgarlin commentedI agree that it might make sense to change at infra level, but I also think that if we have a quick win available within the images that we are using right now (ie: this MR), we should go ahead and do it.
Comment #16
nnewton commentedThe defaults discussion was due to someone suggesting that distros were changing the default, which they are not.
Obviously we change numerous default settings in drupal-infra. As I mentioned in my previous comment, this setting is very difficult to change in a manageable/secure way on an EKS cluster in our config management and we won't be doing so currently. We are working desperately to reduce maintenance overhead and this would increase it for no clear advantage (if people start using external images in mass enough that 8 would be co-scheduled on a node, we can address that then).
If this change is not merged what we will do at the moment is limit per-node concurrency, not change the setting. This is why I suggested the change, because it would stabilize the runs and not require per-node concurrency limits. Changing this setting is not currently an option. We maybe able to re-address it in the future.
Comment #18
andypostComment #20
andypostI did merge to dev, let's see if all images are build https://git.drupalcode.org/project/drupalci_environments/-/jobs/1408250
Comment #22
andypostCurrent build system require changes in
Dockerfileto automatically rebuild, so I updated last commit to dev with https://git.drupalcode.org/project/drupalci_environments/-/commit/cad6b4...Comment #25
andypostTuned outdated repos and now all images are pushed
ref https://git.drupalcode.org/project/drupalci_environments/-/jobs/1418214
images just need to install
netcat-traditional psmiscso I disabled all other repos viasedComment #26
fjgarlin commentedTested on core D11 #3443233: [ignore] Test dev images and core D7 #3443234: [ignore] Test dev images with the
:devimages and everything seems correct.Comment #28
andypostThanks everyone involved, production images are published
Comment #29
andypostbtw just got failure for D7 https://git.drupalcode.org/issue/drupal-3443234/-/jobs/1432748
Comment #31
andypostTesting new approach https://git.drupalcode.org/project/drupalci_environments/-/commit/de898f...