Problem/Motivation

We've been investigating for quite a while situations when the DB service is not available in core and contrib. It's hard to reproduce because re-running the jobs usually fixes the issue, but adding CI_DEBUG_SERVICES gives us extra debug information that can be useful

That was the case, and from @dww in https://www.drupal.org/project/gitlab_templates/issues/3414252#comment-1...

Okay, here's a real failure from a job with CI_DEBUG_SERVICES enabled 🎉

https://git.drupalcode.org/project/address/-/jobs/752891

Logs are full of this:

[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:53.851198940Z netcat: connect to localhost (127.0.0.1) port 3306 (tcp) failed: Connection refused
[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:53.851225416Z netcat: connect to localhost (::1) port 3306 (tcp) failed: Cannot assign requested address

Here's the real culprit:

[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451301665Z 2024-02-05T21:59:23.451125Z 0 [ERROR] InnoDB: io_setup() failed with EAGAIN after 5 attempts.
[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451322612Z 2024-02-05T21:59:23.451154Z 0 [Note] InnoDB: You can disable Linux Native AIO by setting innodb_use_native_aio = 0 in my.cnf
[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451325887Z 2024-02-05T21:59:23.451266Z 0 [ERROR] InnoDB: Cannot initialize AIO sub-system
[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451328497Z 2024-02-05T21:59:23.451274Z 0 [ERROR] InnoDB: Plugin initialization aborted with error Generic error
[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451331115Z 2024-02-05T21:59:23.451281Z 0 [ERROR] Plugin 'InnoDB' init function returned error.
[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451333559Z 2024-02-05T21:59:23.451285Z 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451338953Z 2024-02-05T21:59:23.451289Z 0 [ERROR] Failed to initialize builtin plugins.
[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451341452Z 2024-02-05T21:59:23.451292Z 0 [ERROR] Aborting
[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451344050Z 
[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451346775Z 2024-02-05T21:59:23.451304Z 0 [Note] Binlog end
[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.451465428Z 2024-02-05T21:59:23.451358Z 0 [Note] Shutting down plugin 'CSV'
[service:drupalci/mysql-5.7-database] 2024-02-05T21:59:23.453778406Z 2024-02-05T21:59:23.453665Z 0 [Note] /usr/sbin/mysqld: Shutdown complete

However, it's not clear why that is happening, just from these logs. Wonder if there's other output being saved somewhere that might be useful. Hopefully @fjgarlin has a chance to review this and knows where to look for the underlying problem.

I followed up on slack here: https://drupal.slack.com/archives/CGKLP028K/p1707211277736559?thread_ts=...

It seems that the quickest workaround is to add some configuration to the my.cnf file. We might need a bigger and more robust fix somewhere else, but it's not clear where or what yet, so we should address the issue here if possible.

It seems to be happening on mysql-5.7 but I'd probably do it for the other mysql versions too.

Steps to reproduce

Almost impossible to reproduce, but the above example shows a situation where it happened and detailed output.

Can be duplicated on a local system.

for i in $(seq 1 15);
do
    docker run --rm --name resource_exhaust_$i drupalci/mysql-5.7:production  > /dev/null 2> /dev/null &
    sleep 30
done

Proposed resolution

Change my.cnf files for the images with a fix for that situation.

Remaining tasks

MR

User interface changes

API changes

Data model changes

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

fjgarlin created an issue. See original summary.

fjgarlin’s picture

Project: GitLab Templates » DrupalCI: Environments
Component: gitlab-ci » PHP Containers
cmlara’s picture

Issue summary: View changes

Added bash script that allows duplicating locally outside of GitLab. Nothing is special about GItLab as it relates to this issue, with the only variable being the k8 node fs.aio-max-nr value. Just periodically follow the logs (or remove the null redirect and have the logs output to the console) of each container to see if the error occurs.

On a non-tuned laptop with several other containers running (including 3 MariaDB containers) I managed to reach the 7th container ( 10 database containers total) before I saw the error.

One should be able to simulate higher concurrency by reducing sysctl fs.aio-max-nr (each container will take up a higher percentage of the max limit)

Using the first results I could pull up on Slack (from March 2023) the largest runner in the fleet at the time was m4.10xlarge, which is 40vcpu.

Assuming I understand the reservation system correct, and assuming the standard (gitlab_template) configuration of 2CPU reservations, and assuming an absolute worst case scenario that all pods are PHPUnit stage (where we have the build, helper, php, and mysql) containers along with any other containers a contrib project may have added) 20 SQL container instances may be running on a single physical K8 node(host/worker) at a time along with the other ancillary containers that consume AIO slots.

That gives a (rough) minimal target of how many instances needs to be able to launch for this to be considered 'resolved'.

dimitriskr made their first commit to this issue’s fork.

andypost’s picture

andypost’s picture

Status: Active » Needs review
cmlara’s picture

Installing MariaDB/MySQL system tables in '/var/lib/mysql/' ...
io_setup(8192) returned -11
2024-02-13 21:02:08 0 [Warning] InnoDB: Linux Native AIO disabled.

Config change was loaded into the mysql config for mariadb

Was able to run the script above (when set to the mariadb-10.6:dev test image) with all 15 containers starting.

nnewton’s picture

We are starting to hit this on core gitlabci as we are trying to consolidate runs on nodes. I would suggest we globally disable AIO for these containers (mysql/mariadb). There are solutions on the node side, but they are ugly and won't be portable between testing environments.

On our larger nodes we can reproduce this fairly consistently while watching aio-nr.

1 Job - 1 Node

root@runner-s4yvuuu9g-project-78834-concurrent-0-hpbibdd6:/var/www/html# sysctl -a 2> /dev/null | grep fs.aio
fs.aio-max-nr = 65536
fs.aio-nr = 8805

4 Jobs - 1 Node

root@runner-s4yvuuu9g-project-78834-concurrent-3-qg6kwxrj:/var/www/html# sysctl -a 2> /dev/null  | grep aio
fs.aio-max-nr = 65536
fs.aio-nr = 35220

And if we push 8 jobs to double that, the 8th will fail with:

[ERROR] InnoDB: io_setup() failed with EAGAIN after 5 attempts.
[service:drupalci/mysql-5.7-database] 2024-04-17T21:58:10.746455040Z 2024-04-17T21:58:10.746096Z 0 [Note] InnoDB: You can disable Linux Native AIO by setting innodb_use_native_aio = 0 in my.cnf
[service:drupalci/mysql-5.7-database] 2024-04-17T21:58:10.746456647Z 2024-04-17T21:58:10.746156Z 0 [ERROR] InnoDB: Cannot initialize AIO sub-system

Edit: Redhat looks to have also discovered this and added a way to disable AIO via env variable for their openshift containers: https://bugzilla.redhat.com/show_bug.cgi?id=1281733

andypost’s picture

maybe it just need to increase this value fs.aio-max-nr=200000 as most of distros doing?

nnewton’s picture

Which distros have this set to not 65536? Debian/RHEL/AL2 all seem to have this set to the default of 65536. Either way, our (and everyone elses) EKS/AL2 based clusters will have this set to 65536. Modifying this would require a custom launch template or marking this sysctl as unsafe but allowed at the kubelet level. I would advise this be changed at the container level as that is a far cleaner solution and would resolve this portably between clusters.

fjgarlin’s picture

Based on #9 I added the same setting to all other mariadb and MySQL images: https://git.drupalcode.org/project/drupalci_environments/-/merge_request...

What else would be needed? We can test this in an MR in core (they won't need to merge it I think).
I'm happy to help things move forward on this.

cmlara’s picture

Either way, our (and everyone elses) EKS/AL2 based clusters will have this set to 65536.

Key note is that is a default value, not necessarily what everyone runs with.

Changing these to match the purpose of the environment is to be expected. Defaults are just that, defaults, a cluster manager is expected to manager their cluster to meet the needs of the design.

Changing the Drupal images is a start, however that does nothing for projects that don't run the DrupalCI images (few if any at the moment) and not all containers allow an easy environment variable to disable this feature (for example I couldn't find it in the wodby or dockerhub mariadb images). Few may use these right now, however its not impossible that the gitlab_templates project could move away from drupalci images if justification is provided.

It makes a lot of sense in my opinion for D.O. infra to tweak the environment to perform to expected use by the community

fjgarlin’s picture

I agree that it might make sense to change at infra level, but I also think that if we have a quick win available within the images that we are using right now (ie: this MR), we should go ahead and do it.

nnewton’s picture

The defaults discussion was due to someone suggesting that distros were changing the default, which they are not.

Obviously we change numerous default settings in drupal-infra. As I mentioned in my previous comment, this setting is very difficult to change in a manageable/secure way on an EKS cluster in our config management and we won't be doing so currently. We are working desperately to reduce maintenance overhead and this would increase it for no clear advantage (if people start using external images in mass enough that 8 would be co-scheduled on a node, we can address that then).

If this change is not merged what we will do at the moment is limit per-node concurrency, not change the setting. This is why I suggested the change, because it would stabilize the runs and not require per-node concurrency limits. Changing this setting is not currently an option. We maybe able to re-address it in the future.

fjgarlin changed the visibility of the branch 3419805-fix-io-problems to hidden.

andypost’s picture

  • andypost committed 12da3346 on dev authored by fjgarlin
    Issue #3419805 by fjgarlin, cmlara, nnewton: Disable aio in mariadb and...
andypost’s picture

I did merge to dev, let's see if all images are build https://git.drupalcode.org/project/drupalci_environments/-/jobs/1408250

  • andypost committed cad6b497 on dev authored by fjgarlin
    Issue #3419805 by fjgarlin, andypost, cmlara, nnewton: Disable aio in...
andypost’s picture

Current build system require changes in Dockerfile to automatically rebuild, so I updated last commit to dev with https://git.drupalcode.org/project/drupalci_environments/-/commit/cad6b4...

  • andypost committed 1f233b8a on dev authored by fjgarlin
    Issue #3419805 by fjgarlin, andypost, cmlara, nnewton: Disable aio in...

  • andypost committed 993ad0a1 on dev authored by fjgarlin
    Issue #3419805 by fjgarlin, andypost, cmlara, nnewton: Disable aio in...
andypost’s picture

Tuned outdated repos and now all images are pushed

ref https://git.drupalcode.org/project/drupalci_environments/-/jobs/1418214

images just need to install netcat-traditional psmisc so I disabled all other repos via sed

fjgarlin’s picture

Tested on core D11 #3443233: [ignore] Test dev images and core D7 #3443234: [ignore] Test dev images with the :dev images and everything seems correct.

  • andypost committed 88466756 on production authored by fjgarlin
    Issue #3419805 by fjgarlin, andypost, cmlara, nnewton: Disable aio in...
andypost’s picture

Status: Needs review » Fixed

Thanks everyone involved, production images are published

andypost’s picture

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

andypost’s picture