Uploaded image for project: 'Percona Operator for MySQL based on Percona XtraDB Cluster'
  1. Percona Operator for MySQL based on Percona XtraDB Cluster
  2. K8SPXC-596

Liveness for pxc container could cause zombie processes

Details

    • Bug
    • Status: Done
    • Medium
    • Resolution: Fixed
    • 1.6.0
    • 1.8.0
    • None
    • None

    Description

      Operator returns error for liveness check:

      {"level":"error","ts":1608197347.9033678,"logger":"controller_perconaxtradbcluster","msg":"sync users","error":"exec syncusers: command terminated with exit code 126 / OCI runtime exec failed: exec failed: container_linux.go:370: starting container process caused: process_linux.go:95: starting setns process caused: fork/exec /proc/self/exe: resource temporarily unavailable: unknown\r\n / ","errorVerbose":"exec syncusers: command terminated with exit code 126 / OCI runtime exec failed: exec failed: container_linux.go:370: starting container process caused: process_linux.go:95: starting setns process caused: fork/exec /proc/self/exe: resource temporarily unavailable: unknown\r\n / \ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).syncPXCUsersWithProxySQL\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/users.go:320\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).resyncPXCUsersWithProxySQL.func1\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/controller.go:993\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).resyncPXCUsersWithProxySQL.func1\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/controller.go:995"}
      
      

      Such issues are happening due to fork bombs or due to many zombie processes.
      I can simulate a timeout in liveness-check.sh:
      Added "sleep 100" after TIMEOUT=10 line.

      kubectl describe pod cluster1-pxc-0
      ...
        Warning  Unhealthy  43s (x5 over 20m)  kubelet            Liveness probe errored: rpc error: code = DeadlineExceeded desc = failed to exec in container: timeout 5s exceeded: context deadline exceeded
      
      

      ps shows many zombie processes after some time:

      bash-4.2$ ps -eFH
      UID          PID    PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
      mysql       1245       0  0  2960  2928   2 14:56 pts/1    00:00:00 bash
      mysql       1677    1245  0 12942  3572   1 15:11 pts/1    00:00:00   ps -eFH
      mysql          1       0  1 885231 462944 1 14:39 ?        00:00:32 mysqld --wsrep_start_position=457332a3-54e3-11eb-b2a8-bfd3c651948d:13
      mysql        881       1  0     0     0   4 14:44 ?        00:00:00   [sleep] <defunct>
      mysql        927       1  0     0     0   2 14:46 ?        00:00:00   [sleep] <defunct>
      mysql       1002       1  0     0     0   4 14:47 ?        00:00:00   [sleep] <defunct>
      mysql       1062       1  0     0     0   7 14:49 ?        00:00:00   [sleep] <defunct>
      mysql       1107       1  0     0     0   5 14:51 ?        00:00:00   [sleep] <defunct>
      mysql       1149       1  0     0     0   6 14:52 ?        00:00:00   [sleep] <defunct>
      mysql       1202       1  0     0     0   1 14:54 ?        00:00:00   [sleep] <defunct>
      mysql       1244       1  0     0     0   5 14:56 ?        00:00:00   [sleep] <defunct>
      mysql       1295       1  0     0     0   5 14:57 ?        00:00:00   [sleep] <defunct>
      mysql       1350       1  0     0     0   1 14:59 ?        00:00:00   [sleep] <defunct>
      mysql       1392       1  0     0     0   4 15:01 ?        00:00:00   [sleep] <defunct>
      mysql       1435       1  0     0     0   3 15:02 ?        00:00:00   [sleep] <defunct>
      mysql       1489       1  0     0     0   6 15:04 ?        00:00:00   [sleep] <defunct>
      mysql       1530       1  0     0     0   4 15:06 ?        00:00:00   [sleep] <defunct>
      mysql       1570       1  0     0     0   1 15:07 ?        00:00:00   [sleep] <defunct>
      mysql       1622       1  0     0     0   7 15:09 ?        00:00:00   [sleep] <defunct>
      mysql       1664       1  0  1095   664   4 15:11 ?        00:00:00   sleep 100
      

      Expected behavior:
      mysql command in liveness-check.sh should be killed before "timeout 5s exceeded".

      mysql ... --connect-timeout=$TIMEOUT (where TIMEOUT is 10 seconds) is longer 5 seconds and connect timeout can't guarantee correct timelimit. Even simple queries could take a long time in case of high CPU load on mysql side.

      E.g. for 5 seconds total timeout we can use something like:

      timeout 4s mysql --connect-timeout=3 -uroot -proot_password -e 'select sleep(10)'
      # or even with kill
      timeout -k 4s 3s mysql --connect-timeout=2 -uroot -proot_password -e 'select sleep(10)
      
      

      Attachments

        Activity

          People

            slava.sarzhan Slava Sarzhan
            nickolay.ihalainen Nickolay Ihalainen
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - Not Specified
                Not Specified
                Logged:
                Time Spent - 7 hours, 10 minutes
                7h 10m

                Smart Checklist