Uploaded image for project: 'Percona Operator for MongoDB'
  1. Percona Operator for MongoDB
  2. K8SPSMDB-658

improve operator logs when pausing/unpausing the cluster

Details

    • Improvement
    • Status: Done
    • Medium
    • Resolution: Fixed
    • 1.11.0
    • 1.14.0
    • None
    • Yes
    • Yes
    • Yes

    Description

      1. pause
      Currently when we set "pause: true" for a cluster, this is what we see in operator logs:

      {"level":"info","ts":1647445131.6547358,"logger":"controller_psmdb","msg":"initiating replset","replset":"rs0","pod":"my-cluster-name-rs0-1"}
      {"level":"info","ts":1647445138.8284142,"logger":"controller_psmdb","msg":"replset was initialized","replset":"rs0","pod":"my-cluster-name-rs0-1"}                                                                  {"level":"info","ts":1647445186.9312146,"logger":"controller_psmdb","msg":"update Mongo version to 5.0.6-5 (fetched from db)"}                                                                                      {"level":"info","ts":1647445851.3521862,"logger":"controller_psmdb","msg":"StatefulSet is changed, starting smart update","name":"my-cluster-name-rs0"}                                                             {"level":"info","ts":1647445851.41177,"logger":"controller_psmdb","msg":"primary pod is my-cluster-name-rs0-1.my-cluster-name-rs0.psmdb-test.svc.cluster.local:27017"}                                              {"level":"info","ts":1647445851.4118285,"logger":"controller_psmdb","msg":"apply changes to secondary pod","pod":"my-cluster-name-rs0-2"}                                                                           {"level":"error","ts":1647445912.7067978,"logger":"controller.psmdb-controller","msg":"Reconciler error","name":"my-cluster-name","namespace":"psmdb-test","error":"reconcile StatefulSet for rs0: failed to run smartUpdate: failed to apply changes: wait pod restart: reach pod wait limit","errorVerbose":"reconcile StatefulSet for rs0: failed to run smartUpdate: failed to apply changes: wait pod restart: reach pod wait limit\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/psmdb_controller.go:371\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
      "}
      

      So it just mentions that stateful set has changed, but to know what exactly happened I need to check the diff to see that "replicas: 3" in sts changed to "replicas: 0".
      Also there is some error visible every time which is not nice "reconcile StatefulSet for rs0: failed to run smartUpdate: failed to apply changes: wait pod restart: reach pod wait limit".

      2. unpause
      When we unpause the cluster there is no any message at all in the operator logs. I guess for debugging purposes it would be good to at least write a line that we will now start to unpause some cluster.

      3. This is how it looks in sharded cluster:

      {"level":"info","ts":1647450208.4828126,"logger":"controller.psmdb-controller","msg":"Starting workers","worker count":1}
      {"level":"info","ts":1647450208.5847135,"logger":"controller.perconaservermongodbrestore-controller","msg":"Starting workers","worker count":1}
      {"level":"info","ts":1647450208.5847921,"logger":"controller.perconaservermongodbbackup-controller","msg":"Starting workers","worker count":1}
      {"level":"info","ts":1647450208.7911856,"logger":"KubeAPIWarningLogger","msg":"batch/v1beta1 CronJob is deprecated in v1.21+, unavailable in v1.25+; use batch/v1 CronJob"}
      {"level":"info","ts":1647450208.897378,"logger":"KubeAPIWarningLogger","msg":"policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget"}
      #
      # startup
      #
      {"level":"info","ts":1647450219.105958,"logger":"controller_psmdb","msg":"initiating replset","replset":"cfg","pod":"my-cluster-name-cfg-0"}
      {"level":"info","ts":1647450225.1640992,"logger":"controller_psmdb","msg":"replset was initialized","replset":"cfg","pod":"my-cluster-name-cfg-0"}
      {"level":"info","ts":1647450235.7624125,"logger":"controller_psmdb","msg":"initiating replset","replset":"rs0","pod":"my-cluster-name-rs0-0"}
      {"level":"info","ts":1647450241.4913213,"logger":"controller_psmdb","msg":"replset was initialized","replset":"rs0","pod":"my-cluster-name-rs0-0"}
      {"level":"info","ts":1647450251.741539,"logger":"controller_psmdb","msg":"initiating replset","replset":"rs1","pod":"my-cluster-name-rs1-0"}
      {"level":"info","ts":1647450257.3567595,"logger":"controller_psmdb","msg":"replset was initialized","replset":"rs1","pod":"my-cluster-name-rs1-0"}
      {"level":"info","ts":1647450278.3918908,"logger":"controller_psmdb","msg":"update Mongo version to 5.0.6-5 (fetched from db)"}
      {"level":"info","ts":1647450278.4973407,"logger":"controller_psmdb","msg":"adding rs to shard","rs":"rs0"}
      {"level":"info","ts":1647450280.8158197,"logger":"controller_psmdb","msg":"added to shard","rs":"rs0"}
      {"level":"info","ts":1647450280.9134772,"logger":"controller_psmdb","msg":"adding rs to shard","rs":"rs1"}
      {"level":"info","ts":1647450283.2842658,"logger":"controller_psmdb","msg":"added to shard","rs":"rs1"}
      #
      # pause
      #
      {"level":"info","ts":1647450479.270014,"logger":"controller_psmdb","msg":"waiting for config RS update"}
      {"level":"info","ts":1647450479.365072,"logger":"controller_psmdb","msg":"waiting for config RS update"}
      {"level":"info","ts":1647450546.8712163,"logger":"controller_psmdb","msg":"waiting for mongos update"}
      #
      # unpause
      #
      

      So it's even more confusing because pause part has only "waiting for..." messages and unpause has no info at all.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tomislav.plavcic@percona.com Tomislav Plavcic
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Smart Checklist