Details
-
Bug
-
Status: Done
-
High
-
Resolution: Fixed
-
1.7.0, 1.8.0, 1.9.0
-
None
-
None
Description
-
-
- Description
-
We find that mongodb-operator could accidentally delete the statefulset and PVC belonging to the PerconaServerMongoDB after experiencing a restart and talking to one stale apiserver in an HA k8s cluster. Although the root cause is in the staleness from apiserver, the controller code can be further enhanced to avoid such undesired deletion.
More concretely, if the controller receives a stale update event saying "PerconaServerMongoDB has non-nil DeletionTimestamp" from the stale apiserver, `deletePvcFinalizer` will be invoked to delete both the PVC and the statefulset. Note that the controller only lists the related PVC and statefulset using clusterLabels (i.e., the name), so it cannot tell whether the listed PVC and statefulset really belong to the "to-be-deleted" PerconaServerMongoDB when it shares the name with the currently running one. One potential way to enhance this part is to double-check that the controller reference UID of each listed statefulset matches the UID of the "to-be-deleted" PerconaServerMongoDB. We have attached a patch for the UID checking before deleting a statefulset. The UID checking is located in `deleteAllStatefulsets`:
```
func (r *ReconcilePerconaServerMongoDB) deleteAllStatefulsets(cr *api.PerconaServerMongoDB) error {
stsList, err := r.getAllstatefulsets(cr)
if err != nil
for _, sts := range stsList.Items {
+ if !r.matchUID(cr, &sts)
log.Info("deleting StatefulSet", "name", sts.Name)
err := r.client.Delete(context.TODO(), &sts)
if err != nil
}
return nil
}
```
Since every two objects, even sharing the same name, should have a different UID, checking UID before deletion can help avoid the above problem.
-
-
- Reproduction
-
We list concrete reproduction steps as below (in a HA k8s cluster):
1. Create a PerconaServerMongoDB instance `mdb` with finalizer `delete-psmdb-pvc`, and a `replset` sized 3. The controller is talking to apiserver1 (that is not stale). A statefulset and three PVC will be created in the k8s cluster.
2. Delete `mdb`. Apiserver1 will send the update events with a non-nil DeletionTimestamp to the controller, which triggers `deletePvcFinalizer` to delete the related statefulset and PVC during reconcile. Meanwhile, apiserver2 is partitioned so its watch cache stops at the moment that `mdb` is tagged with a non-nil DeletionTimestamp.
3. Create the PerconaServerMongoDB with the same name `mdb`. Now the statefulset and PVC get back. However, apiserver2 still holds the stale view that `mdb` has a non-nil DeletionTimestamp and is about to be deleted.
4. The controller restarts after a node failure and talks to the stale apiserver2. Reading the stale information (non-nil DeletionTimestamp) about the old `mdb` from apiserver2, `deletePvcFinalizer` gets invoked again and the statefulset and PVC belonging to the existing PerconaServerMongoDB gets listed (as mentioned above) and deleted.
-
-
- Fix
-
We have attached a patch to help fix this issue. As mentioned above, the patch performs UID checking before deleting each statefulset. It seems that PVC does not have a controller reference, we can later send another patch to label each PVC with the PerconaServerMongoDB UID and perform the same check before deleting the PVC.