Details
-
Bug
-
Status: Done
-
Medium
-
Resolution: Fixed
-
1.13.0
-
None
-
None
-
Yes
-
Yes
-
Yes
Description
Our demand-backup-sharding test is failing sporadically because backups end up in error status and what is more some of them even finish even though they have error status.
Looks like this:
NAME CLUSTER STORAGE DESTINATION STATUS COMPLETED AGE backup1 my-cluster-name aws-s3 psmdb/2023-01-12T15:41:55Z ready 87m 87m backup2 my-cluster-name gcp-cs psmdb/2023-01-12T15:43:01Z error 87m backup3 my-cluster-name azure-blob psmdb/2023-01-12T15:42:34Z error 87m
backup1 finished, backup3 started and errored and then backup2 errored as well.
backup3 in this case even finished with PBM, and backup2 was started when backup3 was running with PBM so PBM just ignored it and it was never finished.
List of backups from PBM:
Backup snapshots: 2023-01-12T15:39:08Z <logical> [restore_to_time: 2023-01-12T15:39:14Z] 2023-01-12T15:39:52Z <logical> [restore_to_time: 2023-01-12T15:40:00Z] 2023-01-12T15:40:29Z <logical> [restore_to_time: 2023-01-12T15:40:35Z] 2023-01-12T15:41:55Z <logical> [restore_to_time: 2023-01-12T15:42:01Z] 2023-01-12T15:42:34Z <logical> [restore_to_time: 2023-01-12T15:43:09Z]
As you can see two finished.
What happens is that we have "pbmStartingDeadline" set to 120 seconds (or so we though) and if the backup is in starting state but longer than 120 seconds we mark it as error.
The problem is we never waited 120 seconds to mark it as error, they were marked almost instantly.