Uploaded image for project: 'Percona Backup for MongoDB'
  1. Percona Backup for MongoDB
  2. PBM-683

Show PITR restore as failed if an error occurred during data retrieval from storage

Details

    • Bug
    • Status: Done
    • Medium
    • Resolution: Fixed
    • 1.5.0
    • 1.6.0
    • None
    • None

    Description

      I had one cluster with some data and PITR chunks and then did restore (with the option "latest") to a new cluster.
      Here's the error, but I provide full restore log:

      2021-05-28T10:04:34.000+0000 I [pitr/2021-05-28T10:00:39Z] mongorestore finished
      2021-05-28T10:04:36.000+0000 I [pitr/2021-05-28T10:00:39Z] restoring users and roles
      2021-05-28T10:04:36.000+0000 I [pitr/2021-05-28T10:00:39Z] starting oplog replay
      2021-05-28T10:04:37.000+0000 I [pitr/2021-05-28T10:00:39Z] oplog replay finished on {1622194583 21}
      2021-05-28T10:04:37.000+0000 I [pitr/2021-05-28T10:00:39Z] replay chunks
      2021-05-28T10:05:09.000+0000 W [pitr/2021-05-28T10:00:39Z] errReadObj Err: context deadline exceeded (Client.Timeout or context cancellation while reading body)
      2021-05-28T10:05:09.000+0000 W [pitr/2021-05-28T10:00:39Z] got context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 1s
      2021-05-28T10:05:10.000+0000 I [pitr/2021-05-28T10:00:39Z] session recreated, resuming download
      2021-05-28T10:05:42.000+0000 W [pitr/2021-05-28T10:00:39Z] errReadObj Err: context deadline exceeded (Client.Timeout or context cancellation while reading body)
      2021-05-28T10:05:42.000+0000 W [pitr/2021-05-28T10:00:39Z] got context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 2s
      2021-05-28T10:05:44.000+0000 I [pitr/2021-05-28T10:00:39Z] session recreated, resuming download
      2021-05-28T10:06:15.000+0000 W [pitr/2021-05-28T10:00:39Z] errReadObj Err: context deadline exceeded (Client.Timeout or context cancellation while reading body)
      2021-05-28T10:06:15.000+0000 W [pitr/2021-05-28T10:00:39Z] got context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 3s
      2021-05-28T10:06:18.000+0000 I [pitr/2021-05-28T10:00:39Z] session recreated, resuming download
      2021-05-28T10:06:48.000+0000 W [pitr/2021-05-28T10:00:39Z] errReadObj Err: context deadline exceeded (Client.Timeout or context cancellation while reading body)
      2021-05-28T10:06:48.000+0000 W [pitr/2021-05-28T10:00:39Z] got context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 4s
      2021-05-28T10:06:52.000+0000 I [pitr/2021-05-28T10:00:39Z] session recreated, resuming download
      2021-05-28T10:07:23.000+0000 W [pitr/2021-05-28T10:00:39Z] errReadObj Err: context deadline exceeded (Client.Timeout or context cancellation while reading body)
      2021-05-28T10:07:23.000+0000 W [pitr/2021-05-28T10:00:39Z] got context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 5s
      2021-05-28T10:07:28.000+0000 I [pitr/2021-05-28T10:00:39Z] session recreated, resuming download
      2021-05-28T10:07:58.000+0000 W [pitr/2021-05-28T10:00:39Z] errReadObj Err: context deadline exceeded (Client.Timeout or context cancellation while reading body)
      2021-05-28T10:07:58.000+0000 W [pitr/2021-05-28T10:00:39Z] got context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 6s
      2021-05-28T10:08:04.000+0000 I [pitr/2021-05-28T10:00:39Z] session recreated, resuming download
      2021-05-28T10:08:35.000+0000 W [pitr/2021-05-28T10:00:39Z] errReadObj Err: context deadline exceeded (Client.Timeout or context cancellation while reading body)
      2021-05-28T10:08:35.000+0000 W [pitr/2021-05-28T10:00:39Z] got context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 7s
      2021-05-28T10:08:42.000+0000 I [pitr/2021-05-28T10:00:39Z] session recreated, resuming download
      2021-05-28T10:09:14.000+0000 W [pitr/2021-05-28T10:00:39Z] errReadObj Err: context deadline exceeded (Client.Timeout or context cancellation while reading body)
      2021-05-28T10:09:14.000+0000 W [pitr/2021-05-28T10:00:39Z] got context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 8s
      2021-05-28T10:09:22.000+0000 I [pitr/2021-05-28T10:00:39Z] session recreated, resuming download
      2021-05-28T10:09:54.000+0000 W [pitr/2021-05-28T10:00:39Z] errReadObj Err: context deadline exceeded (Client.Timeout or context cancellation while reading body)
      2021-05-28T10:09:54.000+0000 W [pitr/2021-05-28T10:00:39Z] got context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 9s
      2021-05-28T10:10:03.000+0000 I [pitr/2021-05-28T10:00:39Z] session recreated, resuming download
      2021-05-28T10:10:36.000+0000 W [pitr/2021-05-28T10:00:39Z] errReadObj Err: context deadline exceeded (Client.Timeout or context cancellation while reading body)
      2021-05-28T10:10:36.000+0000 W [pitr/2021-05-28T10:00:39Z] got context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 10s
      2021-05-28T10:10:46.000+0000 I [pitr/2021-05-28T10:00:39Z] session recreated, resuming download
      2021-05-28T10:10:46.000+0000 E [pitr/2021-05-28T10:00:39Z] download 'tomislav-backup-test2/pbmPitr/rs1/20210528/20210528093623-23.20210528095047-3.oplog.snappy' file from S3: context deadline exceeded (Client.Timeout or context cancellation while reading body)
      2021-05-28T10:10:46.000+0000 I [pitr/2021-05-28T10:00:39Z] oplog replay finished on {0 0} <1622196039>
      2021-05-28T10:10:46.000+0000 I [pitr/2021-05-28T10:00:39Z] recovery successfully finished
      

      And data is not the same from cluster1 to cluster2:

      # cluster: my-cluster-name
      mongos> show dbs;
      admin       0.001GB
      config      0.003GB
      ycsb_test   0.131GB
      ycsb_test2  0.132GB
      ycsb_test3  0.132GB
      ycsb_test4  0.132GB
      mongos> use ycsb_test;
      switched to db ycsb_test
      mongos> db.usertable.count()
      100000
      mongos> use ycsb_test2;
      switched to db ycsb_test2
      mongos> db.usertable.count()
      100000
      mongos> use ycsb_test3;
      switched to db ycsb_test3
      mongos> db.usertable.count()
      100000
      mongos> use ycsb_test4;
      switched to db ycsb_test4
      mongos> db.usertable.count()
      100000
      
      # cluster: my-cluster-name2
      mongos> show dbs;
      admin       0.001GB
      config      0.002GB
      ycsb_test   0.130GB
      ycsb_test2  0.130GB
      ycsb_test3  0.130GB
      ycsb_test4  0.103GB
      mongos> use ycsb_test;
      switched to db ycsb_test
      mongos> db.usertable.count()
      100000
      mongos> use ycsb_test2;
      switched to db ycsb_test2
      mongos> db.usertable.count()
      100000
      mongos> use ycsb_test3;
      switched to db ycsb_test3
      mongos> db.usertable.count()
      100000
      mongos> use ycsb_test4;
      switched to db ycsb_test4
      mongos> db.usertable.count()
      83054
      

      As you can see "ycsb_test4.usertable" doesn't have the needed 100000 documents and the error I pasted was seen on the replica which contains ycsb_test4 database (rs1).
      But the restore is claimed to be successful.

      PBM version used:

      # pbm version
      $ kubectl exec -ti my-cluster-name-rs1-0 -c backup-agent -- pbm version
      Version:   1.5.0
      Platform:  linux/amd64
      GitCommit: 970bbe2b5de4d590bce117ad3abee9626b24af8c
      GitBranch: main
      BuildTime: 2021-05-13_16:06_UTC
      GoVersion: go1.15.12
      

      Here was my setup:

      $ kubectl get pods
      NAME                                               READY   STATUS    RESTARTS   AGE
      my-cluster-name-cfg-0                              2/2     Running   0          88m
      my-cluster-name-cfg-1                              2/2     Running   2          88m
      my-cluster-name-cfg-2                              2/2     Running   1          87m
      my-cluster-name-mongos-847fdc75c6-8bgvf            1/1     Running   0          88m
      my-cluster-name-mongos-847fdc75c6-rrfl9            1/1     Running   0          88m
      my-cluster-name-mongos-847fdc75c6-rrsw6            1/1     Running   0          88m
      my-cluster-name-rs0-0                              2/2     Running   0          88m
      my-cluster-name-rs0-1                              2/2     Running   0          88m
      my-cluster-name-rs0-2                              2/2     Running   0          87m
      my-cluster-name-rs1-0                              2/2     Running   0          88m
      my-cluster-name-rs1-1                              2/2     Running   0          88m
      my-cluster-name-rs1-2                              2/2     Running   1          87m
      my-cluster-name2-cfg-0                             2/2     Running   0          55m
      my-cluster-name2-cfg-1                             2/2     Running   2          54m
      my-cluster-name2-cfg-2                             2/2     Running   1          54m
      my-cluster-name2-mongos-b6464f4b4-5dnbj            1/1     Running   0          45m
      my-cluster-name2-mongos-b6464f4b4-f729f            1/1     Running   0          45m
      my-cluster-name2-mongos-b6464f4b4-mfvrv            1/1     Running   0          45m
      my-cluster-name2-rs0-0                             2/2     Running   0          55m
      my-cluster-name2-rs0-1                             2/2     Running   0          54m
      my-cluster-name2-rs0-2                             2/2     Running   1          54m
      my-cluster-name2-rs1-0                             2/2     Running   0          55m
      my-cluster-name2-rs1-1                             2/2     Running   0          54m
      my-cluster-name2-rs1-2                             2/2     Running   0          54m
      percona-server-mongodb-operator-58c459565b-7ts6z   1/1     Running   0          89m
      

      Based on this kubernetes claims that the restore succedeed and is ready:

      # kubectl get psmdb-restore restore1 -oyaml
      spec:
        backupSource:
          destination: s3://tomislav-backup-test2/2021-05-28T09:35:52Z
          s3:
            bucket: tomislav-backup-test2
            credentialsSecret: my-cluster-name-backup-s3
            endpointUrl: https://s3.us-west-2.amazonaws.com/
            region: us-west-2
        clusterName: my-cluster-name2
        pitr:
          type: latest
      status:
        completed: "2021-05-28T10:10:47Z"
        pbmName: "2021-05-28T10:04:10.876896413Z"
        state: ready
      

      PBM status cluster1: pbm-status-cluster1.txt
      PBM status cluster2: pbm-status-cluster2.txt
      Full restore log on my-cluster-name2-rs1-0 pod: rs1-restore.log

      Attachments

        Issue Links

          Activity

            People

              andrew.pogrebnoi Andrew Pogrebnoi
              tomislav.plavcic@percona.com Tomislav Plavcic
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - Not Specified
                  Not Specified
                  Logged:
                  Time Spent - 2 hours
                  2h

                  Smart Checklist