Uploaded image for project: 'Percona XtraDB Cluster'
  1. Percona XtraDB Cluster
  2. PXC-2555

SST initialization delay

    XMLWordPrintable

    Details

    • Type: Admin & Maintenance Task
    • Status: Done
    • Priority: Medium
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.6.45-28.36, 5.7.27-31.39
    • Component/s: None
    • Labels:
      None

      Description

      Hi Team,

      we need to speed-up SST initialization a lot.
      The issue is when I start PXC cluster each new PXC instance requires waits up to 2 minutes before start transferring sst-info.
      on some systems, it is needed to tune sst-initial-timeout up to 240 seconds!
      delay is on Donor side.

      please look logs below.
      stage 1. donor and joiner negotiated
      DONOR below

      2019-05-27T09:01:17.244244Z 0 [Note] WSREP: (0e255152, 'tcp://0.0.0.0:4567') connection established to fcc1c2c4 tcp://10.233.67.14:4567
      2019-05-27T09:01:17.244850Z 0 [Note] WSREP: (0e255152, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 
      2019-05-27T09:01:17.744587Z 0 [Note] WSREP: declaring fcc1c2c4 at tcp://10.233.67.14:4567 stable
      2019-05-27T09:01:17.745733Z 0 [Note] WSREP: Node 0e255152 state primary
      2019-05-27T09:01:17.746309Z 0 [Note] WSREP: Current view of cluster as seen by this node
      view (view_id(PRIM,0e255152,16)
      memb {
      	0e255152,0
      	fcc1c2c4,0
      	}
      joined {
      	}
      left {
      	}
      partitioned {
      	}
      )
      2019-05-27T09:01:17.746336Z 0 [Note] WSREP: Save the discovered primary-component to disk
      2019-05-27T09:01:17.797596Z 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
      2019-05-27T09:01:17.797773Z 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: fd1721c7-805d-11e9-8568-1b45e293bf70
      2019-05-27T09:01:17.831776Z 0 [Note] WSREP: STATE EXCHANGE: sent state msg: fd1721c7-805d-11e9-8568-1b45e293bf70
      2019-05-27T09:01:17.832279Z 0 [Note] WSREP: STATE EXCHANGE: got state msg: fd1721c7-805d-11e9-8568-1b45e293bf70 from 0 (cluster1-pxc-0)
      2019-05-27T09:01:18.116926Z 304 [Note] WSREP: set_query_id(), assigned new next trx id: 1249
      2019-05-27T09:01:18.245970Z 0 [Note] WSREP: STATE EXCHANGE: got state msg: fd1721c7-805d-11e9-8568-1b45e293bf70 from 1 (cluster1-pxc-1)
      2019-05-27T09:01:18.246011Z 0 [Note] WSREP: Quorum results:
      	version    = 4,
      	component  = PRIMARY,
      	conf_id    = 15,
      	members    = 1/2 (primary/total),
      	act_id     = 15,
      	last_appl. = 0,
      	protocols  = 0/9/3 (gcs/repl/appl),
      	group UUID = fe7d1a31-8059-11e9-9a7e-4268b76fe2e7
      2019-05-27T09:01:18.246026Z 0 [Note] WSREP: Flow-control interval: [141, 141]
      2019-05-27T09:01:18.246033Z 0 [Note] WSREP: Trying to continue unpaused monitor
      2019-05-27T09:01:18.246157Z 4 [Note] WSREP: REPL Protocols: 9 (4, 2)
      2019-05-27T09:01:18.246189Z 4 [Note] WSREP: New cluster view: global state: fe7d1a31-8059-11e9-9a7e-4268b76fe2e7:15, view# 16: Primary, number of nodes: 2, my index: 0, protocol version 3
      2019-05-27T09:01:18.246198Z 4 [Note] WSREP: Setting wsrep_ready to true
      2019-05-27T09:01:18.246208Z 4 [Note] WSREP: Auto Increment Offset/Increment re-align with cluster membership change (Offset: 1 -> 1) (Increment: 1 -> 2)
      2019-05-27T09:01:18.246218Z 4 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2019-05-27T09:01:18.246240Z 4 [Note] WSREP: Assign initial position for certification: 15, protocol version: 4
      2019-05-27T09:01:18.246286Z 0 [Note] WSREP: Service thread queue flushed.
      2019-05-27T09:01:20.402902Z 0 [Note] WSREP: (0e255152, 'tcp://0.0.0.0:4567') turning message relay requesting off
      

      JOINER below

      2019-05-27T09:01:17.236948Z 0 [Note] WSREP: Preparing to initiate SST/IST
      2019-05-27T09:01:17.236963Z 0 [Note] WSREP: Starting replication
      2019-05-27T09:01:17.236994Z 0 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
      2019-05-27T09:01:17.237472Z 0 [Note] WSREP: Using CRC-32C for message checksums.
      2019-05-27T09:01:17.237732Z 0 [Note] WSREP: gcomm thread scheduling priority set to other:0 
      2019-05-27T09:01:17.238004Z 0 [Warning] WSREP: Fail to access the file (/var/lib/mysql//gvwstate.dat) error (No such file or directory). It is possible if node is booting for first time or re-booting after a graceful shutdown
      2019-05-27T09:01:17.238029Z 0 [Note] WSREP: Restoring primary-component from disk failed. Either node is booting for first time or re-booting after a graceful shutdown
      2019-05-27T09:01:17.238641Z 0 [Note] WSREP: GMCast version 0
      2019-05-27T09:01:17.239144Z 0 [Note] WSREP: (fcc1c2c4, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
      2019-05-27T09:01:17.239170Z 0 [Note] WSREP: (fcc1c2c4, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
      2019-05-27T09:01:17.240228Z 0 [Note] WSREP: EVS version 0
      2019-05-27T09:01:17.240518Z 0 [Note] WSREP: gcomm: connecting to group 'cluster1-pxc', peer '10.233.64.96:'
      2019-05-27T09:01:17.243249Z 0 [Note] WSREP: (fcc1c2c4, 'tcp://0.0.0.0:4567') connection established to 0e255152 tcp://10.233.64.96:4567
      2019-05-27T09:01:17.243357Z 0 [Note] WSREP: (fcc1c2c4, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 
      2019-05-27T09:01:17.743884Z 0 [Note] WSREP: declaring 0e255152 at tcp://10.233.64.96:4567 stable
      2019-05-27T09:01:17.744673Z 0 [Note] WSREP: Node 0e255152 state primary
      2019-05-27T09:01:17.796666Z 0 [Note] WSREP: Current view of cluster as seen by this node
      view (view_id(PRIM,0e255152,16)
      memb {
      	0e255152,0
      	fcc1c2c4,0
      	}
      joined {
      	}
      left {
      	}
      partitioned {
      	}
      )
      2019-05-27T09:01:17.796788Z 0 [Note] WSREP: Save the discovered primary-component to disk
      2019-05-27T09:01:18.242074Z 0 [Note] WSREP: gcomm: connected
      2019-05-27T09:01:18.242427Z 0 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
      2019-05-27T09:01:18.243012Z 0 [Note] WSREP: Waiting for SST/IST to complete.
      2019-05-27T09:01:18.243029Z 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
      2019-05-27T09:01:18.243487Z 0 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
      2019-05-27T09:01:18.243653Z 0 [Note] WSREP: STATE EXCHANGE: sent state msg: fd1721c7-805d-11e9-8568-1b45e293bf70
      2019-05-27T09:01:18.243721Z 0 [Note] WSREP: STATE EXCHANGE: got state msg: fd1721c7-805d-11e9-8568-1b45e293bf70 from 0 (cluster1-pxc-0)
      2019-05-27T09:01:18.243822Z 1 [Note] WSREP: wsrep running threads now: 1
      2019-05-27T09:01:18.243932Z 2 [Note] WSREP: wsrep running threads now: 2
      2019-05-27T09:01:18.244464Z 0 [Note] WSREP: STATE EXCHANGE: got state msg: fd1721c7-805d-11e9-8568-1b45e293bf70 from 1 (cluster1-pxc-1)
      2019-05-27T09:01:18.244545Z 0 [Note] WSREP: Quorum results:
      	version    = 4,
      	component  = PRIMARY,
      	conf_id    = 15,
      	members    = 1/2 (primary/total),
      	act_id     = 15,
      	last_appl. = -1,
      	protocols  = 0/9/3 (gcs/repl/appl),
      	group UUID = fe7d1a31-8059-11e9-9a7e-4268b76fe2e7
      2019-05-27T09:01:18.244569Z 0 [Note] WSREP: Flow-control interval: [141, 141]
      2019-05-27T09:01:18.244588Z 0 [Note] WSREP: Trying to continue unpaused monitor
      2019-05-27T09:01:18.244613Z 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 15)
      2019-05-27T09:01:18.244882Z 1 [Note] WSREP: State transfer required: 
      	Group state: fe7d1a31-8059-11e9-9a7e-4268b76fe2e7:15
      	Local state: 00000000-0000-0000-0000-000000000000:-1
      2019-05-27T09:01:18.244942Z 1 [Note] WSREP: REPL Protocols: 9 (4, 2)
      2019-05-27T09:01:18.244973Z 1 [Note] WSREP: New cluster view: global state: fe7d1a31-8059-11e9-9a7e-4268b76fe2e7:15, view# 16: Primary, number of nodes: 2, my index: 1, protocol version 3
      2019-05-27T09:01:18.245001Z 1 [Note] WSREP: Setting wsrep_ready to true
      2019-05-27T09:01:18.245021Z 1 [Warning] WSREP: Gap in state sequence. Need state transfer.
      2019-05-27T09:01:18.245032Z 1 [Note] WSREP: Setting wsrep_ready to false
      2019-05-27T09:01:18.246182Z 0 [Note] WSREP: Initiating SST/IST transfer on JOINER side (wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.233.67.14' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '1' --mysqld-version '5.7.25-28-57'   '' )
      	2019-05-27T09:01:18.412389Z WSREP_SST: [DEBUG] The xtrabackup version is 2.4.13
      	2019-05-27T09:01:19.168662Z WSREP_SST: [DEBUG] Streaming with xbstream
      	2019-05-27T09:01:19.172618Z WSREP_SST: [DEBUG] Using socat as streamer
      	2019-05-27T09:01:19.190311Z WSREP_SST: [WARNING] Found a stale sst_in_progress file: /var/lib/mysql//sst_in_progress
      	2019-05-27T09:01:19.332278Z WSREP_SST: [DEBUG] Evaluating (@ Joiner-Recv-sst-info) timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr,retry=30 stdio | xbstream $xbstreameopts -x; RC=( ${PIPESTATUS[@]} )
      2019-05-27T09:01:20.742598Z 0 [Note] WSREP: (fcc1c2c4, 'tcp://0.0.0.0:4567') turning message relay requesting off
      

      now DONOR started to wait for something....
      DONOR below

      2019-05-27T09:01:24.717238Z 305 [Note] WSREP: set_query_id(), assigned new next trx id: 1255
      2019-05-27T09:01:28.116088Z 306 [Note] WSREP: set_query_id(), assigned new next trx id: 1256
      2019-05-27T09:01:34.788522Z 307 [Note] WSREP: set_query_id(), assigned new next trx id: 1262
      2019-05-27T09:01:38.101840Z 308 [Note] WSREP: set_query_id(), assigned new next trx id: 1265
      2019-05-27T09:01:44.688598Z 309 [Note] WSREP: set_query_id(), assigned new next trx id: 1271
      2019-05-27T09:01:48.112492Z 310 [Note] WSREP: set_query_id(), assigned new next trx id: 1272
      2019-05-27T09:01:58.110158Z 311 [Note] WSREP: set_query_id(), assigned new next trx id: 1278
      2019-05-27T09:02:04.694379Z 312 [Note] WSREP: set_query_id(), assigned new next trx id: 1284
      2019-05-27T09:02:04.786691Z 313 [Note] WSREP: set_query_id(), assigned new next trx id: 1285
      2019-05-27T09:02:08.113195Z 314 [Note] WSREP: set_query_id(), assigned new next trx id: 1288
      2019-05-27T09:02:18.112517Z 315 [Note] WSREP: set_query_id(), assigned new next trx id: 1294
      2019-05-27T09:02:24.946242Z 316 [Note] WSREP: set_query_id(), assigned new next trx id: 1300
      2019-05-27T09:02:28.118510Z 317 [Note] WSREP: set_query_id(), assigned new next trx id: 1301
      2019-05-27T09:02:34.788921Z 318 [Note] WSREP: set_query_id(), assigned new next trx id: 1307
      2019-05-27T09:02:38.113479Z 319 [Note] WSREP: set_query_id(), assigned new next trx id: 1310
      2019-05-27T09:02:44.958153Z 320 [Note] WSREP: set_query_id(), assigned new next trx id: 1316
      2019-05-27T09:02:48.113695Z 321 [Note] WSREP: set_query_id(), assigned new next trx id: 1317
      2019-05-27T09:02:58.103353Z 322 [Note] WSREP: set_query_id(), assigned new next trx id: 1323
      2019-05-27T09:02:59.511736Z 0 [Note] WSREP: Member 1.0 (cluster1-pxc-1) requested state transfer from '*any*'. Selected 0.0 (cluster1-pxc-0)(SYNCED) as donor.
      2019-05-27T09:02:59.511766Z 0 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 15)
      2019-05-27T09:02:59.511861Z 4 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2019-05-27T09:02:59.512142Z 0 [Note] WSREP: Initiating SST/IST transfer on DONOR side (wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.233.67.14:4444/xtrabackup_sst//1' --socket '/tmp/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --mysqld-version '5.7.25-28-57'   '' --gtid 'fe7d1a31-8059-11e9-9a7e-4268b76fe2e7:15' )
      2019-05-27T09:02:59.512662Z 4 [Note] WSREP: DONOR thread signaled with 0
      	2019-05-27T09:02:59.593321Z WSREP_SST: [DEBUG] The xtrabackup version is 2.4.13
      	2019-05-27T09:02:59.921732Z WSREP_SST: [DEBUG] Streaming with xbstream
      	2019-05-27T09:02:59.923491Z WSREP_SST: [DEBUG] Using socat as streamer
      	2019-05-27T09:02:59.962477Z WSREP_SST: [DEBUG] Streaming SST meta-info file before SST
      	2019-05-27T09:02:59.964328Z WSREP_SST: [DEBUG] Evaluating (@ donor-sst-info) xbstream $xbstreameopts -c ${FILE_TO_STREAM} | socat -u stdio TCP:10.233.67.14:4444,retry=30; RC=( ${PIPESTATUS[@]} )
      

      but sst-initial-timeout=120 not enough, and JOINER failed
      JOINER below

      	2019-05-27T09:02:59.343498Z WSREP_SST: [ERROR] ******************* FATAL ERROR ********************** 
      	2019-05-27T09:02:59.347590Z WSREP_SST: [ERROR] Possible timeout in receving first data from donor in gtid/keyring stage
      	2019-05-27T09:02:59.351414Z WSREP_SST: [ERROR] ****************************************************** 
      	2019-05-27T09:02:59.356009Z WSREP_SST: [ERROR] Cleanup after exit with status:32
      

      we need to get this issue solved, we are receiving bad feedback from early adopters of Kubernetes Operator.

        Smart Checklist

          Attachments

          1. debug.tgz
            39 kB
          2. k8s_pxc_pod1.err
            89 kB
          3. k8s_pxc_pod2.err
            99 kB
          4. k8s_pxc_pod3.err
            116 kB

            Issue Links

              Activity

                People

                Assignee:
                kenn.takara Kenn Takara
                Reporter:
                mykola.marzhan Mykola Marzhan
                Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                  Dates

                  Created:
                  Updated:
                  Resolved:

                    Time Tracking

                    Estimated:
                    Original Estimate - Not Specified
                    Not Specified
                    Remaining:
                    Remaining Estimate - Not Specified
                    Not Specified
                    Logged:
                    Time Spent - 1 week, 7 hours, 4 minutes
                    1w 7h 4m