Uploaded image for project: 'Percona XtraDB Cluster'
  1. Percona XtraDB Cluster
  2. PXC-843

Joiner hangs after failed SST

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Done
    • Priority: Medium
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Verified this on 5.7.18 with PXB 2.4.7

      https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1707633?comments=all

       

       

      Start 3-nodes cluster.

      export PXC=/bigdisk/sveta/Percona-XtraDB-Cluster-5.7.18-rel15-29.20.1.Linux.x86_64.ssl100
      export GALERA=$PXC/lib/libgalera_smm.so

      export PATH=$PXC:$PATH

      ./bin/mysqld --basedir=. --datadir=./data1 --socket=/tmp/pxc1.sock --port=3371 --log-error --gtid_mode=ON --log-slave-updates --enforce-gtid-consistency --log-bin --server-id=1 --binlog-format=row --wsrep_sst_method=xtrabackup-v2 --wsrep_sst_auth=root: --wsrep_cluster_address='gcomm://127.0.0.1:3381,127.0.0.1:3382' --wsrep-provider=$GALERA --wsrep_cluster_name=cluster1 --wsrep_node_name=pxc1 --default_storage_engine=InnoDB --innodb_autoinc_lock_mode=2 --wsrep_provider_options='base_host=127.0.0.1;base_port=3381' --wsrep_sst_receive_address=127.0.0.1:3391 --wsrep_node_address=127.0.0.1 --wsrep-new-cluster --innodb-temp-data-file-path=../it/ibtmp1:12M:autoextend &

      ./bin/mysqld --basedir=. --datadir=./data2 --socket=/tmp/pxc2.sock --port=3372 --log-error --gtid_mode=ON --log-slave-updates --enforce-gtid-consistency --log-bin --server-id=2 --binlog-format=row --wsrep_sst_method=xtrabackup-v2 --wsrep_sst_auth=root: --wsrep_cluster_address='gcomm://127.0.0.1:3381,127.0.0.1:3382' --wsrep-provider=$GALERA --wsrep_cluster_name=cluster1 --wsrep_node_name=pxc2 --default_storage_engine=InnoDB --innodb_autoinc_lock_mode=2 --wsrep_provider_options='base_host=127.0.0.1;base_port=3382;evs.suspect_timeout=PT10S;evs.inactive_timeout=PT30S;evs.install_timeout=PT15S' --wsrep_sst_receive_address=127.0.0.1:3392 --wsrep_node_incoming_address=127.0.0.1:3372 --wsrep_node_address=127.0.0.1 --wsrep_on=ON --innodb-temp-data-file-path=../it1/ibtmp1:12M:autoextend &

      ./bin/mysqld --basedir=. --datadir=./data3 --socket=/tmp/pxc3.sock --port=3373 --log-error --gtid_mode=ON --log-slave-updates --enforce-gtid-consistency --log-bin --server-id=3 --binlog-format=row --wsrep_sst_method=xtrabackup-v2 --wsrep_sst_auth=root: --wsrep_cluster_address='gcomm://127.0.0.1:3381,127.0.0.1:3382' --wsrep-provider=$GALERA --wsrep_cluster_name=cluster1 --wsrep_node_name=pxc2 --default_storage_engine=InnoDB --innodb_autoinc_lock_mode=2 --wsrep_provider_options='base_host=127.0.0.1;base_port=3383;evs.suspect_timeout=PT10S;evs.inactive_timeout=PT30S;evs.install_timeout=PT15S' --wsrep_sst_receive_address=127.0.0.1:3393 --wsrep_node_incoming_address=127.0.0.1:3373 --wsrep_node_address=127.0.0.1 --wsrep_on=ON --innodb-temp-data-file-path=../it3/ibtmp1:12M:autoextend &

      Make sure cluster works, data replicated.

      Stop node1.

      Create a table and load a lot of data into it:

      USE test;

      DROP TABLE IF EXISTS `joinit`;

      CREATE TABLE `joinit` (
        `i` int(11) NOT NULL AUTO_INCREMENT,
        `s` varchar(64) DEFAULT NULL,
        `t` time NOT NULL,
        `g` int(11) NOT NULL,
        PRIMARY KEY (`i`)
      ) ENGINE=InnoDB DEFAULT CHARSET=latin1;

      INSERT INTO joinit VALUES (NULL, uuid(), time(now()), (FLOOR( 1 + RAND( ) *60 )));
      INSERT INTO joinit SELECT NULL, uuid(), time(now()), (FLOOR( 1 + RAND( ) *60 )) FROM joinit;
      INSERT INTO joinit SELECT NULL, uuid(), time(now()), (FLOOR( 1 + RAND( ) *60 )) FROM joinit;
      INSERT INTO joinit SELECT NULL, uuid(), time(now()), (FLOOR( 1 + RAND( ) *60 )) FROM joinit;
      ...
      INSERT INTO joinit SELECT NULL, uuid(), time(now()), (FLOOR( 1 + RAND( ) *60 )) FROM joinit;

      mysql> select count from joinit;
      ----------

      count

      ----------

      67108864

      ----------
      1 row in set (28.93 sec)

      On one of nodes (node 2) run OPTIMIZE TABLE joinit;

      Wait few seconds and start earlier stopped node.

      This node will try to initiate SST from node3 which will fail with error:

      2017-07-29T10:26:12.201338Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'donor' --address '127.0.0.1:3391/xtrabackup_sst//1' --socket '/tmp/pxc3.sock' --datadir '/bigdisk/sveta/Percona-XtraDB-Cluster-5.7.18-rel15-29.20.1.Linux.x86_64.ssl100/data3/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --binlog 'bm-support01-bin' --gtid 'd66a0e04-6b2c-11e7-98ef-c27e1d637690:300039': 22 (Invalid argument)
      2017-07-29T10:26:12.201591Z 0 [ERROR] WSREP: Command did not run: wsrep_sst_xtrabackup-v2 --role 'donor' --address '127.0.0.1:3391/xtrabackup_sst//1' --socket '/tmp/pxc3.sock' --datadir '/bigdisk/sveta/Percona-XtraDB-Cluster-5.7.18-rel15-29.20.1.Linux.x86_64.ssl100/data3/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --binlog 'bm-support01-bin' --gtid 'd66a0e04-6b2c-11e7-98ef-c27e1d637690:300039'
      2017-07-29T10:26:12.202033Z 0 [Warning] WSREP: Could not find peer: a24ab218-7447-11e7-ac8a-f3686cc44f71
      2017-07-29T10:26:12.202109Z 0 [Warning] WSREP: 0.0 (pxc2): State transfer to -1.-1 (left the group) failed: -22 (Invalid argument)
      2017-07-29T10:26:12.202142Z 0 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 300040)
      2017-07-29T10:26:12.202455Z 0 [Note] WSREP: Member 0.0 (pxc2) synced with group.
      2017-07-29T10:26:12.202489Z 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 300040)

      $ tail -n 45 data3/innobackup.backup.log
      170729 06:26:04 >> log scanned up to (8909890959)
      170729 06:26:05 >> log scanned up to (8909891005)
      170729 06:26:06 >> log scanned up to (8909891051)
      170729 06:26:07 >> log scanned up to (8909891113)
      170729 06:26:08 >> log scanned up to (8909891159)
      170729 06:26:09 >> log scanned up to (8909891205)
      InnoDB: Last flushed lsn: 8901930471 load_index lsn 8909891205
      [FATAL] InnoDB: An optimized(without redo logging) DDLoperation has been performed. All modified pages may not have been flushed to the disk yet.
          PXB will not be able take a consistent backup. Retry the backup operation
      2017-07-29 06:26:10 0x7fc17076e700 InnoDB: Assertion failure in thread 140468792256256 in file ut0ut.cc line 916
      InnoDB: We intentionally generate a memory trap.
      InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
      InnoDB: If you get repeated assertion failures or crashes, even
      InnoDB: immediately after the mysqld startup, there may be
      InnoDB: corruption in the InnoDB tablespace. Please refer to
      InnoDB: http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
      InnoDB: about forcing recovery.
      10:26:10 UTC - xtrabackup got signal 6 ;
      This could be because you hit a bug or data is corrupted.
      This error can also be caused by malfunctioning hardware.
      Attempting to collect some information that could help diagnose the problem.
      As this is a crash and something is definitely wrong, the information
      collection process might fail.

      Thread pointer: 0x0
      Attempting backtrace. You can use the following information to find out
      where mysqld died. If you see no messages after this, something went
      terribly wrong...
      stack_bottom = 0 thread_stack 0x10000
      xtrabackup(my_print_stacktrace+0x35)[0xc81ce5]
      xtrabackup(handle_fatal_signal+0x273)[0xb1cb53]
      /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7fc17381a330]
      /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7fc171b6bc37]
      /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7fc171b6f028]
      xtrabackup[0x75dd4f]
      xtrabackup(_ZN2ib5fatalD1Ev+0xb3)[0x80aa43]
      xtrabackup[0x9a35a1]
      xtrabackup(_Z19recv_parse_log_recsm7store_tb+0x4df)[0x9a61ef]
      xtrabackup[0x766e60]
      xtrabackup[0x767476]
      xtrabackup[0x767823]
      /lib/x86_64-linux-gnu/libpthread.so.0(+0x8184)[0x7fc173812184]
      /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fc171c2f37d]

      Please report a bug at https://bugs.launchpad.net/percona-xtrabackup

      Node1 will not try to repeat SST, but stall:

      $ tail -n20 data1/bm-support01.err
              be949bc4,0
              }
      )
      2017-07-29T10:26:12.185200Z 2 [Note] WSREP: Current view of cluster as seen by this node
      view ((empty))
      2017-07-29T10:26:12.185558Z 2 [Note] WSREP: gcomm: closed
      2017-07-29T10:26:12.185595Z 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
      2017-07-29T10:26:12.185632Z 0 [Note] WSREP: Flow-control interval: [100, 100]
      2017-07-29T10:26:12.185642Z 0 [Note] WSREP: Received NON-PRIMARY.
      2017-07-29T10:26:12.185651Z 0 [Note] WSREP: Shifting JOINER -> OPEN (TO: 300040)
      2017-07-29T10:26:12.185667Z 0 [Note] WSREP: Received self-leave message.
      2017-07-29T10:26:12.185678Z 0 [Note] WSREP: Flow-control interval: [0, 0]
      2017-07-29T10:26:12.185686Z 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
      2017-07-29T10:26:12.185693Z 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 300040)
      2017-07-29T10:26:12.185703Z 0 [Note] WSREP: RECV thread exiting 0: Success
      2017-07-29T10:26:12.185733Z 2 [Note] WSREP: recv_thread() joined.
      2017-07-29T10:26:12.185746Z 2 [Note] WSREP: Closing replication queue.
      2017-07-29T10:26:12.185754Z 2 [Note] WSREP: Closing slave action queue.
      2017-07-29T10:26:14.184075Z 0 [Note] WSREP: Waiting for active wsrep applier to exit
      2017-07-29T10:26:14.184137Z 1 [Note] WSREP: rollbacker thread exiting

      But mysqld will still run:

      $ ps -ef | grep sveta | grep mysqld | grep pxc1
      sveta.s+ 46217 26371 0 Jul29 pts/206 00:00:00 ./bin/mysqld --basedir=. --datadir=./data1 --socket=/tmp/pxc1.sock --port=3371 --log-error --gtid_mode=ON --log-slave-updates --enforce-gtid-consistency --log-bin --server-id=1 --binlog-format=row --wsrep_sst_method=xtrabackup-v2 --wsrep_sst_auth=root: --wsrep_cluster_address=gcomm://127.0.0.1:3381,127.0.0.1:3382 --wsrep-provider=/bigdisk/sveta/Percona-XtraDB-Cluster-5.7.18-rel15-29.20.1.Linux.x86_64.ssl100/lib/libgalera_smm.so --wsrep_cluster_name=cluster1 --wsrep_node_name=pxc1 --default_storage_engine=InnoDB --innodb_autoinc_lock_mode=2 --wsrep_provider_options=base_host=127.0.0.1;base_port=3381 --wsrep_sst_receive_address=127.0.0.1:3391 --wsrep_node_address=127.0.0.1 --innodb-temp-data-file-path=../it/ibtmp1:12M:autoextend

      Suggested fix: retry SST after failure or at least stop joiner node, so some failover script can restart it again

        Attachments

          Activity

            People

            Assignee:
            kenn.takara Kenn Takara
            Reporter:
            kenn.takara Kenn Takara
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Smart Checklist