Uploaded image for project: 'Percona XtraDB Cluster'
  1. Percona XtraDB Cluster
  2. PXC-1990

LP #1698863: One node flapping makes whole cluster enter NON-PRIMARY states

    Details

    • Type: Bug
    • Status: On Hold
    • Priority: Low
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      **Reported in Launchpad by Przemek last update 20-06-2017 10:12:15

      When just one node has a flapping network connection, it sometimes causes the whole cluster to enter non-Primary state.
      Fully repeatable with default wsrep settings and 3 or 4 node cluster.
      Tested with PXC 5.7.18/Galera 3.20, using this example bash script on node pxc4 (172.17.0.3):

      1. cat flap.sh
        #!/bin/bash

      for i in

      {1..10}

      ; do
      iptables -I OUTPUT -p all -d 172.17.0.0/24 -j DROP
      sleep 5
      iptables -F
      sleep 3
      done

      Example error logs from pxc4 and one of the other nodes in attachment.

      What is worrisome is that instead of cluster just expel the faulty node, before which cluster will just pause writes till timeouts are reached, it gets actually confused and enters non-primary state at some point, even though physical connection between the remaining three nodes is absolutely fine, like this:

      2017-06-19T13:22:21.158663Z 0 [Note] WSREP: (87f15ec8, 'tcp://0.0.0.0:4567') connection to peer 40019a16 with addr tcp://172.17.0.3:4567 timed out, no messages seen in PT3S
      2017-06-19T13:22:21.159054Z 0 [Note] WSREP: (87f15ec8, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://172.17.0.3:4567
      2017-06-19T13:22:21.364134Z 0 [Note] WSREP: Current view of cluster as seen by this node
      view (view_id(NON_PRIM,0c4b3f62,202)
      memb

      { 87f15ec8,0 b7f30d85,0 }

      joined {
      }
      left {
      }
      partitioned

      { 0c4b3f62,0 40019a16,3 }

      )
      2017-06-19T13:22:21.364252Z 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 2
      2017-06-19T13:22:21.364287Z 0 [Note] WSREP: Flow-control interval: [141, 141]
      2017-06-19T13:22:21.364304Z 0 [Note] WSREP: Received NON-PRIMARY.
      2017-06-19T13:22:21.364317Z 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 751686)
      2017-06-19T13:22:21.364392Z 2 [Note] WSREP: New cluster view: global state: 3efe5400-aa6d-11e6-b772-625f9abee4ba:751686, view# -1: non-Primary, number of nodes: 2, my index: 0, protocol version 3

        Smart Checklist

          Attachments

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                lpjirasync lpjirasync (Inactive)
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: