Uploaded image for project: 'Percona XtraDB Cluster'
  1. Percona XtraDB Cluster
  2. PXC-1978

LP #1690771: Crash of multiple cluster nodes



    • Type: Bug
    • Status: Done
    • Priority: Low
    • Resolution: Cannot Reproduce
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:


      **Reported in Launchpad by Steffen Boehme last update 31-05-2017 21:16:46

      We have several database running with percona-xtradb-cluster, most of the very small and medium size, some bigger once.

      For one of the bigger cluster, which runs since years as cluster, the nodes started to crash 2 weeks ago under normal workload and they gets more and more unstable.
      The cluster contains 5 nodes and last night between 3 and 4 UTC 3 of this nodes crashed in summary 28 times ... no fun.

      We have another cluster running the very same workload with more data, which runs stable over all the time ... thats curious.
      However, I will try toi describe the setup and the problem attach some log files and hope, I can get some help here ...

      As I said, we run the cluster with ~110GB data size with 5 nodes db01, db02, db03, db04, db05.
      db01 acts as the only writing node, db02, db03 and db04 shares the workload for reading queries and db05 runs as hot failover without traffic.

      No it happened the db02, db03 and db04 crashes since 2 weeks ... beginning with db02 and db03 very seldom (with breaks of 2-3 days).
      At this time all nodes runs with 5.6.30-76.3-56.
      I upgraded all nodes then to the latest version 5.6.35-81.0-56, after that the nodes crashed more frequently ... up the a point where 4 of them (db02 to db05) crashed within minutes and the cluster was non functional afterwards.

      This happened multiple time ... I now run a endless loop on all servers which monitors the mysql processes and auto starts a crashed instances within seconds to make the cluster working over the time ...
      Its horrible ...
      The nodes do not crash all the time (the whole day) with the same frequency ... sometimes they run a day without problems, sometimes (like today between 3 and 4 UTC) they crashes 28 time in the hour.

      It seems that only the nodes with high read traffic crashes during the normal run ...
      I had also the case, where I tried to rejoin nodes to the cluster with a full sync, the doner crashed during the data transfer when parallel live traffic arrives on this server.

      I really love the percona cluster functionality and even stability, which is really in 95% of our servers is great ... but this one cluster is a pain at the moment and I hope some can help with this problem.

      I will attach the log files for the servers ...
      Thanks in advance


        Smart Checklist




              lpjirasync lpjirasync (Inactive)
              0 Vote for this issue
              2 Start watching this issue