Node stuck in aborting SST

Description

I have the following scenario.

The workload is run on Node-0

Meantime, I disconnect network to Node-1 for prolonged period, so when it re-join it will have to perform SST

It re-connects and performs SST from Node-2.

While SST is going, I kill Node-2

It seems Node-1 detects it eventually and "trying to abort SST"

But 30 mins later Node-1 is still active and never delivers on its promise to abort SST,

I would say it stuck.

Logs from Node-1:

...

Environment

None

Smart Checklist

Activity

Show:

Noemi Lapresta July 14, 2021 at 2:15 PM

Verified fix on PXC 5.7.34-31.51. MTR test passes.

Marcelo Altmann February 19, 2021 at 7:00 PM

Still reproducible on 8.0.22

On a 3 pxc node:

  • stop node2

  • remove grastate.dat file

  • start node 2 it will start sst

  • on the donor node, tc qdisc add dev eth0 root netem delay 60000ms loss 99%; 

Done

Details

Assignee

Reporter

Labels

Needs Review

Yes

Time tracking

1w 3d 1h 20m logged

Affects versions

Priority

Smart Checklist

Created February 19, 2021 at 6:16 PM
Updated March 6, 2024 at 9:14 PM
Resolved April 14, 2021 at 4:08 PM