Details
-
Bug
-
Status: Done
-
Medium
-
Resolution: Fixed
-
5.7.x, 8.0.x
-
None
Description
This is a subtask of https://jira.percona.com/browse/PS-7197
Slave hang while waiting for the workers to exit.
This issue is more likely to happen when slave_transaction_retries is set to 0.
Let us consider a replica server which is configured with slave_parallel_workers=3, slave_parallel_type=LOGICAL_CLOCK, slave_preserve_commit_order=1 and slave_transaction_retries=0. When MTS is enabled, it is quite possible that workers execute out of order causing the below state
Worker 1 - Processing the events of Transaction T1
Worker 2 - Executed Transaction T2 and is waiting for T1 to commit.
Worker 3 - Processing the events of Transaction T3
- If T1 and T2 are modifying same rows in InnodB, then the worker 1 detects deadlock and asks worker 2 to rollback by signalling.
- Worker 2 wakes up from the cond_wait. It gets to know that it was asked to roll back by the other transaction and returns with an error.
- Worker 2 comes to the retry part of the code and checks the value of slave_transaction_retries. Since it is 0, it returns from the handle_slave_worker loop and enters the error handling part.
- Worker 2 notifies the co-ordinator that it is exiting.
- Co-ordinator thread gets this information and sets the rli->abort_slave=1 to stop replication and waits till all workers exit.
- Worker 2 exits. There is no worker 2 from here onwards.
Now the status is,
Worker 1 - Processing the events of Transaction T1
Worker 2 - Not running.
Worker 3 - Processing the events of Transaction T3 - Now the worker 1 proceeds and executes the transaction and enters the Commit_order_manager::wait_for_its_turn.
- Worker 1 finds out that the previous worker(Worker 2) failed because of an error.
Worker 1 signals next transaction/worker to proceed. - Worker 3 executes the transaction and enters the Commit_order_manager::wait_for_its_turn.
- Worker 1 rolls back and eventually exits.
- There will be no one to signal Worker 3 and thus waits for ever.
mysql> show processlist; +----+-------------+-----------------+------+---------+------+---------------------------------------------+------------------+-----------+---------------+ | Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined | +----+-------------+-----------------+------+---------+------+---------------------------------------------+------------------+-----------+---------------+ | 2 | root | localhost:55708 | test | Query | 0 | starting | show processlist | 0 | 0 | | 3 | system user | | NULL | Connect | 107 | Waiting for master to send event | NULL | 0 | 0 | | 4 | system user | | NULL | Connect | 77 | Waiting for workers to exit | NULL | 0 | 0 | | 7 | system user | | NULL | Connect | 84 | Waiting for preceding transaction to commit | NULL | 0 | 0 | +----+-------------+-----------------+------+---------+------+---------------------------------------------+------------------+-----------+---------------+
Attachments
Issue Links
- relates to
-
PS-7197 Multi-threaded Replica hangs when slave_trans_retires gets exhausted
-
- Done
-