Uploaded image for project: 'Percona Server for MySQL'
  1. Percona Server for MySQL
  2. PS-7232

Modify Multithreaded Replica to correct the exhausted slave_transaction_retries when replica has `slave_preserve_commit_order` enabled

Details

    Description

      This is a subtask of https://jira.percona.com/browse/PS-7197

      Slave hang while waiting for the workers to exit.

      This issue is more likely to happen when slave_transaction_retries is set to 0.

      Let us consider a replica server which is configured with slave_parallel_workers=3, slave_parallel_type=LOGICAL_CLOCK, slave_preserve_commit_order=1 and slave_transaction_retries=0. When MTS is enabled, it is quite possible that workers execute out of order causing the below state

      Worker 1 - Processing the events of Transaction T1
      Worker 2 - Executed Transaction T2 and is waiting for T1 to commit.
      Worker 3 - Processing the events of Transaction T3

      1. If T1 and T2 are modifying same rows in InnodB, then the worker 1 detects deadlock and asks worker 2 to rollback by signalling.
      2. Worker 2 wakes up from the cond_wait. It gets to know that it was asked to roll back by the other transaction and returns with an error.
      3. Worker 2 comes to the retry part of the code and checks the value of slave_transaction_retries. Since it is 0, it returns from the handle_slave_worker loop and enters the error handling part.
      4. Worker 2 notifies the co-ordinator that it is exiting.
      5. Co-ordinator thread gets this information and sets the rli->abort_slave=1 to stop replication and waits till all workers exit.
      6. Worker 2 exits. There is no worker 2 from here onwards.
        Now the status is,
        Worker 1 - Processing the events of Transaction T1
        Worker 2 - Not running.
        Worker 3 - Processing the events of Transaction T3
      7. Now the worker 1 proceeds and executes the transaction and enters the Commit_order_manager::wait_for_its_turn.
      8. Worker 1 finds out that the previous worker(Worker 2) failed because of an error.
        Worker 1 signals next transaction/worker to proceed.
      9. Worker 3 executes the transaction and enters the Commit_order_manager::wait_for_its_turn.
      10. Worker 1 rolls back and eventually exits.
      11. There will be no one to signal Worker 3 and thus waits for ever.
      mysql> show processlist; 
      +----+-------------+-----------------+------+---------+------+---------------------------------------------+------------------+-----------+---------------+ 
      | Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined | 
      +----+-------------+-----------------+------+---------+------+---------------------------------------------+------------------+-----------+---------------+ 
      | 2 | root | localhost:55708 | test | Query | 0 | starting | show processlist | 0 | 0 | 
      | 3 | system user | | NULL | Connect | 107 | Waiting for master to send event | NULL | 0 | 0 | 
      | 4 | system user | | NULL | Connect | 77 | Waiting for workers to exit | NULL | 0 | 0 | 
      | 7 | system user | | NULL | Connect | 84 | Waiting for preceding transaction to commit | NULL | 0 | 0 | 
      +----+-------------+-----------------+------+---------+------+---------------------------------------------+------------------+-----------+---------------+

      Attachments

        Issue Links

          Activity

            People

              venkatesh.prasad Venkatesh Prasad
              venkatesh.prasad Venkatesh Prasad
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Smart Checklist