Uploaded image for project: 'Percona Server'
  1. Percona Server
  2. PS-1130

LP #1735555: RBR Replication with concurrent XA in READ-COMMITTED takes supremum pseudo-records and breaks replication

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Done
    • Priority: High
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.7.22-22
    • Component/s: None

      Description

      **Reported in Launchpad by Kenny Gryp last update 19-01-2018 16:24:54

      The problem that we are is facing is similar to what is described in https://bugs.mysql.com/bug.php?id=85447, which has been fixed in 5.7.18 (and the test case no longer fails).

      This is what happens:

      There are 2 XA Transactions that deadlock which causes replication to fail with the error:

      Last_Error: Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.
      

      There are several ongoing XA Transactions at the same time.

      When we look at the locks that are being held, we see:

      mysql> select * from INNODB_LOCKS;
      --------------------------------------------------------------------------------------------------------------------------------------+
      
      lock_id	lock_trx_id	lock_mode	lock_type	lock_table	lock_index	lock_space	lock_page	lock_rec	lock_data
      --------------------------------------------------------------------------------------------------------------------------------------+
      
      28024945201:8439:938215:1	28024945201	X	RECORD	`databse`.`tablezz`	PRIMARY	8439	938215	1	supremum pseudo-record
      28024945046:8439:938215:1	28024945046	S	RECORD	`databse`.`tablezz`	PRIMARY	8439	938215	1	supremum pseudo-record
      --------------------------------------------------------------------------------------------------------------------------------------+
      2 rows in set, 1 warning (0.00 sec)
      
      mysql> select * from INNODB_LOCK_WAITS;
      ------------------------------------------------------------------------------------+
      
      requesting_trx_id	requested_lock_id	blocking_trx_id	blocking_lock_id
      ------------------------------------------------------------------------------------+
      
      28024945201	28024945201:8439:938215:1	28024945046	28024945046:8439:938215:1
      ------------------------------------------------------------------------------------+
      1 row in set, 1 warning (0.00 sec)
      

      There is a S lock on a supremum pseudo-record in one XA transaction which 'conflicts' with the X lock another XA transaction wants to take.

      How does this happen?

      What happens on the original master is as follows...

      Requirements:

      We have the following rows: ...,80000,100000,...
      row 80000 is on for example innodb page 99 and row 100000 is on page 100. There is still some space available to add a row to page 99.
      make sure the slave is using log_slave_updates & binlog_format=mixed
      You can simulate it like this:

      CREATE TABLE t1 (t1_pk DECIMAL(20,0) PRIMARY KEY , t1_blob BLOB) ENGINE=InnoDB;
      
      INSERT INTO t1 VALUES (10000, REPEAT("a", 2165));
      INSERT INTO t1 VALUES (20000, REPEAT("a", 2165));
      INSERT INTO t1 VALUES (30000, REPEAT("a", 2165));
      INSERT INTO t1 VALUES (40000, REPEAT("a", 2165));
      INSERT INTO t1 VALUES (50000, REPEAT("a", 2165));
      INSERT INTO t1 VALUES (60000, REPEAT("a", 2165));
      INSERT INTO t1 VALUES (70000, REPEAT("a", 2165));
      INSERT INTO t1 VALUES (80000, REPEAT("a", 2165));
      INSERT INTO t1 VALUES (90000, REPEAT("a", 2165));
      INSERT INTO t1 VALUES (100000, REPEAT("a", 2165));
      INSERT INTO t1 VALUES (110000, REPEAT("a", 2165));
      INSERT INTO t1 VALUES (120000, REPEAT("a", 2165));
      INSERT INTO t1 VALUES (130000, REPEAT("a", 2165));
      INSERT INTO t1 VALUES (140000, REPEAT("a", 2165));
      INSERT INTO t1 VALUES (150000, REPEAT("a", 2165));
      INSERT INTO t1 VALUES (160000, REPEAT("a", 2165));
      INSERT INTO t1 VALUES (170000, REPEAT("a", 2165));
      INSERT INTO t1 VALUES (180000, REPEAT("a", 2165));
      INSERT INTO t1 VALUES (190000, REPEAT("a", 2165));
      INSERT INTO t1 VALUES (200000, REPEAT("a", 2165));
      INSERT INTO t1 VALUES (210000, REPEAT("a", 2165));
      
      DELETE FROM t1 WHERE t1_pk IN (90000, 80000);
      
      CREATE TABLE t2 (t2_pk INT PRIMARY KEY, t1_pk DECIMAL(20,0),
      FOREIGN KEY (t1_pk) REFERENCES t1 (t1_pk)) ENGINE=InnoDB;
      
      --connect(con2,localhost,root)
      XA START '2';
      INSERT INTO t1 VALUES (85000, NULL);
      
      We are inserting a value between 80000,100000
      --connection master
      XA START '1';
      INSERT INTO t2 VALUES (1, 100000);
      
      This causes an S lock on 100000
      XA END '1';
      XA PREPARE '1';
      
      --connection con2
      XA END '2';
      XA PREPARE '2';
      
      --connection con2
      XA COMMIT '2';
      
      --connection master
      XA COMMIT '1';
      

      Now depending on binlog_format and tx_isolation, you will get different results:

      binlog_format=MIXED, tx_isolation=REPEATABLE READ: the events will be logged in SBR and replication will break. You will have supremum pseudo-records S Locks on both master and slave
      binlog_format=MIXED, tx_isolation=READ COMMITTED : the events will be logged in RBR and replication will not break. There will be no supremum pseudo-record S locks
      binlog_format=ROW , tx_isolation=REPEATABLE READ: The events will be logged in RBR and replication will not break. You will have supremum pseudo-records S locks on the master, but not on the slave as RBR binlog events are in READ-COMMITTED tx isolation mode
      binlog_format=ROW , tx_isolation=READ COMMITTED : The events will be logged in RBR and replication will not break. You will have supremum pseudo-records S locks on the master, but not on the slave
      

      All this is to be expected. (This basically means always use RBR with XA)

      I have tried everything to be able to reproduce this issue the customer is facing and am not able to reproduce this yet
      The customer is using binlog_format=MIXED with tx_isolation=READ COMMITTED, the events are RBR, so that test case above does not fail, but the workload of the customer which looks similar does fail.

      You can see with verbose locks in `SHOW ENGINE INNODB STATUS` that there is an S lock on a row in page 938216, and then an supremum S lock on page 938215 :

      RECORD LOCKS space id 8439 page no 938216 n bits 80 index PRIMARY of table `databse`.`tablezz` trx id 28024945046 lock mode S locks rec but not gap
      Record lock, heap no 2 PHYSICAL RECORD: n_fields 45; compact format; info bits 0
      0: len 13; hex 8000000000000001561a65f747; asc V e G;;
      ... REDACTED ...
      
      RECORD LOCKS space id 8439 page no 938215 n bits 112 index PRIMARY of table `databse`.`tablezz` trx id 28024945046 lock mode S
      Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
      0: len 8; hex 73757072656d756d; asc supremum;;
      

      It must be that somehow replication is putting an S lock on a supremum record, even though the TRX is in READ-COMMITTED. I do not yet know when this happens.

        Smart Checklist

          Attachments

            Issue Links

              Activity

                People

                • Assignee:
                  vladislav.lesin Vladislav Lesin (Inactive)
                  Reporter:
                  lpjirasync lpjirasync (Inactive)
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  8 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved: