Details

    • Type: Improvement
    • Status: Confirmation
    • Priority: Medium
    • Resolution: Unresolved
    • Affects Version/s: 5.6.41-28.28, 5.7.23-31.31
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      With current DDL implementation in Galera/wsrep, there is no way even a simple ALTER will be unobtrusive for the live production traffic. Not even using RSU DDL method nor pt-online-schema-change will save us from some unwanted disruptions.

      Below, I would like to present a case, where RSU method is usually a good idea,  when we want to avoid cluster wide stall during schema compatible DDL, in this case, noop-ALTER (aka optimize table).

       

      -- session1
      node1 > select @@wsrep_OSU_method,@@wsrep_on;
      +--------------------+------------+
      | @@wsrep_OSU_method | @@wsrep_on |
      +--------------------+------------+
      | RSU                |          1 |
      +--------------------+------------+
      1 row in set (0.00 sec)
      
      -- session2
      node1 > select count(*) from db1.sbtest1 a join db1.sbtest1 b where a.id<10000;
      ERROR 1213 (40001): WSREP detected deadlock/conflict and aborted the transaction. Try restarting the transaction
      
      -- session1
      node1 >  alter table db1.sbtest1 engine=innodb;
      Query OK, 0 rows affected (1.19 sec)
      Records: 0  Duplicates: 0  Warnings: 0
      

      As seen above, a noop alter executed in RSU interrupted simple SELECT query. Both sessions running on the same PXC node!

       

      In the error log, we can see more details:

       

      2018-12-04T15:07:03.039712Z 12 [Note] WSREP: Provider paused at 7bf59bb4-996d-11e8-b3b6-8ed02cd38513:471796 (26)
      2018-12-04T15:07:03.109785Z 12 [Note] WSREP: --------- CONFLICT DETECTED --------
      2018-12-04T15:07:03.109802Z 12 [Note] WSREP: cluster conflict due to high priority abort for threads:
      2018-12-04T15:07:03.109805Z 12 [Note] WSREP: Winning thread: 
       THD: 12, mode: total order, state: executing, conflict: no conflict, seqno: -1
       SQL: alter table db1.sbtest1 engine=innodb
      2018-12-04T15:07:03.109807Z 12 [Note] WSREP: Victim thread: 
       THD: 11, mode: local, state: executing, conflict: no conflict, seqno: -1
       SQL: select count(*) from db1.sbtest1 a join db1.sbtest1 b where a.id<10000
      2018-12-04T15:07:03.109811Z 12 [Note] WSREP: MDL conflict db=db1 table=sbtest1 ticket=MDL_SHARED_READ solved by abort
      2018-12-04T15:07:03.109812Z 12 [Note] WSREP: --------- CONFLICT DETECTED --------
      2018-12-04T15:07:03.109814Z 12 [Note] WSREP: cluster conflict due to high priority abort for threads:
      2018-12-04T15:07:03.109815Z 12 [Note] WSREP: Winning thread: 
       THD: 12, mode: total order, state: executing, conflict: no conflict, seqno: -1
       SQL: alter table db1.sbtest1 engine=innodb
      2018-12-04T15:07:03.109817Z 12 [Note] WSREP: Victim thread: 
       THD: 11, mode: local, state: executing, conflict: must abort, seqno: -1
       SQL: select count(*) from db1.sbtest1 a join db1.sbtest1 b where a.id<10000
      2018-12-04T15:07:03.959399Z 12 [Note] WSREP: resuming provider at 26
      2018-12-04T15:07:03.959418Z 12 [Note] WSREP: Provider resumed.
      

       

      Even though such alter in normal MySQL would be online, and just wait for MDL lock to execute, here it causes any ongoing transactions to immediately abort.

      Theoretically, set wsrep_desync=1; and set wsrep_on=0; before an ALTER would be simialar to what RSU mode offers, but it is not the case. Disabling wsrep_on brings the normal MySQL behavior here:

       

      -- session1
      node1 > set global wsrep_desync=1; set wsrep_on=0;
      Query OK, 0 rows affected, 1 warning (0.00 sec)
      Query OK, 0 rows affected (0.00 sec)
      node1 > select @@wsrep_OSU_method,@@wsrep_on,@@wsrep_desync;
      +--------------------+------------+----------------+
      | @@wsrep_OSU_method | @@wsrep_on | @@wsrep_desync |
      +--------------------+------------+----------------+
      | TOI | 0 | 1 |
      +--------------------+------------+----------------+
      1 row in set (0.00 sec)
      

      Now, the selects are not disrupted at all:

      -- session2
      node1 > select count(*) from db1.sbtest1 a join db1.sbtest1 b where a.id<10000;
      +-----------+
      | count(*) |
      +-----------+
      | 423680000 |
      +-----------+
      1 row in set (13.24 sec)

       

      -- session 1
      node1 > alter table db1.sbtest1 engine=innodb;
      Query OK, 0 rows affected (12.98 sec)
      Records: 0 Duplicates: 0 Warnings: 0
      

       

       

      -- session 3
      node1 > select id,command,time,state,info from information_schema.processlist where user="root";
      +----+---------+------+---------------------------------+-----------------------------------------------------------------------------------------+
      | id | command | time | state | info |
      +----+---------+------+---------------------------------+-----------------------------------------------------------------------------------------+
      | 11 | Query | 13 | Sending data | select count(*) from db1.sbtest1 a join db1.sbtest1 b where a.id<10000 |
      | 12 | Query | 12 | Waiting for table metadata lock | alter table db1.sbtest1 engine=innodb |
      | 15 | Query | 0 | executing | select id,command,time,state,info from information_schema.processlist where user="root" |
      +----+---------+------+---------------------------------+-----------------------------------------------------------------------------------------+
      3 rows in set (0.00 sec) 
      

       

      I think there is no reason why RSU mode should abort local transactions. With known risks, it should allow the schema compatible DDLs to behave just like on standalone MySQL. This way we could get non-blocking DDLs at least for safe ALTERs, like table optimize, and adding/removing secondary indexes. 

       

        Smart Checklist

          Attachments

            Activity

              People

              • Assignee:
                krunal.bauskar Krunal Bauskar
                Reporter:
                przemyslaw.malkowski@percona.com Przemyslaw Malkowski
              • Votes:
                2 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: