Uploaded image for project: 'Percona XtraDB Cluster'
  1. Percona XtraDB Cluster
  2. PXC-2684

Modified error handling to prevent deadlock when stored procedure was aborted.

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Done
    • Priority: Medium
    • Resolution: Fixed
    • Affects Version/s: 5.6.45-28.36, 5.7.27-31.39
    • Fix Version/s: 5.6.47-28.40, 5.7.29-31.43
    • Component/s: None
    • Labels:
      None

      Description

      In the situation when a node runs a procedure, accessing a table receives update from another node which could be resolved using InnoDB Buffer Pool Prefetch cache, this node behaves slowly having many "WSREP: BF lock wait long"

      How to Repeat.

      1. Download *tar.gz package, cd to the mysql-test directory:
      2. Start MTR:

      ]$ ./mtr --suite=galera_3nodes --mysqld=--wsrep-provider=`pwd`/../lib/libgalera_smm.so --start galera_certification_ccc --mysqld=--wsrep_debug=on  &
      

      3. Load attached dump:

      ../bin/mysql -h127.0.0.1 -P13010 -uroot test < t1.sql 
      

      4. Connect to node 1 and create a procedure:

      create procedure p1()
      begin
      DECLARE done INT DEFAULT FALSE;
      DECLARE c INT DEFAULT 0;
      DECLARE pk1v VARCHAR(32);
      DECLARE pk2v VARCHAR(127);
      DECLARE orderval INT DEFAULT 0;
      DECLARE printed INT DEFAULT FALSE;
      declare pk2c cursor for SELECT DISTINCT pk2, pk1, ordercol FROM t1 ORDER BY ordercol;
      DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
      
      start transaction;
      update t1 set sp1=0, sp2=UNIX_TIMESTAMP() where sp1=1;
      
      open pk2c;
      pk2c_loop: LOOP
      FETCH pk2c INTO pk2v, pk1v, orderval;
      IF done THEN
      LEAVE pk2c_loop;
      END IF;
      SET c=c+1;
      UPDATE t1 SET sp1=1, sp2=UNIX_TIMESTAMP() where pk2=pk2v and pk1=pk1v;
      
      IF orderval > 1075858151 THEN
      IF NOT printed THEN
      SELECT 'Now run updates';
      SELECT SLEEP(100);                
      SET printed=TRUE;
      ELSE
      LEAVE pk2c_loop;
      END IF;
      END IF;
      END LOOP;
      close pk2c;
      
      update t1 set sp1=10, sp2=UNIX_TIMESTAMP() WHERE pk1='animi iste et odio cupiditate qu' and pk2='voluptatem ipsa dolores ducimus.';
      update t1 set sp1=11, sp2=UNIX_TIMESTAMP() WHERE pk1='animi iste et odio cupiditate qu' and pk2='voluptatem ipsa dolores ducimus.';
      
      COMMIT;
      END
      

      5. Run procedure for the first time:

      mysql> call p1()|
      +-----------------+
      | Now run updates |
      +-----------------+
      | Now run updates |
      +-----------------+
      1 row in set (3.22 sec)
      
      +------------+
      | SLEEP(100) |
      +------------+
      |          0 |
      +------------+
      1 row in set (1 min 43.22 sec)
      
      Query OK, 0 rows affected (1 min 43.36 sec)
      

      6. Once you see "Now run updates" go to another terminal, connect to node 2 and run updates:

      ../bin/mysql -h127.0.0.1 -P13014 -uroot test -e "update t1 set u=2 WHERE pk1='a alias rerum consequuntur eveni' and pk2='excepturi laboriosam incidunt error in.';" &
      
      ../bin/mysql -h127.0.0.1 -P13014 -uroot test -e "update t1 set u=1 WHERE pk1='a alias rerum consequuntur eveni' and pk2='excepturi laboriosam incidunt error in.';" &
      

      So far, so good. You won't see any issue.
      7. Once the procedure finishes executing run it second time:

      mysql> call p1()|
      +-----------------+
      | Now run updates |
      +-----------------+
      | Now run updates |
      +-----------------+
      1 row in set (3.36 sec)
      

      8. Once you see "Now run updates" go to another window and run:

      $ ../bin/mysql -h127.0.0.1 -P13014 -uroot test -e "update t1 set u=2 WHERE pk1='a alias rerum consequuntur eveni' and pk2='excepturi laboriosam incidunt error in.';" &
      

      9. Check SHOW PROCESSLIST on the first node:

      $ time ../bin/mysql -h127.0.0.1 -P13010 -uroot -e "show processlist"
      +----+-------------+-----------------+------+---------+------+-------------------------------------------------+------------------+-----------+---------------+
      | Id | User        | Host            | db   | Command | Time | State                                           | Info             | Rows_sent | Rows_examined |
      +----+-------------+-----------------+------+---------+------+-------------------------------------------------+------------------+-----------+---------------+
      |  1 | system user |                 | NULL | Sleep   | 2064 | wsrep: aborter idle                             | NULL             |         0 |             0 |
      |  2 | system user |                 | NULL | Sleep   | 1593 | System lock                                     | NULL             |         0 |             0 |
      | 11 | root        | localhost:34824 | test | Query   | 1593 | wsrep: initiating pre-commit for write set (12) | COMMIT           |         2 |         35016 |
      | 14 | root        | localhost:49304 | NULL | Query   |    0 | starting                                        | show processlist |         0 |             0 |
      +----+-------------+-----------------+------+---------+------+-------------------------------------------------+------------------+-----------+---------------+
      
      real    0m30.022s
      user    0m0.010s
      sys     0m0.008s
      

      Notice that thread in "System lock" and "wsrep: initiating pre-commit for write set (12)" states are runing for too long. Also SHOW PROCESSLIST command runs ridicoulosly 30 seconds.

      10. Now switch to the terminal where you have procedure on the first node running and find it was sleeping less than requested 100 seconds:

      +------------+
      | SLEEP(100) |
      +------------+
      |          1 |
      +------------+
      1 row in set (6.87 sec)
      

      11. Check error log:

      2019-10-24 15:00:37 41723 [Note] WSREP: cluster conflict due to high priority abort for threads:
      2019-10-24 15:00:37 41723 [Note] WSREP: Winning thread: 
         THD: 2, mode: applier, state: executing, conflict: no conflict, seqno: 11
         SQL: (null)
      2019-10-24 15:00:37 41723 [Note] WSREP: Victim thread: 
         THD: 8, mode: local, state: executing, conflict: no conflict, seqno: -1
         SQL: SELECT SLEEP(100)
      2019-10-24 15:00:37 41723 [Note] WSREP: BF kill (1, seqno: 11), victim: (8) trx: 1385
      2019-10-24 15:00:37 41723 [Note] WSREP: Aborting query: SELECT SLEEP(100) conf 0 trx: 1385
      2019-10-24 15:00:37 41723 [Note] WSREP: kill trx QUERY_EXEC for 1385
      2019-10-24 15:00:37 41723 [Note] WSREP: kill query for: 8
      2019-10-24 15:00:40 41723 [Note] WSREP: wsrep_sync_wait: thd->variables.wsrep_sync_wait = 15, mask = 1
      WSREP: BF lock wait long
      

      Killing query, executing the procedure, does not help. All subsequent accesses to this node continue running slowly, queries on the t1 table hanging or getting canceled due to lock wait timeout.

      Log provided for version 5.6.45, but the issue is repeatable with version 5.7.27 too.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              kenn.takara Kenn Takara
              Reporter:
              sveta.smirnova Sveta Smirnova
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - Not Specified
                  Not Specified
                  Logged:
                  Time Spent - 3 weeks, 3 hours, 10 minutes
                  3w 3h 10m

                    Smart Checklist