Uploaded image for project: 'Percona Server for MySQL'
  1. Percona Server for MySQL
  2. PS-7498

Prevent the replication coordinator thread getting stuck due to MASTER_DELAY while handling partial relay log transactions.

Details

    Description

      Hi,

      We're encountering a bug where sql_thread got stuck in Waiting until MASTER_DELAY seconds after master executed event for hours and doesn't apply any transaction.

      The issue happens only when parallel replication (slave_parallel_workers > 1) and sql_delay is enabled on the replica. Disabling the parallel replication fixes the issue and sql_thread resumes the work as expected.

      mysql> show global variables where Variable_name in ('slave_preserve_commit_order','slave_parallel_type','slave_parallel_workers');
      +-----------------------------+---------------+
      | Variable_name               | Value         |
      +-----------------------------+---------------+
      | slave_parallel_type         | LOGICAL_CLOCK |
      | slave_parallel_workers      | 4             |
      | slave_preserve_commit_order | ON            |
      +-----------------------------+---------------+
      3 rows in set (0.00 sec)
      
      $ mysql -e"show slave status\G" | grep  'SQL_Delay'
                          SQL_Delay: 14400
      
      

      The following output confirms the issue, after the relay log rotation the Slave_SQL_Running_State stuck in "Waiting until MASTER_DELAY seconds after master executed event".  The replica is stuck even if Seconds_Behind_Master > SQL_Delay.

      # mysql -e"show slave status\G" | grep -iE 'seconds|delay'
              Seconds_Behind_Master: 74795
                          SQL_Delay: 14400
                SQL_Remaining_Delay: 8752
            Slave_SQL_Running_State: Waiting until MASTER_DELAY seconds after master executed event
      

      The Relay_Master_Log_File and Exec_Master_Log_Pos don't change for hours.

      while true; 
      do echo "========================";
      mysql -e"show slave status\G" | grep -iE 'relay_master_log_file|Exec_Master_Log_Pos'; 
      sleep 5; 
      done
      =============================================
              Relay_Master_Log_File: mysql-bin.008275
                Exec_Master_Log_Pos: 86759722
      =============================================
              Relay_Master_Log_File: mysql-bin.008275
                Exec_Master_Log_Pos: 86759722
      =============================================
              Relay_Master_Log_File: mysql-bin.008275
                Exec_Master_Log_Pos: 86759722
      =============================================
              Relay_Master_Log_File: mysql-bin.008275
                Exec_Master_Log_Pos: 86759722
      =============================================
              Relay_Master_Log_File: mysql-bin.008275
                Exec_Master_Log_Pos: 86759722
      =============================================
      

      MySQL processlist showing the same issue.

      mysql> show processlist;
      +------+--------------+-------------------+------+---------+-------+----------------------------------------------------------------+------------------+-----------+---------------+
      | Id   | User         | Host              | db   | Command | Time  | State                                                          | Info             | Rows_sent | Rows_examined |
      +------+--------------+-------------------+------+---------+-------+----------------------------------------------------------------+------------------+-----------+---------------+
      |    1 | system user  |                   | NULL | Connect |  8137 | Waiting for master to send event                               | NULL             |         0 |             0 |
      |    2 | system user  |                   | NULL | Connect |  5964 | Waiting until MASTER_DELAY seconds after master executed event | NULL             |         0 |             0 |
      |    3 | system user  |                   | NULL | Connect | 75111 | Waiting for an event from Coordinator                          | NULL             |         0 |             0 |
      |    4 | system user  |                   | NULL | Connect | 75111 | Waiting for an event from Coordinator                          | NULL             |         0 |             0 |
      |    5 | system user  |                   | NULL | Connect | 75111 | Waiting for an event from Coordinator                          | NULL             |         0 |             0 |
      |    6 | system user  |                   | NULL | Connect | 75111 | Waiting for an event from Coordinator                          | NULL             |         0 |             0 |
      ...
       
      

      From the InnoDB status output (show engine innodb status\G), we could see that sql_thread stuck in Waiting for an event from Coordinator state while keeping the transaction open for 6011 seconds and increasing. 

      ---TRANSACTION 57479112090, ACTIVE 6011 sec
      2 lock struct(s), heap size 1136, 1 row lock(s), undo log entries 1
      MySQL thread id 3, OS thread handle 140093445318400, query id 17809197 Waiting for an event from Coordinator
      

      The behavior is exactly the same as mentioned in the bug - https://jira.percona.com/browse/PS-3716

      After disabling the parallel replication, it was working as expected.

      mysql> show global variables where Variable_name in ('slave_preserve_commit_order','slave_parallel_type','slave_parallel_workers');
      +-----------------------------+---------------+
      | Variable_name               | Value         |
      +-----------------------------+---------------+
      | slave_parallel_type         | LOGICAL_CLOCK |
      | slave_parallel_workers      | 0             |
      | slave_preserve_commit_order | ON            |
      +-----------------------------+---------------+
      3 rows in set (0.00 sec)
      

      SQL_thread started applying events and Seconds_Behind_Master started decreasing.

      # while true;
      > do echo "========================";
      > mysql -e"show slave status\G" | grep -iE 'relay_master_log_file|Exec_Master_Log_Pos';
      > sleep 5;
      > done
      ========================
              Relay_Master_Log_File: mysql-bin.008275
                Exec_Master_Log_Pos: 984748121
      ========================
              Relay_Master_Log_File: mysql-bin.008275
                Exec_Master_Log_Pos: 1004479338
      ========================
              Relay_Master_Log_File: mysql-bin.008275
                Exec_Master_Log_Pos: 1024737695
      

      Server and MySQL Configuration on the affected replica.

      cat /etc/redhat-release
      CentOS Linux release 7.7.1908 (Core)
      
      CPU : 16vCPU
      RAM : 125G
      
      Percona Server version
      rpm -qa | grep -i percona-server
      Percona-Server-shared-57-5.7.32-35.1.el7.x86_64
      Percona-Server-shared-compat-57-5.7.32-35.1.el7.x86_64
      Percona-Server-client-57-5.7.32-35.1.el7.x86_64
      Percona-Server-devel-57-5.7.32-35.1.el7.x86_64
      Percona-Server-57-debuginfo-5.7.32-35.1.el7.x86_64
      Percona-Server-server-57-5.7.32-35.1.el7.x86_64
      
      
      My.cnf file
      
      [mysql]
      port                                = 3306
      socket                              = /var/lib/mysql/mysql.sock
      
      [mysqld]
      collation-server                    = utf8_unicode_ci
      character-set-server                = utf8
      sql-mode                            = "STRICT_ALL_TABLES,STRICT_TRANS_TABLES"
      plugin-load-add                     = audit_log.so
      user                                = mysql
      default-storage-engine              = InnoDB
      socket                              = /var/lib/mysql/mysql.sock
      pid-file                            = /var/lib/mysql/mysql.pid
      tmpdir                              = /tmp
      explicit_defaults_for_timestamp     = 1
      bind-address                        = 0.0.0.0
      skip-name-resolve                   = ON
      max-connect-errors                  = 1000000
      sysdate-is-now                      = 1
      datadir                             = /var/lib/mysql
      expire-logs-days                    = 4
      binlog-format                       = ROW
      log_slave_updates
      gtid-mode                           = ON
      enforce-gtid-consistency            = ON
      tmp-table-size                      = 32M
      max-heap-table-size                 = 32M
      query-cache-type                    = 0
      query-cache-size                    = 0
      thread-cache-size                   = 128
      open-files-limit                    = 65535
      table-definition-cache              = 1024
      max_allowed_packet                  = 1G
      net_buffer_length                   = 1M
      thread_cache_size                   = 6000
      innodb-flush-method                 = O_DIRECT
      innodb-log-files-in-group           = 2
      innodb-flush-log-at-trx-commit      = 2
      innodb-file-per-table               = 1
      innodb_page_cleaners                = 16
      log-error                           = /var/lib/mysql/mysql-error.log
      log-queries-not-using-indexes       = 1
      slow-query-log-file                 = /var/lib/mysql/mysql-slow.log
      master_info_repository              = TABLE
      server-id                           = <ID>
      report-host                         = <HOST>
      local_infile                        = OFF
      automatic_sp_privileges             = FALSE
      old_passwords                       = 2
      innodb-max-dirty-pages-pct          = 75
      read_only                           = 1
      super_read_only                     = 0
      skip_ssl
      log-bin                             = mysql-bin
      max_binlog_size                     = 1G
      max-connections                     = 5000
      table_open_cache                    = 10000
      innodb_adaptive_hash_index          = 0
      long_query_time                     = 0
      thread_pool_size                    = 32
      innodb_purge_threads                = 8
      innodb_write_io_threads             = 32
      innodb_read_io_threads              = 32
      innodb_flush_neighbors              = 0
      innodb_thread_concurrency           = 32
      log_output                          = file
      slow_query_log                      = ON
      long_query_time                     = 1
      log_slow_rate_limit                 = 100
      log_slow_rate_type                  = query
      log_slow_verbosity                  = full
      log_slow_admin_statements           = ON
      log_slow_slave_statements           = ON
      slow_query_log_always_write_time    = 1
      slow_query_log_use_global_control   = all
      innodb_monitor_enable               = all
      max_user_connections                = 5000
      max_connections                     = 5050
      thread_cache_size                   = 4000
      innodb_autoinc_lock_mode            = 2
      innodb_sync_array_size              = 1
      innodb_open_files                   = 2000
      skip_symbolic_links
      table_open_cache_instances          = 4
      innodb_log_file_size                = 10737418240
      sync_binlog                         = 0
      slave_net_timeout                   = 250
      innodb_lru_scan_depth               = 4096
      audit_log_exclude_accounts          = "'user1'@'XX.XX.%.%','user2'@'XX.XX.%.%',.............."
      audit_log_rotate_on_size            = 409600
      innodb-io-capacity                  = 1000
      innodb-io-capacity-max              = 2000
      innodb_flushing_avg_loops           = 200
      innodb-buffer-pool-size             = 86G
      innodb_buffer_pool_instances        = 8
      slave_rows_search_algorithms        = "INDEX_SCAN,HASH_SCAN"
      binlog_group_commit_sync_no_delay_count = 8
      slave_parallel_type                 = LOGICAL_CLOCK
      slave_preserve_commit_order         = 1
      slave_parallel_workers              = 4
      
      
      # The master has following settings.
      | binlog_group_commit_sync_delay          | 100                  |
      | binlog_group_commit_sync_no_delay_count | 8                    |
      
      
      # Enable sql_delay on replica after the replication setup.
      stop slave;
      CHANGE MASTER TO MASTER_DELAY=14400;
      start slave; 
      
      

      Attachments

        Activity

          People

            venkatesh.prasad Venkatesh Prasad
            alok.pathak Alok Pathak (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - Not Specified
                Not Specified
                Logged:
                Time Spent - 4 days, 7 hours, 30 minutes
                4d 7h 30m

                Smart Checklist