Details
-
Bug
-
Status: Done
-
Medium
-
Resolution: Fixed
-
5.7.32-35
-
None
Description
Hi,
We're encountering a bug where sql_thread got stuck in Waiting until MASTER_DELAY seconds after master executed event for hours and doesn't apply any transaction.
The issue happens only when parallel replication (slave_parallel_workers > 1) and sql_delay is enabled on the replica. Disabling the parallel replication fixes the issue and sql_thread resumes the work as expected.
mysql> show global variables where Variable_name in ('slave_preserve_commit_order','slave_parallel_type','slave_parallel_workers'); +-----------------------------+---------------+ | Variable_name | Value | +-----------------------------+---------------+ | slave_parallel_type | LOGICAL_CLOCK | | slave_parallel_workers | 4 | | slave_preserve_commit_order | ON | +-----------------------------+---------------+ 3 rows in set (0.00 sec) $ mysql -e"show slave status\G" | grep 'SQL_Delay' SQL_Delay: 14400
The following output confirms the issue, after the relay log rotation the Slave_SQL_Running_State stuck in "Waiting until MASTER_DELAY seconds after master executed event". The replica is stuck even if Seconds_Behind_Master > SQL_Delay.
# mysql -e"show slave status\G" | grep -iE 'seconds|delay' Seconds_Behind_Master: 74795 SQL_Delay: 14400 SQL_Remaining_Delay: 8752 Slave_SQL_Running_State: Waiting until MASTER_DELAY seconds after master executed event
The Relay_Master_Log_File and Exec_Master_Log_Pos don't change for hours.
while true; do echo "========================"; mysql -e"show slave status\G" | grep -iE 'relay_master_log_file|Exec_Master_Log_Pos'; sleep 5; done ============================================= Relay_Master_Log_File: mysql-bin.008275 Exec_Master_Log_Pos: 86759722 ============================================= Relay_Master_Log_File: mysql-bin.008275 Exec_Master_Log_Pos: 86759722 ============================================= Relay_Master_Log_File: mysql-bin.008275 Exec_Master_Log_Pos: 86759722 ============================================= Relay_Master_Log_File: mysql-bin.008275 Exec_Master_Log_Pos: 86759722 ============================================= Relay_Master_Log_File: mysql-bin.008275 Exec_Master_Log_Pos: 86759722 =============================================
MySQL processlist showing the same issue.
mysql> show processlist; +------+--------------+-------------------+------+---------+-------+----------------------------------------------------------------+------------------+-----------+---------------+ | Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined | +------+--------------+-------------------+------+---------+-------+----------------------------------------------------------------+------------------+-----------+---------------+ | 1 | system user | | NULL | Connect | 8137 | Waiting for master to send event | NULL | 0 | 0 | | 2 | system user | | NULL | Connect | 5964 | Waiting until MASTER_DELAY seconds after master executed event | NULL | 0 | 0 | | 3 | system user | | NULL | Connect | 75111 | Waiting for an event from Coordinator | NULL | 0 | 0 | | 4 | system user | | NULL | Connect | 75111 | Waiting for an event from Coordinator | NULL | 0 | 0 | | 5 | system user | | NULL | Connect | 75111 | Waiting for an event from Coordinator | NULL | 0 | 0 | | 6 | system user | | NULL | Connect | 75111 | Waiting for an event from Coordinator | NULL | 0 | 0 | ...
From the InnoDB status output (show engine innodb status\G), we could see that sql_thread stuck in Waiting for an event from Coordinator state while keeping the transaction open for 6011 seconds and increasing.
---TRANSACTION 57479112090, ACTIVE 6011 sec
2 lock struct(s), heap size 1136, 1 row lock(s), undo log entries 1
MySQL thread id 3, OS thread handle 140093445318400, query id 17809197 Waiting for an event from Coordinator
The behavior is exactly the same as mentioned in the bug - https://jira.percona.com/browse/PS-3716
After disabling the parallel replication, it was working as expected.
mysql> show global variables where Variable_name in ('slave_preserve_commit_order','slave_parallel_type','slave_parallel_workers'); +-----------------------------+---------------+ | Variable_name | Value | +-----------------------------+---------------+ | slave_parallel_type | LOGICAL_CLOCK | | slave_parallel_workers | 0 | | slave_preserve_commit_order | ON | +-----------------------------+---------------+ 3 rows in set (0.00 sec)
SQL_thread started applying events and Seconds_Behind_Master started decreasing.
# while true; > do echo "========================"; > mysql -e"show slave status\G" | grep -iE 'relay_master_log_file|Exec_Master_Log_Pos'; > sleep 5; > done ======================== Relay_Master_Log_File: mysql-bin.008275 Exec_Master_Log_Pos: 984748121 ======================== Relay_Master_Log_File: mysql-bin.008275 Exec_Master_Log_Pos: 1004479338 ======================== Relay_Master_Log_File: mysql-bin.008275 Exec_Master_Log_Pos: 1024737695
Server and MySQL Configuration on the affected replica.
cat /etc/redhat-release CentOS Linux release 7.7.1908 (Core) CPU : 16vCPU RAM : 125G Percona Server version rpm -qa | grep -i percona-server Percona-Server-shared-57-5.7.32-35.1.el7.x86_64 Percona-Server-shared-compat-57-5.7.32-35.1.el7.x86_64 Percona-Server-client-57-5.7.32-35.1.el7.x86_64 Percona-Server-devel-57-5.7.32-35.1.el7.x86_64 Percona-Server-57-debuginfo-5.7.32-35.1.el7.x86_64 Percona-Server-server-57-5.7.32-35.1.el7.x86_64 My.cnf file [mysql] port = 3306 socket = /var/lib/mysql/mysql.sock [mysqld] collation-server = utf8_unicode_ci character-set-server = utf8 sql-mode = "STRICT_ALL_TABLES,STRICT_TRANS_TABLES" plugin-load-add = audit_log.so user = mysql default-storage-engine = InnoDB socket = /var/lib/mysql/mysql.sock pid-file = /var/lib/mysql/mysql.pid tmpdir = /tmp explicit_defaults_for_timestamp = 1 bind-address = 0.0.0.0 skip-name-resolve = ON max-connect-errors = 1000000 sysdate-is-now = 1 datadir = /var/lib/mysql expire-logs-days = 4 binlog-format = ROW log_slave_updates gtid-mode = ON enforce-gtid-consistency = ON tmp-table-size = 32M max-heap-table-size = 32M query-cache-type = 0 query-cache-size = 0 thread-cache-size = 128 open-files-limit = 65535 table-definition-cache = 1024 max_allowed_packet = 1G net_buffer_length = 1M thread_cache_size = 6000 innodb-flush-method = O_DIRECT innodb-log-files-in-group = 2 innodb-flush-log-at-trx-commit = 2 innodb-file-per-table = 1 innodb_page_cleaners = 16 log-error = /var/lib/mysql/mysql-error.log log-queries-not-using-indexes = 1 slow-query-log-file = /var/lib/mysql/mysql-slow.log master_info_repository = TABLE server-id = <ID> report-host = <HOST> local_infile = OFF automatic_sp_privileges = FALSE old_passwords = 2 innodb-max-dirty-pages-pct = 75 read_only = 1 super_read_only = 0 skip_ssl log-bin = mysql-bin max_binlog_size = 1G max-connections = 5000 table_open_cache = 10000 innodb_adaptive_hash_index = 0 long_query_time = 0 thread_pool_size = 32 innodb_purge_threads = 8 innodb_write_io_threads = 32 innodb_read_io_threads = 32 innodb_flush_neighbors = 0 innodb_thread_concurrency = 32 log_output = file slow_query_log = ON long_query_time = 1 log_slow_rate_limit = 100 log_slow_rate_type = query log_slow_verbosity = full log_slow_admin_statements = ON log_slow_slave_statements = ON slow_query_log_always_write_time = 1 slow_query_log_use_global_control = all innodb_monitor_enable = all max_user_connections = 5000 max_connections = 5050 thread_cache_size = 4000 innodb_autoinc_lock_mode = 2 innodb_sync_array_size = 1 innodb_open_files = 2000 skip_symbolic_links table_open_cache_instances = 4 innodb_log_file_size = 10737418240 sync_binlog = 0 slave_net_timeout = 250 innodb_lru_scan_depth = 4096 audit_log_exclude_accounts = "'user1'@'XX.XX.%.%','user2'@'XX.XX.%.%',.............." audit_log_rotate_on_size = 409600 innodb-io-capacity = 1000 innodb-io-capacity-max = 2000 innodb_flushing_avg_loops = 200 innodb-buffer-pool-size = 86G innodb_buffer_pool_instances = 8 slave_rows_search_algorithms = "INDEX_SCAN,HASH_SCAN" binlog_group_commit_sync_no_delay_count = 8 slave_parallel_type = LOGICAL_CLOCK slave_preserve_commit_order = 1 slave_parallel_workers = 4 # The master has following settings. | binlog_group_commit_sync_delay | 100 | | binlog_group_commit_sync_no_delay_count | 8 | # Enable sql_delay on replica after the replication setup. stop slave; CHANGE MASTER TO MASTER_DELAY=14400; start slave;