Details
-
Bug
-
Status: Done
-
High
-
Resolution: Fixed
-
3.0.13
-
Yes
-
Yes
-
Yes
Description
Summary:
pt-online-schema-change was bringing the database into a broken state when applying the "rebuild_constraints" foreign keys modification method if any of child tables were blocked by the metadata lock
The database left in broken state if any of child tables blocked by metadata lock:
create table t_prnt(prnt_id int auto_increment primary key); create table t_chld(chld_id int auto_increment primary key, prnt_id int); alter table t_chld add constraint FOREIGN KEY (prnt_id) REFERENCES t_prnt(prnt_id) ON DELETE CASCADE; replace into t_prnt values (1); insert into t_chld (prnt_id) values (1); # create an mdl against child table: mysql -e 'begin; select sleep(3600) from test.t_chld;' PTDEBUG=1 ./pt-online-schema-change --execute --alter-foreign-keys-method rebuild_constraints D=test,t=t_prnt --alter 'add column c int
There is a part from ptdebug output:
2019-07-20T11:01:33 Swapping tables... # Retry:3823 6891 Try 1 of 10 # pt_online_schema_change:10617 6891 RENAME TABLE `test`.`t_prnt` TO `test`.`_t_prnt_old`, `test`.`_t_prnt_new` TO `test`.`t_prnt` .... # Retry:3835 6891 Calling wait code # Retry:3823 6891 Try 10 of 10 # pt_online_schema_change:10915 6891 ALTER TABLE `test`.`t_chld` DROP FOREIGN KEY `t_chld_ibfk_1`, ADD CONSTRAINT `_t_chld_ibfk_1` FOREIGN KEY (`prnt_id`) REFERENCES `test`.`t_prnt` (`prnt_id`) ON DELETE CASCADE # Retry:3829 6891 Try code failed: DBD::mysql::db do failed: Lock wait timeout exceeded; try restarting transaction [for Statement "ALTER TABLE `test`.`t_chld` DROP FOREIGN KEY `t_chld_ibfk_1`, ADD CONSTRAINT `_t_chld_ibfk_1` FOREIGN KEY (`prnt_id`) REFERENCES `test`.`t_prnt` (`prnt_id`) ON DELETE CASCADE"] at ./pt-online-schema-change line 10916. # # Retry:3845 6891 Try code did not succeed # CleanupTask:6584 6891 Calling cleanup task CODE(0x5649d741e2f8) # pt_online_schema_change:9396 6891 Clean up triggers 2019-07-20T11:11:48 Dropping triggers... # Retry:3823 6891 Try 1 of 10 # pt_online_schema_change:11296 6891 DROP TRIGGER IF EXISTS `test`.`pt_osc_test_t_prnt_del` # Retry:3840 6891 Try code succeeded # Retry:3823 6891 Try 1 of 10 # pt_online_schema_change:11296 6891 DROP TRIGGER IF EXISTS `test`.`pt_osc_test_t_prnt_upd` # Retry:3840 6891 Try code succeeded # Retry:3823 6891 Try 1 of 10 # pt_online_schema_change:11296 6891 DROP TRIGGER IF EXISTS `test`.`pt_osc_test_t_prnt_ins` # Retry:3840 6891 Try code succeeded 2019-07-20T11:11:49 Dropped triggers OK. # pt_online_schema_change:9159 6891 Clean up new table # TableParser:3445 6891 Checking `test`.`_t_prnt_new` # TableParser:3451 6891 SHOW TABLES FROM `test` LIKE '\_t\_prnt\_new' # TableParser:3462 6891 Table does not exist # pt_online_schema_change:9165 6891 New table exists: no # pt_online_schema_change:9036 6891 Clean up done, report if orig table was altered Altered `test`.`t_prnt` but there were errors or warnings. Error updating foreign key constraints: 2019-07-20T11:11:48 DBD::mysql::db do failed: Lock wait timeout exceeded; try restarting transaction [for Statement "ALTER TABLE `test`.`t_chld` DROP FOREIGN KEY `t_chld_ibfk_1`, ADD CONSTRAINT `_t_chld_ibfk_1` FOREIGN KEY (`prnt_id`) REFERENCES `test`.`t_prnt` (`prnt_id`) ON DELETE CASCADE"] at ./pt-online-schema-change line 10916. # NibbleIterator:6175 6891 Finish nibble_sth # NibbleIterator:6175 6891 Finish explain_nibble_sth # Cxn:4096 6891 Destroying cxn # Cxn:4106 6891 DBI::db=HASH(0x5649d74176b0) Disconnecting dbh on 2228f53cb7c8 # Cxn:4096 6891 Destroying cxn # Cxn:4106 6891 DBI::db=HASH(0x5649d7410848) Disconnecting dbh on 2228f53cb7c8
At the end there are:
_t_prnt_old old table
t_chld references to the old table: CONSTRAINT `t_chld_ibfk_1` FOREIGN KEY (`prnt_id`) REFERENCES `_t_prnt_old` (`prnt_id`) ON DELETE CASCADE
Due to online nature of pt-osc, the application could continue rely on delete cascade:
Unfortunately
delete from t_prnt where prnt_id=1;
will not delete rows from t_chld after rename stage.
At the same time values inserted during rebuild constraints stage are not going to the old table.
Thus in order to recover after such faults:
1. stop any update against parent and child tables
2. investigate every delete and update statement in binary logs against t_prnt and apply same changes against _t_prnt_old
Possible fixes:
Do not rename _t_prnt_new to t_prnt until rebuild finished:
Application will get errors, but new and old tables will be exactly the same.