Pt-archiver --charset option is not working for MySQL8.0
Description
Environment
AFFECTED CS IDs
Attachments
- 25 Jul 2022, 03:48 AM
- 25 Jul 2022, 03:01 AM
- 25 Jul 2022, 03:01 AM
Smart Checklist
Activity
Sveta Smirnova December 7, 2023 at 4:50 PM
Case with ANSI_QUOTES should be fixed by https://jira.percona.com/browse/PT-2207. Please check and open separate bug if the issue still exists for you. You need to use version 3.5.6 (not released) or download tool from GitHub to test.
Jaime Sicam May 31, 2023 at 4:07 AM
This is the code that forces utf8mb4 on MySQL 8.0 even if you change the charset parameter:
my ($dbh_version) = $dbh->selectrow_array("SELECT version()");
#if ($dbh_version =~ m/^(\d+\.\d+)\.\d+.*/ && $1 ge '8.0' && !$o->get('charset')) {
if ($dbh_version =~ m/^(\d+\.\d+)\.\d+.*/ && $1 ge '8.0') {
PTDEBUG && _d("MySQL 8.0+ detected and charset was not specified.\n Setting character_set_client = utf8mb4 and --charset=utf8");
$dbh->do('/*!40101 SET character_set_connection = utf8mb4 */;');
$o->set('charset', 'utf8');
}
If the commented #if statement is used then the chosen charset set for MySQL 8.0 could be used.
Aaditya Dubey August 12, 2022 at 12:47 PM
Hi @Yijian Zhang ,
Thank you for the report.
I have tried debug the issue and looks like it is not taking charset even it is specified, please find my test case below:
mysql [localhost:8031] {root} (test) > select version(); -> Upstream
+-----------+
| version() |
+-----------+
| 8.0.29 |
+-----------+
1 row in set (0.00 sec)
mysql [localhost:8031] {msandbox} ((none)) > use test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql [localhost:8031] {msandbox} (test) > drop table t1;
Query OK, 0 rows affected (0.08 sec)
mysql [localhost:8031] {msandbox} (test) > create table t1(id int key, year Year);
Query OK, 0 rows affected (0.02 sec)
mysql [localhost:8031] {msandbox} (test) > insert into t1(id,year) values(1,'0000'),(2,'2010'),(3,'2016'),(4,'2017'),(5,'2018'),(6,'2020'),(7,'2020'),(8,'0000'),(9,'0000'),(10,'2022');
Query OK, 10 rows affected (0.01 sec)
Records: 10 Duplicates: 0 Warnings: 0
Executed pt-archive
$ PTDEBUG=1 ./pt-archiver --source h=127.0.0.1,D=test,u=xx,p=xx,t=t1,P=8031 --where '1=1' --purge --charset latin1 2>&1 | grep -i detected -C 5
# Variable_name => 'character_sets_dir'
# }
# ];
# $DBD::mysql::VERSION: 4.023 $DBI::VERSION: 1.627
# pt_archiver:6210 14649 Inspecting table on A=latin1,D=test,P=8031,h=127.0.0.1,p=...,t=t1,u=xx
# pt_archiver:6224 14649 MySQL 8.0+ detected and charset was not specified.
# Setting character_set_client = utf8mb4 and --charset=utf8
# VersionParser:2853 14649 VersionParser got a dbh, trying to get the version
# VersionParser:2901 14649 InnoDB support: DEFAULT
# VersionParser:2913 14649 InnoDB version: 8.0.29
# TableParser:1912 14649 /*!40101 SET @OLD_SQL_MODE := @@SQL_MODE, @@SQL_MODE := '', @OLD_QUOTE := @@SQL_QUOTE_SHOW_CREATE, @@SQL_QUOTE_SHOW_CREATE := 1 */
When Modified Character set to latin1
mysql [localhost:8031] {msandbox} (test) > ALTER DATABASE test CHARACTER SET latin1;
Query OK, 1 row affected (0.01 sec)
mysql [localhost:8031] {msandbox} (test) > ALTER TABLE t1 CONVERT TO CHARACTER SET latin1;
Query OK, 0 rows affected (0.02 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql [localhost:8031] {msandbox} (test) > SELECT default_character_set_name FROM information_schema.SCHEMATA S WHERE schema_name = "test";
+----------------------------+
| DEFAULT_CHARACTER_SET_NAME |
+----------------------------+
| latin1 |
+----------------------------+
1 row in set (0.00 sec)
mysql [localhost:8031] {msandbox} (test) > SELECT b.character_set_name FROM information_schema.`TABLES` a,information_schema.`COLLATION_CHARACTER_SET_APPLICABILITY` b
-> WHERE b.collation_name = a.table_collation AND a.table_schema = "test" AND a.table_name = "t1";
+--------------------+
| CHARACTER_SET_NAME |
+--------------------+
| latin1 |
+--------------------+
1 row in set (0.01 sec)
$ PTDEBUG=1 ./pt-archiver --source h=127.0.0.1,D=test,u=xx,p=xx,t=t1,P=8031 --where '1=1' --purge --charset latin1
# /usr/bin/perl 5.016003
# Linux xx-xx.xx.xx.xx.xx 5.1.7-1.el7.elrepo.x86_64 #1 SMP Tue Jun 4 09:15:25 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux
# Arguments: _[--source]_ _[h=127.0.0.1,D=test,u=xx,p=xx,t=t1,P=8031]_ _[--where]_ _[1=1]_ _[--purge]_ _[--charset]_ _[latin1]_
# OptionParser:930 45700 Option rule: Specify at least one of --dest, --file, or --purge.
...
# pt_archiver:6264 45700 SELECT CONCAT(/*!40100 @@session.character_set_connection, */ "")
Character set mismatch: --source DSN uses utf8mb4, table uses latin1. You can disable this check by specifying --no-check-charset.
Character set mismatch: --source DSN uses utf8mb4, table uses latin1. You can disable this check by specifying --no-check-charset.
--charset=latin1 is not working because pt-archiver force to SET NAMES utf8mb4 (related code release I found).May I know why pt-archiver is doing this and is there any workaround for tables using latin1 in MySQL version 8 (is it safe to comment this part of code?)
--[no]check-charset
default: yes
Ensure connection and table character sets are the same. Disabling this check may cause text to be erroneously converted from one character set to another (usually from utf8 to latin1) which may cause data loss or mojibake. Disabling this check may be useful or necessary when character set conversions are intended.
Apart from the above question, I also found when ANSI sql_mode is enabled, --check-charset failed because of `"` is recognised as an identifier. Is it safe to replace the double quote with single quote? and is there any potential issue when using pt-archiver with ANSI enabled?
ANSI
Equivalent to REAL_AS_FLOAT, PIPES_AS_CONCAT, ANSI_QUOTES, IGNORE_SPACE, and ONLY_FULL_GROUP_BY.
ANSI_QUOTES
Treat " as an identifier quote character (like the ` quote character) and not as a string quote character. You can still use ` to quote identifiers with this mode enabled. With ANSI_QUOTES enabled, you cannot use double quotation marks to quote literal strings because they are interpreted as identifiers.
Sending the concern to engineering for further review and updates.
Yijian Zhang July 25, 2022 at 3:48 AM
And we found that pt-archiver will force to use utf8mb4, even if you already specified --charset=latin1. May I know why remove the original logical && !$o -> get('charset') ?
Source and destination MySQL:
(root@localhost) [(none)]>\s -------------- mysql Ver 8.0.29 for macos12.4 on arm64 (Homebrew)(root@localhost)
pt-archiver command
pt-archiver --source h=x.x.x.x,P=3306,u=xxx,p='xxx',D=test,t=t --dest h=x.x.x.x,P=3306,u=xxx,p='xxx',D=test_archive,t=t --charset=latin1 --where '1=1' --progress 1000 --limit=1000 --commit-each --statistics --no-delete --replace --run-time=10 --sleep=1 --why-quit Character set mismatch: --source DSN uses utf8mb4, table uses latin1. You can disable this check by specifying --no-check-charset.
--charset=latin1 is not working because pt-archiver force to SET NAMES utf8mb4 (related code release I found).May I know why pt-archiver is doing this and is there any workaround for tables using latin1 in MySQL version 8 (is it safe to comment this part of code?)
Apart from the above question, I also found when ANSI sql_mode is enabled, --check-charset failed because of `"` is recognised as an identifier. Is it safe to replace the double quote with single quote? and is there any potential issue when using pt-archiver with ANSI enabled?
As attched Screenshot 2022-07-25 at 10.59.54 AM.png
Thanks for your help.