Details
-
Improvement
-
Status: Done
-
High
-
Resolution: Done
-
2.10.1
-
2
-
Platform Sprint 28
-
Yes
-
Yes
-
Yes
Description
Even when the table stats collectors are disabled due to number of tables being greater than 1000, the perf_schema.file_instances collector is still used, which adds a lot of overhead to the low resolution metrics exporter. This is making it give timeout for all metrics for that resolution (see attached image from the /prometheus/targets page).
SHELL> pmm-admin add mysql --host=127.0.0.1 --port=45888 --username=root --password=msandbox tablestats_test1 MySQL Service added. Service ID : /service_id/a9b52245-f9c1-47c2-befb-c19ce34281ff Service name: tablestats_test1Table statistics collection disabled (the limit is 1000, the actual table count is 145338).
If we check times with curl commands, we have the following. For all collectors as they are now (again, with table stats disabled):
SHELL> curl -u pmm:/agent_id/ed5ade4a-9df2-45b0-8b8c-6f696f2cdfac "http://10.30.8.141:42000/metrics?collect%5B%5D=binlog_size&collect%5B%5D=custom_query.lr&collect%5B%5D=engine_tokudb_status&collect%5B%5D=global_variables&collect%5B%5D=heartbeat&collect%5B%5D=info_schema.clientstats&collect%5B%5D=info_schema.innodb_tablespaces&collect%5B%5D=info_schema.userstats&collect%5B%5D=perf_schema.eventsstatements&collect%5B%5D=perf_schema.file_instances" > /tmp/curl-mysqld-exporter-lr.out
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 247M 0 247M 0 0 13.4M 0 --:--:-- 0:00:18 --:--:-- 76.1M
It takes 18 seconds, which makes it time out due to global low resolution timeout set to 10s.
If we don't use the perf_schema.file_instances collector, we get:
SHELL> curl -u pmm:/agent_id/ed5ade4a-9df2-45b0-8b8c-6f696f2cdfac "http://10.30.8.141:42000/metrics?collect%5B%5D=binlog_size&collect%5B%5D=custom_query.lr&collect%5B%5D=engine_tokudb_status&collect%5B%5D=global_variables&collect%5B%5D=heartbeat&collect%5B%5D=info_schema.clientstats&collect%5B%5D=info_schema.innodb_tablespaces&collect%5B%5D=info_schema.userstats&collect%5B%5D=perf_schema.eventsstatements" > /tmp/curl-mysqld-exporter-lr-no-file_instances.out
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 48.6M 0 48.6M 0 0 13.7M 0 --:--:-- 0:00:03 --:--:-- 13.7M
It takes only 3 seconds.
Moreover, we are not currently using these metrics for any dashboard. Test with random metric, to validate procedure:
SHELL> sqlite3 grafana.db "select title, data from dashboard" | grep -c -i file_events_seconds_total 2 SHELL> sqlite3 grafana.db "select title, data from dashboard" | grep -i file_events_seconds_total | cut -d '|' -f 1 MySQL Performance Schema Details MySQL InnoDB Details
This metric is being used in two dashboards. On the other hand:
SHELL> sqlite3 grafana.db "select title, data from dashboard" | grep -c -i file_instance
0
There are no metrics with "file_instance" being used in any stock PMM dashboard.
For now, it should be completely safe to disable this collector, and it will guarantee that other important low resolution metrics that are indeed being used by dashboards to be successfully collected.
Another potential fix here is to extend the timeout for low resolution metrics to something like 30 seconds, although it seems that the time it takes depends on the actual amount of tables present in that MySQL instance. I tested with around 145,000 tables (as shown above), but we've seen systems with 250,000 in which it takes around 28 seconds to collect data.
Proposed solution:
Consider to include perf_schema.file_instances collector in the tablestat group
How to test:
- Setup DB with tables count more than 1000 (ideally more than 100000)
- Add the DB to pmm via pmm-admin
- metrics provided by perf_schema.file_instances collector should not be visible in grafana explore or prometheus.
- list of metrics TBD
When --collect.perf_schema.file_instances is enabled, in exporter's output browsing at http://localhost:9104/metrics-lr there are two metrics:
mysql_perf_schema_file_instances_bytes and mysql_perf_schema_file_instances_total.
Example:
# HELP mysql_perf_schema_file_instances_bytes The number of bytes processed by file read/write operations. # TYPE mysql_perf_schema_file_instances_bytes counter mysql_perf_schema_file_instances_bytes{event_name="wait/io/file/csv/data",file_name="/tmp/12345/data/mysql/general_log.CSV",mode="read"} 0 mysql_perf_schema_file_instances_bytes{event_name="wait/io/file/csv/data",file_name="/tmp/12345/data/mysql/general_log.CSV",mode="write"} 0 mysql_perf_schema_file_instances_bytes{event_name="wait/io/file/csv/data",file_name="/tmp/12345/data/mysql/slow_log.CSV",mode="read"} 0 mysql_perf_schema_file_instances_bytes{event_name="wait/io/file/csv/data",file_name="/tmp/12345/data/mysql/slow_log.CSV",mode="write"} 0 mysql_perf_schema_file_instances_bytes{event_name="wait/io/file/csv/metadata",file_name="/tmp/12345/data/mysql/general_log.CSM",mode="read"} 35
# HELP mysql_perf_schema_file_instances_total The total number of file read/write operations. # TYPE mysql_perf_schema_file_instances_total counter mysql_perf_schema_file_instances_total{event_name="wait/io/file/csv/data",file_name="/tmp/12345/data/mysql/general_log.CSV",mode="read"} 0 mysql_perf_schema_file_instances_total{event_name="wait/io/file/csv/data",file_name="/tmp/12345/data/mysql/general_log.CSV",mode="write"} 0 mysql_perf_schema_file_instances_total{event_name="wait/io/file/csv/data",file_name="/tmp/12345/data/mysql/slow_log.CSV",mode="read"} 0 mysql_perf_schema_file_instances_total{event_name="wait/io/file/csv/data",file_name="/tmp/12345/data/mysql/slow_log.CSV",mode="write"} 0 mysql_perf_schema_file_instances_total{event_name="wait/io/file/csv/metadata",file_name="/tmp/12345/data/mysql/general_log.CSM",mode="read"} 1 mysql_perf_schema_file_instances_total{event_name="wait/io/file/csv/metadata",file_name="/tmp/12345/data/mysql/general_log.CSM",mode="write"} 1 mysql_perf_schema_file_instances_total{event_name="wait/io/file/csv/metadata",file_name="/tmp/12345/data/mysql/slow_log.CSM",mode="read"} 1
After making the changes requested in this ticket, those metrics shouldn't appear by default.
Attachments
Issue Links
- relates to
-
PMM-7227 Table stats metrics not being collected in instances with millions of tables
-
- Done
-