Uploaded image for project: 'Percona Monitoring and Management'
  1. Percona Monitoring and Management
  2. PMM-6744

Prevent timeout of low resolution metrics in MySQL instances with many tables (~1000's)

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Done
    • Priority: High
    • Resolution: Done
    • Affects Version/s: 2.10.1
    • Fix Version/s: 2.12.0
    • Component/s: PMM ManageD
    • Labels:
    • Story Points:
      2
    • Sprint:
      Platform Sprint 28
    • Needs Review:
      Yes
    • Needs QA:
      Yes
    • Needs Doc:
      Yes

      Description

      Even when the table stats collectors are disabled due to number of tables being greater than 1000, the perf_schema.file_instances collector is still used, which adds a lot of overhead to the low resolution metrics exporter. This is making it give timeout for all metrics for that resolution (see attached image from the /prometheus/targets page).

      SHELL> pmm-admin add mysql --host=127.0.0.1 --port=45888 --username=root --password=msandbox tablestats_test1
      MySQL Service added.
      Service ID  : /service_id/a9b52245-f9c1-47c2-befb-c19ce34281ff
      Service name: tablestats_test1Table statistics collection disabled (the limit is 1000, the actual table count is 145338). 

      If we check times with curl commands, we have the following. For all collectors as they are now (again, with table stats disabled):

      SHELL> curl -u pmm:/agent_id/ed5ade4a-9df2-45b0-8b8c-6f696f2cdfac "http://10.30.8.141:42000/metrics?collect%5B%5D=binlog_size&collect%5B%5D=custom_query.lr&collect%5B%5D=engine_tokudb_status&collect%5B%5D=global_variables&collect%5B%5D=heartbeat&collect%5B%5D=info_schema.clientstats&collect%5B%5D=info_schema.innodb_tablespaces&collect%5B%5D=info_schema.userstats&collect%5B%5D=perf_schema.eventsstatements&collect%5B%5D=perf_schema.file_instances" > /tmp/curl-mysqld-exporter-lr.out
        % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                       Dload  Upload   Total   Spent    Left  Speed
      100  247M    0  247M    0     0  13.4M      0 --:--:--  0:00:18 --:--:-- 76.1M 

      It takes 18 seconds, which makes it time out due to global low resolution timeout set to 10s.

      If we don't use the perf_schema.file_instances collector, we get:

      SHELL> curl -u pmm:/agent_id/ed5ade4a-9df2-45b0-8b8c-6f696f2cdfac "http://10.30.8.141:42000/metrics?collect%5B%5D=binlog_size&collect%5B%5D=custom_query.lr&collect%5B%5D=engine_tokudb_status&collect%5B%5D=global_variables&collect%5B%5D=heartbeat&collect%5B%5D=info_schema.clientstats&collect%5B%5D=info_schema.innodb_tablespaces&collect%5B%5D=info_schema.userstats&collect%5B%5D=perf_schema.eventsstatements" > /tmp/curl-mysqld-exporter-lr-no-file_instances.out
        % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                       Dload  Upload   Total   Spent    Left  Speed
      100 48.6M    0 48.6M    0     0  13.7M      0 --:--:--  0:00:03 --:--:-- 13.7M 

      It takes only 3 seconds.

      Moreover, we are not currently using these metrics for any dashboard. Test with random metric, to validate procedure:

      SHELL> sqlite3 grafana.db "select title, data from dashboard" | grep -c -i file_events_seconds_total
      2
      
      SHELL> sqlite3 grafana.db "select title, data from dashboard" | grep -i file_events_seconds_total | cut -d '|' -f 1
      MySQL Performance Schema Details
      MySQL InnoDB Details 

      This metric is being used in two dashboards. On the other hand:

      SHELL> sqlite3 grafana.db "select title, data from dashboard" | grep -c -i file_instance
      0

      There are no metrics with "file_instance" being used in any stock PMM dashboard.

      For now, it should be completely safe to disable this collector, and it will guarantee that other important low resolution metrics that are indeed being used by dashboards to be successfully collected.

      Another potential fix here is to extend the timeout for low resolution metrics to something like 30 seconds, although it seems that the time it takes depends on the actual amount of tables present in that MySQL instance. I tested with around 145,000 tables (as shown above), but we've seen systems with 250,000 in which it takes around 28 seconds to collect data.

       

      Proposed solution:

      Consider to include perf_schema.file_instances collector in the tablestat group 

      see https://confluence.percona.com/display/PMM/Implement+Three+Scraping+Intervals+for+all+Exporters#ImplementThreeScrapingIntervalsforallExporters-CollectorsMatrix

       

      How to test:

      1. Setup DB with tables count more than 1000 (ideally more than 100000)
      2. Add the DB to pmm via pmm-admin
      3. metrics provided by perf_schema.file_instances collector should not be visible in grafana explore or prometheus.
        1. list of metrics TBD

       

      When --collect.perf_schema.file_instances is enabled, in exporter's output browsing at http://localhost:9104/metrics-lr there are two metrics:
      mysql_perf_schema_file_instances_bytes and mysql_perf_schema_file_instances_total.

      Example:

      # HELP mysql_perf_schema_file_instances_bytes The number of bytes processed by file read/write operations.
      # TYPE mysql_perf_schema_file_instances_bytes counter
      mysql_perf_schema_file_instances_bytes{event_name="wait/io/file/csv/data",file_name="/tmp/12345/data/mysql/general_log.CSV",mode="read"} 0
      mysql_perf_schema_file_instances_bytes{event_name="wait/io/file/csv/data",file_name="/tmp/12345/data/mysql/general_log.CSV",mode="write"} 0
      mysql_perf_schema_file_instances_bytes{event_name="wait/io/file/csv/data",file_name="/tmp/12345/data/mysql/slow_log.CSV",mode="read"} 0
      mysql_perf_schema_file_instances_bytes{event_name="wait/io/file/csv/data",file_name="/tmp/12345/data/mysql/slow_log.CSV",mode="write"} 0
      mysql_perf_schema_file_instances_bytes{event_name="wait/io/file/csv/metadata",file_name="/tmp/12345/data/mysql/general_log.CSM",mode="read"} 35 
      # HELP mysql_perf_schema_file_instances_total The total number of file read/write operations.
      # TYPE mysql_perf_schema_file_instances_total counter
      mysql_perf_schema_file_instances_total{event_name="wait/io/file/csv/data",file_name="/tmp/12345/data/mysql/general_log.CSV",mode="read"} 0
      mysql_perf_schema_file_instances_total{event_name="wait/io/file/csv/data",file_name="/tmp/12345/data/mysql/general_log.CSV",mode="write"} 0
      mysql_perf_schema_file_instances_total{event_name="wait/io/file/csv/data",file_name="/tmp/12345/data/mysql/slow_log.CSV",mode="read"} 0
      mysql_perf_schema_file_instances_total{event_name="wait/io/file/csv/data",file_name="/tmp/12345/data/mysql/slow_log.CSV",mode="write"} 0
      mysql_perf_schema_file_instances_total{event_name="wait/io/file/csv/metadata",file_name="/tmp/12345/data/mysql/general_log.CSM",mode="read"} 1
      mysql_perf_schema_file_instances_total{event_name="wait/io/file/csv/metadata",file_name="/tmp/12345/data/mysql/general_log.CSM",mode="write"} 1
      mysql_perf_schema_file_instances_total{event_name="wait/io/file/csv/metadata",file_name="/tmp/12345/data/mysql/slow_log.CSM",mode="read"} 1 

      After making the changes requested in this ticket, those metrics shouldn't appear by default.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              andrii.skomorokhov Andrii Skomorokhov
              Reporter:
              agustin.gallego Agustín Gallego
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - Not Specified
                  Not Specified
                  Logged:
                  Time Spent - 1 hour
                  1h

                    Smart Checklist