Details
-
Improvement
-
Status: Done
-
Medium
-
Resolution: Done
-
None
-
None
-
None
-
5
-
Yes
-
Yes
-
No
-
[obsolete] C/S Core
-
Description
PoC for Vacuum monitoring was implemented in PMM-10244.
Everyone can take a look at changes and test it in PMM by using the following images
server docker - perconalab/pmm-server-fb:PR-2643-72e8e6e client docker - perconalab/pmm-client-fb:PR-2643-72e8e6e client - https://s3.us-east-2.amazonaws.com/pmm-build-cache/PR-BUILDS/pmm2-client/pmm2-client-PR-2643-72e8e6e.tar.gz Create Staging Instance: https://pmm.cd.percona.com/job/aws-staging-start/parambuild/?DOCKER_VERSION=perconalab/pmm-server-fb:PR-2643-72e8e6e&CLIENT_VERSION=https://s3.us-east-2.amazonaws.com/pmm-build-cache/PR-BUILDS/pmm2-client/pmm2-client-PR-2643-72e8e6e.tar.gz
My inputs for further steps
I think it'll be good to have
- Collector of those metrics with additional parameters for databases and tables that have more than 1000 tables
- Labels like oid, table, schema, and database can produce high cardinality issues in the future on large-scale databases with many tables. Or many clusters with many tables.
We also discussed that Postgres exporter needs to work with PG connections more gentle because PG creates one process per connection and in the case of 50 databases PG exporter needs to open 50 connections (I'm not sure about this point)
Jobin's inputs
When it comes to autovacuum, the ultimate source of information is the PostgreSQL logs.
Generally every PostgreSQL user will set log_autovacuum_min_duration parameter and capture the details to PostgreSQL logs.
It is important to understand which tables are becoming too frequent candidate for autovacuum and the load it creates.
For example,
2022-07-30 16:15:01.497 IST [62272] LOG: automatic vacuum of table "postgres.public.pgbench_accounts": index scans: 1 pages: 0 removed, 32787 remain, 0 skipped due to pins, 0 skipped frozen tuples: 1000000 removed, 1000000 remain, 0 are dead but not yet removable, oldest xmin: 820290 index scan needed: 16394 pages from table (50.00% of total) had 1000000 dead item identifiers removed index "pgbench_accounts_pkey": pages: 5486 in total, 0 newly deleted, 0 currently deleted, 0 reusable index "idx_pgbench_accounts_bid": pages: 1629 in total, 836 newly deleted, 836 currently deleted, 0 reusable avg read rate: 101.908 MB/s, avg write rate: 114.909 MB/s buffer usage: 59385 hits, 36348 misses, 40985 dirtied WAL usage: 73567 records, 828 full page images, 16119666 bytes system usage: CPU: user: 0.82 s, system: 0.40 s, elapsed: 2.78 s
Charly's inputs
- We need to be careful with the load will be added to the database we are monitoring. Few comments above databases with large number of tables are mentioned but there are other cases, for example, an instance with many databases. Postgres requires 1 connection per database, and the more databases in the same instance, the more connections PMM will open. It's a global problem for PMM and PG but can be even more harmful here;
- Some queries can take longer depending on the number of objects (schemas, tables, etc) in the database and the 1k table cap may not solve the issue;
- This is great to start the vacuum monitoring but we need start thinking about other sources of information, for example the one mentioned by Jobin, the log files. It may be for another POC or features updates but are they in the road map?
- Configuration and granularity: Would it be possible for the user to have different granularity of the collected information? For example, would it be possible to enable the collector only for one specific database (the similar to schema in MySQL), and or disable for a list of them? Maybe add for future improvements?