Details
-
Bug
-
Status: Done
-
High
-
Resolution: Fixed
-
2.25.0
-
None
-
None
-
2
-
Yes
-
Yes
-
No
-
C/S Core
Description
Reported scenario
A VM with PMM client installed is replicated in cloud environment. PMM client runs in replica with same configuration data as original VM. Duplicate agent ID is detected by pmm-managed. It causes failure to add any new resources for monitoring and OOM-killer kills pmm-managed.
User impact
- Deadlock in pmm-managed; no new agents are able to register. (Check attachment)
- pmm-managed accumulates incoming connection request; killed by OOM killer.
- PMM dashboard's inventory shows No agents Available message and throws HTTP 504. (Check attachment)
Steps to reproduce
Following steps with allow reproduction of reported issue using Docker:
1. Run 3 pmm-client containers (good-client-1, bad-client-1, good-client-2) and 1 pmm-server container (use attached docker-compose.yml)
2. Generate configuration for pmm-agent on every client container
pmm-agent setup --config-file=/usr/local/percona/pmm2/config/pmm-agent.yaml --server-address=pmm-server --server-insecure-tls --server-username=admin --server-password=admin
3. Update generated configuration in bad-client-1 to have same agent ID as one in good-client-1
4. Run pmm-agent for all client containers, in given order (good-client-1, bad-client-1, good-client-2)
pmm-agent --config-file=/usr/local/percona/pmm2/config/pmm-agent.yaml
Actual result
pmm-agent register and start exporters on good-client-1 but fails for bad-client-1 and good-client-2.
Expected result
pmm-agent register and start exporters on all client containers.
Root Cause
Agent registration logic for duplicate ID executes as follows:
register (acquires lock) -> register calls Kick -> Kick calls unregister ->unregsiter (tries to acquire same lock)
Solution
Workaround
Before invoking PMM agent on a cloned VM, PMM client should be re-initialised to avoid usage of duplicate agent ID.
Fix
Scenario: P and Q are two PMM agent with same agent ID. P is connected to PMM server, when Q tries to connect with server.
Status | Proposal | Pros | Cons |
---|---|---|---|
ACCEPTED | Server kicks P and registers Q |
|
|
REJECTED | Server rejects registration attempt by Q |
|
This doesn't handle the case when an agent goes bad and has a dangling connection on PMM server's end |
Analysis: A scenario for two agents to have same agent ID is considered less likely. A scenario where network issue causes an agent to disconnect from server, but server remaining blissfully unaware of broken connection, is considered more likely.
The rejected proposal doesn't handle network issue automatically, which is considered more likely. Currently, a disconnected agent retains no data, hence we want it to connect to PMM server ASAP.
Future improvements
- Investigate usage of keepalive mechanism in gRPC to handle network issues
- Build support in pmm-agent to execute termination request from PMM server
Acceptance Criteria
Scenario | Adding a new instance to PMM server via pmm-admin |
When | New instance reuses an existing agent ID |
And | Another agent is connected to server with same agent ID |
Then | New and old agent both compete for connection with PMM server |
And | Both agents get kicked repeatedly (Check attachment) |
Scenario | Adding a new instance to PMM server via pmm-admin |
When | New instance uses a unique agent ID |
And | Two agent with same agent ID are competing to maintain connection with PMM server |
Then | New agent with unique agent ID is connected to PMM server |
And | New agent is listed in Inventory dashboard on PMM server |
And | New agent passes metrics to PMM server |