[obsolete] C/S Core
A VM with PMM client installed is replicated in cloud environment. PMM client runs in replica with same configuration data as original VM. Duplicate agent ID is detected by pmm-managed. It causes failure to add any new resources for monitoring and OOM-killer kills pmm-managed.
- Deadlock in pmm-managed; no new agents are able to register. (Check attachment)
- pmm-managed accumulates incoming connection request; killed by OOM killer.
- PMM dashboard's inventory shows No agents Available message and throws HTTP 504. (Check attachment)
Following steps with allow reproduction of reported issue using Docker:
1. Run 3 pmm-client containers (good-client-1, bad-client-1, good-client-2) and 1 pmm-server container (use attached docker-compose.yml)
2. Generate configuration for pmm-agent on every client container
3. Update generated configuration in bad-client-1 to have same agent ID as one in good-client-1
4. Run pmm-agent for all client containers, in given order (good-client-1, bad-client-1, good-client-2)
pmm-agent register and start exporters on good-client-1 but fails for bad-client-1 and good-client-2.
pmm-agent register and start exporters on all client containers.
Agent registration logic for duplicate ID executes as follows:
Before invoking PMM agent on a cloned VM, PMM client should be re-initialised to avoid usage of duplicate agent ID.
Scenario: P and Q are two PMM agent with same agent ID. P is connected to PMM server, when Q tries to connect with server.
|ACCEPTED||Server kicks P and registers Q||
|REJECTED||Server rejects registration attempt by Q||
||This doesn't handle the case when an agent goes bad and has a dangling connection on PMM server's end|
Analysis: A scenario for two agents to have same agent ID is considered less likely. A scenario where network issue causes an agent to disconnect from server, but server remaining blissfully unaware of broken connection, is considered more likely.
The rejected proposal doesn't handle network issue automatically, which is considered more likely. Currently, a disconnected agent retains no data, hence we want it to connect to PMM server ASAP.
- Investigate usage of keepalive mechanism in gRPC to handle network issues
- Build support in pmm-agent to execute termination request from PMM server