  Percona Monitoring and Management
  PMM-9413

pmm-managed gets deadlocked when an agent connects with a duplicate agent_id


      Reported scenario

      A VM with PMM client installed is replicated in cloud environment. PMM client runs in replica with same configuration data as original VM. Duplicate agent ID is detected by pmm-managed. It causes failure to add any new resources for monitoring and OOM-killer kills pmm-managed.

      User impact

      1. Deadlock in pmm-managed; no new agents are able to register. (Check attachment)
      2. pmm-managed accumulates incoming connection request; killed by OOM killer.
      3. PMM dashboard's inventory shows No agents Available message and throws HTTP 504. (Check attachment)

      Steps to reproduce

      Following steps with allow reproduction of reported issue using Docker:

      1. Run 3 pmm-client containers (good-client-1, bad-client-1, good-client-2) and 1 pmm-server container (use attached docker-compose.yml)
      2. Generate configuration for pmm-agent on every client container

      pmm-agent setup --config-file=/usr/local/percona/pmm2/config/pmm-agent.yaml --server-address=pmm-server --server-insecure-tls --server-username=admin --server-password=admin

      3. Update generated configuration in bad-client-1 to have same agent ID as one in good-client-1
      4. Run pmm-agent for all client containers, in given order (good-client-1, bad-client-1, good-client-2)

      pmm-agent --config-file=/usr/local/percona/pmm2/config/pmm-agent.yaml

      Actual result

      pmm-agent register and start exporters on good-client-1 but fails for bad-client-1 and good-client-2.

      Expected result

      pmm-agent register and start exporters on all client containers.

      Root Cause

      Agent registration logic for duplicate ID executes as follows:

      register (acquires lock) -> register calls Kick -> Kick calls unregister ->unregsiter (tries to acquire same lock)



      Before invoking PMM agent on a cloned VM, PMM client should be re-initialised to avoid usage of duplicate agent ID.


      Scenario: P and Q are two PMM agent with same agent ID. P is connected to PMM server, when Q tries to connect with server.

      Status Proposal Pros Cons
      ACCEPTED Server kicks P and registers Q
      • If P goes bad and Q is a new agent on same machine, it allows kicking off Q
      • If network issue breaks connection between P and server, P cannot reconnect to server unless it can remove existing connection (here P and Q are same agent)
      • A valid agent connection can be closed by another agent
      • Ripe for malicious use
      • Since agent retries connection, Q will later be kicked by P; P and Q are in a livelock and will never be able to reliable pass metrics to server
      REJECTED Server rejects registration attempt by Q
      • Duplicate agent ID can be a malicious attempt, which is rejected
      • Control of kicking an agent rests with server and not an external agent
      This doesn't handle the case when an agent goes bad and has a dangling connection on PMM server's end

       Analysis: A scenario for two agents to have same agent ID is considered less likely. A scenario where network issue causes an agent to disconnect from server, but server remaining blissfully unaware of broken connection, is considered more likely.

      The rejected proposal doesn't handle network issue automatically, which is considered more likely. Currently, a disconnected agent retains no data, hence we want it to connect to PMM server ASAP. 

      Future improvements

      • Investigate usage of keepalive mechanism in gRPC to handle network issues
      • Build support in pmm-agent to execute termination request from PMM server 

      Acceptance Criteria

      Case 1. Livelock in agents with duplicate agent ID
      Scenario Adding a new instance to PMM server via pmm-admin
      When  New instance reuses an existing agent ID
      And   Another agent is connected to server with same agent ID
      Then New and old agent both compete for connection with PMM server
       And  Both agents get kicked repeatedly (Check attachment)
      Case 2. Adding new agent with unique agent ID
      Scenario Adding a new instance to PMM server via pmm-admin
      When  New instance uses a unique agent ID 
      And  Two agent with same agent ID are competing to maintain connection with PMM server 
      Then New agent with unique agent ID is connected to PMM server
      And  New agent is listed in Inventory dashboard on PMM server
      And   New agent passes metrics to PMM server


