I stopped pmm-agent on one of nodes. Looking in Inventory I see that it still reports it as "Running".
Set status to DONE when 2-way channel between pmm-managed and pmm-agent is closed - set it in registry's Run method. https://github.com/percona/pmm-managed/blob/PMM-2.0/services/agents/registry.go#L216
We don't check on pmm-managed startup what agents are actually running. Set status to UNKNOWN on startup to all agents, then the ones alive change their status.
How to test
Test case 1
- Check that all agents have status DONE in PMM inventory when their parent pmm-agent stops or when the connection between pmm-server and the pmm-agent breaks.
- network failure can be faked for example using docker network disconnect, when pmm-server and pmm-agents are connected to the same network - in our DBaaS setup the network pmm-server is on is called minikube.
- Check the status changes to RUNNING when you start the pmm-agent again or when connection is reestablished again.
Test case 2
- Add pmm-agent to pmm-server, run the pmm-agent. After the agent is connected and it's agents have status RUNNING, stop pmm-server. Then stop the pmm-agent. When you bring pmm-server up again, agents of the pmm-agent should have status UNKNOWN as we don't know anything about them.
- When you start pmm-agent again its agents should change status to RUNNING gradually.
Test case 3
- Add pmm-agent !Unable to render embedded object: File (*without starting it*) not found.! using pmm-admin. Check that agents have status UNKNOWN before starting pmm-agent. Then try to start the pmm-agent, the status should change to RUNNING gradually.
User's improvement suggestion
The above is a bug by itself, but I can also suggest some kind of improvement here. Recently some agents were stopped by accident, and I did not notice this until saw holes in graphs. It's possible to add external monitoring of course, like pmm-agent service status monitoring, but probably it could be easily integrated into PMM? On start page, where number of "Monitored nodes" and "Monitored DB Services" is shown, can we see number of problem instances too? I.e. smth like "25 (1 failed)", and when mouse over, see which one failed.