10 - Core
Hi PMM team,
We are currently testing PMM to monitor our databases and we ran into the following issue with the MongoDB Query Collector: after some random delay, the queries information collection fails for a given database with this error message in loop.
WARN[2021-04-12T23:56:52.044+07:00]logrus/entry.go:359 logrus.(*Entry).Logln couldn't retrieve data from cursor connection() : dial tcp <IP_ADDRESS>:<PORT>: i/o timeout agentID=/agent_id/dec3260a-1d54-48ee-83d8-042fe2873733 component=agent-builtin db=<DATABASE_NAME> type=qan_mongodb_profiler_agent
After the first occurrence from this message, we can see with a network capture that the collector never tries again to ask MongoDB for the given database.
We do no yet what is the cause of the i/o timeout here (probably a misleading error, the network capture doesn't show any issue), but we are surprised that the collector doesn't try to re-connect by itself.
Looking at the code here, https://github.com/percona/pmm-agent/blob/master/agents/mongodb/internal/profiler/collector/collector.go#L234, we see that in case of cursor error, the code will just loop with a one second delay and try again to call cursor.TryNext.
However, looking at the MongoDB Go Driver, it seems that once a cursor returns an error, it will always continue to return the same error: https://github.com/mongodb/mongo-go-driver/blob/457710110ff38c16006d73a656d108d432f935ca/mongo/cursor.go#L102
So basically the current MongoDB collector will always fail in loop once the cursor is in error.
We were able to fix that by adding a simple return statement (see small patch attached), so that the outer function retry the connection / cursor creation, and we can see now that the collector properly recovers from the i/o timeout error that pop up from time to time.
Hope this will help.
How to test:
1. Start pmm and monitor a MongoDB instance.
2. Enable profiler in the mongo instance. Related docs here. The profiling information is generated in the <database>.system.profile collection as explained here. That means that if you generate load using a database named testdb the profiling collection will be testdb.system.profile
3. Generate some load foro the selected database.
4. While runing pmm-agent:
i. Stop the MongoDB instance or
ii. Delete the <database>.system.profile
5. The agent should recover from this condition and if you generate new queries, they should appear in QAN
Feature build: https://github.com/Percona-Lab/pmm-submodules/pull/1771