Uploaded image for project: 'Percona Monitoring and Management'
  1. Percona Monitoring and Management
  2. PMM-7910

MongoDB Query Collector never recovers from failed cursor

Details

    • 0.5
    • 10 - Core
    • Yes
    • Yes
    • C/S Core

    Description

      Original report:
      Hi PMM team,

      We are currently testing PMM to monitor our databases and we ran into the following issue with the MongoDB Query Collector: after some random delay, the queries information collection fails for a given database with this error message in loop.

      WARN[2021-04-12T23:56:52.044+07:00]logrus/entry.go:359 logrus.(*Entry).Logln couldn't retrieve data from cursor connection() : dial tcp <IP_ADDRESS>:<PORT>: i/o timeout agentID=/agent_id/dec3260a-1d54-48ee-83d8-042fe2873733 component=agent-builtin db=<DATABASE_NAME> type=qan_mongodb_profiler_agent

       

      After the first occurrence from this message, we can see with a network capture that the collector never tries again to ask MongoDB for the given database.

      We do no yet what is the cause of the i/o timeout here (probably a misleading error, the network capture doesn't show any issue), but we are surprised that the collector doesn't try to re-connect by itself.

       

      Looking at the code here, https://github.com/percona/pmm-agent/blob/master/agents/mongodb/internal/profiler/collector/collector.go#L234, we see that in case of cursor error, the code will just loop with a one second delay and try again to call cursor.TryNext.

       

      However, looking at the MongoDB Go Driver, it seems that once a cursor returns an error, it will always continue to return the same error: https://github.com/mongodb/mongo-go-driver/blob/457710110ff38c16006d73a656d108d432f935ca/mongo/cursor.go#L102

       

      So basically the current MongoDB collector will always fail in loop once the cursor is in error.

       

      We were able to fix that by adding a simple return statement (see small patch attached), so that the outer function retry the connection / cursor creation, and we can see now that the collector properly recovers from the i/o timeout error that pop up from time to time.

       

      Hope this will help.

       

      Cheers,

      Yann

       

      How to test:
      1. Start pmm and monitor a MongoDB instance.
      2. Enable profiler in the mongo instance. Related docs here. The profiling information is generated in the <database>.system.profile collection as explained here. That means that if you generate load using a database named testdb the profiling collection will be testdb.system.profile
      3. Generate some load foro the selected database.
      4. While runing pmm-agent: 
          i. Stop the MongoDB instance or
          ii. Delete the <database>.system.profile
      5. The agent should recover from this condition and if you generate new queries, they should appear in QAN

       Feature build: https://github.com/Percona-Lab/pmm-submodules/pull/1771

       

       

      Attachments

        Activity

          People

            carlos.salguero Carlos Salguero
            yannrouillard Yann Rouillard
            Votes:
            2 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - Not Specified
                Not Specified
                Logged:
                Time Spent - 2 hours
                2h

                Smart Checklist