Uploaded image for project: 'Percona Monitoring and Management'
  1. Percona Monitoring and Management
  2. PMM-6811

MongoDB Cluster Summary: when secondary optime is newer than primary optime, lag incorrectly shows 136 years

Details

    • Bug
    • Status: Done
    • High
    • Resolution: Fixed
    • None
    • 2.12.0
    • MongoDB_Exporter
    • 1
    • Yes

    Description

      Due to the MongoDB design issue, described in PSMDB-752, PMM sees optime on a Secondary which is newer than optime on the Primary. Then, when it calculates slave lag, the lag becomes a negative value and since data for the lag is UNSIGNED INTEGER, it is converted to the maximum UNSIGNED INTEGER value which is then converted to 136 years on the graph. This makes replication lag graphs not usable.

      How to Repeat

      See PMM-4547 and PSMDB-752, practically:

      1. Start Replicaset:

      mlaunch --dir ./data-4.4.1-rs --replicaset   --binarypath ./percona-server-mongodb-4.4.1-3-x86_64.glibc2.17/bin
      

      2. Enable monitoring:

      pmm-admin add mongodb --service-name=mongo_node1 --host=127.0.0.1 --port=27017
      pmm-admin add mongodb --service-name=mongo_node2 --host=127.0.0.1 --port=27018
      pmm-admin add mongodb --service-name=mongo_node3 --host=127.0.0.1 --port=27019
      

      3. Connect to PRIMARY and run:

      for (var i = 1; i <= 25; i++) {    db.collectionName.insert({ x : i }); sleep(1000); }
      
      

      4. While loop is in progress connect to SECONDARY, run rs.status() and check values of optime/optimeDate:

      replset:SECONDARY> st=rs.status()
      ...
      replset:SECONDARY> st.members[0].stateStr
      PRIMARY
      replset:SECONDARY> st.members[1].stateStr
      SECONDARY
      
      replset:SECONDARY> st.members[0].optime.ts
      Timestamp(1603204318, 1)
      replset:SECONDARY> st.members[1].optime.ts
      Timestamp(1603204319, 1)
      
      replset:SECONDARY> st.members[0].optimeDate
      ISODate("2020-10-20T14:31:58Z")
      replset:SECONDARY> st.members[1].optimeDate
      ISODate("2020-10-20T14:31:59Z")
      

      5. Once you noticed difference in times go to PMM and notice it reports 136 years of the replication lag.

      Suggested Quick Fix

      Convert maximum value for UNSIGNED INTEGER to 0: this slight error is better than huge error of 136 years.

      Alternatively use SIGNED INTEGER to store replication lag values.

      Suggested Good Fix

      Implement pt-heartbeat for MongoDB as we have for MySQL

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              sveta.smirnova Sveta Smirnova
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Smart Checklist