Details
-
Bug
-
Status: Done
-
High
-
Resolution: Fixed
-
None
-
1
-
Yes
Description
Due to the MongoDB design issue, described in PSMDB-752, PMM sees optime on a Secondary which is newer than optime on the Primary. Then, when it calculates slave lag, the lag becomes a negative value and since data for the lag is UNSIGNED INTEGER, it is converted to the maximum UNSIGNED INTEGER value which is then converted to 136 years on the graph. This makes replication lag graphs not usable.
How to Repeat
See PMM-4547 and PSMDB-752, practically:
1. Start Replicaset:
mlaunch --dir ./data-4.4.1-rs --replicaset --binarypath ./percona-server-mongodb-4.4.1-3-x86_64.glibc2.17/bin
2. Enable monitoring:
pmm-admin add mongodb --service-name=mongo_node1 --host=127.0.0.1 --port=27017 pmm-admin add mongodb --service-name=mongo_node2 --host=127.0.0.1 --port=27018 pmm-admin add mongodb --service-name=mongo_node3 --host=127.0.0.1 --port=27019
3. Connect to PRIMARY and run:
for (var i = 1; i <= 25; i++) { db.collectionName.insert({ x : i }); sleep(1000); }
4. While loop is in progress connect to SECONDARY, run rs.status() and check values of optime/optimeDate:
replset:SECONDARY> st=rs.status() ... replset:SECONDARY> st.members[0].stateStr PRIMARY replset:SECONDARY> st.members[1].stateStr SECONDARY replset:SECONDARY> st.members[0].optime.ts Timestamp(1603204318, 1) replset:SECONDARY> st.members[1].optime.ts Timestamp(1603204319, 1) replset:SECONDARY> st.members[0].optimeDate ISODate("2020-10-20T14:31:58Z") replset:SECONDARY> st.members[1].optimeDate ISODate("2020-10-20T14:31:59Z")
5. Once you noticed difference in times go to PMM and notice it reports 136 years of the replication lag.
Suggested Quick Fix
Convert maximum value for UNSIGNED INTEGER to 0: this slight error is better than huge error of 136 years.
Alternatively use SIGNED INTEGER to store replication lag values.
Suggested Good Fix
Implement pt-heartbeat for MongoDB as we have for MySQL