Resolution: Cannot Reproduce
Affects Version/s: None
Fix Version/s: None
Environment: CentoOS 6 (kernel 2.6.32), MongoDB 3.0.11 and 3.2.12 , storage engine MMAP, config servers on SCCC
We are experiencing performance degradation when moving from 3.0.11 to 3.2.12. Application throughput is getting reduced by 5-10 times in 3.2.12 compared to 3.0.11.
In the past, we had attempted to upgrade from 3.0.11 to 3.2.8 but due to https://jira.mongodb.org/browse/SERVER-26159 bug, we rollback to 3.0.11. In 3.2.8 application throughput was fine but since the mongos were randomly crashing due to SERVER-26159 we rollback to 3.0.11. Bug SERVER-26159 fixed in 3.2.10 so we attempt to upgrade but we got our performance reduced so we rollback to 3.0.11 again. We opened a JIRA SERVER-26654 about this issue (and several other people report almost the same issues) and according to Jira the issue was solved in 3.2.12. We attempt to upgrade to 3.2.12 but we got the same performance degradation as the 3.2.10 upgrade.
The issue we are seeing in the logs after increasing the verbosity from 1 to 2 is the following:
isMaster command is timeout for different "TaskExecutorPool" all the time.
Note: I am not changing the "protocolVersion" to 1 after the 3.0.11 to 3.2.12 upgrade as makes the rollback harder.
We managed to reproduce the issue with sysbench-mongodb using 3.2.12 (both MongoInc and Persona distribution) on a 10 nodes sharded cluster, not in the scale we getting it on our production system.
To remedy the issue in testing we changed taskExecutorPoolSize value:
Our mongos has 6 CPUs so I assume it creates 6 connection pools with defaults. Using a smaller value like "taskExecutorPoolSize"=2 reduces the timeouts so it seems the more connection pools I use the more timeouts I get during the benchmark. When I set "taskExecutorPoolSize"=1, which I believe set a single connection pool, I am not getting the above timeouts.
We also modified the ShardingTaskExecutorPoolRefreshTimeoutMS from the default 20 seconds to 60 seconds which also eliminated the timeouts.
We combined both on production but unfortunately, the timeouts didn't go away and we still noticed the same performance degradation.
I want to believe that is not our workload that triggering the performance degradation as it operates fine on 3.0.11.
The purpose of the thread is to understand what has changed between 3.2.8 and 3.2.12 that might trigger the isMaster request to fail between mongod and mongos.
Much appreciated If anyone has internals on the change or is facing the same problem and found a workaround.
Thanks in advance,