One of the nodes in the galera cluster had high memory usage. Investigation
showed that this node, instead of writing transactions to galera.cache ring
buffer, was writing them to page stores. It had created 276x128MB pages
This started happening a few hours after a wsrep process failed
to start on the local mysql node. I am not sure what caused the wsrep event.
Note that, this doesn’t mean that galera wasn’t running
or the changes weren’t being applied. At least based on metrics,
the node was participating in certification and commits/write-sets were being
applied. Its just that the write-sets were being retained. Not sure why.
2018-08-13T21:26:00.305187Z 0 [ERROR] WSREP: bind: Address already in use
2018-08-13T21:26:00.305338Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 98: error while trying to listen 'tcp://0.0.0.0:4567?socket.non_blocking=1', asio error 'bind: Address already in use': 98 (Address already in use)
2018-08-13T21:26:00.305370Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -98 (Address already in use)
2018-08-13T21:26:00.305567Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1514: Failed to open channel 'cluster1' at 'gcomm://member1,member2': -98 (Address already in use)
2018-08-13T21:26:00.305593Z 0 [ERROR] WSREP: gcs connect failed: Address already in use
2018-08-13T21:26:00.305608Z 0 [ERROR] WSREP: Provider/Node (gcomm://member1,member2) failed to establish connection with cluster (reason: 7)
2018-08-13T21:26:00.305627Z 0 [ERROR] Aborting
2018-08-14T01:13:29.240162Z 0 [Note] WSREP: Created page /db/mysql/gcache.page.000000 of size 134217728 bytes
2018-08-14T01:35:41.118082Z 0 [Note] WSREP: Created page /db/mysql/gcache.page.000001 of size 134217728 bytes
2018-08-14T01:57:41.416134Z 0 [Note] WSREP: Created page /db/mysql/gcache.page.000002 of size 134217728 bytes
2018-08-14T02:19:25.737846Z 0 [Note] WSREP: Created page /db/mysql/gcache.page.000003 of size 134217728 bytes
Restarting the node didn’t cleanup the data. It complained about the data in files still being mmaped.
2018-08-18T02:43:13.427709Z 0 [ERROR] WSREP: Could not delete 276 page files: some buffers are still "mmapped".
Although, the files weren’t accessed/opened by the node after restart and a
manual delete was required.
This behavior seems related to the one intentionally introduced by
gcache.freeze_purge_at_seqno but it somehow got triggered here.
This seems like a regression.