Uploaded image for project: 'Percona XtraDB Cluster'
  1. Percona XtraDB Cluster
  2. PXC-2220

An Xtradb node starts writing to pagestore instead of ring bugger


    • Type: Bug
    • Status: Done
    • Priority: Medium
    • Resolution: Fixed
    • Affects Version/s: 5.7.22-29.26
    • Fix Version/s: 5.7.24-31.33, 5.6.42-28.30
    • Component/s: None
    • Labels:
    • Environment:


      Kernel 3.13.0-112-generic

      PXC - 5.7.22-22-57


      One of the nodes in the galera cluster had high memory usage. Investigation
      showed that this node, instead of writing transactions to galera.cache ring
      buffer, was writing them to page stores. It had created 276x128MB pages

      This started happening a few hours after a wsrep process failed
      to start on the local mysql node. I am not sure what caused the wsrep event.

      Note that, this doesn’t mean that galera wasn’t running 
      or the changes weren’t being applied. At least based on metrics,
      the node was participating in certification and commits/write-sets were being
      applied. Its just that the write-sets were being retained. Not sure why.

      2018-08-13T21:26:00.305187Z 0 [ERROR] WSREP: bind: Address already in use
      2018-08-13T21:26:00.305338Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 98: error while trying to listen 'tcp://', asio error 'bind: Address already in use': 98 (Address already in use)
      at gcomm/src/asio_tcp.cpp:listen():836
      2018-08-13T21:26:00.305370Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -98 (Address already in use)
      2018-08-13T21:26:00.305567Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1514: Failed to open channel 'cluster1' at 'gcomm://member1,member2': -98 (Address already in use)
      2018-08-13T21:26:00.305593Z 0 [ERROR] WSREP: gcs connect failed: Address already in use
      2018-08-13T21:26:00.305608Z 0 [ERROR] WSREP: Provider/Node (gcomm://member1,member2) failed to establish connection with cluster (reason: 7)
      2018-08-13T21:26:00.305627Z 0 [ERROR] Aborting

      2018-08-14T01:13:29.240162Z 0 [Note] WSREP: Created page /db/mysql/gcache.page.000000 of size 134217728 bytes
      2018-08-14T01:35:41.118082Z 0 [Note] WSREP: Created page /db/mysql/gcache.page.000001 of size 134217728 bytes
      2018-08-14T01:57:41.416134Z 0 [Note] WSREP: Created page /db/mysql/gcache.page.000002 of size 134217728 bytes
      2018-08-14T02:19:25.737846Z 0 [Note] WSREP: Created page /db/mysql/gcache.page.000003 of size 134217728 bytes


      Restarting the node didn’t cleanup the data. It complained about the data in files still being mmaped.

      2018-08-18T02:43:13.427709Z 0 [ERROR] WSREP: Could not delete 276 page files: some buffers are still "mmapped".

      Although, the files weren’t accessed/opened by the node after restart and a
      manual delete was required.

      This behavior seems related to the one intentionally introduced by
      gcache.freeze_purge_at_seqno but it somehow got triggered here.
      This seems like a regression.

        Smart Checklist




              • Assignee:
                krunal.bauskar Krunal Bauskar (Inactive)
                shashanksahni12@gmail.com Shashank Sahni
              • Votes:
                0 Vote for this issue
                3 Start watching this issue


                • Created:

                  Time Tracking

                  Original Estimate - Not Specified
                  Not Specified
                  Remaining Estimate - 0 minutes
                  Time Spent - 1 day, 4 hours
                  1d 4h