Uploaded image for project: 'Percona XtraDB Cluster'
  1. Percona XtraDB Cluster
  2. PXC-2220

An Xtradb node starts writing to pagestore instead of ring bugger

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Done
    • Priority: Medium
    • Resolution: Fixed
    • Affects Version/s: 5.7.22
    • Fix Version/s: 5.7.24-31.33, 5.6.42-28.30
    • Component/s: None
    • Labels:
    • Environment:

      Linux

      Kernel 3.13.0-112-generic

      PXC - 5.7.22-22-57

      Description

      One of the nodes in the galera cluster had high memory usage. Investigation
      showed that this node, instead of writing transactions to galera.cache ring
      buffer, was writing them to page stores. It had created 276x128MB pages
      already.

      This started happening a few hours after a wsrep process failed
      to start on the local mysql node. I am not sure what caused the wsrep event.

      Note that, this doesn’t mean that galera wasn’t running 
      or the changes weren’t being applied. At least based on metrics,
      the node was participating in certification and commits/write-sets were being
      applied. Its just that the write-sets were being retained. Not sure why.

      {code}
      2018-08-13T21:26:00.305187Z 0 [ERROR] WSREP: bind: Address already in use
      2018-08-13T21:26:00.305338Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 98: error while trying to listen 'tcp://0.0.0.0:4567?socket.non_blocking=1', asio error 'bind: Address already in use': 98 (Address already in use)
      at gcomm/src/asio_tcp.cpp:listen():836
      2018-08-13T21:26:00.305370Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -98 (Address already in use)
      2018-08-13T21:26:00.305567Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1514: Failed to open channel 'cluster1' at 'gcomm://member1,member2': -98 (Address already in use)
      2018-08-13T21:26:00.305593Z 0 [ERROR] WSREP: gcs connect failed: Address already in use
      2018-08-13T21:26:00.305608Z 0 [ERROR] WSREP: Provider/Node (gcomm://member1,member2) failed to establish connection with cluster (reason: 7)
      2018-08-13T21:26:00.305627Z 0 [ERROR] Aborting

      2018-08-14T01:13:29.240162Z 0 [Note] WSREP: Created page /db/mysql/gcache.page.000000 of size 134217728 bytes
      2018-08-14T01:35:41.118082Z 0 [Note] WSREP: Created page /db/mysql/gcache.page.000001 of size 134217728 bytes
      2018-08-14T01:57:41.416134Z 0 [Note] WSREP: Created page /db/mysql/gcache.page.000002 of size 134217728 bytes
      2018-08-14T02:19:25.737846Z 0 [Note] WSREP: Created page /db/mysql/gcache.page.000003 of size 134217728 bytes

      ...
      {code}

      Restarting the node didn’t cleanup the data. It complained about the data in files still being mmaped.

      {code}
      2018-08-18T02:43:13.427709Z 0 [ERROR] WSREP: Could not delete 276 page files: some buffers are still "mmapped".
      {code}

      Although, the files weren’t accessed/opened by the node after restart and a
      manual delete was required.

      This behavior seems related to the one intentionally introduced by
      gcache.freeze_purge_at_seqno but it somehow got triggered here.
      This seems like a regression.

        Attachments

          Activity

            People

            • Assignee:
              krunal.bauskar Krunal Bauskar
              Reporter:
              shashanksahni12@gmail.com Shashank Sahni
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0 minutes
                0m
                Logged:
                Time Spent - 1 day, 4 hours
                1d 4h