Uploaded image for project: 'Percona Server for MySQL'
  1. Percona Server for MySQL
  2. PS-2431

LP #1232101: Add hard timeouts to page cleaner flushes

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Done
    • Priority: Low
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None

      Description

      **Reported in Launchpad by Laurynas Biveinis last update 12-09-2016 12:57:39

      Copy of http://bugs.mysql.com/bug.php?id=70453:

      [27 Sep 15:51] Laurynas Biveinis

      Description:
      Currently the page cleaner thread is designed as follows, if I understand correctly:

      • in a loop:
      • do LRU tail flushing, controlled by max LRU scan depth;
      • do flush list flushing, controlled by I/O capacity;
      • sleep the remaining time until 1s.

      Now the proper tuning (correct me if I'm wrong) would strive to minimize the sleep time and utilize storage as closely as possible to capacity.

      But the flushing time is not only determined by the I/O variable settings, but also by how quickly can cleaner can get access to the shared resources such as buffer pool mutexes too. This means that the page cleaner iteration time, even with properly tuned I/O settings, can exceed 1 second. Under very heavy loads (sysbench, I/O-bound, 512 threads on 32 core) one page cleaner iteration might take as long as minutes. This is bad due to several reasons: if LRU flushing is taking that long, the flush list flushing is not happening, the query threads go to sync preflush. if flush list flushing is taking that long, the LRU tail flushing is not happening, the query threads go to single page LRU flushes. Moreover the page cleaner and adaptive flushing heuristics are designed to be updated roughly every second, and probably will not get a very exact idea of what's going if called once in several minutes.

      How to repeat:
      Benchmarks, code analysis

      Suggested fix:
      A partial solution would be to add a hard timeout, i.e. limit LRU tail flush to 1 second, checked after each batch, and flush list flushing to 1 second too. (the exact value of the constant is debatable). It looks it's more important to re-check heuristics and to alternate periodically between the two than to allow one of them to complete fully.

      This is not a complete solution for the case of multiple buffer pool instances, because with a timeout, some instances may receive no flushing at all. This will be reported separately.

        Attachments

          Activity

            People

            Assignee:
            Unassigned
            Reporter:
            lpjirasync lpjirasync (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Smart Checklist