Uploaded image for project: 'Percona Toolkit'
  1. Percona Toolkit
  2. PT-1571

pt-secure-collect uses incorrect regex for hostname obfuscation

    Details

    • Type: Bug
    • Status: Done
    • Priority: High
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.0.11
    • Component/s: None
    • Labels:
      None

      Description

      When using pt-secure-collect, some outputs are made unreadable by using an incorrect regex for the hostname detection.

      For example:

      top - 19:49:33 up 10 days, 16:11, 1 user, load average: hostname hostname 0.17
      Tasks: 127 total, 1 running, 126 sleeping, 0 stopped, 0 zombie
      %Cpu(s): 21.7 us, 34.8 sy, 0.0 ni, 43.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
      KiB Mem : 3881748 total, 142512 free, 1898868 used, 1840368 buff/cache
      KiB Swap: 1572860 total, 1572820 free, 40 used. 1602908 avail Mem
      
      PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
      5304 vagrant 20 0 1983280 hostname 12272 S 6.2 35.8 456:hostnameprometheus
      5313 root 20 0 142100 16952 5428 S 6.2 0.4 189:hostnamenode_expor+
      10559 mysql 20 0 1518964 221568 9976 S 6.2 5.7 10:hostnamemysqld

      this is what we get in output for 2018_06_25_19_49_32-top. See how in "load average" section we get two "hostname", and in RES and Time columns, we get one, for instance. In reality, the expected output should be:

      top - 20:05:17 up 10 days, 16:27, 1 user, load average: 0.01, 0.15, 0.19
      Tasks: 115 total, 1 running, 114 sleeping, 0 stopped, 0 zombie
      %Cpu(s): 1.0 us, 0.3 sy, 0.0 ni, 98.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
      KiB Mem : 3881748 total, 147324 free, 1892824 used, 1841600 buff/cache
      KiB Swap: 1572860 total, 1572748 free, 112 used. 1609372 avail Mem
      
      PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
      5304 vagrant 20 0 1983280 1.327g 12272 S 2.3 35.8 456:34.90 prometheus
      5313 root 20 0 142100 16952 5428 S 1.0 0.4 189:16.81 node_exporter

      This is because it's taking floating point numbers (in this case) as hostnames. Note that only the floating point numbers that are followed by another character, in this case, are substituted (see regex below).

       

      The regex used is:

      hostnameRE = regexp.MustCompile(`(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])\.)+([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\-][A-Za-z0-9]){2,3}(?:\W)`)
      

      which in turn is used by:

      func sanitizeHostnames(lines []string) {
      for i := range lines {
        lines[i] = hostnameRE.ReplaceAllStringFunc(lines[i], replaceHostname)
      }
      }

      and the replace function:

      func replaceHostname(s string) string {
      if strings.HasSuffix(s, ":") {
        return "<hostname>:"
      }
      return "hostname"
      }

      I haven't had time to check the regex further, but if I'm not mistaken, the following would match 0.01, for instance:

      ([a-zA-Z0-9])\.)+([A-Za-z0-9]){2,3}(?:\W)

      which is a subset of the regex mentioned above. We can use the following URL to test if this is the case or not:

      https://regex-golang.appspot.com/assets/html/index.html 

      If we use the last regex I sent, and the following string

      top - 20:05:17 up 10 days, 16:27, 1 user, load average: 0.01, 0.15, 0.19
      

       you will see that the match is what is then seen above as substituted with "hostname".

       

      How to reproduce:

      Run the tool with the following command:

      pt-secure-collect collect --mysql-user="root" --mysql-password="" --mysql-host=localhost

      and use whatever password you want for encrypting. Then decrypt and decompress outputs, and check the files generated. In this case, the one for pt-stalk's `top` output was used: 2018_06_25_19_49_32-top. But this is seen in many other files, so an exhaustive check should be done (grep -R 'hostname' *; to check all files generated).

       

      I'm setting as "high" priority, since it will make a lot of outputs have no meaning, which will not let us correctly assess server performance and will potentially mean we missed a window of action to capture data due to this.

        Attachments

          Expenses

            Activity

              People

              • Assignee:
                carlos.salguero Carlos Salguero
                Reporter:
                agustin.gallego Agustín Gallego
              • Votes:
                1 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0 minutes
                  0m
                  Logged:
                  Time Spent - 1 hour
                  1h