[PT-1571] pt-secure-collect uses incorrect regex for hostname obfuscation Created: 25/Jun/18  Updated: 09/Jul/18  Resolved: 02/Jul/18

Status: Done
Project: Percona Toolkit
Component/s: None
Affects Version/s: None
Fix Version/s: 3.0.11

Type: Bug Priority: High
Reporter: Agustín Gallego Assignee: Carlos Salguero
Resolution: Fixed Votes: 1
Labels: None
Remaining Estimate: 0 minutes
Time Spent: 1 hour
Original Estimate: Not Specified


 Description   

When using pt-secure-collect, some outputs are made unreadable by using an incorrect regex for the hostname detection.

For example:

top - 19:49:33 up 10 days, 16:11, 1 user, load average: hostname hostname 0.17
Tasks: 127 total, 1 running, 126 sleeping, 0 stopped, 0 zombie
%Cpu(s): 21.7 us, 34.8 sy, 0.0 ni, 43.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 3881748 total, 142512 free, 1898868 used, 1840368 buff/cache
KiB Swap: 1572860 total, 1572820 free, 40 used. 1602908 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5304 vagrant 20 0 1983280 hostname 12272 S 6.2 35.8 456:hostnameprometheus
5313 root 20 0 142100 16952 5428 S 6.2 0.4 189:hostnamenode_expor+
10559 mysql 20 0 1518964 221568 9976 S 6.2 5.7 10:hostnamemysqld

this is what we get in output for 2018_06_25_19_49_32-top. See how in "load average" section we get two "hostname", and in RES and Time columns, we get one, for instance. In reality, the expected output should be:

top - 20:05:17 up 10 days, 16:27, 1 user, load average: 0.01, 0.15, 0.19
Tasks: 115 total, 1 running, 114 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.0 us, 0.3 sy, 0.0 ni, 98.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 3881748 total, 147324 free, 1892824 used, 1841600 buff/cache
KiB Swap: 1572860 total, 1572748 free, 112 used. 1609372 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5304 vagrant 20 0 1983280 1.327g 12272 S 2.3 35.8 456:34.90 prometheus
5313 root 20 0 142100 16952 5428 S 1.0 0.4 189:16.81 node_exporter

This is because it's taking floating point numbers (in this case) as hostnames. Note that only the floating point numbers that are followed by another character, in this case, are substituted (see regex below).

 

The regex used is:

hostnameRE = regexp.MustCompile(`(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])\.)+([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\-][A-Za-z0-9]){2,3}(?:\W)`)

which in turn is used by:

func sanitizeHostnames(lines []string) {
for i := range lines {
  lines[i] = hostnameRE.ReplaceAllStringFunc(lines[i], replaceHostname)
}
}

and the replace function:

func replaceHostname(s string) string {
if strings.HasSuffix(s, ":") {
  return "<hostname>:"
}
return "hostname"
}

I haven't had time to check the regex further, but if I'm not mistaken, the following would match 0.01, for instance:

([a-zA-Z0-9])\.)+([A-Za-z0-9]){2,3}(?:\W)

which is a subset of the regex mentioned above. We can use the following URL to test if this is the case or not:

https://regex-golang.appspot.com/assets/html/index.html 

If we use the last regex I sent, and the following string

top - 20:05:17 up 10 days, 16:27, 1 user, load average: 0.01, 0.15, 0.19

 you will see that the match is what is then seen above as substituted with "hostname".

 

How to reproduce:

Run the tool with the following command:

pt-secure-collect collect --mysql-user="root" --mysql-password="" --mysql-host=localhost

and use whatever password you want for encrypting. Then decrypt and decompress outputs, and check the files generated. In this case, the one for pt-stalk's `top` output was used: 2018_06_25_19_49_32-top. But this is seen in many other files, so an exhaustive check should be done (grep -R 'hostname' *; to check all files generated).

 

I'm setting as "high" priority, since it will make a lot of outputs have no meaning, which will not let us correctly assess server performance and will potentially mean we missed a window of action to capture data due to this.


Generated at Wed Nov 21 02:20:00 UTC 2018 using Jira 7.12.1#712002-sha1:609a50578ba6bc73dbf8b05dddd7c04a04b6807c.