2

As the question heading states, given a packet capture I want to extract the top 5 flows for TCP (or UDP) sorted based on total bytes in the descending order.

I have come up with this so far
tshark -r test.pcap -q -z conv,tcp | sed "1,5d" | head -n -1 | sort -r -k5 | head -n 5

The sed and head commands are to remove the first 5 lines and last line followed by sorting the column 5 and truncating the output to top 5 lines using head again.

An example of just the tshark command output looks like this (heading rows and last row removed):

10.215.173.1:49248         <-> 49.44.185.78:443                84 312 kB         78 10 kB         162 323 kB      215.775760000        12.0809
10.215.173.1:49212         <-> 49.44.185.78:443                83 312 kB         76 10 kB         159 322 kB      215.740042000        12.1151
10.215.173.1:49302         <-> 49.44.185.78:443                79 211 kB         80 9876 bytes     159 221 kB      215.811485000        12.0465
10.215.173.1:49242         <-> 49.44.185.78:443                82 312 kB         76 10 kB         158 322 kB      215.771412000        12.0851
10.215.173.1:49134         <-> 49.44.185.78:443                80 311 kB         76 10 kB         156 322 kB      215.647900000        12.2038
10.215.173.1:49202         <-> 49.44.185.78:443                83 312 kB         73 10 kB         156 322 kB      215.728497000        12.1263
10.215.173.1:49290         <-> 49.44.185.78:443                77 211 kB         78 9700 bytes     155 221 kB      215.803830000        12.0538
10.215.173.1:49278         <-> 49.44.185.78:443                77 211 kB         77 9612 bytes     154 221 kB      215.797622000         7.7149
10.215.173.1:49342         <-> 49.44.185.78:443                74 211 kB         75 9436 bytes     149 220 kB      215.866905000        11.9925
10.215.173.1:49360         <-> 49.44.185.78:443                73 211 kB         74 9348 bytes     147 220 kB      215.895946000        11.9642

Columns in Order: Source ip:port Destination ip:port Incoming Packets:Bytes Outgoing Packets:Bytes Total Packets:Bytes Relative start Duration of flow

I think you can see the problem here, some values are in kB and others in just bytes, since sort only works on numeric values, the result will be wrong. And even if all the values were in kB the sort seems to give the wrong output, meaning I am using it the wrong way.

How do I convert all relevant bytes column related values to kB and then sort the output the right way?

Any other alternative approach using tshark is also accepted.

  • Not a full answer, but some pointers. For `sort` you need to sort numeric, which normally happens with `-n`, but you still will have the problem with the kB. There is also an option `-h` instead, which is human numeric sort, but that works only if the kB etc are connected to the number (e.g. `312kB` instead of `312 kB`. You can probably add some more `sed` to connect the unit. All that said, I just tried it on my system and `tshark` gives me parseable numbers without the units, maybe some version specific behaviour? (I am using tshark v 3.2.3) – gepa May 30 '23 at 17:34
  • FYI `sed` can remove the last line without using `head`: `sed -e '1,5d' -e '$d'` or on most versions `sed '1,5d;$d'` (note `'` not `"`: `$d` in `"` doesn't work) – dave_thompson_085 May 31 '23 at 01:40
  • @dave_thompson_085 Thanks, I had been thinking why it was not working, figured I was using `"`. – Trevor Philip May 31 '23 at 09:25

2 Answers2

0

The cleanest way to do what you are asking for would be to find a way for tshark to print the actual (machine readable) numbers, so that you can easily sort. Unfortunately, tshark seems to have changed the way they print these values (from machine readable to human readable) in version 3.3.0 and looking at the source code this does not seem to be configurable, neither with a command line option, nor with one of the preferences.

Lacking this option, the easiest way I can see you accomplishing this, is by trying to convert the human readable format to the human readable format that sort -h understands, i.e. without spaces between the number and the kB and without the unit bytes.

Something like this should do the trick:

tshark -r test.pcap -q -z conv,tcp |
    sed "1,5d" |
    head -n -1 |
    sed -E -e 's/ ([kMGT]B )/\1/g' |
    sed -e 's/ bytes /     /g' |
    sort -h -r -k5 |
    head -n 5

But again, the optimal solution would be if anybody was to update tshark and add an option to have the presentation of these values configurable (human/machine-readable). There is a reason that this format is called human-readable and is not expected to be parsed by a machine.

gepa
  • 811
  • 1
  • 2
  • 10
  • I did run the command you have suggested, but it does not give the right solution, the output does not have the units in some cases. But I understand what you meant about human-readable format, and I think having the option for human-readable or not is good. (tshark version 3.6.2) – Trevor Philip May 31 '23 at 09:29
  • Yes, it is not perfect as I mentioned, but at least `sort -h` should sort first the unitless numbers, then the ones with 'kB', then the ones with 'MB' etc. If I understand the code generating these lines that you pasted correctly, this should be enough for the sorting to work correctly (if you manually give numbers like 100000000 and 1kB it will sort wrong, but this should not happen in the `tshark` command you mentioned. Can you share an example of an output that gives wrong sorting? – gepa May 31 '23 at 11:16
  • Reading your sentence again, I think I misunderstood your "the output does not have the units in some cases". This was intentional in my code (the line with `bytes`), and the thinking behind it was to keep the number of columns consistent in case you want to sort at a later column. Numbers without unit just mean bytes. If you want, you can replace that line (3rd last) with this: `sed -e 's/ bytes /B /g' |`, this should show you `123B` for plain bytes (instead of `123 bytes`). – gepa May 31 '23 at 13:12
0

After looking at the comments and answers, I think it is better to parse the output of tshark and use some programming language to infer required results.
I think using python with pandas package makes this task very easy and simple for me than using sed and sort linux CLI tools.
I know this was not the intended approach to proceed but is time saving and easier.