1

Yesterday I asked this question, and got awesome answers, it's really a joy to ask questions on this site.

Today I got a slightly different question

say I have csv1

1,2,3,4
5,6,7,8 --
9,10,11,12
13,14,15 --

and csv2 has

1,2,3,4,5 --
20,21,22,23,24
24,25,26,27,28
9,10,11,12,30 --
45,46,47,48,60

How can I print only those rows whose 1st 4 fields are only present in one of the two files? In other words, discard all lines from each file whose 1st four fields are also present in a line in the other file.

1,2,3,4
9,10,11,12
20,21,22,23,24
24,25,26,27,28
45,46,47,48,60

Note that -- doesn't exist in the actual files, i added them to help you notice the difference.

So far, I'm loading everything in numpy arrays and comparing each element,

if a[i] == b[i] and ...

But I want to know if there's a better way to do it using Linux tools.

Edit

Every line in csv2 has a corresponding line in csv1 and there are no duplicate lines in the same file. basically i'm trying to remove csv2 from csv1 and output the rest of csv1.

Lynob
  • 6,615
  • 16
  • 76
  • 108

1 Answers1

3

Here's one way:

$ awk -F, 'NR==FNR{a[$1$2$3$4]++; next}!a[$1$2$3$4]' csv2 csv1
110,12,31,345
1,12,14,55 
12,53,22,10
1,12,32,44 

Explanation

  • -F, : set the field separator to ,. Now, the first comma-separated field of each line will be $1, the second $2 and so on.
  • NR==FNR : these are two awk special variables. NR is the current input line and FNR is the line number of the current file. The two will be equal only while the 1st file is being read.
  • NR==FNR{a[$1$2$3$4]++; next} : while reading the 1st file, save the 1st 4 fields as a key in the array a and set their value to 1. This basically saves all 1st 4 fields of csv1. The next ensures that we immediately skip to the next line and don't process the rest of the script.
  • !a[$1$2$3$4] : the default action of awk is to print the current line. So, if you use something that evaluates to true, awk understands that it should print this line. !a[ $1$2$3$4] is true when a[$1$2$3$4] is not define which will happen for lines in csv1 whose 1st 4 fields were not present in any lines of csv2. Therefore, this directive will cause all lines whose 1st 4 fields have never been seen (so their value in the a array is not 1) to be printed.
terdon
  • 98,183
  • 15
  • 197
  • 293
  • is it `csv1 csv1 csv2` or is it `csv1 csv2 result`? – Lynob Jun 25 '15 at 14:00
  • @Lynob It's `csv1 csv1 csv2`. We need to read one of the two files in its entirety before continuing. There's no other way of knowing of the line is present. So, we save all 1st 4 fields of `csv1`, then go over `csv1` again and also `csv2`, printing only lines whose 1st four fields aren't 1. Since we only add 1 once when reading `csv1` the first time, any lines from `csv2` that weren't in `csv1` will have a value of `undef` and will be printed. – terdon Jun 25 '15 at 14:25
  • @Lynob I just realized something. I may have made a bad edit to your question. Do you want lines in file1 whose 1st 4 fields are not in file2? My solution will only print lines in file2 whose 1st 4 fields weren't in file1. Please let me know if that is your desired behavior. – terdon Jun 25 '15 at 14:30
  • i tried `csv1 csv1 csv2 > results` and `csv2 csv2 csv1 > results' both are not what i was looking for, since csv 1 has 10 000 lines and csv2 has 2 000 so results should have 8000 lines, perhaps im doing something wrong – Lynob Jun 25 '15 at 19:44
  • @Lynob yes, as I said in my previous comment, this will only find unique lines from the second file. I'm working on a fixed version now. one more thing, can there be duplicate lines (I mean lines with the same 1st 4 fields) in the _same_ file? – terdon Jun 25 '15 at 21:25
  • @Lynob OK, try the new version. This one should work for unique lines in either file. I don't have time to add an explanation now, sorry, I'll try and add one tomorrow. However, I'm afraid I might not have understood well. Why would the output file have 8000 lines? Do you mean that _every_ line in csv2 has a corresponding line in csv1? – terdon Jun 25 '15 at 21:57
  • yes every line in csv2 has a corresponding line in csv1 - and there are no duplicate lines in the same file - basically im trying to remove csv2 from csv1 and output the rest of csv1 – Lynob Jun 26 '15 at 10:53
  • @Lynob ah! That makes everything quite a bit simpler. Please [edit] your question and make it clear. I'll edit my answer. – terdon Jun 26 '15 at 10:54
  • @Lynob OK, see updated answer. – terdon Jun 26 '15 at 11:01