Shuffle rows independently in a large file

Question

I need to generate a random shuffle a very large csv (where I don't know in advance how many columns would be) in this way. So I have to go from this form

1,a,...
2,b,...
3,c,...

to something like this

3,b,...
1,c,...
2,a,...

I know I can shuffle the rows with shuffle, but I need to shuffle each column independently. I am wondering if it possible with a combination of bash commands.

Kinda feels like the way to do it would be to split the file into two (can `sed` do that?), then shuffle the files separately, and then merge them back together. I'm not a Linux command line guru, but that would be the approach I would be looking to achieve. — Mokubai, Apr 11 '21 at 17:47
Is every `,` character a delimiter? In general csv allows embedded commas [with quoting](https://en.wikipedia.org/wiki/Comma-separated_values#Basic_rules). If every comma is a delimiter then the first step may be `awk -F , …`. — Kamil Maciorowski, Apr 11 '21 at 17:49
I don't know in advance how many columns there would in the csv. — emanuele, Apr 11 '21 at 17:49
I would highly suggest a more advanced language than bash (which is *barely* a language). Python, Perl, Java, c# are all very capable of doing this job with ease. This 100% does not answer your question so ignore it if you please 8^P — Señor CMasMas, Apr 11 '21 at 17:57
I tried python, but it is not viable. Has been 3 days and counting... — emanuele, Apr 11 '21 at 20:21
@emanuele https://www.w3schools.com/python/numpy/numpy_array_split.asp and https://stackoverflow.com/questions/473973/shuffle-an-array-with-python-randomize-array-item-order-with-python then https://www.w3schools.com/python/numpy/numpy_array_join.asp — Mokubai, Apr 11 '21 at 21:15

score 0 · Answer 1 · answered Apr 12 '21 at 11:32

I created a python script that generate a bash script. I don't think is the most elegant way, but works quite well.

import csv

FILENAME = 'my_huge_csv.csv'

with open(FILENAME,'r') as f:
    reader = csv.reader(f,delimiter=',')
    NCOL = len(next(reader))

with open("shuffle_{}.sh".format(FILENAME),"w+") as f:
    f.write("#/bin/bash \n")
    f.write("/usr/bin/head -n 1 {} > final.csv \n".format(FILENAME))
    for i in range(NCOL):
        f.write("/usr/bin/tail -n +2 {}|/usr/bin/cut -d, -f{}|shuf > tmp_file_{}.csv &\n".format(FILENAME, i+1,i+1))
    f.write("wait \n")
    cut_arg = ['tmp_file_{}.csv'.format(i+1) for i in range(NCOL)]
    cut_cmd = '/usr/bin/paste -d , ' + ' '.join(cut_arg) + ' >> final.csv \n'
    f.write(cut_cmd)
    f.write('rm '+ ' '.join(cut_arg) + ' \n')

Than I have to simply execute chmod +x on my script and running it.

This looks like an awful way. I'm not downvoting it because for now it's the only answer and I believe it kinda works (i.e. it works at least in some circumstances); and I don't feel it would be right to punish you for posting your own imperfect solution, especially when there are no alternatives (yet). Still I'm not upvoting either. In hope of getting better answers: (1) `cut -d,` suggests the answer to my comment (under the question) is "yes". Please confirm. (2) `head` and `tail` suggest there is a header; but there is no header in the example in the question body. Is there a header or not? — Kamil Maciorowski, Apr 12 '21 at 12:15
It's quite awful indeed, but I can't find a better solution. And yes, there is an header. — emanuele, Apr 12 '21 at 16:20

Shuffle rows independently in a large file

1 Answers1