2

I have a text file (more then 1GB in size) and it contains lines like these:

1083021106e581c71003b987a75f18543cf5858b9fcfc5e04c0dddd79cd18764a865ba86d027de6d1900dc171e4d90a0564abbce99b812b821bd0d7d37aad72ead19c17
10840110dbd43121ef0c51a8ba62193eac247f57f1909e270eeb53d68da60ad61519f19cfb0511ec2431ca54e2fcabf6fa985615ec06def5ba1b753e8ad96d0564aa4c
1084011028375c62fd132d5a4e41ffef2419da345b6595fba8a49b5136de59a884d878fc9789009843c49866a0dc97889242b9fb0b8c112f1423e3b220bc04a2d7dfbdff
10880221005f0e261be654e4c52034d8d05b5c4dc0456b7868763367ab998b7d5886d64fbb24efd14cea668d00bfe8048eb8f096c3306bbb31aaea3e06710fa8c0bb8fca71
108501103461fca7077fc2f0d895048606b828818047a64611ec94443e52cc2d39c968363359de5fc76df48e0bf3676b73b1f8fea5780c2af22c507f83331cc0fbfe6ea9
1085022100a4ce8a09d1f28e78530ce940d6fcbd3c1fe2cb00e7b212b893ce78f8839a11868281179b4f2c812b8318f8d3f9a598b4da750a0ba6054d7e1b743bb67896ee62
1086022100638681ade4b306295815221c5b445ba017943ae59c4c742f0b1442dae4902a56d173a6f859dc6088b6364224ec17c4e2213d9d3c96bd9992b696d7c13b234b50

all strings staring with like below, .....

10830110
1083021
10840110
10840110
1088022100
10850110
1085022100
1086022100

i need separate 8 files, how to do with sed command

muru
  • 193,181
  • 53
  • 473
  • 722
maa
  • 91
  • 8
  • Do you have any attempts yet? Why the restriction to use sed? I have an idea how to do this with egrep. – Sheldon Jul 27 '22 at 07:42
  • Sheldon no ,sed may be easy, any other idea welcome... – maa Jul 27 '22 at 08:16
  • The example pattern string `10840110` is repeated two times(*probably by mistake ... so please edit your question to correct that*) in your question ... Therefore I used 7 patterns in my answer as the patterns need to be unique or otherwise you will get duplicate lines in the output. – Raffa Jul 27 '22 at 10:04
  • 1
    How do you want to separate the file? Do you just need to have 8 files of the same size (or as close as possible) or do you want to separate them based on the first few numbers? – terdon Jul 27 '22 at 18:54

4 Answers4

10

You could use sed to turn your file of prefixes into a file of sed commands, then use that in a sed command to process the large file - this will almost certainly be more efficient than using a shell loop to run sed (or grep) multiple times on the same (large) file. Ex. given

$ cat file2
10830110
1083021
10840110
10840110
1088022100
10850110
1085022100
1086022100

then

$ sed 's:.*:/^&/w&.txt:' file2
/10830110/w10830110.txt
/1083021/w1083021.txt
/10840110/w10840110.txt
/10840110/w10840110.txt
/1088022100/w1088022100.txt
/10850110/w10850110.txt
/1085022100/w1085022100.txt
/1086022100/w1086022100.txt

so that

$ sed 's:.*:/^&/w&.txt:' file2 | sed -n -f - file1

produces

$ head 108*.txt
==> 10830110.txt <==

==> 1083021.txt <==
1083021106e581c71003b987a75f18543cf5858b9fcfc5e04c0dddd79cd18764a865ba86d027de6d1900dc171e4d90a0564abbce99b812b821bd0d7d37aad72ead19c17

==> 10840110.txt <==
10840110dbd43121ef0c51a8ba62193eac247f57f1909e270eeb53d68da60ad61519f19cfb0511ec2431ca54e2fcabf6fa985615ec06def5ba1b753e8ad96d0564aa4c
10840110dbd43121ef0c51a8ba62193eac247f57f1909e270eeb53d68da60ad61519f19cfb0511ec2431ca54e2fcabf6fa985615ec06def5ba1b753e8ad96d0564aa4c
1084011028375c62fd132d5a4e41ffef2419da345b6595fba8a49b5136de59a884d878fc9789009843c49866a0dc97889242b9fb0b8c112f1423e3b220bc04a2d7dfbdff
1084011028375c62fd132d5a4e41ffef2419da345b6595fba8a49b5136de59a884d878fc9789009843c49866a0dc97889242b9fb0b8c112f1423e3b220bc04a2d7dfbdff

==> 10850110.txt <==
108501103461fca7077fc2f0d895048606b828818047a64611ec94443e52cc2d39c968363359de5fc76df48e0bf3676b73b1f8fea5780c2af22c507f83331cc0fbfe6ea9

==> 1085022100.txt <==
1085022100a4ce8a09d1f28e78530ce940d6fcbd3c1fe2cb00e7b212b893ce78f8839a11868281179b4f2c812b8318f8d3f9a598b4da750a0ba6054d7e1b743bb67896ee62

==> 1086022100.txt <==
1086022100638681ade4b306295815221c5b445ba017943ae59c4c742f0b1442dae4902a56d173a6f859dc6088b6364224ec17c4e2213d9d3c96bd9992b696d7c13b234b50

==> 1088022100.txt <==
10880221005f0e261be654e4c52034d8d05b5c4dc0456b7868763367ab998b7d5886d64fbb24efd14cea668d00bfe8048eb8f096c3306bbb31aaea3e06710fa8c0bb8fca71

You may want to de-duplicate the pattern file first - and possibly sort it numerically and modify the second sed command to break after the first match, so that you only match the longest prefix:

$ sort -nru file2 | sed 's:.*:/^&/{w&.txt\nb\n}:' | sed -n -f - file1

giving

$ head 108*.txt
==> 10830110.txt <==

==> 1083021.txt <==
1083021106e581c71003b987a75f18543cf5858b9fcfc5e04c0dddd79cd18764a865ba86d027de6d1900dc171e4d90a0564abbce99b812b821bd0d7d37aad72ead19c17

==> 10840110.txt <==
10840110dbd43121ef0c51a8ba62193eac247f57f1909e270eeb53d68da60ad61519f19cfb0511ec2431ca54e2fcabf6fa985615ec06def5ba1b753e8ad96d0564aa4c
1084011028375c62fd132d5a4e41ffef2419da345b6595fba8a49b5136de59a884d878fc9789009843c49866a0dc97889242b9fb0b8c112f1423e3b220bc04a2d7dfbdff

==> 10850110.txt <==
108501103461fca7077fc2f0d895048606b828818047a64611ec94443e52cc2d39c968363359de5fc76df48e0bf3676b73b1f8fea5780c2af22c507f83331cc0fbfe6ea9

==> 1085022100.txt <==
1085022100a4ce8a09d1f28e78530ce940d6fcbd3c1fe2cb00e7b212b893ce78f8839a11868281179b4f2c812b8318f8d3f9a598b4da750a0ba6054d7e1b743bb67896ee62

==> 1086022100.txt <==
1086022100638681ade4b306295815221c5b445ba017943ae59c4c742f0b1442dae4902a56d173a6f859dc6088b6364224ec17c4e2213d9d3c96bd9992b696d7c13b234b50

==> 1088022100.txt <==
10880221005f0e261be654e4c52034d8d05b5c4dc0456b7868763367ab998b7d5886d64fbb24efd14cea668d00bfe8048eb8f096c3306bbb31aaea3e06710fa8c0bb8fca71
terdon
  • 98,183
  • 15
  • 197
  • 293
steeldriver
  • 131,985
  • 21
  • 239
  • 326
  • *"this will almost certainly be more efficient than using a shell loop to run sed multiple times on the same (large) file."* ... For very large files, yes it is ... That earned an up-vote :-) – Raffa Jul 27 '22 at 13:27
  • At first-glance, I think it's more efficient to avoid sed entirely and process in a single `read`-type loop (but I haven't tried it yet to prove it to myself), but I do love this answer. Upvoted simply for the meta approach of using a sed script to write a sed script :-) – NotTheDr01ds Jul 27 '22 at 16:11
  • sort -nru file2 | sed 's:.*:/^&/{w&.txt\nb\n}:' | sed -n -f - file1 --- it is working perfectly Thankyou – maa Jul 29 '22 at 06:15
6

prefix.text (contains the 8 prefixes)

1prefix
2prefix
3prefix
4prefix
x1prefix
x2prefix
x3prefix
x4prefix

input.text (like your 1 GB text file)

1prefix90956666
3prefix26588388
1prefix49080634
x3prefix59162307
x1prefix86437679
x4prefix77832956
x3prefix56458412
2prefix37484977
x2prefix73879936
x1prefix44005273
2prefix57156422
x1prefix67751608
4prefix25566629
x2prefix93657051
x3prefix40897616
4prefix93222501
3prefix35680804
x4prefix42979833
x2prefix08229240
1prefix42071365
4prefix67857600
2prefix66384962
x4prefix21482824
3prefix59616880

loop with grep, to write 1 output file per 1 prefix

while read prefix
do
    grep "^${prefix}" input.text > output_${prefix}.text
done < prefix.text

output_x1prefix.text (output example)

x1prefix86437679
x1prefix44005273
x1prefix67751608
Sheldon
  • 447
  • 3
  • 7
  • This will require reading the entire enormous file once per prefix. Also, `egrep` is deprecated in favor of `grep -E`. In any case, you don't need `grep -E` or `egrep` here, a simple `grep` would do the same. – terdon Jul 27 '22 at 18:56
  • Thank you @terdon, i have replaced `egrep` with `grep`. – Sheldon Jul 27 '22 at 21:24
  • 1
    @terdon It seems this answer provide the fastest solution to the OP so far ... As per the **methbusters** :-) ... Kindly see the **Speed test** part in my answer and let me know the results you get if you happen to test the solutions ... Sheldon I have already up-voted your answer so can't up-vote it again although I would if I could :-) – Raffa Jul 28 '22 at 08:30
  • @Raffa well, I'll be... Sheldon, looks like I owe you an apology! I still feel that reading the file multiple times is inefficient, but I also ran a test and got the same result as Raffa, so I was clearly wrong. Sorry! – terdon Jul 28 '22 at 10:25
  • Any reason why you picked `.text` instead of `.txt` or `.tmp`? – Ismael Miguel Jul 28 '22 at 23:13
  • 1
    @IsmaelMiguel Yes, i enjoy my freedom since the times of DOS filename restrictions (8.3) are over :-) – Sheldon Jul 29 '22 at 12:26
  • @Sheldon Can't argue with that. It is a good reason. – Ismael Miguel Jul 29 '22 at 13:46
6

This will create a new file for each matched pattern with an extension of .splt in the current working directory and write all matching lines to it:

sed in a shell for loop:

for i in "10830110" "1083021" "10840110" "1088022100" "10850110" "1085022100" "1086022100" # Patterns to match
    do
    sed -n "/^$i/p" FileName > "$i.splt" # Change "FileName" to your file name
    done

You can do the same, as well, with awk in a shell for loop:

for i in "10830110" "1083021" "10840110" "1088022100" "10850110" "1085022100" "1086022100" # Patterns to match
    do
    awk -v pat="^$i" -v pat2="$i" '$0 ~ pat { print $0 > pat2".splt"}' FileName # Change "FileName" to your file name
    done

awk with an array of patterns:

awk '{pat["0"] = "10830110";
    pat["1"] = "1083021";
    pat["2"] = "10840110";
    pat["3"] = "1088022100";
    pat["4"] = "10850110";
    pat["5"] = "1085022100";
    pat["6"] = "1086022100";} {for (i in pat) { if ($0 ~ "^"pat[i]) print $0 > pat[i]".splt"}}' YourFile

or save patterns as lines(each pattern on a new line) in a pat.txt file and let awk build the array of patterns like so:

awk 'FILENAME=="pat.txt" { pat[$i]=$0; next } { for (i in pat) { if ($0 ~ "^"pat[i]) print $0 > pat[i]".splt"}}' pat.txt YourFile

Speed test (for science)

I tested the solutions(tested 3 times each and took the rounded average) provided in my answer as well as in the answers by @steeldriver and @Sheldon and here are the results(tested on the same average specifications PC) with the same patterns pat.txt contains:

$ cat pat.txt 
10830110
1083021
10840110
1088022100
10850110
1085022100
1086022100

and the data file file.dat is a 1.1G containing 8,484,000 lines which is made by duplicating the lines in the example provided by the OP i.e.:

1083021106e581c71003b987a75f18543cf5858b9fcfc5e04c0dddd79cd18764a865ba86d027de6d1900dc171e4d90a0564abbce99b812b821bd0d7d37aad72ead19c17
10840110dbd43121ef0c51a8ba62193eac247f57f1909e270eeb53d68da60ad61519f19cfb0511ec2431ca54e2fcabf6fa985615ec06def5ba1b753e8ad96d0564aa4c
1084011028375c62fd132d5a4e41ffef2419da345b6595fba8a49b5136de59a884d878fc9789009843c49866a0dc97889242b9fb0b8c112f1423e3b220bc04a2d7dfbdff
10880221005f0e261be654e4c52034d8d05b5c4dc0456b7868763367ab998b7d5886d64fbb24efd14cea668d00bfe8048eb8f096c3306bbb31aaea3e06710fa8c0bb8fca71
108501103461fca7077fc2f0d895048606b828818047a64611ec94443e52cc2d39c968363359de5fc76df48e0bf3676b73b1f8fea5780c2af22c507f83331cc0fbfe6ea9
1085022100a4ce8a09d1f28e78530ce940d6fcbd3c1fe2cb00e7b212b893ce78f8839a11868281179b4f2c812b8318f8d3f9a598b4da750a0ba6054d7e1b743bb67896ee62
1086022100638681ade4b306295815221c5b445ba017943ae59c4c742f0b1442dae4902a56d173a6f859dc6088b6364224ec17c4e2213d9d3c96bd9992b696d7c13b234b50

The results are ordered from the fastest on top and the code I used for timing is provided under each result:

#1 grep in a shell loop @Sheldon (18 seconds)

s=$(date +%s); while read prefix
do
    grep "^${prefix}" file.dat > ${prefix}.splt
done < pat.txt; e=$(date +%s); echo $(($e-$s))

More accurate timing:

$ time (while read prefix
do
    grep "^${prefix}" file.dat > ${prefix}.splt
done < pat.txt)

real    0m17.969s
user    0m4.437s
sys     0m2.176s

#2 sed @steeldriver (20 seconds)

s=$(date +%s); sed 's:.*:/&/w&.splt:' pat.txt | sed -n -f - file.dat; e=$(date +%s); echo $(($e-$s))

More accurate timing with ^ added in response to the comment by @terdon:

$ time (sed 's:.*:/^&/w&.splt:' pat.txt | sed -n -f - file.dat)

real    0m18.748s
user    0m10.408s
sys     0m1.546s

#3 sed in a shell loop @Raffa (21 seconds)

s=$(date +%s); for i in "10830110" "1083021" "10840110" "1088022100" "10850110" "1085022100" "1086022100" # Patterns to match
    do
    sed -n "/^$i/p" file.dat > "$i.splt" # Change "FileName" to your file name
    done; e=$(date +%s); echo $(($e-$s))

#4 awk in a shell loop @Raffa (35 seconds)

s=$(date +%s); for i in "10830110" "1083021" "10840110" "1088022100" "10850110" "1085022100" "1086022100" # Patterns to match
    do
    awk -v pat="^$i" -v pat2="$i" '$0 ~ pat { print $0 > pat2".splt"}' file.dat # Change "FileName" to your file name
    done; e=$(date +%s); echo $(($e-$s))

#5 awk @Raffa (414 seconds) <-- That was a shock

s=$(date +%s); awk '{pat["0"] = "10830110";
    pat["1"] = "1083021";
    pat["2"] = "10840110";
    pat["3"] = "1088022100";
    pat["4"] = "10850110";
    pat["5"] = "1085022100";
    pat["6"] = "1086022100";} {for (i in pat) { if ($0 ~ "^"pat[i]) print $0 > pat[i]".splt"}}' file.dat; e=$(date +%s); echo $(($e-$s))
Raffa
  • 24,905
  • 3
  • 35
  • 79
  • Nice! However, steeldriver's answer had an error: they had forgotten to add a `^` to only match the beginning of the line. Adding that (see updated answer) makes steeldriver's sed approach as fast as the grep loop on my system. – terdon Jul 28 '22 at 10:55
  • @terdon It comes close but `grep` in a loop is still leading :-) ... It even uses less CPU time! ... See the updated speed test above. – Raffa Jul 28 '22 at 13:29
  • As the prefixes are fixed previously (pun intended), could be interesting to test if using `grep -F` string regex optimization would be faster than normal `grep`. – Fjor Aug 09 '22 at 20:24
  • 1
    @Fjor No, not really ... It even took a couple more seconds as it seems to process the through the whole lines with the absence of `^` as it's obviously a regular expression and cant be used with `-F` ... So I tried with `grep -P "^\Q${prefix}\E"` enabling Perl style regular expressions trying to get the best of the two worlds, but this even took a few more seconds than `grep -F "${prefix}"` ... I would imagine that `-F` when used with `-x` to match whole lines from beginning to end of the line would be the most efficient use case. – Raffa Aug 10 '22 at 13:54
0

If the file is already split in to lines which start with these strings, as it is in the example, you can use awk in a way as this (reference):

 awk '{file="file."(++i)".txt"}{print > file;}' input-file.txt

This will produce a new file for each line.

If we suppose the starting strings have fixed length of 7 character (which is not the case in the example), we can split the input file into separate files for each starting string string by something like (reference):

awk '{file="file."(substr($1,1,7))".txt"}{print >> file;}' input-file.txt
pa4080
  • 29,351
  • 10
  • 85
  • 161
  • starting prefix was 10830110..1083021 10840110 10840110 1088022100 10850110 1085022100, 1086022100. so many lines in the 1gb.starting with mentioned strings...i need 10830110.txt,1083021.txt ................1086022100.txt – maa Jul 27 '22 at 09:34
  • Hi, @maa, I've updated the answer with a solution for starting strings with a fixed length. If the starting strings have different length and you need to apply different logic, like in the other answers, it is better to implement this logic via `awk` to achieve satisfactory performance. – pa4080 Jul 27 '22 at 10:08
  • each one starting with different length 1083021106e581c71003b987a75f18543cf5858b9fcfc5e04c0dddd79cd18764a865ba86d027de6d1900dc171e4d90a0564abbce99b812b821bd0d7d37aad72ead19c17 – maa Jul 27 '22 at 12:04