To make separate files based on pattern within a single file

Question

I have a file containing 2.3M lines. Which looks like:

$less V2.fastq

>TS19_EWP4IQK02JPFP5
CATGCTGCCTCCCGTAGGAGTTTGGTCCGTGTCTCAGTACCAATGTGGGGGACCTTCCTC
TCAGAACCCTATCCATCGTCGGTTTGGTGGGCCGTTACCCGCCAACTGCCTAATGGAACG
CATGCCTATCTATCAGCGATGAATCTTTAGCAAATATCCCCATGCGGGGCCCTGCTTCAT
GCGGTATTAGTCCGACTTTCGCCGGGTTATCCCCTCTGATAGGTAAGTTGCATACGCGTT
ACTCACCGTGCGCCGG
>TS20_EWP4IQK02FSQQL
CATGCTGCCTCCCGTAGGAGTTTGGACCGGTGTCTCAGTTCCAACTGTGGGGGGACCTTC
CTCTCCAGAACCCCCTATCCCATCGAAG
>TS19_EWP4IQK02GBB8K
CATGCTGCCTCCCGTAGGAGTCTGGGCCGTGTCTCAGTCCCAGTGTGGCCGATCACCCTC
TCAGGTCGGCTATGTATCGTCGCCTAGGTGAGCCGTTACCTCACCTACTAGCTAATACAA
CGCAGGTCCATCTTGTAGTGGAGCATTTGCCCCTTTCAAATAAATGACATGAGTCACCCA
TTGTTATGCGGTATTAGCTATCGTTTCCAATAGTTATCCCCCGCTACAAGGCAGGTTACC
TACGCG
>TS19_EWP4IQK02FUJRM
CATGCTGCCTCCCGTAGGAGTTTGGACCGTGTCTCAGTTCCAATGTGGGGGACCTTCCTC
TCAGAACCCCTATCCATCGAAGACTAGGTGGGCCGTTACCCCGCCTACTATCTAATGGAA
CGCACCCCCATCTTACACCGGTAAACCTTTAATCATGCGAAAATGCTTACTCATGATAAC
ATCTTGTATTAATCTCCCTTTCAGAAGGCTGTCCAAGAGTGTAAGGCAGGTTGGATACGC
GTTACTCACCCGTGCGCCGGTCG
>TS119_EWP4IQK02I2KHZ
CATGCTGCCTCCCGTAGGAGTTTGGACCGTGTCTCAGTTCCAATGTGGGGGACCTTCCTC
TCAGAACCCCTATCCATCGATGGCTTGGTGGGCCGTTACCCCGCCAACAACCTAATGGAA
CGCATCCCCATCAATGACCGAAATTCTTTAATAGCTGAAAGATGCCTTTCAGATATACCA
TCGGGTATTAATCTTTCTTTCGAAAGGCTATCCCCGAGTCATCGGCAGGTTGGATACGTG
TTACTCACCCGTGCGCCGTCG

Line that starts with ">" denotes a single SampleID. Sample name is designated by the term before "_" in that line. For example:TS19, TS20, TS119, etc. I want to make separate output files for each such sample that contains the SampleID and the content within. Can anyone please help me?

Many thanks

edit:1 For getting output for sample TS_19 we can use this command which returns following output: Command

sed -n '/>TS19_/, />/p' V2.fasta

Output (a few lines out of thousands)

>TS19_ok4.40713 CTAACGCAGTCA
TTGGGCCGTGTCTCAGTCCCAATGTGGCCGGTCACCCTCTCAGGTCGGCTACTGATCGTCGGCTTGGTAGGCCGTTACCCCACCAACTACCTAATCAGACGCGGGTCCATCTCATACCACCGGAGCTTTTTCACACCGTACCATGCGGTACTGTGCGCTTATGCGGTATTAGCAGTCGTTTCCAACTGTTATCCCCTGTATGAGGCAGGTTACCCACGCGTTACTCACCCGTCCG
>TS6.2_ok4.40714 CGTCAGACGGAT
>TS19_ok4.40771 CTAACGCAGTCA
TTGGGCCGTGTCTCAGTCCCAATGTGGCCGGTCACCCTCTCAGGTCGGCTACTGATCGTCGCTTTGGTAGGCCGATACCCCACCAACCGGCTAATCAGACGCGGGTCCATCTCATACCACCGGAGTTTTTACCCCTCGCACCATGCGGTGCTGTGGTCTTATGCGGTATTAGCAGTCATTTCTTGACTGTTTATTTCCCCTCGTATGAGGCAGGTTACCCACGCGTTACTCACCCG
>TS8_ok4.40772 TCGAGACGCTTA
>TS19_ok4.40971 CTAACGCAGTCA
CTGGGCCGTGTCTCAGTCCCAATGTGGCCGGTCACCCTCTCAGGTCGGCTACTGATCATCGCCTTGGTGGGCCGTTACCCCGCCAACAAGCTAATCAGACGCGGGTCCATCTCATACCACCGGAGTTTTTCACACTGTACCATGTGGTACTGTGCGCTTATGCGGTATTACCAGCCGTTTCCAGCTGCTATCCCCATCTGAAGGGCAGGTTGCTTACGCGG
>TS127_ok4.40972 GACCGAGCTATG

I just want to remove the lines that starts with > but don't follow TS_19. Can anyone help me?

edit:2 https://drive.google.com/file/d/17MC0tiIE6axOJqNZukzsQX5bVpuvV312/view?usp=sharing

Look at [Stackoverflow](https://stackoverflow.com/questions/11313852/split-one-file-into-multiple-files-based-on-delimiter). You'll find a huge amount of similar questions there — kanehekili, Oct 16 '22 at 17:59
... or U&L, ex. [Split file into multiple files based on pattern](https://unix.stackexchange.com/questions/263904/split-file-into-multiple-files-based-on-pattern) — steeldriver, Oct 16 '22 at 18:05
Hi @steeldriver- The solution in the link works only when the SampleID is same always. In my case, only a portion of the lines starting with ">" denote SampleID. — DEEP, Oct 16 '22 at 19:00
Hi, @steeldriver- following command that includes the start pattern (**>TS19**) and end pattern (**>**) works to some extent. But, it returns line with the end pattern which I don't want. Can you please add some changes to the command so that it works perfectly? **sed -n '/>TS19_/, />/p' V2.fasta** — DEEP, Oct 16 '22 at 21:12

mondotofu · Answer 1 · 2022-10-21T21:10:03.350

1

Edit 1 Take away the -n 7 ... you won't need it.

csplit -z v2.fastq  -f TestSample /\>TS/ '{*}'

Will generate files TestSample00, TestSample01, TestSample02,TestSample03,... TestSamplennnnnn based upon your file.

Finally, you'll want a prefix to identify all these files. Sorry my solution doesn't rename your file to show the Test Sample number naming convention, but at least you can vary it each time you run the command by changing the prefix with the -f parameter.

Edit 2
If however you need all of your data having the same test sample identifier collected together in the same file, then follow up with a command such as

find . -name "TestSample*" | xargs grep -l TS19_ | awk '{print "cat " $1"  >> My_TS19_.fasta " }' | sh

The new file (My_TS19_.fasta) will have all the sequences in it that pertain to TS19_ or whatever case-sensitive string you put in after grep

I've added the xargs command to stream the list of files rather than choking the find command.

The awk command takes the file names and appends each one to an initially non-existent or empty file. Be careful to use a new file each time to avoid making duplicates.

edited Oct 21 '22 at 21:10

answered Oct 19 '22 at 02:31

mondotofu

721
4
8

Hi @mondotofu- probably my question is not clear enough. Actually, here **TS19_EWP4IQK02JPFP5, TS19_EWP4IQK02GBB8K and TS19_EWP4IQK02FUJRM** belong to same sample that is **TS19**. They will be kept inside a single file named **TS19**. Is it possible to do? – DEEP Oct 19 '22 at 06:30
I've edited my answer by adding another command at the end. – mondotofu Oct 20 '22 at 00:15
If this answer resolved your question, then please show your appreciation by "accepting" it: click the checkmark next to the question. – mondotofu Oct 21 '22 at 01:14
HI @mondotofu- Actually my system hangs with the `csplit` command probably because it results into million of outputs. And also`find` command returns the following error: `bash: /usr/bin/find: Argument list too long`. I will request to take a look into **edit 1** in the questin itself where I have put a command that goes close to desired answer. And I need your help to make it perfect. Please take a look. – DEEP Oct 21 '22 at 10:51
1

Revised the commands and used xargs to eliminate the Argument list too long error. Take note of new commands in **Edit 1** and **Edit 2** @DEEP – mondotofu Oct 21 '22 at 21:12
1

Tried out commands on your file in Google drive. csplit completed in about a minute. The second command finished in about 10 seconds. I have Ubuntu 22.04.1 LTS on a Thinkpad with 32GB RAM. – mondotofu Oct 22 '22 at 02:09
Yeah @mondotofu- probably because of RAM issue it is not running. I have only 8GB of RAM – DEEP Oct 22 '22 at 05:17
I'd wipe out all the TestSample files and try again with the revised commands in Edit 1 and Edit 2 in my answer. There's should be enough gradualism in the csplit command that you don't need to take large amounts of RAM to complete it. The xargs command also takes away the long list of files, doing one at a time, and the sh command makes sure that it executes once for each file detected. – mondotofu Oct 22 '22 at 15:09

score 1 · Answer 2 · answered Oct 21 '22 at 11:56

I wrote a perl script a while ago, specifically for this.

The script takes a fasta file and creates individual files for all sequences. It will also clean the fasta file: Linebreaks in the sequence as well as empty lines and leading whitespaces in headers (> id) are removed by default. Additionally, non ACGT charachters can be converted to N and lowercase sequence characters can be converted to uppercase.

The script is called split_fasta.pl and you can find it on my github: https://github.com/nterhoeven/sequence_processing

Specifically tailored scripts for working with fasta or fastq are welcome — mondotofu, Oct 21 '22 at 20:24

Raffa · Accepted Answer · 2022-10-21T13:38:26.907

1

With awk, you can set > as the record separator and process(match) whole records instead of lines and search for e.g. records containing "TS19" like so:

awk 'BEGIN {RS=">"; ORS=RS} /TS19/' V2.fasta

Or automatically split each record type into a file with .split extension i.e. TS119.split TS19.split TS20.split ... in the same working directory like so:

awk 'BEGIN {RS=">"; ORS=RS} {split($1, arr, "_"); f=arr[1]".split"; print > f}' V2.fasta

edited Oct 21 '22 at 13:38

answered Oct 21 '22 at 13:32

Raffa

24,905
3
35
79

Hi @Raffa- Thanks a lot. it works somewhat close but not perfect. I have seen following command returns 21469 lines: `cat V2.fasta | grep ">TS19_" | wc -l`. But, your command generates 21435 lines as obtained from this ommand: `awk 'BEGIN {RS=">"; ORS=RS} /TS19_/' V2.fasta | grep ">TS19_" | wc -l`. I am uploading my file under **edit:2** with the question so that you can help me properly. – DEEP Oct 21 '22 at 13:44
Hi @Raffa- I found some sample IDs that are not being extracted with your command and they have a common pattern. They have four different strings separated by three spaces in the ID names started with `>`. Here are some examples of such IDs: `>TS19_ct4.121197 CTAACGCAGTCA Corrected: CTAATGCAGTCA, Changes: (pos 4, T -> C) >TS19_ct4.121626 CTAACGCAGTCA >TS19_ct4.131873 CTAACGCAGTCA Corrected: CTTACGCAGTCA, Changes: (pos 2, T -> A) >TS19_ct4.131930 CTAACGCAGTCA` – DEEP Oct 21 '22 at 14:08
Hi @DEEP `grep` will match lines (each one) directly ... While the above `awk` code will identify blocks (records) separated by `>` first, then print the whole record if a match is found in that record ... I apologize for not having time to inspect your file ... But I have downloaded it and hopefully I will look into it when I have time. :-) – Raffa Oct 21 '22 at 14:20
ok @Raffa- Thanks – DEEP Oct 21 '22 at 14:39
1

@DEEP I quickly looked into your file and found leading white-space in some lines ... That can be solved with `awk 'BEGIN {RS=">"; ORS=RS} gsub(/^[ \t]+/,""); /TS19_/' V2.fasta` ... Please try it and let me know. – Raffa Oct 21 '22 at 15:35
Many many thanks @Raffa. It is working perfectly. I really appreciate your helping hands towards me. However, is it possible to write a loop for all sample IDs like `TS19_, TS119_, TS20_` and so on? – DEEP Oct 21 '22 at 15:42
@DEEP Sure, you can loop over patterns in an array or from another file ... Please see my answer here: https://askubuntu.com/a/1420653 ... It includes an example with `awk` that you can adapt with the above code ... It's my pleasure to help :-) – Raffa Oct 21 '22 at 16:03
sorry @Raffa- I found some problem with your last command `awk 'BEGIN {RS=">"; ORS=RS} gsub(/^[ \t]+/,""); /TS19_/' V2.fasta`. It actually adds some distorted lines at the end of the file where the lines are starting with something like `>C) or >G), etc`. An example line is: `>G) CTGGGCCGTGTCTCAGTCCCAATGTGGCCGGTCACCCTCTCAGGTCGGCTACTGATCGTCGGCTTGGTGGGCCGTTACCTCACCAACTACCTAATCAGACGCGGGTCCATCTCATACCACCGGAGTTTTACCACTGTACCATGCAGTACTGTGGTCTTATGCGGTATTAGCAGCCATTTCTAACTGTTACTCCCCCTGTATGAGGCAGGTTACCCACGCGTTACTCCACCCGTCCCG` – DEEP Oct 22 '22 at 13:45
1

@DEEP It's OK ... I tested and had the same result so I inspected your file and found `>` in the middle of some lines and treated like a record separator where it shouldn't ... Your file is inconsistent in this regard ... That can be fixed: `awk 'BEGIN {RS="\n>"; ORS=RS} gsub(/^[ \t]+/,""); /TS19_/' V2.fasta` – Raffa Oct 22 '22 at 14:51
I found it works except the fact that it adds a single `>` at the end like: `>TS19_ok4.40771 CTAACGCAGTCA TTGGGCCGTGTCTCAGTCCCAATGTGCTCTCAGGTCGGCTACTGATCGTCGCTTTGGTAGGCCGATACCCCACCAACCGGCTAATCAGACGCGGGTCCATCTCATACCACCGGAGTTTTTACCCCTCGCACCATGCGGTGCTGTGGTCTTATGCGGTATTAGCAGTCATTTCTTGACTGTTTATTTCCCCTCGTATGAGGCAGGTTACCCACGCGTTACTCACCCG >TS19_ok4.40971 CTAACGCAGTCA CTGGGCCGTGTCTCAGTCCCAATGTGGCCGGTCACCCTCTCAGGTCGGCTACTGATCATCGCCTTGGTGGGCCGTTACCCCGCCAACAAGCTAATCAGACGCGGGTCCATCTCATACCACCGGAGTTTTTCACACTGTACCGGTACTGTGCGCTTATGCGGTATTACCAGCCGTTTCCAGCTGCTATCCCCATCTGAAGGGCAGGTTGCTTACGCGG >` – DEEP Oct 22 '22 at 14:57
1

@DEEP That is because setting `ORS=RS` ... It sets the output record separator as the input record separator and is invoked after each `print` action (not before) and as a result the first record is printed without a `>` as well ... One solution is: `awk 'BEGIN {RS="\n>"} gsub(/^[ \t]+/,""); /TS19_/ {print ">"$0}' V2.fasta` – Raffa Oct 22 '22 at 15:14
Thanks @Raffa- It works. For generating individual files for each sample names (TS19, TS190, TS191 etc) I have generated a file (`samplenames.txt`) with the sample names from `V2.fasta`. It's content are like: TS19 `TS190 TS191 TS19.2`. Now, wrote this script to run as loop with the samples: `#!/usr/bin/bash filename="samplenames.txt" samples=$(cat $filename) for f in $samples do awk 'BEGIN {RS="\n>"} gsub(/^[ \t]+/,""); /$f/ {print ">"$0}' V2.fasta > ${f}.fasta done`. It generates only blank output files `TS19.fasta, TS190.fasta, etc`. Do you have any idea about the problem? – DEEP Oct 23 '22 at 04:52
@DEEP You need to pass the shell variable to awk like so `awk -v f="$f" '…..'` and use it inside awk without `$` … just `f` – Raffa Oct 23 '22 at 06:51
@DEEP `awk -v f="$f" 'BEGIN {RS="\n>"} gsub(/^[ \t]+/,""); $0 ~ f {print ">"$0}' V2.fasta` – Raffa Oct 23 '22 at 09:11
Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/140070/discussion-between-deep-and-raffa). – DEEP Oct 23 '22 at 11:23
@DEEP Put that awk command back in your for loop in place of the similar awk command part and it will work :-) … I am on my mobile phone now so can’t write code correctly but will get infront of a PC in a couple hours if you still need me – Raffa Oct 23 '22 at 11:49
Thanks @Raffa for being patient with my problems. It works but I discovered one problem. I found `for f in TS1.2_`, inspite of only `TS1.2_`, all the following sample names are taken at once: `TS1.2_ TS132_ TS142_ TS152_ TS162_ TS182_ TS192_` . I think the dot `.` chracter in `TS1.2_` is being treated as a wildcard which is creating the problem. – DEEP Oct 23 '22 at 12:44

To make separate files based on pattern within a single file

3 Answers3