Split BAM file from more sample
2
0
Entering edit mode
6.9 years ago
martyferr90 ▴ 30

Hi all! I have a little problem: I have 1 bam file (44gb ca) ant it contain the reads from 11 different sample. I have 2 txt file with sample name and a lot tab delimited number.

How can I split this unique BAM file into 11 different bam files?

Is it correct to use the following code? samtools view -bhR readids_for_sample_A.txt File.bam > File_A.bam

bam sample split split bam • 4.7k views
ADD COMMENT
0
Entering edit mode

Do you have @RG tags in your BAM? See: Split a multisample bam using RG tag information

ADD REPLY
0
Entering edit mode

I'm not sure that my txt file contain the RG tags, because in my files there are this information:

sample1.sorted.bam    sample2.sorted.bam (and other 9)
14578326    10905678    9856227 14119725    12330675    1395283512191130    13570563    43751694    6531804 10925343    
14551023    10883187    9835887 14095128    12308196    1392150612160806    13543377    43670814    6517218 10904049    
1176495 887865  802050  1150821 1008891 1140798 988629  1106853 3577497 529143  882285  
1236009 929736  839184  1204938 1046034 1187097 1032639 1157844 3716544 559122  929523  
1025331 762015  688068  988872  859437  982050  851715  952365  3046440 464532  763785  
944853  701523  631413  901983  785562  897543  778584  864981  2760462 423342  700155  
912696  683217  619527  883587  767157  875331  761220  852915  2718768 413766  686181  
878742  657891  596466  852810  743709  848145  737712  819984  2622840 401040  663999  
766929  569241  515904  740352  644226  736128  638103  712884  2287110 341982  574176

Any idea? Could work like RG tags? How can I produce 11 separated files?

ADD REPLY
0
Entering edit mode

could you do a samtools view input.bam |head -1 and post the results?

ADD REPLY
0
Entering edit mode

this is the output

L7IZC:01332:11594   4   *   0   0   *   *   0   0TAGAGAGTACGATCTCAGGTTTCAGGGTTATTTGACTACTACCTAGCTCAAGTCTTGAGCCACCATTACGTGTGCTAGAAAGGGTTACTAACCTCTGCCGAAGGGCTATAATGCTTACTGTAGAATTCTACTTGTCTATAGGATAAAGCATGATAATGGATGGTGGTAATTGCTCA   <;;<?<::::;;;;;?>>6==4<;;;2=5:;;/98::;;;7/*//////7499::7<=;<655.486899:88//.7755/59/616955.56;;:::5996::39998:6:=985::99998848599::;7<<<<<;;:7;<@@3///:::<A5<<5;;;5;:578/6278744    ZP:B:f,0.0106096,0.00442751,0.00104526  ZG:i:317    ZB:i:30 ZC:B:i,317,317,1,0  ZA:i:176    ZM:B:s,318,-8,300,-16,-14,306,-4,268,260,24,30,34,284,254,28,262,50,-14,272,228,64,-2,208,80,200,-14,-2,220,250,10,236,16,-12,240,4,22,222,-2,288,-20,250,246,44,2,242,478,-18,50,736,242,10,224,40,32,746,4,450,36,10,270,16,-16,6,-2,768,48,10,180,24,214,208,8,242,-16,8,-14,220,-14,218,-2,266,2,8,248,62,560,26,-2,242,-2,18,290,36,14,134,160,174,44,198,2,-10,500,-8,220,240,222,462,240,282,252,478,244,52,538,12,168,466,32,46,236,-2,52,26,28,326,36,238,84,184,72,6,216,278,2,74,222,16,306,196,32,200,204,12,684,14,10,646,82,372,46,-2,254,16,34,30,36,254,18,48,-14,288,560,446,84,268,126,236,8,258,2,-20,184,102,34,486,12,84,26,176,464,46,42,640,54,74,36,26,42,302,-12,-4,6,242,334,46,90,248,414,104,26,232,14,42,182,82,114,206,120,460,30,28,228,50,192,40,26,228,186,192,206,32,-10,178,28,112,454,94,34,414,52,202,76,280,0,36,18,272,52,210,40,488,58,184,6,212,202,42,84,246,32,12,258,48,22,88,10,240,264,60,372,84,244,14,54,202,26,32,-16,624,186,98,178,276,26,162,274,234,-22,94,444,270,458,6,276,-4,30,100,50,208,108,44,426,260,28,6,396,214,18,8,28,356,62,-16,84,488,42,204,144,134,236,148,74,214,154,44,40,196,292,72,2,282,94,202,32,108,212,450,206,88,112,22,28,334    ZF:i:8  RG:Z:L7IZC  PG:Z:bc
ADD REPLY
0
Entering edit mode

ok, so for this read, the read group is l7izc? How come rg is not capitalized? How was demultiplexing done?

ADD REPLY
0
Entering edit mode

I tryed to convert my bam file into a sam file to understand something. The complete head is:

@HD VN:1.4  GO:none SO:coordinate
@RG ID:L7IZC    PL:IONTORRENT   PU:s5/540   FO:TACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGADS:IONA Test S5 Run  for use with Traceability Worksheet DT:2017-12-02T15:32:17+0100 SM:Sample_1 KS:TCAG CN:S5TorrentServerVM/S5-0318
@RG ID:L7IZC.1  PL:IONTORRENT   PU:s5/540   FO:TACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGADS:IONA Test S5 Run  for use with Traceability Worksheet DT:2017-12-02T16:17:34+0100 SM:Sample_1 KS:TCAG CN:S5TorrentServerVM/S5-0318
@RG ID:L7IZC.10 PL:IONTORRENT   PU:s5/540   FO:TACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGADS:IONA Test S5 Run  for use with Traceability Worksheet DT:2017-12-02T21:44:54+0100 SM:Sample_1 KS:TCAG CN:S5TorrentServerVM/S5-0318

[...]

So, I could understand that I have 94 (from complete sam file) distinct ID, from 1 sample (SM, right?) but I know that I have only 11 sample. Am I right?

ADD REPLY
0
Entering edit mode

Are those chromosomes in the ID by any chance? i.e. samples split into chromosomes?

ADD REPLY
0
Entering edit mode

If so, should I not have 24*11 IDs?

ADD REPLY
0
Entering edit mode

So one would think. Have you looked through the collection of 94 to see if there is a pattern consistent with all?

ADD REPLY
0
Entering edit mode

every RG is like this:

@RG ID:L7IZC.1  PL:IONTORRENT   PU:s5/540   FO:TACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGATCGATGTACAGCTACGTACGTCTGAGCATCGA DS:IONA Test S5 Run  for use with Traceability WorksheetDT:2017-12-02T16:17:34+0100 SM:Sample_1 KS:TCAG`    CN:S5TorrentServerVM/S5-0318

the only thing that change is the ID. HD line is:

@HD VN:1.4  GO:none SO:coordinate

Any idea?

ADD REPLY
0
Entering edit mode

Then the program I mentioned should work and give you 1 bam file per sample while going through the bam file once.

ADD REPLY
0
Entering edit mode

Ok, now I have 94 bam files, but I have 11 sample, any idea to how can I have 1 file for sample?

ADD REPLY
0
Entering edit mode

the program creates one file per RG in the header, can you do an 'ls -al' in your directory?

ADD REPLY
0
Entering edit mode

This is the 'ls -al' output.

-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.10.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.11.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.12.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.13.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.9.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.A.bam
-rw-r--r-- 1 user user 46076502142 dic  7 16:49 out.L7IZC.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.B.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.C.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.D.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.E.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.F.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.G.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.H.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.I.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.J.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.K.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.L.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.M.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.N.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.O.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.P.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.Q.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.R.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.S.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.T.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.U.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.V.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.W.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.X.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.Y.bam
-rw-r--r-- 1 user user        3092 dic  7 16:49 out.L7IZC.Z.bam
-rw-r--r-- 1 user user 46076511182 dic  6 10:25 R_2017_12_02_08_20_47_user_S5-0318-32-IONA_Test_S5_-_Traceability_Worksheet_-01DIC2017_Sample_1_Auto_user_S5-0318-32-IONA_Test_S5_-_Traceability_Worksheet_-01DIC2017_Sample_1_183.basecaller.bam
-rw-r--r-- 1 user user 25809747968 dic  7 11:44 R_2017_12_02_08_20_47_user_S5-0318-32-IONA_Test_S5_-_Traceability_Worksheet_-01DIC2017_Sample_1_Auto_user_S5-0318-32-IONA_Test_S5_-_Traceability_Worksheet_-01DIC2017_Sample_1_183.basecaller.sam
ADD REPLY
0
Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

Use Submit Answers only for new answers to original question.

ADD REPLY
0
Entering edit mode

Those do not look like Chromosome names after all and it does not look like the samples were split.

ADD REPLY
0
Entering edit mode

Yep, you're right! It a strange output, I have 1 big bam file, and the others are very small, but I have just this informations, there aren't any process that I can do for split this file??

ADD REPLY
0
Entering edit mode

Was this data produced by Torrent Suite? Perhaps you can export individual samples from there?

ADD REPLY
0
Entering edit mode

Yes, this data was produce by Torrent Suite. Honestly I don't know, I only know that this file is from a specific analysis with a specific workflow. They ask to me if I could analyze this file, but for my analysis I need 1 file per sample, I tried to analyze the entire unique file, but I can't. Maybe is there a script, R package or similar for analyze a file like this? In particular for aneuplody research.

ADD REPLY
0
Entering edit mode
6.9 years ago
Hussain Ather ▴ 990

What you've written should work.

ADD COMMENT
0
Entering edit mode
6.9 years ago
Gabriel R. ★ 2.9k

if you want one bam file per sample and have RG tags, you can use my little program here: https://github.com/grenaud/libbam/blob/master/splitByRG.cpp

otherwise, you can just iterate over each RG using samtools view -r [rg] but then you go over each record 11 times.

ADD COMMENT

Login before adding your answer.

Traffic: 736 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6