How to Demultiplex a fastq.gz file.
1
0
Entering edit mode
5.2 years ago
eli_bayat ▴ 90

I am a new postdoc student and I was given a folder of fastq.gz files. I was told they are not de-multiplexed and I need to basically extract each sample information separately from each of these fastq file (they contain info for multiple subjects) and save it as fastq file and run dada2 pipeline on them to get ASVs. My apologies if I am not using some terms correctly, I am very new to this. I worked with ASV table before, but never done de-multiplixing before. If you can help me how to do it or what software or platform I can use to separate these samples, I appreciate your help.

illumina de-multiplexing Miseq fastq dada2 • 8.9k views
ADD COMMENT
2
Entering edit mode

Are the sample barcodes in the indices, or are they internal to the read? Have they been pulled out the the read and moved to the read name? If the usual Illumina indices are used to multiplex, it is far easier for them to be demultiplexed as the fastqs are being generated than to do it after the fact.

ADD REPLY
0
Entering edit mode

This is how the data looks like when I open a fastq file in terminal. There is also a Barcode text file with a column of sample ID and Barcode pair name.

enter image description here

MWI006 is the sample ID and I have a bunch of that with different numbers in one fastq file, which means I need to Demultiplex the samples.

ADD REPLY
0
Entering edit mode

That pic doesn't work for me, just copy and paste the text.

ADD REPLY
0
Entering edit mode

Sorry about that, I am pretty new to this forum.

@M01380:62:000000000-B547W:1:1102:20819:1013 1:N:0:MWI006 NGCCTCTT|1|NCTGCATA|1
NGTAGAGTTTGATTCTGGCTCAGGATGAACGCTGACAGAATGCTTAACACATGCAAGTCTACTTGATCCTTCGGGTGATGGTGGCGGACGGGTGAGTAACGCGTAAAGAACTTGCCCTGCAGTCTGGGACAACATTTGGAAACGAATGCTAATACCGGATATTATGCGAACTTCGCATGTAGCTCGTATGAAAGCTATATGCGCTGCAGGATAGCTTTGCGTCCTATTAGCTAGTTGGTGAGGTAACGGATCACCAAGGCCATGATCGGTAGCCGGGCTGAGTGTGTGAACGGCCGCAAGG
+
#8BCCGGGGGGGGGGGGGFGGDFGFFGGFCFGGGDGFF8CEAFGFGGGGEDFGGGGGGFFGGGGGGGGGCFAF7C<+DDFGGGD8@EFFFGGGGFGGGGCCFGGDCGDD?,B?ECG?A<FGDFGGGGGGGFF8FGGFGGG9EFF7BFFFFFFDGCFG7CEFAF@FG,3FGGGG,+FCECGG=CC9:CCFFGGF9>CFFCGGFGGGC*6<@@,9?FC@FG@EC88E?9F?F6>76+>AFC5C5EFAC6C**//02A=EGFEE437>:+1***122)/)/7*)9*:**)01*)87)4),)-1:

@M01380:62:000000000-B547W:1:1102:16288:1015 1:N:0:MWI006 NGCCTCTT|1|NCTGCATA|1
NTACGTAGGGTTCGATCCTGGCTCAGGATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCAGCATCATCAAAGATTGCTTTGATGGATGGCGACCGGCGCACGGGTGAGTAACACGTATCCAACCTGCCGACAACACTGGGATAGCCTTTCGAAAGAAAGATTAATACCGGATGGCATAATTATTACGCATGGGATAATTATTAAAGAATTTCGGTGGCCGATGGGGGTGCGTTACATTAGGCAGATGGCGGGGGAAAGGCCTACCAAAACAACGACGGATAGGGTGTGTGG
+
#8@ACGG@BEFF87EFFFFF88CFGGFG,EECCF,CF:,,F<FECCFFDFGFGGFDCCEFFFGEGGG:@FCCDF8FFFGFGG8,9@,,?<C<CFGGEFF8FCCEEC7=7FFCG+8+AE<CBEGFEFF:BFFGFC8,,BF7@7CE8B=FAB8,5,,7@FAE**><@,FCCFA@FFCC;,>11*5*>FGFG9,@C9,6=CEGG88+29+3?C+23+49<=9+?BFD8***3==/:=*;**/*1:C**+2+0:+3<C**+76==7*))*2979C**2)2)9)*)*.1>)87:.,9*.,*4).4(

@M01380:62:000000000-B547W:1:1102:15376:1016 1:N:0:MWI005 NGCCTCTT|1|NTAAGGAG|1
NTACGTAGGGTTCGATTCTGGCTCAGGATGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGAAGCGGTTTGTCGGAAGTTTTCGGATGGAAGATAAACTGACTGAGTGGCGGACGGGTGAGTAACGCGTGGGTAAACTGCCTCATACAGGGGGGTAAAAGTTAGAACTTACTGATAATACAGCATAAGACAACAGCACCGAATGGTGCAGGGGTAAAAACACCGGGGGTATGAGATGGAGTCGAGAATGATAAGCAAGTTGGAGGGGTGAGTGCATACCAAAACGACGCTCAGCA
ADD REPLY
0
Entering edit mode

I looked for what each line means, and I get it, the only part I am not getting is NGCCTCTT|1|NCTGCATA|1 at the end of first line. can you help me with this? what it means?

ADD REPLY
1
Entering edit mode

That probably the sequences of the two indices, but why didn't the people who made the fastqs demultiplex for you? Anyway, you can write a little script with whatever to split out the reads by the sample name, since for some reason that's in the read name. If you have a modest number of samples, you can grep for the desired sample names one at a time.

ADD REPLY
1
Entering edit mode

if you wanted to try to do this manually yourself, you might look at the posts here: How to subset fastq data based on leading nt of sequences?

ADD REPLY
0
Entering edit mode

That's not what the OP needs. Their indices are not embedded in the read.

ADD REPLY
0
Entering edit mode

This is how the data looks like when I open a fastq file in terminal. There is also a Barcode text file with a column of sample ID and Barcode pair name.

enter image description here

ADD REPLY
2
Entering edit mode

Hi eli_bayat,

welcome to Biostars. No need to apologize for being new to the community, we all were at some point. As advice, it is recommended to add data and code examples as plain text and highlight them by using the code button 10101 which allows easy copy/paste for others to, e.g. test code one might suggest to you.

For embedding images, please use the image buttom (the one right of the 10101 bottom). You have to paste-in the full link to the image from the image hoster so e.g. https://i.ibb.co/HF8PH8T/(...).png to make sure it is properly embedded. I made the changes in this thread this time. Cheers!

ADD REPLY
0
Entering edit mode

Thanks! I appreciate it :)

ADD REPLY
1
Entering edit mode
5.2 years ago
steve ★ 3.5k

You typically demultiplex Illumina sequencing data with the program bcl2fastq. As the name implies, it converts the original basecall files (.bcl) from the sequencer into the demultiplexed .fastq.gz output directly. This is done with a .csv formatted samplesheet. Your best bet is to figure out who did the sequencing and get them to demultiplex it. This is typically done automatically by the sequencing facility. Trying to demultiplex it after the fact is kind of a waste of time because it will be much harder and slower.

ADD COMMENT

Login before adding your answer.

Traffic: 1006 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6