Calculate numbers of indels <=2 nt within sam
3
0
Entering edit mode
7.4 years ago
Rox ★ 1.4k

Hello everyone,

I am really bad at manipulating sam files... I was looking for a simple way to count, within a sam, the number of indels <=2 nucleotides. I also want it to be position specific, which means for me that, for a single position, if I have a read coverage of 100 and in each read there is a deletion, I only want to count one.

Is this a simple thing to do using samtools or any other scripting method ?

Thanks for your help !

Roxane

script • 2.6k views
ADD COMMENT
0
Entering edit mode

I was looking for a simple way to count, within a sam, the number of indels <=2 nucleotides.

This wouldn't be too hard since this information is readily available in the CIGAR strings.

I also want it to be position specific, which means for me that, for a single position, if I have a read coverage of 100 and in each read there is a deletion, I only want to count one.

That makes it a bit more complicated.

I would probably roll my own solution (using python-pysam), but it's perfectly possible that what you are looking for already exists...

ADD REPLY
0
Entering edit mode
7.4 years ago
Rahul Sharma ▴ 660

Few months ago I did some SNP analyses using mapped sam files. I compared two samples say prep2 and prep3, using samtools and bcftools. Following are the commands I used to investigate all variations including INDELS with a coverage cut-off of 100 and mapping quality cutoff 20. You could grep INDELS from the 'Qualt20DP100_flt.vcf.passed' file and check for position.

 nohup samtools mpileup -t DP -uv -Q 15 -f /media/Storage/Analysis/mm9/genome.fa /media/Storage/Data/bam/prep2.bam /media/Storage/Data/bam/prep3.bam -o prep2-3_mpileup.vcf &       
 bcftools call -mv prep2-3_mpileup.vcf > Only_variations.vcf &
 bcftools filter -s LowQual -e '%QUAL<20 || DP<100' Only_variations.vcf > Qualt20DP100_flt.vcf
 grep -P "\tPASS\t" Qualt20DP100_flt.vcf > Qualt20DP100_flt.vcf.passed

I hope it helps.

Cheers, Rahul

ADD COMMENT
0
Entering edit mode

Yes, but I think OP is interested in indels on the read level, also if those aren't reported by a variant caller.

ADD REPLY
0
Entering edit mode
7.4 years ago

The BBMap package can produce a histogram of such events, like this:

reformat.sh in=mapped.sam indelhist=indelhist.txt

That will show how many insertions and deletions there are of any specific length. You can also produce this directly from BBMap during mapping.

ADD COMMENT
0
Entering edit mode
7.4 years ago

using bioalcidae: http://lindenb.github.io/jvarkit/BioAlcidae.html with the following script:

while(iter.hasNext()) { 
    var rec=iter.next();
    if(rec.getReadUnmappedFlag()) continue;
    var ref= rec.getAlignmentStart();
    var cigarElements =  rec.getCigar().getCigarElements();
    for(var i=0;i< cigarElements.size();i++)
        {
        var ce = cigarElements.get(i);
        var op = ce.getOperator();
        if( op.isIndelOrSkippedRegion()  && ce.getLength()<2)
            {
            out.println(rec.getContig()+"\t"+ref+"\t"+op.name()+"\t"+ce.getLength());
            }
        if( op.consumesReferenceBases())
            {
            ref += ce.getLength();
            }
        }
    }

and pipe in sort | uniq:

 java -jar dist/bioalcidae.jar -f script.js in.bam | sort | uniq
ADD COMMENT

Login before adding your answer.

Traffic: 1831 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6