Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug (?): unexpected results with enforced interval size greater than default #162

Open
eboyden opened this issue Oct 15, 2022 · 4 comments

Comments

@eboyden
Copy link

eboyden commented Oct 15, 2022

When I aligned a pair of fastqs using BWA-MEM2, Bowtie2, or SNAP, and an enforced interval size of 0-500, I got similar results. When I aligned with BWA-MEM2 or Bowtie2 and an enforced interval size of 50-500, the results didn't change much; very few on-target alignments had intervals <50 bp. But when I aligned with SNAP and an enforced (using -fs) interval size of 50-500, many fewer reads aligned. Furthermore, here is an example of a read pair that no longer aligned with a minimum interval of 50 bp, despite the fact that it's a 120 bp properly paired interval:

MN01688:12:000H3MG5N:1:23106:10622:10676        99      chr1    215878739       70      119M    =       215878739       120     AAAAATCATAGTCACCTTCTCTTACCTCAAATTAGGTCCATTTGGCTTGGATGGTGGTTGCCAAGAAATCACAACATATGATTCACTTAGTGGAATCACAGACAATGGGCCAACATTCT FFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFFFFFFFFFFFFFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF LB:Z:lb        MC:Z:120M       MD:Z:119        PG:Z:SNAP       RG:Z:rg   PL:Z:ILLUMINA   NM:i:0  SM:Z:sm        MQ:i:70 UQ:i:0  QS:i:4389       PU:Z:pu   ms:i:4389
MN01688:12:000H3MG5N:1:23106:10622:10676        147     chr1    215878739       70      120M    =       215878739       -120    AAAAATCATAGTCACCTTCTCTTACCTCAAATTAGGTCCATTTGGCTTGGATGGTGGTTGCCAAGAAATCACAACATATGATTCACTTAGTGGAATCACAGACAATGGGCCAACATTCTG        FFFFFFFFFFFFFFFFFFFF=FFFFFFFFFFAFFFFFFFFFFFFFFFFFFFFFFFFFF/FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF        LB:Z:lb        MC:Z:119M       MD:Z:120        PG:Z:SNAP       RG:Z:rg   PL:Z:ILLUMINA   NM:i:0  SM:Z:sm        MQ:i:70 UQ:i:0  QS:i:4388       PU:Z:pu   ms:i:4388

I also recapitulated this on the NA12878 data NIST7035_TAAGGCGA; the alignment rate fell from 97% to 59%. And it only seems to happen with the -fs option.

samtools flagstat reports:
-s 0 500:

39731146 + 0 in total (QC-passed reads + QC-failed reads)
39731146 + 0 primary
0 + 0 secondary
0 + 0 supplementary
3114693 + 0 duplicates
3114693 + 0 primary duplicates
39079698 + 0 mapped (98.36% : N/A)
39079698 + 0 primary mapped (98.36% : N/A)
39731146 + 0 paired in sequencing
19865573 + 0 read1
19865573 + 0 read2
38521349 + 0 properly paired (96.96% : N/A)
38615594 + 0 with itself and mate mapped
464104 + 0 singletons (1.17% : N/A)
19728 + 0 with mate mapped to a different chr
14333 + 0 with mate mapped to a different chr (mapQ>=5)

-s 0 500 -fs (similar results):

39731146 + 0 in total (QC-passed reads + QC-failed reads)
39731146 + 0 primary
0 + 0 secondary
0 + 0 supplementary
2899310 + 0 duplicates
2899310 + 0 primary duplicates
38627431 + 0 mapped (97.22% : N/A)
38627431 + 0 primary mapped (97.22% : N/A)
39731146 + 0 paired in sequencing
19865573 + 0 read1
19865573 + 0 read2
38533349 + 0 properly paired (96.99% : N/A)
38556972 + 0 with itself and mate mapped
70459 + 0 singletons (0.18% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

-s 50 500 (similar results):

39731146 + 0 in total (QC-passed reads + QC-failed reads)
39731146 + 0 primary
0 + 0 secondary
0 + 0 supplementary
3144369 + 0 duplicates
3144369 + 0 primary duplicates
39074551 + 0 mapped (98.35% : N/A)
39074551 + 0 primary mapped (98.35% : N/A)
39731146 + 0 paired in sequencing
19865573 + 0 read1
19865573 + 0 read2
23181753 + 0 properly paired (58.35% : N/A)
38607952 + 0 with itself and mate mapped
466599 + 0 singletons (1.17% : N/A)
333828 + 0 with mate mapped to a different chr
47673 + 0 with mate mapped to a different chr (mapQ>=5)

-s 50 500 -fs (abnormally low alignment rate):

39731146 + 0 in total (QC-passed reads + QC-failed reads)
39731146 + 0 primary
0 + 0 secondary
0 + 0 supplementary
1658040 + 0 duplicates
1658040 + 0 primary duplicates
23583000 + 0 mapped (59.36% : N/A)
23583000 + 0 primary mapped (59.36% : N/A)
39731146 + 0 paired in sequencing
19865573 + 0 read1
19865573 + 0 read2
23344882 + 0 properly paired (58.76% : N/A)
23512538 + 0 with itself and mate mapped
70462 + 0 singletons (0.18% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

For comparison:
Bowtie2 --very-fast-local -I 0 --no-discordant (default):

39731146 + 0 in total (QC-passed reads + QC-failed reads)
39731146 + 0 primary
0 + 0 secondary
0 + 0 supplementary
2509954 + 0 duplicates
2509954 + 0 primary duplicates
39685464 + 0 mapped (99.89% : N/A)
39685464 + 0 primary mapped (99.89% : N/A)
39731146 + 0 paired in sequencing
19865573 + 0 read1
19865573 + 0 read2
39514079 + 0 properly paired (99.45% : N/A)
39653236 + 0 with itself and mate mapped
32228 + 0 singletons (0.08% : N/A)
13340 + 0 with mate mapped to a different chr
8318 + 0 with mate mapped to a different chr (mapQ>=5)

Bowtie2 --very-fast-local -I 50 --no-discordant:

39731146 + 0 in total (QC-passed reads + QC-failed reads)
39731146 + 0 primary
0 + 0 secondary
0 + 0 supplementary
2510277 + 0 duplicates
2510277 + 0 primary duplicates
39685267 + 0 mapped (99.88% : N/A)
39685267 + 0 primary mapped (99.88% : N/A)
39731146 + 0 paired in sequencing
19865573 + 0 read1
19865573 + 0 read2
39382600 + 0 properly paired (99.12% : N/A)
39652842 + 0 with itself and mate mapped
32425 + 0 singletons (0.08% : N/A)
19408 + 0 with mate mapped to a different chr
8593 + 0 with mate mapped to a different chr (mapQ>=5)
@eboyden
Copy link
Author

eboyden commented Jan 12, 2023

Reviewing this a few months later, I noticed that although the alignment rate with -s 50 500 is high (98.35%), the "properly paired" rate is low (only 58.35%), which suggests that nearly half of the alignments have short intervals. So adding -fs to the command is simply preventing those "improperly paired" alignments from occurring. So the problem seems to be with the minimum insert size, not the -fs option.

@bolosky
Copy link
Contributor

bolosky commented Jan 12, 2023

I wonder if SNAP computes the interval differently than the other aligners. I'll take a look at it and see what's up.

@eboyden
Copy link
Author

eboyden commented Jan 12, 2023

Yeah I think you're right - I just took another look at alignments with either -s 0 500 -fs or -s 50 500 -fs, and in the latter case not only did the number of alignments with intervals >=50 sharply decrease, I also still observed plenty of alignments with intervals <50. Note that I'm basing interval size on column 9 of the sam file (TLEN).

@bolosky
Copy link
Contributor

bolosky commented Jan 12, 2023

I'll figure it out and try to harmonize what we're doing with the other aligners.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants