Paralogous genes (without introns) stealing mapping in splice mode #96
Description
We've been looking at the recent mapping results of a MinION cDNA run, and discovered a few genes that were curiously absent in minimap2-generated files. These were genes that had high expression (according to STAR) in Illumina reads from the same cell line.
I dug into the details of one of these genes (Eno1), and discovered that it had an intron-less paralog (with >99% identity and coverage) in the mus musculus genome. Mapping a minion read separately to both regions produced positive matches in both cases, but only one match was returned when the regions were combined as a multi-fasta file. This issue remains when reducing down to a minimal test case containing just the two genes of interest, but disappears when I remove the -p 10
mode from minimap2.
While I don't expect minimap2 to prioritise the spliced mapping over the non-spliced mapping, it'd be nice if it included the spliced mapping as at least a secondary mapping.
Here's a quick console trace to demonstrate the issue:
$ ~/scripts/fastx-length.pl ref_Eno1.fasta
12159 Eno1_chr4 chr4:150236721..150248879 (+ strand) class=gene length=12159
Total sequences: 1
Total length: 12.159 kb
Longest sequence: 12.159 kb
Shortest sequence: 12.159 kb
Mean Length: 12.159 kb
Median Length: 12.159 kb
N10: 1 sequences; L10: 12.159 kb
N50: 1 sequences; L50: 12.159 kb
N90: 1 sequences; L90: 12.159 kb
$ ~/scripts/fastx-length.pl ref_Eno1_Eno1b.fasta
12159 Eno1_chr4 chr4:150236721..150248879 (+ strand) class=gene length=12159
3044 Eno1b_chr18 chr18:48045335..48048378 (+ strand) class=gene length=3044
Total sequences: 2
Total length: 15.203 kb
Longest sequence: 12.159 kb
Shortest sequence: 3.044 kb
Mean Length: 7.601 kb
Median Length: 12.159 kb
N10: 1 sequences; L10: 12.159 kb
N50: 1 sequences; L50: 12.159 kb
N90: 2 sequences; L90: 3.044 kb
$ ~/install/minimap2/minimap2 -p 10 -x splice ~/db/fasta/mmus/ucsc/mmus_ucsc_all.idx MinION_read_ENO1.fa
[WARNING] Indexing parameters (-k, -w or -H) overridden by parameters used in the prebuilt index.
[M::main::11.306*1.00] loaded/built the index for 22 target sequence(s)
[M::mm_mapopt_update::14.437*1.00] mid_occ = 596
[M::mm_idx_stat] kmer size: 15; skip: 10; is_HPC: 0; #seq: 22
[M::mm_idx_stat::16.115*1.00] distinct minimizers: 97764656 (39.86% are singletons); average occurrences: 5.130; average spacing: 5.435
87531090-7cc9-45d5-b6c0-16726871e5b8 1888 124 1797 + chr18 90702639 48046669 48048362 1555 1693 8 tp:A:P cm:i:261 s1:i:1555 s2:i:1508
[M::worker_pipeline::16.116*1.00] mapped 1 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/minimap2/minimap2 -p 10 -x splice /home/gringer/db/fasta/mmus/ucsc/mmus_ucsc_all.idx MinION_read_ENO1.fa
[M::main] Real time: 16.634 sec; CPU: 16.632 sec
$ ~/install/minimap2/minimap2 -p 10 -x splice ref_Eno1_Eno1b.fasta.idx MinION_read_ENO1.fa
[M::main::0.004*0.90] loaded/built the index for 2 target sequence(s)
[M::mm_mapopt_update::0.005*0.81] mid_occ = 43
[M::mm_idx_stat] kmer size: 15; skip: 5; is_HPC: 0; #seq: 2
[M::mm_idx_stat::0.005*0.75] distinct minimizers: 4534 (88.82% are singletons); average occurrences: 1.127; average spacing: 2.976
87531090-7cc9-45d5-b6c0-16726871e5b8 1888 123 1797 + Eno1b_chr18 3044 1334 3028 1607 1694 22 tp:A:P cm:i:485 s1:i:1607 s2:i:1484
[M::worker_pipeline::0.007*1.21] mapped 1 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/minimap2/minimap2 -p 10 -x splice ref_Eno1_Eno1b.fasta.idx MinION_read_ENO1.fa
[M::main] Real time: 0.008 sec; CPU: 0.008 sec
$ ~/install/minimap2/minimap2 -x splice ref_Eno1_Eno1b.fasta.idx MinION_read_ENO1.fa
[M::main::0.005*0.85] loaded/built the index for 2 target sequence(s)
[M::mm_mapopt_update::0.005*0.77] mid_occ = 43
[M::mm_idx_stat] kmer size: 15; skip: 5; is_HPC: 0; #seq: 2
[M::mm_idx_stat::0.006*0.72] distinct minimizers: 4534 (88.82% are singletons); average occurrences: 1.127; average spacing: 2.976
87531090-7cc9-45d5-b6c0-16726871e5b8 1888 123 1797 + Eno1b_chr18 3044 1334 3028 1607 1694 22 tp:A:P cm:i:485 s1:i:1607 s2:i:1484
87531090-7cc9-45d5-b6c0-16726871e5b8 1888 123 1797 + Eno1_chr4 12159 541 12136 1580 11595 0 tp:A:S cm:i:444 s1:i:1484
[M::worker_pipeline::0.007*1.13] mapped 1 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/minimap2/minimap2 -x splice ref_Eno1_Eno1b.fasta.idx MinION_read_ENO1.fa
[M::main] Real time: 0.008 sec; CPU: 0.008 sec
$ ~/install/minimap2/minimap2 -x splice ref_Eno1.fasta.idx MinION_read_ENO1.fa
[M::main::0.004*1.14] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.004*1.03] mid_occ = 22
[M::mm_idx_stat] kmer size: 15; skip: 5; is_HPC: 0; #seq: 1
[M::mm_idx_stat::0.004*0.96] distinct minimizers: 4025 (99.75% are singletons); average occurrences: 1.009; average spacing: 2.993
87531090-7cc9-45d5-b6c0-16726871e5b8 1888 123 1797 + Eno1_chr4 12159 541 12136 1580 11595 60 tp:A:P cm:i:444 s1:i:1484 s2:i:0
[M::worker_pipeline::0.005*0.80] mapped 1 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/minimap2/minimap2 -x splice ref_Eno1.fasta.idx MinION_read_ENO1.fa
[M::main] Real time: 0.006 sec; CPU: 0.004 sec
$ ~/install/minimap2/minimap2 -p 10 -x splice ref_Eno1.fasta.idx MinION_read_ENO1.fa
[M::main::0.006*0.73] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.006*0.66] mid_occ = 22
[M::mm_idx_stat] kmer size: 15; skip: 5; is_HPC: 0; #seq: 1
[M::mm_idx_stat::0.006*1.23] distinct minimizers: 4025 (99.75% are singletons); average occurrences: 1.009; average spacing: 2.993
87531090-7cc9-45d5-b6c0-16726871e5b8 1888 123 1797 + Eno1_chr4 12159 541 12136 1580 11595 60 tp:A:P cm:i:444 s1:i:1484 s2:i:0
[M::worker_pipeline::0.008*1.03] mapped 1 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/minimap2/minimap2 -p 10 -x splice ref_Eno1.fasta.idx MinION_read_ENO1.fa
[M::main] Real time: 0.010 sec; CPU: 0.012 sec
I've attached the example sequences to demonstrate this issue.