Skip to content

Paralogous genes (without introns) stealing mapping in splice mode #96

Closed
@gringer

Description

We've been looking at the recent mapping results of a MinION cDNA run, and discovered a few genes that were curiously absent in minimap2-generated files. These were genes that had high expression (according to STAR) in Illumina reads from the same cell line.

I dug into the details of one of these genes (Eno1), and discovered that it had an intron-less paralog (with >99% identity and coverage) in the mus musculus genome. Mapping a minion read separately to both regions produced positive matches in both cases, but only one match was returned when the regions were combined as a multi-fasta file. This issue remains when reducing down to a minimal test case containing just the two genes of interest, but disappears when I remove the -p 10 mode from minimap2.

While I don't expect minimap2 to prioritise the spliced mapping over the non-spliced mapping, it'd be nice if it included the spliced mapping as at least a secondary mapping.

Here's a quick console trace to demonstrate the issue:

$ ~/scripts/fastx-length.pl ref_Eno1.fasta
12159 Eno1_chr4 chr4:150236721..150248879 (+ strand) class=gene length=12159
Total sequences: 1
Total length: 12.159 kb
Longest sequence: 12.159 kb
Shortest sequence: 12.159 kb
Mean Length: 12.159 kb
Median Length: 12.159 kb
N10: 1 sequences; L10: 12.159 kb
N50: 1 sequences; L50: 12.159 kb
N90: 1 sequences; L90: 12.159 kb
$ ~/scripts/fastx-length.pl ref_Eno1_Eno1b.fasta
12159 Eno1_chr4 chr4:150236721..150248879 (+ strand) class=gene length=12159
3044 Eno1b_chr18 chr18:48045335..48048378 (+ strand) class=gene length=3044
Total sequences: 2
Total length: 15.203 kb
Longest sequence: 12.159 kb
Shortest sequence: 3.044 kb
Mean Length: 7.601 kb
Median Length: 12.159 kb
N10: 1 sequences; L10: 12.159 kb
N50: 1 sequences; L50: 12.159 kb
N90: 2 sequences; L90: 3.044 kb

$ ~/install/minimap2/minimap2 -p 10 -x splice ~/db/fasta/mmus/ucsc/mmus_ucsc_all.idx MinION_read_ENO1.fa 
[WARNING] Indexing parameters (-k, -w or -H) overridden by parameters used in the prebuilt index.
[M::main::11.306*1.00] loaded/built the index for 22 target sequence(s)
[M::mm_mapopt_update::14.437*1.00] mid_occ = 596
[M::mm_idx_stat] kmer size: 15; skip: 10; is_HPC: 0; #seq: 22
[M::mm_idx_stat::16.115*1.00] distinct minimizers: 97764656 (39.86% are singletons); average occurrences: 5.130; average spacing: 5.435
87531090-7cc9-45d5-b6c0-16726871e5b8	1888	124	1797	+	chr18	90702639	48046669	48048362	1555	1693	8	tp:A:P	cm:i:261	s1:i:1555	s2:i:1508
[M::worker_pipeline::16.116*1.00] mapped 1 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/minimap2/minimap2 -p 10 -x splice /home/gringer/db/fasta/mmus/ucsc/mmus_ucsc_all.idx MinION_read_ENO1.fa
[M::main] Real time: 16.634 sec; CPU: 16.632 sec

$ ~/install/minimap2/minimap2 -p 10 -x splice ref_Eno1_Eno1b.fasta.idx MinION_read_ENO1.fa 
[M::main::0.004*0.90] loaded/built the index for 2 target sequence(s)
[M::mm_mapopt_update::0.005*0.81] mid_occ = 43
[M::mm_idx_stat] kmer size: 15; skip: 5; is_HPC: 0; #seq: 2
[M::mm_idx_stat::0.005*0.75] distinct minimizers: 4534 (88.82% are singletons); average occurrences: 1.127; average spacing: 2.976
87531090-7cc9-45d5-b6c0-16726871e5b8	1888	123	1797	+	Eno1b_chr18	3044	1334	3028	1607	1694	22	tp:A:P	cm:i:485	s1:i:1607	s2:i:1484
[M::worker_pipeline::0.007*1.21] mapped 1 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/minimap2/minimap2 -p 10 -x splice ref_Eno1_Eno1b.fasta.idx MinION_read_ENO1.fa
[M::main] Real time: 0.008 sec; CPU: 0.008 sec

$ ~/install/minimap2/minimap2 -x splice ref_Eno1_Eno1b.fasta.idx MinION_read_ENO1.fa 
[M::main::0.005*0.85] loaded/built the index for 2 target sequence(s)
[M::mm_mapopt_update::0.005*0.77] mid_occ = 43
[M::mm_idx_stat] kmer size: 15; skip: 5; is_HPC: 0; #seq: 2
[M::mm_idx_stat::0.006*0.72] distinct minimizers: 4534 (88.82% are singletons); average occurrences: 1.127; average spacing: 2.976
87531090-7cc9-45d5-b6c0-16726871e5b8	1888	123	1797	+	Eno1b_chr18	3044	1334	3028	1607	1694	22	tp:A:P	cm:i:485	s1:i:1607	s2:i:1484
87531090-7cc9-45d5-b6c0-16726871e5b8	1888	123	1797	+	Eno1_chr4	12159	541	12136	1580	11595	0	tp:A:S	cm:i:444	s1:i:1484
[M::worker_pipeline::0.007*1.13] mapped 1 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/minimap2/minimap2 -x splice ref_Eno1_Eno1b.fasta.idx MinION_read_ENO1.fa
[M::main] Real time: 0.008 sec; CPU: 0.008 sec

$ ~/install/minimap2/minimap2 -x splice ref_Eno1.fasta.idx MinION_read_ENO1.fa 
[M::main::0.004*1.14] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.004*1.03] mid_occ = 22
[M::mm_idx_stat] kmer size: 15; skip: 5; is_HPC: 0; #seq: 1
[M::mm_idx_stat::0.004*0.96] distinct minimizers: 4025 (99.75% are singletons); average occurrences: 1.009; average spacing: 2.993
87531090-7cc9-45d5-b6c0-16726871e5b8	1888	123	1797	+	Eno1_chr4	12159	541	12136	1580	11595	60	tp:A:P	cm:i:444	s1:i:1484	s2:i:0
[M::worker_pipeline::0.005*0.80] mapped 1 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/minimap2/minimap2 -x splice ref_Eno1.fasta.idx MinION_read_ENO1.fa
[M::main] Real time: 0.006 sec; CPU: 0.004 sec

$ ~/install/minimap2/minimap2 -p 10 -x splice ref_Eno1.fasta.idx MinION_read_ENO1.fa 
[M::main::0.006*0.73] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.006*0.66] mid_occ = 22
[M::mm_idx_stat] kmer size: 15; skip: 5; is_HPC: 0; #seq: 1
[M::mm_idx_stat::0.006*1.23] distinct minimizers: 4025 (99.75% are singletons); average occurrences: 1.009; average spacing: 2.993
87531090-7cc9-45d5-b6c0-16726871e5b8	1888	123	1797	+	Eno1_chr4	12159	541	12136	1580	11595	60	tp:A:P	cm:i:444	s1:i:1484	s2:i:0
[M::worker_pipeline::0.008*1.03] mapped 1 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/minimap2/minimap2 -p 10 -x splice ref_Eno1.fasta.idx MinION_read_ENO1.fa
[M::main] Real time: 0.010 sec; CPU: 0.012 sec

I've attached the example sequences to demonstrate this issue.

Eno1_example_ref_reads.zip

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions