My data: RNA-seq: single embryo CEL-Seq (3' bias data); 35bp Single End reads; Total reads: 361K Annotation: I have two transcriptome assembly with no genome information.
Aligner and the alignment details:
#Aligner: Transcriptome-1, Transcriptome-2
#Bowtie2 default: 54K, 41K
#Hisat2 default: 47K, 34K
#Kallisto, index -k 31: 7K, 17k (My usual default setting)
#Kallisto, index -k 21: 17K, 30k
#Kallisto, index -k 15: 102K, 100K
#Kallisto, index -k 7: 118K, 102K
#Kallisto --single-overhang, index -k 31: 40K, 30K
#Kallisto --single-overhang, index -k 21: 77K, 64K
#Kallisto --single-overhang, index -k 15: 154K, 128K
#Kallisto --single-overhang, index -k 7: 128K, 109K
With my usual default kallisto setting, my alignment was poor. Then I realized that my data has 3' bias and is of short read length. So, I tried using different kmer length (21,15,7) for index creation to account for small read length and enabled --single-overhang to account for 3' bias. I am not sure what might a good setting to use. Any suggestions are welcome. Note: The sample has a lot of spike-in reads. In the publication where the Transcriptome-1 assembly was used, they have reported only 16k reads aligned to Transcriptome-1, 173k reads to spike-in, 156k has no alignment (using bowtie2).
I doubt the kmer length has too much of an impact, see my Q from years back Salmon Quantification for RNA-seq Read Pairs with Different Lengths
Yes, agreed.
"I doubt the kmer length has too much of an impact, see my Q from years back Salmon Quantification for RNA-seq Read Pairs with Different Lengths" - Any reasons why I might be having a huge number of alignments for smaller K values? Possibility of spike-in reads aligning to transcript sequences?
Other discussions: https://www.reddit.com/r/bioinformatics/comments/1edjrvr/kallisto_effect_of_kmer_size_on_quantification/
Hard for me to comment on optimal k-mer length based on number of reads being mapped. You actually need to get the TPM quantifications and plot how well the different settings correlate (usually they should correlate pretty well).
Enabling —single-overhang is good.
From intuition, I think k=31 is fine; you’re going to miss a few things because a sequencing error may prevent an exact 31-bp match to somewhere in your index but I don’t anticipate it being a big deal. Can go slightly lower if you need to recover more reads.
"You actually need to get the TPM quantifications and plot how well the different settings correlate" - I did try comparing log2(TPM+1) values for different Kmer length with —single-overhang option. Looks like they agree in general but a lot of unique alignment to some transcripts only in certain kmer length (Points hugging the axis line).
"Enabling —single-overhang is good." - I agree.
"From intuition, I think k=31 is fine; you’re going to miss a few things because a sequencing error may prevent an exact 31-bp match to somewhere in your index but I don’t anticipate it being a big deal. Can go slightly lower if you need to recover more reads." - I am planning to go with K=31 with —single-overhang.
Do you think smaller kmer length (K ~ 15 or less) somehow allows more multi-mapping? Still the amount of reads I get aligned is too much. Other discussions: https://www.reddit.com/r/bioinformatics/comments/1edjrvr/kallisto_effect_of_kmer_size_on_quantification/
Oh, of course, smaller k will definitely allow more multimapping, especially since you are using such small k (think about how many unique k’s you get with 4^k). Larger k is usually more reliable for that reason.
However, regardless, you MUST include the spike ins in your index otherwise you will get a bunch of false positive mappings.
Cross-post: https://stackoverflow.com/questions/78801895/kallisto-effect-of-kmer-size-on-quantification.
People on Reddit are clueless. The following response (in the link you posted) is bizarre about pseudoalignment: “ is, no offense, kind of a fake method, supposedly to be fast and memory efficient but in reality alignment speed and memory requirements are just a tiny part of data analysis for nearly all applications and not worth sacrificing real alignments in favor this fake psuedo alignment method...”
Pseudoalignment is not fake (whatever “fake” is supposed to mean). It provides reliable quantifications, much of which cannot be directly obtained from alignment-based methods. I’m still debating whether i should write a response to that bizarre Reddit statement on reddit (edit: I just did).
Thank you dsull. I really appreciate your time and help. I didn’t mean to offend you or other developers of kallisto. It is my favorite pseudoaligner and I use it regularly for my data analysis. I do not agree with with that comment (rather I am not technically sound to make that kind of comment). I just agreed on not using certain tools which is not suitable for your purpose by sacrificing performance or other things. Again I also don’t known what that person meant by that. And for my purpose kallisto is a great tool. I didn’t want to continue a discussion along those lines and wanted to finish it on a general comment. Thanks again for your help.