Entering edit mode
8 months ago
bioinfo
▴
150
Hello,
I am aligning my data with kallisto to a reference transcriptome and then assigning gene counts using tximport and biomart. I am trying to understand how kallisto/tximport handle the multi mapped reads.
Does it discard the multimapped reads or does it add fractions of counts to the transcripts?
Thank you
I don't think tximport has anything to do with reads whatsoever. Relevant GitHub issue: https://github.com/pachterlab/kallistobustools/issues/15
Thank you for the link. I had found this before but I am not sure if kallisto and kallisto bustools handle multi mapped reads the same way. Do you know if they do?
Are you using kallisto for 10X genomics single-cell RNAseq? I ask because you mentioned "bustools". If so, kallisto (w/ bustools), by default, discards all reads that map to more than one gene (this is the same approach taken by other software like Cell Ranger).
If you're using tximport, that means you're interesting in bulk RNAseq, in which case kallisto indeed does fractional count assignment performed by an EM algorithm (as ATpoint mentioned). Can go into it more if you're interested.
I don't think they are dealing with scRNA-seq data. I picked a slightly off-topic issue accidentally.
Thank you so much for replying. I am using kallisto (without bustools) for bulk RNA seq. Would you mind explaining more how kallisto does the fractional count assignment?
Let's say you have exactly 4 reads in your dataset: All four reads map to transcript A while some of the reads also map to transcripts B and/or C.
When you run the EM algorithm, transcript A will get the most "fractional counts" while transcripts B and C will still get some (but much smaller). This is because the EM algorithm gives you probability estimates (i.e. probability of selecting a read from tx A, from tx B, from tx C, etc.), and those probability estimates (which sum up to 1) are multiplied by the number of mapped reads in your dataset. Remember, kallisto is a probabilistic algorithm -- it's doing something a bit more intelligent than simply dividing up the counts evenly. See this link for more explanation or page 16-17 of this paper.
dsull is an active kallisto developer and might give you a good explanation how exactly the EM algorithm within kallisto works. As for tximport, it does not know "reads". What it does is to take the transcript-level counts and then sum this to gene-level. It takes whatever the preprocessing quantifier/aligner gives it in terms of counts (and multimappers).