pay attention to skani accuracy below 85% ANI #199

jianshu93 · 2024-01-19T04:03:46Z

Dear CoverM team,

I remember I suggested skani some time ago, and glad to see you included it in the newest version. However, you might notice inaccuracy estimations below 85% ANI. While with the newest version of fastANI, it is almost as accurate as blastn-based ANI. See my blog here: https://jianshu93.github.io/blog/ANI-calculator/

I would suggest only shift to skani for above 90% ANI while using fastANI for below 90% clustering threshold by default, but not allow users to choose fastANI or skani, for example, if I want to dereplicate at 85% ANI to choose genus level representatives, clearly fastANI is a better option than skani. Does it make sense? And if there is a paper on CoverM in preparation, I would be happy to contribute and provide those benchmark results that I have been studying.

Jianshu

wwood · 2024-01-27T22:54:25Z

Thanks @jianshu93. I suppose you would also suggest requiring FastANI 1.34 and using the -correct flag?

@AroneyS it seems the docs are wrong - https://wwood.github.io/CoverM/coverm-genome.html#dereplication-genome-clustering still talks about FastANI - can you make sure skani is being used by default and the docs reflect this please?

jianshu93 · 2024-01-28T06:01:20Z

Hello @wwood,

Yes I would suggest so since the corrected flag is more close to the actual alignment-based ANI. We will update the bioconda channel for the newest --correct option soon.

Jianshu

jianshu93 · 2024-02-20T03:16:38Z

hello @wwood pre-clustering default is now skani, which is very dangerous because below 82% ANI skani output is 0. I would suggest use finch version of minhash (essential the same with Mash without over-sketching) , above 90% ANI, skani is as good as fastANI, so I think it is ok to use. For pre-cluster, I would suggest use BinDash (https://github.com/zhaoxiaofei/bindash), I am developing bindash version 2 with Xiaofei. Bindash is as accurate as Mash but is 100 to 1000 times faster than Mash, dashing finch due to the theoretical breakthrough called B-bit one permutation MinHash with optimal/faster densification. BinDash can also be easily installed via bioconda. Theoretically, dashing has the largest variance, then Mash, BinDash has the smallest and it has the amazing property called locality sensitive hashing, neither above has this property. This idea is also implemented in my software called gsearch, for search and classification of genomes (will be published soon). See Gsearch here(https://gitlab.com/Jianshu_Zhao/gsearch), can be installed via bioconda. Let me know you what to also include gsearch into coverm, to classify genomes with extreme speed (almost in seconds).

Thanks,

Jianshu

jianshu93 · 2024-07-28T17:39:55Z

Hi Team,

GSearch is out! It can search GTDB or entire Refseq/IMG_VR in almost no time! and the running time does not grow with increasing number of genomes! https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkae609/7714450

I believe it will solve all the problems all other search and classification tools are having!

Thanks,

Jianshu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pay attention to skani accuracy below 85% ANI #199

pay attention to skani accuracy below 85% ANI #199

jianshu93 commented Jan 19, 2024 •

edited

Loading

wwood commented Jan 27, 2024 •

edited

Loading

jianshu93 commented Jan 28, 2024

jianshu93 commented Feb 20, 2024 •

edited

Loading

jianshu93 commented Jul 28, 2024

pay attention to skani accuracy below 85% ANI #199

pay attention to skani accuracy below 85% ANI #199

Comments

jianshu93 commented Jan 19, 2024 • edited Loading

wwood commented Jan 27, 2024 • edited Loading

jianshu93 commented Jan 28, 2024

jianshu93 commented Feb 20, 2024 • edited Loading

jianshu93 commented Jul 28, 2024

jianshu93 commented Jan 19, 2024 •

edited

Loading

wwood commented Jan 27, 2024 •

edited

Loading

jianshu93 commented Feb 20, 2024 •

edited

Loading