-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pay attention to skani accuracy below 85% ANI #199
Comments
Thanks @jianshu93. I suppose you would also suggest requiring FastANI 1.34 and using the -correct flag? @AroneyS it seems the docs are wrong - https://wwood.github.io/CoverM/coverm-genome.html#dereplication-genome-clustering still talks about FastANI - can you make sure skani is being used by default and the docs reflect this please? |
Hello @wwood, Yes I would suggest so since the corrected flag is more close to the actual alignment-based ANI. We will update the bioconda channel for the newest --correct option soon. Jianshu |
hello @wwood pre-clustering default is now skani, which is very dangerous because below 82% ANI skani output is 0. I would suggest use finch version of minhash (essential the same with Mash without over-sketching) , above 90% ANI, skani is as good as fastANI, so I think it is ok to use. For pre-cluster, I would suggest use BinDash (https://github.com/zhaoxiaofei/bindash), I am developing bindash version 2 with Xiaofei. Bindash is as accurate as Mash but is 100 to 1000 times faster than Mash, dashing finch due to the theoretical breakthrough called B-bit one permutation MinHash with optimal/faster densification. BinDash can also be easily installed via bioconda. Theoretically, dashing has the largest variance, then Mash, BinDash has the smallest and it has the amazing property called locality sensitive hashing, neither above has this property. This idea is also implemented in my software called gsearch, for search and classification of genomes (will be published soon). See Gsearch here(https://gitlab.com/Jianshu_Zhao/gsearch), can be installed via bioconda. Let me know you what to also include gsearch into coverm, to classify genomes with extreme speed (almost in seconds). Thanks, Jianshu |
Hi Team, GSearch is out! It can search GTDB or entire Refseq/IMG_VR in almost no time! and the running time does not grow with increasing number of genomes! https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkae609/7714450 I believe it will solve all the problems all other search and classification tools are having! Thanks, Jianshu |
Dear CoverM team,
I remember I suggested skani some time ago, and glad to see you included it in the newest version. However, you might notice inaccuracy estimations below 85% ANI. While with the newest version of fastANI, it is almost as accurate as blastn-based ANI. See my blog here: https://jianshu93.github.io/blog/ANI-calculator/
I would suggest only shift to skani for above 90% ANI while using fastANI for below 90% clustering threshold by default, but not allow users to choose fastANI or skani, for example, if I want to dereplicate at 85% ANI to choose genus level representatives, clearly fastANI is a better option than skani. Does it make sense? And if there is a paper on CoverM in preparation, I would be happy to contribute and provide those benchmark results that I have been studying.
Jianshu
The text was updated successfully, but these errors were encountered: