Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pay attention to skani accuracy below 85% ANI #199

Open
jianshu93 opened this issue Jan 19, 2024 · 4 comments
Open

pay attention to skani accuracy below 85% ANI #199

jianshu93 opened this issue Jan 19, 2024 · 4 comments

Comments

@jianshu93
Copy link

jianshu93 commented Jan 19, 2024

Dear CoverM team,

I remember I suggested skani some time ago, and glad to see you included it in the newest version. However, you might notice inaccuracy estimations below 85% ANI. While with the newest version of fastANI, it is almost as accurate as blastn-based ANI. See my blog here: https://jianshu93.github.io/blog/ANI-calculator/

I would suggest only shift to skani for above 90% ANI while using fastANI for below 90% clustering threshold by default, but not allow users to choose fastANI or skani, for example, if I want to dereplicate at 85% ANI to choose genus level representatives, clearly fastANI is a better option than skani. Does it make sense? And if there is a paper on CoverM in preparation, I would be happy to contribute and provide those benchmark results that I have been studying.

Jianshu

@wwood
Copy link
Owner

wwood commented Jan 27, 2024

Thanks @jianshu93. I suppose you would also suggest requiring FastANI 1.34 and using the -correct flag?

@AroneyS it seems the docs are wrong - https://wwood.github.io/CoverM/coverm-genome.html#dereplication-genome-clustering still talks about FastANI - can you make sure skani is being used by default and the docs reflect this please?

@jianshu93
Copy link
Author

Hello @wwood,

Yes I would suggest so since the corrected flag is more close to the actual alignment-based ANI. We will update the bioconda channel for the newest --correct option soon.

Jianshu

@jianshu93
Copy link
Author

jianshu93 commented Feb 20, 2024

hello @wwood pre-clustering default is now skani, which is very dangerous because below 82% ANI skani output is 0. I would suggest use finch version of minhash (essential the same with Mash without over-sketching) , above 90% ANI, skani is as good as fastANI, so I think it is ok to use. For pre-cluster, I would suggest use BinDash (https://github.com/zhaoxiaofei/bindash), I am developing bindash version 2 with Xiaofei. Bindash is as accurate as Mash but is 100 to 1000 times faster than Mash, dashing finch due to the theoretical breakthrough called B-bit one permutation MinHash with optimal/faster densification. BinDash can also be easily installed via bioconda. Theoretically, dashing has the largest variance, then Mash, BinDash has the smallest and it has the amazing property called locality sensitive hashing, neither above has this property. This idea is also implemented in my software called gsearch, for search and classification of genomes (will be published soon). See Gsearch here(https://gitlab.com/Jianshu_Zhao/gsearch), can be installed via bioconda. Let me know you what to also include gsearch into coverm, to classify genomes with extreme speed (almost in seconds).

Thanks,

Jianshu

@jianshu93
Copy link
Author

Hi Team,

GSearch is out! It can search GTDB or entire Refseq/IMG_VR in almost no time! and the running time does not grow with increasing number of genomes! https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkae609/7714450

I believe it will solve all the problems all other search and classification tools are having!

Thanks,

Jianshu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants