CD-HIT: accelerated for clustering the next-generation sequencing data
- PMID: 23060610
- PMCID: PMC3516142
- DOI: 10.1093/bioinformatics/bts565
CD-HIT: accelerated for clustering the next-generation sequencing data
Abstract
Summary: CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. In response to the rapid increase in the amount of sequencing data produced by the next-generation sequencing technologies, we have developed a new CD-HIT program accelerated with a novel parallelization strategy and some other techniques to allow efficient clustering of such datasets. Our tests demonstrated very good speedup derived from the parallelization for up to ∼24 cores and a quasi-linear speedup for up to ∼8 cores. The enhanced CD-HIT is capable of handling very large datasets in much shorter time than previous versions.
Availability: http://cd-hit.org.
Contact: liwz@sdsc.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
Figures
Similar articles
-
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.Bioinformatics. 2006 Jul 1;22(13):1658-9. doi: 10.1093/bioinformatics/btl158. Epub 2006 May 26. Bioinformatics. 2006. PMID: 16731699
-
CD-HIT Suite: a web server for clustering and comparing biological sequences.Bioinformatics. 2010 Mar 1;26(5):680-2. doi: 10.1093/bioinformatics/btq003. Epub 2010 Jan 6. Bioinformatics. 2010. PMID: 20053844 Free PMC article.
-
Search and clustering orders of magnitude faster than BLAST.Bioinformatics. 2010 Oct 1;26(19):2460-1. doi: 10.1093/bioinformatics/btq461. Epub 2010 Aug 12. Bioinformatics. 2010. PMID: 20709691
-
Parallelization of MAFFT for large-scale multiple sequence alignments.Bioinformatics. 2018 Jul 15;34(14):2490-2492. doi: 10.1093/bioinformatics/bty121. Bioinformatics. 2018. PMID: 29506019 Free PMC article.
-
MMseqs software suite for fast and deep clustering and searching of large protein sequence sets.Bioinformatics. 2016 May 1;32(9):1323-30. doi: 10.1093/bioinformatics/btw006. Epub 2016 Jan 6. Bioinformatics. 2016. PMID: 26743509
Cited by
-
Charting Peptide Shared Sequences Between 'Diabetes-Viruses' and Human Pancreatic Proteins, Their Structural and Autoimmune Implications.Bioinform Biol Insights. 2024 Nov 5;18:11779322241289936. doi: 10.1177/11779322241289936. eCollection 2024. Bioinform Biol Insights. 2024. PMID: 39502449 Free PMC article.
-
PqqD is a novel peptide chaperone that forms a ternary complex with the radical S-adenosylmethionine protein PqqE in the pyrroloquinoline quinone biosynthetic pathway.J Biol Chem. 2015 May 15;290(20):12908-18. doi: 10.1074/jbc.M115.646521. Epub 2015 Mar 27. J Biol Chem. 2015. PMID: 25817994 Free PMC article.
-
Public Baseline and shared response structures support the theory of antibody repertoire functional commonality.PLoS Comput Biol. 2021 Mar 1;17(3):e1008781. doi: 10.1371/journal.pcbi.1008781. eCollection 2021 Mar. PLoS Comput Biol. 2021. PMID: 33647011 Free PMC article.
-
A novel end-to-end method to predict RNA secondary structure profile based on bidirectional LSTM and residual neural network.BMC Bioinformatics. 2021 Mar 31;22(1):169. doi: 10.1186/s12859-021-04102-x. BMC Bioinformatics. 2021. PMID: 33789581 Free PMC article.
-
Residue contacts predicted by evolutionary covariance extend the application of ab initio molecular replacement to larger and more challenging protein folds.IUCrJ. 2016 Jun 15;3(Pt 4):259-70. doi: 10.1107/S2052252516008113. eCollection 2016 Jul 1. IUCrJ. 2016. PMID: 27437113 Free PMC article.
References
-
- Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26:2460–2461. - PubMed
-
- Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–1659. - PubMed
-
- Li W, et al. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics. 2001;17:282–283. - PubMed
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases