Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts

doi:10.1093/nar/gkt646

. 2013 Sep;41(17):e166.

doi: 10.1093/nar/gkt646. Epub 2013 Jul 27.

Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts

Liang Sun¹, Haitao Luo, Dechao Bu, Guoguang Zhao, Kuntao Yu, Changhai Zhang, Yuanning Liu, Runsheng Chen, Yi Zhao

Affiliations

Affiliation

¹ Bioinformatics Research Group, Advanced Computing Research Laboratory, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China, College of Computer Science and Technology, Jilin University, Changchun 130012, China and Laboratory of Bioinformatics and Non-coding RNA, Institute of Biophysics, Chinese Academy of Sciences, Beijing 100101, China.

PMID: 23892401
PMCID: PMC3783192
DOI: 10.1093/nar/gkt646

Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts

Liang Sun et al. Nucleic Acids Res. 2013 Sep.

. 2013 Sep;41(17):e166.

doi: 10.1093/nar/gkt646. Epub 2013 Jul 27.

Authors

Liang Sun¹, Haitao Luo, Dechao Bu, Guoguang Zhao, Kuntao Yu, Changhai Zhang, Yuanning Liu, Runsheng Chen, Yi Zhao

Affiliation

¹ Bioinformatics Research Group, Advanced Computing Research Laboratory, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China, College of Computer Science and Technology, Jilin University, Changchun 130012, China and Laboratory of Bioinformatics and Non-coding RNA, Institute of Biophysics, Chinese Academy of Sciences, Beijing 100101, China.

PMID: 23892401
PMCID: PMC3783192
DOI: 10.1093/nar/gkt646

Abstract

It is a challenge to classify protein-coding or non-coding transcripts, especially those re-constructed from high-throughput sequencing data of poorly annotated species. This study developed and evaluated a powerful signature tool, Coding-Non-Coding Index (CNCI), by profiling adjoining nucleotide triplets to effectively distinguish protein-coding and non-coding sequences independent of known annotations. CNCI is effective for classifying incomplete transcripts and sense-antisense pairs. The implementation of CNCI offered highly accurate classification of transcripts assembled from whole-transcriptome sequencing data in a cross-species manner, that demonstrated gene evolutionary divergence between vertebrates, and invertebrates, or between plants, and provided a long non-coding RNA catalog of orangutan. CNCI software is available at http://www.bioinfo.org/software/cnci.

PubMed Disclaimer

Figures

**Figure 1.**
Illustration of ANT score matrix and CNCI framework. The score of each ANT is calculated for human (a) or mouse (b). The three black rows or columns represent three stop codons, including UAA, UAG and UGA (corresponding to ATT, ATC and ACT in cDNA sequence, respectively), which shows low frequency in protein-coding sequence. (c) The framework of CNCI. The top panel shows the process of a sequence in a testing set. For a given sequence, six MLCDS regions (represented by six lines) are identified from six reading frames (represented by six color arrow lines) using a sliding window and dynamic programming algorithm. Then, an MLCDS region with a maximal S-score is selected to incorporate into an SVM. The bottom panel shows the training and classification process. Reliable protein-coding and non-coding sequences are used as a training set, and five features are extracted to train SVM, which classifies the incorporating sequence into protein-coding or non-coding sequence.

**Figure 2.**
CNCI performance. (a) The top panel shows ANT score distribution (the left y-axis) of these six reading frames for each protein-coding transcript, whose length is normalized to 1100 nucleotide triplets in the x-axis. Red line represents the correct transcriptional reading frame and other five lines (blue or green) represent other five reading frames. Green line indicates the distribution of the coverage (the right y-axis) of the MLCDS region for each protein-coding transcript across the normalized length. The three regions marked by blue, yellow and green indicate the mean length of 3′UTR (6%), CDS (56.6%) and 5′UTR (37.4%), respectively, across the normalized length. The bottom panel shows an example of a gene NM_021222. (b) The ROC analyses of CNCI, CPC and phyloCSF. The MAE denoted by solid squares is 0.05, 0.11 and 0.28, respectively. (c) The accuracy of CNCI, CPC and phyloCSF for classification of different lincRNA lengths. (d) The ROC curves and taxonomic tree of 12 species. The minimum error rate is marked following the name of species.

See this image and copyright information in PMC

Cited by

Unraveling the role of long non-coding RNAs in chronic heat stress-induced muscle injury in broilers.
Liu Z, Liu Y, Xing T, Li J, Zhang L, Zhao L, Jiang Y, Gao F. Liu Z, et al. J Anim Sci Biotechnol. 2024 Oct 8;15(1):135. doi: 10.1186/s40104-024-01093-6. J Anim Sci Biotechnol. 2024. PMID: 39375773 Free PMC article.
The lncRNA SNHG26 drives the inflammatory-to-proliferative state transition of keratinocyte progenitor cells during wound healing.
Li D, Liu Z, Zhang L, Bian X, Wu J, Li L, Chen Y, Luo L, Pan L, Kong L, Xiao Y, Wang J, Zhang X, Wang W, Toma M, Piipponen M, Sommar P, Xu Landén N. Li D, et al. Nat Commun. 2024 Oct 5;15(1):8637. doi: 10.1038/s41467-024-52783-8. Nat Commun. 2024. PMID: 39366968 Free PMC article.
Integrated Metabolome, Transcriptome and Long Non-Coding RNA Analysis Reveals Potential Molecular Mechanisms of Sweet Cherry Fruit Ripening.
Liu G, Fu D, Duan X, Zhou J, Chang H, Xu R, Wang B, Wang Y. Liu G, et al. Int J Mol Sci. 2024 Sep 12;25(18):9860. doi: 10.3390/ijms25189860. Int J Mol Sci. 2024. PMID: 39337346 Free PMC article.
Full-Length Transcriptome Construction and Systematic Characterization of Virulence Factor-Associated Isoforms in Vairimorpha (Nosema) Ceranae.
Guo S, Zang H, Liu X, Jing X, Liu Z, Zhang W, Wang M, Zheng Y, Li Z, Qiu J, Chen D, Yan T, Guo R. Guo S, et al. Genes (Basel). 2024 Aug 23;15(9):1111. doi: 10.3390/genes15091111. Genes (Basel). 2024. PMID: 39336702 Free PMC article.
Whole-transcriptome analyses of ovine lung microvascular endothelial cells infected with bluetongue virus.
Luo S, Chen Y, Ma X, Miao H, Jia H, Yi H. Luo S, et al. Vet Res. 2024 Sep 27;55(1):122. doi: 10.1186/s13567-024-01372-0. Vet Res. 2024. PMID: 39334220 Free PMC article.

See all "Cited by" articles

References

1. Dunham I, Kundaje A, Aldred SF, Collins PJ, Davis CA, Doyle F, Epstein CB, Frietze S, Harrow J, Kaul R, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. - PMC - PubMed
1. Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, Tanzer A, Lagarde J, Lin W, Schlesinger F, et al. Landscape of transcription in human cells. Nature. 2012;489:101–108. - PMC - PubMed
1. Brawand D, Soumillon M, Necsulea A, Julien P, Csardi G, Harrigan P, Weier M, Liechti A, Aximu-Petri A, Kircher M, et al. The evolution of gene expression levels in mammalian organs. Nature. 2011;478:343–348. - PubMed
1. Kong L, Zhang Y, Ye ZQ, Liu XQ, Zhao SQ, Wei L, Gao G. CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res. 2007;35:W345–W349. - PMC - PubMed
1. Lin MF, Jungreis I, Kellis M. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics. 2011;27:i275–i282. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

[1] Dunham I, Kundaje A, Aldred SF, Collins PJ, Davis CA, Doyle F, Epstein CB, Frietze S, Harrow J, Kaul R, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. - PMC - PubMed

[2] Dunham I, Kundaje A, Aldred SF, Collins PJ, Davis CA, Doyle F, Epstein CB, Frietze S, Harrow J, Kaul R, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. - PMC - PubMed

[3] Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, Tanzer A, Lagarde J, Lin W, Schlesinger F, et al. Landscape of transcription in human cells. Nature. 2012;489:101–108. - PMC - PubMed

[4] Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, Tanzer A, Lagarde J, Lin W, Schlesinger F, et al. Landscape of transcription in human cells. Nature. 2012;489:101–108. - PMC - PubMed

[5] Brawand D, Soumillon M, Necsulea A, Julien P, Csardi G, Harrigan P, Weier M, Liechti A, Aximu-Petri A, Kircher M, et al. The evolution of gene expression levels in mammalian organs. Nature. 2011;478:343–348. - PubMed

[6] Brawand D, Soumillon M, Necsulea A, Julien P, Csardi G, Harrigan P, Weier M, Liechti A, Aximu-Petri A, Kircher M, et al. The evolution of gene expression levels in mammalian organs. Nature. 2011;478:343–348. - PubMed

[7] Kong L, Zhang Y, Ye ZQ, Liu XQ, Zhao SQ, Wei L, Gao G. CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res. 2007;35:W345–W349. - PMC - PubMed

[8] Kong L, Zhang Y, Ye ZQ, Liu XQ, Zhao SQ, Wei L, Gao G. CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res. 2007;35:W345–W349. - PMC - PubMed

[9] Lin MF, Jungreis I, Kellis M. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics. 2011;27:i275–i282. - PMC - PubMed

[10] Lin MF, Jungreis I, Kellis M. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics. 2011;27:i275–i282. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts

Affiliation

Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources