Abstract

Facilitated by the rapid progress of high-throughput sequencing technology, a large number of long noncoding RNAs (lncRNAs) have been identified in mammalian transcriptomes over the past few years. LncRNAs have been shown to play key roles in various biological processes such as imprinting control, circuitry controlling pluripotency and differentiation, immune responses and chromosome dynamics. Notably, a growing number of lncRNAs have been implicated in disease etiology. With the increasing number of published lncRNA studies, the experimental data on lncRNAs (e.g. expression profiles, molecular features and biological functions) have accumulated rapidly. In order to enable a systematic compilation and integration of this information, we have updated the NONCODE database (http://www.noncode.org) to version 3.0 to include the first integrated collection of expression and functional lncRNA data obtained from re-annotated microarray studies in a single database. NONCODE has a user-friendly interface with a variety of search or browse options, a local Genome Browser for visualization and a BLAST server for sequence-alignment search. In addition, NONCODE provides a platform for the ongoing collation of ncRNAs reported in the literature. All data in NONCODE are open to users, and can be downloaded through the website or obtained through the SOAP API and DAS services.

INTRODUCTION

Long noncoding RNAs (lncRNAs) were first characterized as mRNA-like noncoding RNAs in that they undergo splicing and have features such as a poly(A) signal/tail (1), while an arbitrary criterion of ‘transcripts longer than 200 nucleotides’ has later been added to its ‘definition’ (2,3). With the development of experimental technology, especially the high-throughput sequencing methods, and further advancement of computational prediction algorithm, an increasing number of lncRNAs is being identified in mammals. For example, thousands of conserved large intervening (or intergenic) noncoding RNAs (lincRNAs) were discovered in human and mouse by using chromatin signature analysis (4–6). Computational methods including ORF-Predictor and BLASTP pipeline (7) identified 5446 lncRNAs in the human genome, 1859 lncRNAs were found throughout the human genome by high-throughput sequencing across a prostate cancer cohort (8) and a reference catalog of 8195 lincRNAs were founded using RNA-seq data collected from ∼4 billion RNA-seq reads across 24 human tissues and cell types (9). The functional properties of the lncRNAs are also rapidly being revealed. lncRNAs have already been shown to play key roles in imprinting control, circuitries controlling pluripotency and differentiation, immune responses, chromosome dynamics and human diseases (2,10). In parallel, the lncRNA studies have led to an accumulating amount of experimental data, such as expression profiles (11) and information on lncRNA functions in a variety of biological processes (3,12).

In order to compile this information and establish a comprehensive and systematic database, and thereby facilitating further exploration of the molecular mechanisms of lncRNAs, we have updated the NONCODE database to version 3.0 (NONCODE v3.0). As a first, NONCODE v3.0 now also includes expressional and functional lncRNA data (13,14) obtained from re-annotated microarray studies. At the same time, other classes of ncRNAs have also been updated. The number of ncRNA entries has been more than doubled since NONCODE v2.0, increasing from 206 226 to 411 552. Other improvements include upgraded BLAST and UCSC Genome Browser functions, and incorporation of SOAP API and DAS services, which will simplify queries, visualization and access to the large amounts of data in NONCODE v3.0. To simplify a continuous update of the data, an online submission system for new ncRNAs has also been provided. An overview of NONCODE v3.0 is shown in Figure 1. In conclusion, the aim of the NONCODE database is to provide a user-friendly web interface to browse, search, retrieve and update information on ncRNAs, and to facilitate further research of ncRNAs, gene networks and functional genomics. The NONCODE v3.0 database is freely accessible at http://www.noncode.org.

Overview of the NONCODE v3.0 Database. Raw data were mainly obtained from three types of sources: GenBank, specialized databases and literature. Sequences from different sources first go through redundancy elimination and are then included in the database. The ncRNAs in NONCODE can be accessed and analyzed by various tools and services, including a variety of search or browse options, a local Genome Browser for visualization, and a BLAST server for sequence-alignment search. ncRNA sequences can be download directly from the website or accessed through the SOAP API or DAS servers. In addition, an on-line submission system is provided for continuous collection of new ncRNAs.
Figure 1.

Overview of the NONCODE v3.0 Database. Raw data were mainly obtained from three types of sources: GenBank, specialized databases and literature. Sequences from different sources first go through redundancy elimination and are then included in the database. The ncRNAs in NONCODE can be accessed and analyzed by various tools and services, including a variety of search or browse options, a local Genome Browser for visualization, and a BLAST server for sequence-alignment search. ncRNA sequences can be download directly from the website or accessed through the SOAP API or DAS servers. In addition, an on-line submission system is provided for continuous collection of new ncRNAs.

DATA COLLECTION

NONCODE v3.0 has, as far as possible, collected all published ncRNAs that have been experimentally verified or identified by computational methods. It presently contains 411 552 public sequences distributed on 134 ncRNA classes and 26 cellular processes. The 411 552 ncRNA entries in the database are collected from 1239 different organisms, and 73 370 of the entries represent lncRNAs, covering nearly all published human and mouse lncRNAs (see Figures 2 and 3 for further details on the lncRNAs). In other words, NONCODE is one of the most comprehensive and systematic ncRNA databases. The new data included in NONCODE v3.0 have mainly been obtained from the following three types of sources.

Length distribution of human and mouse lncRNAs.
Figure 2.

Length distribution of human and mouse lncRNAs.

Distribution of the lncRNAs on organisms.
Figure 3.

Distribution of the lncRNAs on organisms.

GenBank

We extracted ncRNAs from GenBank with the keywords ‘ncRNA’, ‘miRNA’, ‘piRNA’, ‘snRNA’, ‘snoRNA’, ‘snmRNA’, ‘tmRNA’, ‘SRP RNA’ or ‘gRNA’ by using the pipeline built in NONCODE v1.0/v2.0 (15,16). Altogether 8954 unique new ncRNAs which had not been collected by other databases were obtained and labeled as ‘from GenBank’.

Specialized databases

The latest releases of a number of well-known databases were used as the sources for NONCODE: RNAdb v2.0 (17), fRNAdb v3.0 (18), H-InvDB v7.5 (19), FANTOM3 (20), lncRNAdb (21), miRBase v17.0 (22), RefSeq (23), UCSC (24) and Ensembl (25). The ncRNAs were first extracted from these databases, then passed through the redundancy elimination operation (below), and finally entered into NONCODE.

Literature sources

Several new types of ncRNAs have recently been reported in the literature [e.g. lincRNAs (4–6,9) and eRNAs (26)], but have not yet been included by any of the specialized databases. We therefore constructed a pipeline for mining the literature of recently published ncRNAs. We first used EFetch to retrieve literature published since 1 January 2009, from PubMed, employing the key words ‘ncRNA’, ‘noncoding’, ‘non-coding’, ‘noncode’ or ‘non-code’, and got 2605 relevant articles. After manually selecting reports on new ncRNAs, we retrieved sequences, genome locations and other relevant information concerning these transcripts. This resulted in 20 252 new ncRNAs being entered into the database.

REDUNDANCY ELIMINATION

The data sources mentioned above necessarily contain a varying extent of overlapping data, and a step to eliminate redundancies across different sources is necessary. To decide whether two primary entries might represent the same ncRNA, we took into account their accession numbers, organism information and sequence similarity. The latter was measured by the identity, e-value and overlap-ratio as returned by Blast alignments. The overlap-ratio for each entry was calculated as the proportion of the length of matched sequence compared with the whole length of sequence, and the overlap-ratio of both entries was taken into consideration. According to the above information, two entries derived from the same organism fell into one of three categories: (i) Identical entries. These are entries with identical sequences and genomic locations, and with non-conflicting accession numbers (‘non-conflicting accession numbers’ refer to cases when both entries have the same accession number or at least one entry does not have an accession number). (ii) Similar entries. These are entries with similar sequence information (overlap-ratio>0.8, identity>0.8, e-value<1e-10). (iii) Different ncRNAs. These are entries that do not fall into categories (i) and (ii). The ‘identical entries’ were finally integrated as one record in NONCODE, whereas the ‘similar ncRNAs’ were all retained, but assigned with the same ‘uniqID’ to indicate their relationship. After the redundancy elimination step, a total of 411 552 ncRNAs were finally recorded in NONCODE v3.0.

ncRNA ANNOTATION

One significant characteristic of NONCODE is its comprehensive annotation information. Each sequence in NONCODE is annotated with (i) basic information including the ncRNA name, alias, sequence, length, organisms, references etc. and (ii) additional information concerns its function, cellular role, cellular location and process function class (PfClass). In this update, four important attributes, two of which are dedicated to lncRNAs, have been added as follows:

Coding potential assessment

Since not all published ncRNAs have undergone detailed experimental analysis, we calculated a coding potential calculator score [CPC score (27)] and Coding Non-Coding Index (CNCI (software in-house)) for each ncRNA to evaluate its coding potential. This will enable the user to quickly identify transcripts whose coding potential may need further scrutiny.

Mapping information

For most ncRNAs, we have collected its mapping information from its original source. The remaining ncRNAs have been mapped to its reference genome using BLAT, the top one hit with >99% match to the reference genome being retained as its ‘locus’. The mapped locations can be viewed in the UCSC Genome Browser.

Expression profiles

Three independent sources of multi-tissue expression profiles have been included to facilitate the functional study of 27 408 lncRNAs. These are the FANTOM customer-designed microarray data which contain the expression profile of 10 874 mouse lncRNAs across 20 tissues (28), the re-annotated expression profiles of Affymetrix arrays (13), which contain expression profiles of 343 human lncRNAs across 65 tissues, and 4075 mouse lncRNAs across 22 tissues, and 13 565 human lincRNAs expression profiles from RNA-seq data across 22 tissues and cell lines (9). As more re-annotated or other lncRNA microarray data, along with RNA-seq data, are made available, these will be integrated within NONCODE.

Potential functions

Functional predictions may guide and assist future investigations of lncRNAs. A total of 1635 lncRNAs have been annotated with potential functions that have been predicted based on a Coding-Noncoding co-expression network (13,14). The estimated ‘quality’ of each functional prediction is indicated by a P-value.

SERVICE UPDATE

The NONCODE database is based on MySQL and the web site is powered by an Apache server. NONCODE has a user-friendly interface with a number of convenient browse and search options. Several useful services are available for users to access the NONCODE data, including BLAST, UCSC Genome Browser, SOAP API, DAS and an online submission system. BLAST and UCSC Genome Browser have been upgraded in the new NONCODE version, while other three services are new additions.

Browse and search

Two browse options, ‘Browse by expression profile’ and ‘Browse by functional prediction’, have been added to the new NONCODE version. These ensure rapid access to the expression profiles or to information on potential functions of the ncRNAs. Search by GO term functional keywords is also supported. All browse and search results can be exported instantly from the query page. Besides, searching results can be filtered by species (human or mouse) and transcript length (more or less than 200 nt). These options will render browsing and searching more convenient for the user.

SOAP API service

Simple Object Access Protocol (SOAP) is a protocol specification used for exchanging structured information between client and server computers in the implementation of Web Services. NONCODE now provides a SOAP API that can be easily accessed for custom query. Users could get their query results by writing short codes that calls six SOAP query functions, including ncRNADetails(), QueryByRNA(), QueryByClass(), QueryByReference(), QueryByNucleotide() and QueryByLength(). The SOAP API service in NONCODE can be accessed via the following URL: http://www.noncode.org/soapApi.html. No installation of any program or package is needed to use the functions.

DAS service

Distributed Annotation System (DAS) allows sequence annotations to be decentralized among multiple third-party annotators and integrated on an as-needed basis on the client side (29), which facilitates integration and collation of ncRNA annotations from multiple servers. The DAS service is now available in NONCODE. It provides access to all annotation data for current assemblies featured in NONCODE, and can be visited via: http://www.noncode.org/das.html. Several examples have been illustrated to guide construction of DAS queries and fetch NONCODE tracks through the DAS server.

Online submission of new ncRNAs

In order to maintain an up-to-date and comprehensive resource, we encourage users to submit their own data to NONCODE. The submission page offers three different submission options: (i) if the data have already been submitted to NCBI, the user can just submit the NCBI accession number to us; otherwise, (ii) if the data are small, the user can paste them into a text box in FASTA format; or (iii) if the amount of data is large, the user can upload a FASTA format file to our server. In order to ensure data quality, we recommend users provide their names and email addresses. Email is especially necessary for quick and convenient communication.

CONCLUSION

Compared with the previous versions of NONCODE, version 3.0 is a step towards a more integrated knowledge database, particularly with respect to lncRNAs. The total number of ncRNAs and the number of lncRNAs, functional annotations and on-line services have all been expanded (shown in Table 1). Beyond mere sequence information, NONCODE V3.0 also integrates various kinds of informative content, such as genome context, process function, coding potential score, re-annotated expression data, potential functions etc. NONCODE is designed to enable integration with other resources, including the UCSC Genome Browser, GenBank and other databases, and NONCODE thus provides a location from which researchers can obtain a wide range of information regarding their genes of interest. Although currently the expression data and annotations of predicted functions are only integrated with a small portion of the lncRNA entries in NONCODE, we expect this to increase as more data are published.

Table 1.

Comparison of NONCODE v1.0, v2.0 and v3.0

VersionTotal ncRNA numberlncRNA numberFunctional annotation, etc.Services
1.053391557PfClassBrowse, Search, Download
2.0206 22635 805PfClassBrowse, Search, Download, Blast, Genome Browser
3.0411 55273 370PfClass Expression profile, predicted functionsBrowse, Search, Download, Blast, Genome Browser, Soap API, DAS, On-line Submission
VersionTotal ncRNA numberlncRNA numberFunctional annotation, etc.Services
1.053391557PfClassBrowse, Search, Download
2.0206 22635 805PfClassBrowse, Search, Download, Blast, Genome Browser
3.0411 55273 370PfClass Expression profile, predicted functionsBrowse, Search, Download, Blast, Genome Browser, Soap API, DAS, On-line Submission
Table 1.

Comparison of NONCODE v1.0, v2.0 and v3.0

VersionTotal ncRNA numberlncRNA numberFunctional annotation, etc.Services
1.053391557PfClassBrowse, Search, Download
2.0206 22635 805PfClassBrowse, Search, Download, Blast, Genome Browser
3.0411 55273 370PfClass Expression profile, predicted functionsBrowse, Search, Download, Blast, Genome Browser, Soap API, DAS, On-line Submission
VersionTotal ncRNA numberlncRNA numberFunctional annotation, etc.Services
1.053391557PfClassBrowse, Search, Download
2.0206 22635 805PfClassBrowse, Search, Download, Blast, Genome Browser
3.0411 55273 370PfClass Expression profile, predicted functionsBrowse, Search, Download, Blast, Genome Browser, Soap API, DAS, On-line Submission

The decreasing cost and improved depth of the RNA-sequencing technology have already enabled numerous transcriptome studies in a variety of species. As a result of this, it is expected that huge numbers of lncRNAs will be rapidly identified and characterized in the near future. NONOCDE will continue to keep track of and promptly collect these data into the database. Also, the central role of lncRNAs in the molecular etiology of complex diseases, such as cancer, will make them a persistent research hotspot. Therefore, we expect that NONCODE will stay as an informative and valuable data source on the biological roles of lncRNAs for the scientific community.

FUNDING

National Natural Science Foundation of China (No. 31071137, 31000586, 30970623); National Key Basic Research and Development Program (973) under (Grant Nos.0997011001); Knowledge Innovation Program of the Chinese Academy of Sciences (KSCX2-EW-R-01,KSCX2-EW-R-0102); The Natural Science Foundation of Jiangsu province (BK2008231); Sci-tech Innovation Team of Jiangsu University (2008-018-02). Funding for open access charge: Knowledge Innovation Program of the Chinese Academy of Sciences (KSCX2-EW-R-01).

Conflict of interest statement. None declared.

REFERENCES

1
Erdmann
VA
Szymanski
M
Hochberg
A
de Groot
N
Barciszewski
J
Collection of mRNA-like non-coding RNAs
Nucleic Acids Res.
1999
, vol. 
27
 (pg. 
192
-
195
)
2
Mercer
TR
Dinger
ME
Mattick
JS
Long non-coding RNAs: insights into functions
Nat. Rev. Genet.
2009
, vol. 
10
 (pg. 
155
-
159
)
3
Nagano
T
Fraser
P
No-nonsense functions for long noncoding RNAs
Cell
2011
, vol. 
145
 (pg. 
178
-
181
)
4
Mitchell Guttman
IA
Garber
M
French
C
Lin
MF
Feldser
D
Huarte
M
Zuk
O
Carey
BW
Cassady
JP
Cabili
MN
Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals
Nature
2009
, vol. 
458
 (pg. 
223
-
227
)
5
Khalil
AM
Guttman
M
Huarte
M
Garber
M
Raj
A
Rivea Morales
D
Thomas
K
Presser
A
Bernstein
BE
Van Oudenaarden
A
Many human large intergenic noncoding RNAs associate with chromatin-modifying complexes and affect gene expression
Proc. Natl Acad. Sci. USA
2009
, vol. 
106
 (pg. 
11667
-
11672
)
6
Guttman
M
Garber
M
Levin
JZ
Donaghey
J
Robinson
J
Adiconis
X
Fan
L
Koziol
MJ
Gnirke
A
Nusbaum
C
Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs
Nat. Biotechnol.
2010
, vol. 
28
 (pg. 
503
-
510
)
7
Jia
H
Osak
M
Bogu
GK
Stanton
LW
Johnson
R
Lipovich
L
Genome-wide computational identification and manual annotation of human long noncoding RNA genes
RNA
2010
, vol. 
16
 (pg. 
1478
-
1487
)
8
Prensner
JR
Iyer
MK
Balbin
OA
Dhanasekaran
SM
Cao
Q
Brenner
JC
Laxman
B
Asangani
IA
Grasso
CS
Kominsky
HD
Transcriptome sequencing across a prostate cancer cohort identifies PCAT-1, an unannotated lincRNA implicated in disease progression
Nat. Biotechnol.
2011
, vol. 
29
 (pg. 
742
-
749
)
9
Cabili
MN
Trapnell
C
Goff
L
Koziol
M
Tazon-Vega
B
Regev
A
Rinn
JL
Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses
Genes Dev.
2011
, vol. 
25
 (pg. 
1915
-
1927
)
10
Taft
RJ
Pang
KC
Mercer
TR
Dinger
M
Mattick
JS
Non-coding RNAs: regulators of disease
J. Pathol.
2010
, vol. 
220
 (pg. 
126
-
139
)
11
Dinger
ME
Pang
KC
Mercer
TR
Crowe
ML
Grimmond
SM
Mattick
JS
NRED: a database of long noncoding RNA expression
Nucleic Acids Res.
2009
, vol. 
37
 (pg. 
D122
-
D126
)
12
Guttman
M
Donaghey
J
Carey
BW
Garber
M
Grenier
JK
Munson
G
Young
G
Lucas
AB
Ach
R
Bruhn
L
lincRNAs act in the circuitry controlling pluripotency and differentiation
Nature
2011
, vol. 
477
 (pg. 
295
-
300
)
13
Liao
Q
Liu
C
Yuan
X
Kang
S
Miao
R
Xiao
H
Zhao
G
Luo
H
Bu
D
Zhao
H
Large-scale prediction of long non-coding RNA functions in a coding-non-coding gene co-expression network
Nucleic Acids Res.
2011
, vol. 
39
 (pg. 
3864
-
3878
)
14
Liao
Q
Xiao
H
Bu
D
Xie
C
Miao
R
Luo
H
Zhao
G
Yu
K
Zhao
H
Skogerb
G
ncFANs: a web server for functional annotation of long non-coding RNAs
Nucleic Acids Res.
2011
, vol. 
39
 (pg. 
W118
-
W124
)
15
Liu
C
Bai
B
Skogerb
G
Cai
L
Deng
W
Zhang
Y
Bu
D
Zhao
Y
Chen
R
NONCODE: an integrated knowledge database of non-coding RNAs
Nucleic Acids Res.
2005
, vol. 
33
 (pg. 
D112
-
D115
)
16
He
S
Liu
C
Skogerb
G
Zhao
H
Wang
J
Liu
T
Bai
B
Zhao
Y
Chen
R
NONCODE v2. 0: decoding the non-coding
Nucleic Acids Res.
2008
, vol. 
36
 (pg. 
D170
-
D172
)
17
Pang
KC
Stephen
S
Dinger
ME
Engström
PG
Lenhard
B
Mattick
JS
RNAdb 2.0—an expanded database of mammalian non-coding RNAs
Nucleic Acids Res.
2006
, vol. 
35
 (pg. 
D178
-
D182
)
18
Mituyama
T
Yamada
K
Hattori
E
Okida
H
Ono
Y
Terai
G
Yoshizawa
A
Komori
T
Asai
K
The Functional RNA Database 3.0: databases to support mining and annotation of functional RNAs
Nucleic Acids Res.
2009
, vol. 
37
 (pg. 
D89
-
D92
)
19
Yamasaki
C
Murakami
K
Takeda
J
Sato
Y
Noda
A
Sakate
R
Habara
T
Nakaoka
H
Todokoro
F
Matsuya
A
H-InvDB in 2009: extended database and data mining resources for human genes and transcripts
Nucleic Acids Res.
2010
, vol. 
38
 (pg. 
D626
-
D632
)
20
Carninci
P
Kasukawa
T
Katayama
S
Gough
J
Frith
M
Maeda
N
Oyama
R
Ravasi
T
Lenhard
B
Wells
C
The transcriptional landscape of the mammalian genome
Science
2005
, vol. 
309
 (pg. 
1559
-
1563
)
21
Amaral
PP
Clark
MB
Gascoigne
DK
Dinger
ME
Mattick
JS
lncRNAdb: a reference database for long noncoding RNAs
Nucleic Acids Res.
2011
, vol. 
39
 (pg. 
D146
-
D151
)
22
Kozomara
A
Griffiths-Jones
S
miRBase: integrating microRNA annotation and deep-sequencing data
Nucleic Acids Res.
2011
, vol. 
39
 (pg. 
D152
-
D157
)
23
Pruitt
KD
Tatusova
T
Klimke
W
Maglott
DR
NCBI Reference Sequences: current status, policy and new initiatives
Nucleic Acids Res.
2009
, vol. 
37
 (pg. 
D32
-
D36
)
24
Fujita
PA
Rhead
B
Zweig
AS
Hinrichs
AS
Karolchik
D
Cline
MS
Goldman
M
Barber
GP
Clawson
H
Coelho
A
The UCSC Genome Browser database: update 2011
Nucleic Acids Res.
2011
, vol. 
39
 (pg. 
D876
-
D882
)
25
Flicek
P
Amode
MR
Barrell
D
Beal
K
Brent
S
Chen
Y
Clapham
P
Coates
G
Fairley
S
Fitzgerald
S
Ensembl 2011
Nucleic Acids Res.
2011
, vol. 
39
 (pg. 
D800
-
D806
)
26
Kim
TK
Hemberg
M
Gray
JM
Costa
AM
Bear
DM
Wu
J
Harmin
DA
Laptewicz
M
Barbara-Haley
K
Kuersten
S
Widespread transcription at neuronal activity-regulated enhancers
Nature
2010
, vol. 
465
 (pg. 
182
-
187
)
27
Kong
L
Zhang
Y
Ye
ZQ
Liu
XQ
Zhao
SQ
Wei
L
Gao
G
CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine
Nucleic Acids Res.
2007
, vol. 
35
 (pg. 
W345
-
W349
)
28
Bono
H
Yagi
K
Kasukawa
T
Nikaido
I
Tominaga
N
Miki
R
Mizuno
Y
Tomaru
Y
Goto
H
Nitanda
H
Systematic expression profiling of the mouse transcriptome using RIKEN cDNA microarrays
Genome Res.
2003
, vol. 
13
 (pg. 
1318
-
1323
)
29
Dowell
R
Jokerst
R
Day
A
Eddy
S
Stein
L
The distributed annotation system
BMC Bioinformatics
2001
, vol. 
2
 pg. 
7
 

Author notes

The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Comments

0 Comments
Submit a comment
You have entered an invalid code
Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.