Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2010 Oct 15;19(R2):R162-8.
doi: 10.1093/hmg/ddq362. Epub 2010 Aug 25.

Transcribed dark matter: meaning or myth?

Affiliations
Review

Transcribed dark matter: meaning or myth?

Chris P Ponting et al. Hum Mol Genet. .

Abstract

Genomic tiling arrays, cDNA sequencing and, more recently, RNA-Seq have provided initial insights into the extent and depth of transcribed sequence across human and other genomes. These methods have led to greatly improved annotations of protein-coding genes, but have also identified transcription outside of annotated exons. One resultant issue that has aroused dispute is the balance of transcription of known exons against transcription outside of known exons. While non-genic 'dark matter' transcription was found by tiling arrays to be pervasive, it was seen to contribute only a small percentage of the polyadenylated transcriptome in some RNA-Seq experiments. This apparent contradiction has been compounded by a lack of clarity about what exactly constitutes a protein-coding gene. It remains unclear, for example, whether or not all transcripts that overlap on either strand within a genomic locus should be assigned to a single gene locus, including those that fail to share promoters, exons and splice junctions. The inability of tiling arrays and RNA-Seq to count transcripts, rather than exons or exon pairs, adds to these difficulties. While there is agreement that thousands of apparently non-coding loci are present outside of protein-coding genes in the human genome, there is vigorous debate of what constitutes evidence for their functionality. These issues will only be resolved upon the demonstration, or otherwise, that organismal or cellular phenotypes frequently result when non-coding RNA loci are disrupted.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Exons from known genes are associated with 88% of uniquely mapping short reads, but provide 22% of genomic sequence that is transcribed [human data from van Bakel et al. (19)]. On the other hand, only 6% of uniquely mapping reads are in intergenic sequence, but these lowly expressing regions cover about one quarter of all transcribed genomic sequence.
Figure 2.
Figure 2.
Cufflinks (6) multiexonic transcript models from short reads mapped to mouse chromosome 6 (assembly mm9, bases 8 200 479–8 555 654) and viewed using the UCSC Genome Browser (56). Short reads (coverage represented at top) map not just to known exons of coding and non-coding loci, but also to introns and gene termini. Multiexonic cufflinks transcript models consistent with known gene annotations (shown in blue, based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics) are shown as red exons joined by pink lines (blue joining lines indicate introns not supported by the data). Three protein-coding loci (Rpa3, Gig18 and Glcci1) and four FANTOM ncRNA loci (AK039608, AK039954, BC062820 and AK037260) are represented. Transcripts discussed in the main text are labelled A–C. Mouse brain data were generated using the Illumina genome analyzer (unpublished data).
Figure 3.
Figure 3.
Nucleotide substitution rates tend to be slower in lincRNA sequence (solid line) than in putatively neutral sequence (dotted line). Cumulative frequency histogram of nucleotide substitution rates for 3390 human lincRNA sequence aligned to mouse (solid line) compared with aligned putatively neutral sequence (dotted line). Neutrally evolving sequence has been acquired from genomically adjacent TE sequence inferred to have been present in the last common ancestor of human and mouse (‘ancestral repeats’, ‘ARs’). Of 16 268 human intergenic seqfrags obtained from a RNA-Seq experiment (19), 3390 were chosen since they contain at least 100 bp of sequence aligned to mouse. Nucleotide substitution rates in lincRNA loci (dlocus) and ancestral repeats (dAR) were calculated using a method that accounts for GC content (32). LincRNA loci tend to have evolved significantly and 10% more slowly than neighbouring neutral sequence (median dlocus/dAR = 0.902, median dlocus = 0.438, median dAR = 0.488; two-sided Mann–Whitney test, P<10−15). Further sets of lincRNA loci derived from cDNA sequencing and chromatin signatures show very similar degrees of evolutionary constraint (dlocus/dAR = 0.887 and 0.904, respectively) (32).

Similar articles

Cited by

References

    1. Pertea M., Salzberg S.L. Between a chicken and a grape: estimating the number of human genes. Genome Biol. 11:206. doi:10.1186/gb-2010-11-5-206. - DOI - PMC - PubMed
    1. Ponting C.P. The functional repertoires of metazoan genomes. Nat. Rev. Genet. 2008;9:689–698. doi:10.1038/nrg2413. - DOI - PubMed
    1. Dinger M.E., Pang K.C., Mercer T.R., Mattick J.S. Differentiating protein-coding and noncoding RNA: challenges and ambiguities. PLoS Comput. Biol. 2008;4:e1000176. doi:10.1371/journal.pcbi.1000176. - DOI - PMC - PubMed
    1. Frith M.C., Forrest A.R., Nourbakhsh E., Pang K.C., Kai C., Kawai J., Carninci P., Hayashizaki Y., Bailey T.L., Grimmond S.M. The abundance of short proteins in the mammalian proteome. PLoS Genet. 2006;2:e52. doi:10.1371/journal.pgen.0020052. - DOI - PMC - PubMed
    1. Yamada K., Lim J., Dale J.M., Chen H., Shinn P., Palm C.J., Southwick A.M., Wu H.C., Kim C., Nguyen M., et al. Empirical analysis of transcriptional activity in the Arabidopsis genome. Science. 2003;302:842–846. doi:10.1126/science.1088305. - DOI - PubMed

Publication types