Transcript length bias in RNA-seq data confounds systems biology

doi:10.1186/1745-6150-4-14

. 2009 Apr 16:4:14.

doi: 10.1186/1745-6150-4-14.

Transcript length bias in RNA-seq data confounds systems biology

Alicia Oshlack¹, Matthew J Wakefield

Affiliations

PMID: 19371405
PMCID: PMC2678084
DOI: 10.1186/1745-6150-4-14

Transcript length bias in RNA-seq data confounds systems biology

Alicia Oshlack et al. Biol Direct. 2009.

. 2009 Apr 16:4:14.

doi: 10.1186/1745-6150-4-14.

Authors

Alicia Oshlack¹, Matthew J Wakefield

Affiliation

¹ Walter and Eliza Hall Institute of Medical Research, Parkville, Vic, Australia. oshlack@wehi.edu.au

PMID: 19371405
PMCID: PMC2678084
DOI: 10.1186/1745-6150-4-14

Abstract

Background: Several recent studies have demonstrated the effectiveness of deep sequencing for transcriptome analysis (RNA-seq) in mammals. As RNA-seq becomes more affordable, whole genome transcriptional profiling is likely to become the platform of choice for species with good genomic sequences. As yet, a rigorous analysis methodology has not been developed and we are still in the stages of exploring the features of the data.

Results: We investigated the effect of transcript length bias in RNA-seq data using three different published data sets. For standard analyses using aggregated tag counts for each gene, the ability to call differentially expressed genes between samples is strongly associated with the length of the transcript.

Conclusion: Transcript length bias for calling differentially expressed genes is a general feature of current protocols for RNA-seq technology. This has implications for the ranking of differentially expressed genes, and in particular may introduce bias in gene set testing for pathway analysis and other multi-gene systems biology analyses.

Reviewers: This article was reviewed by Rohan Williams (nominated by Gavin Huttley), Nicole Cloonan (nominated by Mark Ragan) and James Bullard (nominated by Sandrine Dudoit).

PubMed Disclaimer

Figures

**Figure 1**
**Differential expression as a function of transcript length**. The data is binned according to transcript length and the percentage of transcripts called differentially expressed using a statistical cut-off is plotted (points). A linear regression is also plotted (lines). a – e use all the data from RNA-seq and the microarrays from studies [4-6] respectively. f and g plot 33% of genes with highest expression levels (blue crosses) and 33% of genes with low expression (red triangles) taken from the microarray data for genes which appear on both platforms in [6]. The regression gives a significant trend for the percent of differential expression with transcript length for a, c, d and f and the lowly expressed genes in g. Note that this figure illustrates common data features between disparate experiments and is not a comparison between platforms, methods or experiments.

**Figure 2**
**Mean-variance relationship**. Here we show the sample variance across lanes in the liver sample from the Marioni et al[6] data plotted as a function of the mean for each gene (a). Next we have the same data where the tag counts for each gene are divided by the length of the gene (b). The red line fits a linear relationship between the mean and variance for the one third of shortest genes while the blue line is the linear fit to the longest genes. In plot a the fits are very close to the line of equality between mean and variance (black line) which is what would be expected from a Poisson process. In plot b the short genes have higher variance for a given expression level than long genes.

**Figure 3**
**Length of genes found in KEGG pathways significantly over represented with differentially expressed genes**. The first box in the plot represents the length of genes found in the four significant categories from both platforms. The second box is the length of genes found in categories significant only in the sequencing data. The third box is the length of all genes in common to both technologies. It can be seen that categories unique to the sequencing data tend to have longer transcripts.

See this image and copyright information in PMC

Cited by

Analyzing RNA-Seq data from Chlamydia with super broad transcriptomic activation: challenges, solutions, and implications for other systems.
Wan D, Cheng A, Wang Y, Zhong G, Li WV, Fan H. Wan D, et al. BMC Genomics. 2024 Aug 25;25(1):801. doi: 10.1186/s12864-024-10714-3. BMC Genomics. 2024. PMID: 39182031 Free PMC article.
Comprehensive Metatranscriptomic Analysis of Plant Viruses in Imported Frozen Cherries and Blueberries.
Lee GE, Lee HJ, Jeong RD. Lee GE, et al. Plant Pathol J. 2024 Aug;40(4):377-389. doi: 10.5423/PPJ.OA.06.2024.0088. Epub 2024 Aug 1. Plant Pathol J. 2024. PMID: 39117336 Free PMC article.
A practical introduction to holo-omics.
Odriozola I, Rasmussen JA, Gilbert MTP, Limborg MT, Alberdi A. Odriozola I, et al. Cell Rep Methods. 2024 Jul 15;4(7):100820. doi: 10.1016/j.crmeth.2024.100820. Epub 2024 Jul 9. Cell Rep Methods. 2024. PMID: 38986611 Free PMC article. Review.
Analyzing RNA-Seq Data from Chlamydia with Super Broad Transcriptomic Activation: Challenges, Solutions, and Implications for Other Systems.
Wan D, Cheng A, Wang Y, Zhong G, Li WV, Fan H. Wan D, et al. bioRxiv [Preprint]. 2024 May 21:2024.05.16.594566. doi: 10.1101/2024.05.16.594566. bioRxiv. 2024. Update in: BMC Genomics. 2024 Aug 25;25(1):801. doi: 10.1186/s12864-024-10714-3. PMID: 38826265 Free PMC article. Updated. Preprint.
Normalization of RNA-Seq data using adaptive trimmed mean with multi-reference.
Singh V, Kirtipal N, Song B, Lee S. Singh V, et al. Brief Bioinform. 2024 Mar 27;25(3):bbae241. doi: 10.1093/bib/bbae241. Brief Bioinform. 2024. PMID: 38770720 Free PMC article.

See all "Cited by" articles

References

1. Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB. Alternative isoform regulation in human tissue transcriptomes. Nature. 2008;456:470–476. doi: 10.1038/nature07509. - DOI - PMC - PubMed
1. Dunning MJ, Barbosa-Morais NL, Lynch AG, Tavare S, Ritchie ME. Statistical issues in the analysis of Illumina data. BMC Bioinformatics. 2008;9:85. doi: 10.1186/1471-2105-9-85. - DOI - PMC - PubMed
1. Wu Z, Irizarry RA. Stochastic models inspired by hybridization theory for short oligonucleotide arrays. J Comput Biol. 2005;12:882–893. doi: 10.1089/cmb.2005.12.882. - DOI - PubMed
1. Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, et al. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science. 2008;321:956–960. doi: 10.1126/science.1160342. - DOI - PubMed
1. Cloonan N, Forrest AR, Kolle G, Gardiner BB, Faulkner GJ, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G, et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods. 2008;5:613–619. doi: 10.1038/nmeth.1223. - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

[1] Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB. Alternative isoform regulation in human tissue transcriptomes. Nature. 2008;456:470–476. doi: 10.1038/nature07509. - DOI - PMC - PubMed

[2] Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB. Alternative isoform regulation in human tissue transcriptomes. Nature. 2008;456:470–476. doi: 10.1038/nature07509. - DOI - PMC - PubMed

[3] Dunning MJ, Barbosa-Morais NL, Lynch AG, Tavare S, Ritchie ME. Statistical issues in the analysis of Illumina data. BMC Bioinformatics. 2008;9:85. doi: 10.1186/1471-2105-9-85. - DOI - PMC - PubMed

[4] Dunning MJ, Barbosa-Morais NL, Lynch AG, Tavare S, Ritchie ME. Statistical issues in the analysis of Illumina data. BMC Bioinformatics. 2008;9:85. doi: 10.1186/1471-2105-9-85. - DOI - PMC - PubMed

[5] Wu Z, Irizarry RA. Stochastic models inspired by hybridization theory for short oligonucleotide arrays. J Comput Biol. 2005;12:882–893. doi: 10.1089/cmb.2005.12.882. - DOI - PubMed

[6] Wu Z, Irizarry RA. Stochastic models inspired by hybridization theory for short oligonucleotide arrays. J Comput Biol. 2005;12:882–893. doi: 10.1089/cmb.2005.12.882. - DOI - PubMed

[7] Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, et al. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science. 2008;321:956–960. doi: 10.1126/science.1160342. - DOI - PubMed

[8] Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, et al. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science. 2008;321:956–960. doi: 10.1126/science.1160342. - DOI - PubMed

[9] Cloonan N, Forrest AR, Kolle G, Gardiner BB, Faulkner GJ, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G, et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods. 2008;5:613–619. doi: 10.1038/nmeth.1223. - DOI - PubMed

[10] Cloonan N, Forrest AR, Kolle G, Gardiner BB, Faulkner GJ, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G, et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods. 2008;5:613–619. doi: 10.1038/nmeth.1223. - DOI - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Transcript length bias in RNA-seq data confounds systems biology

Affiliation

Transcript length bias in RNA-seq data confounds systems biology

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Miscellaneous