- Split View
-
Views
-
Cite
Cite
Zhide Fang, Xiangqin Cui, Design and validation issues in RNA-seq experiments, Briefings in Bioinformatics, Volume 12, Issue 3, May 2011, Pages 280–287, https://doi.org/10.1093/bib/bbr004
- Share Icon Share
Abstract
The next-generation sequencing technologies are being rapidly applied in biological research. Tens of millions of short sequences generated in a single experiment provide us enormous information on genome composition, genetic variants, gene expression levels and protein binding sites depending on the applications. Various methods are being developed for analyzing the data generated by these technologies. However, the relevant experimental design issues have rarely been discussed. In this review, we use RNA-seq as an example to bring this topic into focus and to discuss experimental design and validation issues pertaining to next-generation sequencing in the quantification of transcripts.
INTRODUCTION
The next-generation sequencing is a group of new sequencing technologies that are based on randomly amplifying and shotgun sequencing techniques. Short sequences (often around 30 bases) are obtained in extremely high throughput. The total sequences generated by each run can be hundreds of millions of bases due to the extremely high parallel nature of these technologies [1–3]. Since the marketing in 2004, next-generation sequencing technologies have dramatically improved, and their application is growing exponentially [4]. The next-generation sequencing technologies greatly reduce the cost for sequencing new genomes [5] and for resequencing genomes with reference genome sequences for genetic variant discovery [6–8]. They have also been widely used to characterize DNA–protein interaction (ChIP-seq) [9, 10], to survey the transcriptome for expression level and splicing variants [11–13], and to study epigenomics [14–16] with more precise digital count outcomes.
The applications of next-generation sequencing technologies will grow more rapidly as the technologies continue to improve and the cost continues to drop. However, in the rapid growth of the applications, the experimental design issues related to the next-generation sequencing have rarely been discussed until very recently [17]. Here, we will review some statistical experimental design principles and provide some thoughts on the design and validation. We will use RNA-seq as an example to discuss the principles of experimental design and technology-specific issues. We will briefly mention a few other issues specific to other applications of next-generation sequencing at the end.
The next-generation sequencing technologies have been widely applied to measure gene expression levels and composition (termed as RNA-seq) [16, 18–21]. The power of RNA-seq in quantifying and annotating transcriptomes is striking. By obtaining tens of millions of short sequence reads from the transcript population of interest and by mapping these reads to the reference genome, RNA-seq produces digital signals (counts), and thus leads to highly reproducible results with relatively little technical variation [12, 22, 23]. When enough reads are collected from a sample, it has the potential to detect and quantify RNAs from all biologically relevant classes, including those with low and moderate abundance [12, 24].
EXPERIMENTAL DESIGN PRINCIPLES
For any experiment that has variation, there are well-established experimental design principles for achieving validity and efficiency [25, 26]. These principles were originally established from low throughput experiments, but have been widely accepted for microarray experiments. For detailed discussion, please refer to book chapters dedicated to this topic [27, 28]. Although next-generation sequencing technologies have many characteristics different from microarray technologies, most of the experimental design principles still apply. Here, we briefly review the principles and remind people about their application in experiments employing the next-generation sequencing technologies.
Randomization
Randomization dictates that the experimental subjects should be randomly assigned to the treatments or conditions to be studied in order to eliminate unknown factors that potentially affect results [29]. For RNA-seq experiments, besides the randomization in preparing the research subjects, there are many other steps to consider for randomization due to the complexity of the technologies. For example, we can randomize the sample order for various steps in the library construction and the order/location of the samples in the sequencer.
Replication
Replication is essential for estimating and decreasing the experimental error, and thus to detect the biological (treatment) effect more precisely. A true replication is an independent repetition of the same experimental process and independent acquisition of the observations [26]. As in the expression microarray experiments, there are different levels of replications in RNA-seq experiments. The most desirable replicates are the biological replicates, which are true replicates and provide us the variation among biological samples [30, 31]. In the current RNA-seq publications, some studies include biological replicates [13, 18, 32–35], while many others only have technical replicates that are repeated measurements from the same biological sample [12, 20, 22, 23]. If the goal is to evaluate the technology, technical replicates alone are sufficient. Otherwise, if the goal is to investigate the biological differences between conditions/tissues/treatments, biological replicates are essential. In addition to the intrinsic biological variation in gene expression, the sequence library construction often includes PCR amplification. The PCR amplification artifacts can result in a large number of identical sequences, which can be confounded with gene expression level. If consistent results are obtained from biological replicates, the count of reads from a gene is more likely to reflect the expression level [36]. However, it is worth pointing out that there may be sequence polymorphisms between biological replicates, which can result in a gain or a loss in reads from different biological replicates depending on their sequence consistency with the reference genome sequence [37]. Some sequence reads containing sequence polymorphisms compared with the reference sequences are more likely to be discarded during mapping. Therefore, there could appear to be an extra biological variation in addition to the gene expression variation. It might be less of an issue for inbred model organisms, such as inbred mice, when comparisons are within an inbred strain. However, for human studies, this may not be a negligible issue.
Different levels of replicates often require different statistical analysis methods. For example, when there are only strict technical replicates of different runs from the same library, the variation comes from the random sampling variance of sequencing. This variation can be modeled fairly well using a Poisson distribution [20]. However, when biological replicates are used, a Bayesian hierarchical error model or a negative binomial (NB) model is more appropriate for modeling the variation resulted from both the random sampling of sequencing and the natural variation among biological replicates [38, 39].
RNA-SEQ SPECIFIC EFFECTS AND BLOCKING
As in microarray studies, RNA-seq experiments can be affected by the variability coming from nuisance factors, often called technical effects in the RNA-seq literature. Besides processing date, technician and reagent batch, which are commonly known to investigators, there are some recognized technical effects specific to the RNA-seq procedures. One of these technical effects comes from the generation of libraries of cDNA fragments, which involves various ligations of adaptors and PCR amplifications. Besides the library preparation effect, there are also other technology-specific effects. For example, the commonly used Illumina-sequencing technology can sequence eight samples simultaneously in the eight lanes in one flow cell, of which one lane is often used for the control sample. Thus, there is variation from one flow cell to another resulting in flow cell effect. In addition, there exits variation between the individual lanes within a flow cell due to systematic variation in sequencing cycling and/or base-calling. Among these sources of variation, the library preparation effect is the largest [40]. The flow cell and lane effects are relatively small [20, 41].
From the experimental design point of view, there are some steps that can be taken to properly handle these effects besides the technology improvement. For the library preparation effect, introducing replicates before this step (often biological replicates) provides a way to estimate this effect and to properly handle it in the statistical inference. Blocking design can be used to eliminate the flow cell and lane effects. Blocking is also an experimental design principle. It dictates comparisons within a block, a known uninteresting factor that causes variation, such as flow cell effect. Either the randomized complete block design (RCBD) or the balanced incomplete block design (BIBD) can be used to achieve this goal, depending on the number of treatments/groups to be compared. Sequencing lanes can also serve as blocks when bar-coding during library preparation (for the protocol for Illumina platform, see http://www.illumina.com/Documents/products/datasheets/datasheet_sequencing_multiplex.pdf) is used for multiplexing [17]. However, it has been shown that multiplexing reduces sensitivity and reproducibility in miRNA detection [42]. Therefore, caution needs to be taken when multiplexing is considered for the purpose of reducing flow cell and lane effects.
SEQUENCING DEPTH
One unique characteristic for RNA-seq and some other applications of next-generation sequencing technologies is sequencing depth or coverage, which is often estimated as the number of total mapped sequences. Genes are expressed at different levels in each transcriptome. Due to the random sampling nature of RNA-seq, it will take a large number of sequences to measure the transcripts that are expressed at low level. For a given budget, it is critical to decide whether to increase the sequencing depth to have more accurate measurements on the genes expressed at low level or increase the sample size with limited sequencing depth for each sample. Since gene expressions roughly follow power law in terms of expression level and number of genes [43, 44], Bashir et al. [24] has modeled the sequencing depth for RNA-seq to examine the number of reads needed to reach the low expression level. Based on Marioni’s data, they showed that more than 90% of the transcripts were sampled with one million sequence reads. For experimental design, they recommend a small pilot sequencing (about 1-million sequences) to estimate the distribution of all transcripts in the population before deciding the actual sequencing depth for the whole experiment. However, as the authors acknowledged, the sequencing process is not an unbiased random sampling process. There are several recognized biases as discussed later in this review and the potential lack of fit of the power law or other distributions to the gene expression in the genome. Therefore, their model might be too optimistic and the required total number of reads could be much higher than estimated to achieve the target coverage. In addition, it is well known that the low count reads show low consistency between technical replicates and the low count transcripts (less than a few reads) are often excluded from analysis [40] for comparisons across conditions. Therefore, more than one runs of the same sample are sometimes used to increase the depth of the sequencing. However, it is still not clear what the lower limit of gene expression level is for being functionally relevant. If the extremely low expression is stochastic instead of biologically meaningful, sequencing may not need to reach an incredible depth. Thus, given the total cost of the experiment, one needs to carefully consider whether to increase the sequencing depth per sample or to increase the biological replicates.
RNA-seq is sometimes used to detect the differential expression between the two alleles of the same gene at the heterozygous locations in the genome when the expressed sequences contain sequence polymorphisms [45–47]. Heap et al. [45] ran a simple power analysis based on a Chi-square test at each SNP and found that it will take about 50 reads covering a SNP to detect a 2-fold difference between the two alleles with 19% power at significance level of 0.001. Therefore, they only focused on the SNPs with at least 50 reads for discovering allelic differential expression. It would take extremely deep coverage in order to detect allelic differential expression for genes expressed at a fairly low level.
PAIRED-END SEQUENCING
The development of the paired-end sequencing technique brought more improvement in the next-generation sequencing [48–50]. In RNA-seq experiments, the sequencing of both ends of RNA fragments adds more information especially in the detection of alternative splicing and chimeric transcripts [51, 52]. At the same sequencing depth, the pair-end sequences increase the sensitivity and specificity of the detection of the alternative splicing and chimeras in comparison with the single end sequencing. Therefore, paired-end sequencing is a more efficient strategy for characterizing and quantifying transcriptome.
BIASES OF NEXT-GENERATION SEQUENCING
The shotgun short sequences in RNA-seq experiments are expected to be randomly obtained from the transcripts. Therefore, the number of reads from a transcript depends on the transcript length [12]. This nature makes the comparison across samples more efficient for longer transcripts than shorter ones [40, 53]. In addition, sequence reads are not exactly randomly obtained from transcripts in reality. Biases have been found to be related to GC content of the sequence [54], the use of the random hexamer primers [55], 3′ and 5′ depletion or bias towards 3′-end [36], and bias toward specific RNA species [56]. Most of these biases are related to library preparation methods. They directly affect the transcription level comparisons across genes, gene end characterization, and alternative splicing characterization. Some attempts for removing these sequence/spatial biases have been explored both from the analysis level [55] and the protocol improvement [57]. From the experimental design point of view, these biases increase the required samples size and sequence depth, which emphasize the importance of choosing better protocols and selecting the right analysis methods.
SAMPLE SIZE CALCULATION FOR RNA-SEQ
The complexity of RNA-seq experiments makes it difficult to determine sample sizes. To our knowledge, there has not yet been any literature published on this topic. The sample size may be determined at two levels—the number of lanes for technical replicates in one treatment or the number of biological replicates for each treatment.
When there are biological replicates and the over-dispersion problem exists, NB distributions are more appropriate than Poisson distributions to model the RNA-seq data [38, 39]. If we denote as a negative binomial distribution with mean μ and variance then we can assume that and , where δ is the tissue/condition ratio difference. Since there is no close form for the maximum likelihood estimate (MLE) of τ [61], there is no analytical relation between power and sample size. As these authors suggest, we can perform Monte Carlo simulations to obtain the simulated power for a fixed sample size n (assuming ). Specifically, for an effect size (δ ≠ 1), a numerical relation between power and the sample size (n) can be built with the following steps:
Obtain two independent random samples, and , from and separately, where are the corresponding MLEs from the preliminary data;
Repeat the process in (1) w times;
Calculate the simulated power as the percentage of times that the null hypothesis is rejected by likelihood ratio test or Wald-type Z-test.
The sample size formulas discussed above are for a single gene. They cannot be directly applied to the RNA-seq experiments since there are thousands of genes in one RNA-seq experiment. However, we can handle this as in microarray experiments—first obtain the sample sizes for one gene and then determine the overall sample size based on the overall average power. Another way to compute the power and sample size could be based on the setting of effect size, number of non-differential genes, and the expected number of false positives as that for the microarray data [62] although error models would need to be adjusted. A third potential way is based on modeling the P-values as a mixture of distribution from genes that are not differentially expressed and genes that are differentially expressed as the method behind the PowerAtlas (http://www.poweratlas.org/) software [63].
VALIDATION
Validation has been an important part in expression microarray literature. The differentially expressed genes (at least some) identified using microarray are often validated using quantitative RT-PCR (qRT-PCR). Although validation is required by a lot of journals for publication, it is still debatable whether qRT-PCR validation of differentially expressed genes is still necessary after the huge number of publications with qRT-PCR validation of the microarray technologies and how to do it exactly if the answer is yes [64, 65]. In RNA-seq studies, RT-PCR has also been used for validation in some studies. For example, Camarena et al. [66] validated a handful of differentially expressed genes identified using RNA-seq in pathogen Acinetobacter baumannii. They showed that the fold changes estimated from RNA-seq had high correlation with that from the qRT-PCR. High consistency between RNA-seq and qRT-PCR results have also been observed in other studies [13, 32, 67]. A detailed RT-PCR analysis by Ramskold et al. [68] showed that UTRs especially the 3′-UTR are actually quite variable. Excluding the UTRs from the RNA-seq data improves the consistency of RNA-seq and RT-PCR results significantly. This finding is consistent with the observations that high consistency exists between microarray and qRT-PCR results for genes with microarray probes and PCR primers interrogating exactly the same transcript while the questionable genes show lower consistency [69]. In addition, studies on allelic expression and alternative splicing also validated their RNA-seq findings using RT-PCR [46, 70]. It is worth pointing out that validation using qRT-PCR on the same RNA samples assayed in the RNA-seq analysis only validates the technology. It does not validate the conclusion about the treatments/conditions. It is the validation using different biological replicates from the same populations that can further validate the biological conclusions from RNA-seq experiments [65].
SOME DESIGN ISSUES RELATED TO OTHER APPLICATIONS OF THE NEXT-GENERATION SEQUENCING
Next-generation sequencing has been used to identify genetic variants in a population. For example, the Thousand Genomes project (http://www.1000genomes.org/page.phpis) is sequencing at least 1000 individuals to identify all the genetics variants in humans. To most efficiently identify the genetic structure variants (such as CNV, Indels) using the next-generation sequencing technologies, one factor to consider is the insert length of the library. Bashir et al. [24] pointed out that having two libraries with different insert lengths is beneficial (for example, 200 bp and 20 kb libraries) for mapping the break points of genetic-structure variants.
ChIP-seq is another common application of the next-generation sequencing technologies. The DNA fragments enriched in chromatin immuno-precipitation (ChIP) are compared with the total DNA to identify binding sites in the genome for a given protein [10, 12, 24, 71, 72]. For some DNA binding factors, the total ChIP-seq sequence needs to be more than what is obtained from one sequencing lane. In this case, whether increasing the sequencing depth by sequencing the same sample in multiple lanes (technical replicates) or sequencing different biological samples in different lanes (biological replicates) needs to be considered. Tuteja et al. [73] found that using biological replicates can increase the number of peaks identified and increase the enrichment for consensus sequences for the binding factor. One explanation is that results from technical replicates tend to be affected more by the PCR artifacts than the independent biological replicates.
Biological replicates are important in most RNA-seq experiments.
Sequencing depth and sample size are interrelated. They are often limited by experiment budget.
Sequencing biases potentially increase the required sample size and sequencing depth.
Validation using biological replicates is more meaningful.