Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 Oct 23:7:464.
doi: 10.1186/1471-2105-7-464.

A summarization approach for Affymetrix GeneChip data using a reference training set from a large, biologically diverse database

Affiliations

A summarization approach for Affymetrix GeneChip data using a reference training set from a large, biologically diverse database

Simon Katz et al. BMC Bioinformatics. .

Abstract

Background: Many of the most popular pre-processing methods for Affymetrix expression arrays, such as RMA, gcRMA, and PLIER, simultaneously analyze data across a set of predetermined arrays to improve precision of the final measures of expression. One problem associated with these algorithms is that expression measurements for a particular sample are highly dependent on the set of samples used for normalization and results obtained by normalization with a different set may not be comparable. A related problem is that an organization producing and/or storing large amounts of data in a sequential fashion will need to either re-run the pre-processing algorithm every time an array is added or store them in batches that are pre-processed together. Furthermore, pre-processing of large numbers of arrays requires loading all the feature-level data into memory which is a difficult task even with modern computers. We utilize a scheme that produces all the information necessary for pre-processing using a very large training set that can be used for summarization of samples outside of the training set. All subsequent pre-processing tasks can be done on an individual array basis. We demonstrate the utility of this approach by defining a new version of the Robust Multi-chip Averaging (RMA) algorithm which we refer to as refRMA.

Results: We assess performance based on multiple sets of samples processed over HG U133A Affymetrix GeneChip arrays. We show that the refRMA workflow, when used in conjunction with a large, biologically diverse training set, results in the same general characteristics as that of RMA in its classic form when comparing overall data structure, sample-to-sample correlation, and variation. Further, we demonstrate that the refRMA workflow and reference set can be robustly applied to naïve organ types and to benchmark data where its performance indicates respectable results.

Conclusion: Our results indicate that a biologically diverse reference database can be used to train a model for estimating probe set intensities of exclusive test sets, while retaining the overall characteristics of the base algorithm. Although the results we present are specific for RMA, similar versions of other multi-array normalization and summarization schemes can be developed.

PubMed Disclaimer

Figures

Figure 1
Figure 1
ROC curves for the Classic RMA, Full refRMA, and MAS5 algorithms via affycomp. Each of the summarization algorithms are shown with respect to False Positive probe sets vs. True Positive percentage for the affycomp spiked-in HG U133A data set. The spiked-in probe sets used are limited to a) lowly expressed probe sets (≤ 2 pM) and b) moderately expressed probe sets (≥ 4 and ≤ 32 pM) as defined in [9]. The Full refRMA model performs better than either of the MAS5 algorithms, but does not do as well as the Classic RMA model for the spiked-in probe sets. The likely reason for this result is discussed in the text.
Figure 2
Figure 2
Correlations between Classic and refRMA data trained with GEO samples. Randomly selected samples from normal GEO samples of mutiple organs were used to train refRMA models at each of the sample sizes (I) indicated and were applied to 50 exclusive test set samples. Correlation for each test sample summarized by refRMA was calculated relative to the same test sample summarized by Classic RMA (i.e., trained on the 50 test set samples) using all probe sets. The mean correlation across the 50 test samples was then calculated. This process was repeated 100 times using random selection of both training and test sets to yield the correlation distributions shown as box plots. The entry shown as "DB" is the Full refRMA model trained on 1,614 samples from Gene Logic's reference database. Note that the GEO-based models, where test set experiments are not completely exclusive of training set, show slightly higher correlations than the Full refRMA model, which represents complete exclusivity of training and test sets.
Figure 3
Figure 3
MA plots for Classic RMA and refRMA models. Mean probe set intensities as summarized for each of two sets of 15 normal liver samples are shown. The a) Full refRMA model was calculated based on 1,614 biologically diverse samples exclusive of the 30 normal liver test set samples contributing to this plot. The b) Classic RMA model was trained using only the 30 test set samples. Consistency in general data structure is observed regardless of training set. For each probe set, the mean intensity was calculated by averaging individual expression values across the 15 samples in each set.
Figure 4
Figure 4
MA plots showing differences between Classic and refRMA models for the same test set. Mean probe set intensities as summarized by the two different RMA models are shown for the same test set of 15 normal liver samples. In each case, the y-axis (i.e., the axis indicating variability) is scaled relative to the biological variability observed in Figure 3 in order to contextualize the effect of model relative to effect of inherent variability contributed by different test set sampling. a) Classic and refRMA models differ somewhat, while b) the Full refRMA model and a refRMA model where liver has not been used in the training set are almost identical, indicating that the Full refRMA model is unaffected by incorporation of other single organs. For each probe set, the mean was calculated by averaging individual expression values across the 15 samples in each set.
Figure 5
Figure 5
Coefficient of Variation and Correlations for summarization methods. The distributions of a) probe set-specific coefficients of variation resulting from log2 MAS5, Classic RMA, and Full refRMA summarization schemes on 30 liver normal samples and b) correlations of all possible pair-wise comparisons across samples using all probe sets on the HG U133A GeneChip. The two small bumps in the correlation Classic RMA plot are discussed in the manuscript body.
Figure 6
Figure 6
CAT plots showing the overlap of top n selected probe sets. Two types of comparisons are made for each of the a) fold change and b) t-test metrics frequently used for selection of regulated probe sets within a simple experimental design such as the one used here (i.e., 15 normal vs. 15 malignant liver samples). For both fold change and t-test metrics, the consistency of regulation events is based on overlapping selection of top n regulated probe sets. First, multiple test sets of 15 normal and 15 malignant samples are used to assess consistency of regulation using the same summarization algorithm over 20 bootstrap iterations with sample replacement. Second, the same test set of 15 normal and 15 malignant samples are used to assess consistency of regulation using different summarization schemes, namely Classic vs. Full refRMA. For each metric, regulation consistency due to summarization scheme is higher than consistency due to different test sets

Similar articles

Cited by

References

    1. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003;31:e15. doi: 10.1093/nar/gng015. - DOI - PMC - PubMed
    1. Li C, Hung Wong W. Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol. 2001;2:RESEARCH0032. - PMC - PubMed
    1. Wu Z, Irizarry RA, Gentleman R, Martinez-Murillo F, Spencer F. A model-based background adjustment for oligonucleotide expression arrays. Journal of the American Statistical Association. 2004;99:909–917. doi: 10.1198/016214504000000683. - DOI
    1. Affymetrix . In: Guide to Probe Logarithmic Intensity Error (PLIER) Estimation. Affymetrix I, editor. Santa Clara, CA, ; 2005.
    1. Choe SE, Boutros M, Michelson AM, Church GM, Halfon MS. Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biol. 2005;6:R16. doi: 10.1186/gb-2005-6-2-r16. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources