Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Mar 2:14:75.
doi: 10.1186/1471-2105-14-75.

virtualArray: a R/bioconductor package to merge raw data from different microarray platforms

Affiliations

virtualArray: a R/bioconductor package to merge raw data from different microarray platforms

Andreas Heider et al. BMC Bioinformatics. .

Abstract

Background: Microarrays have become a routine tool to address diverse biological questions. Therefore, different types and generations of microarrays have been produced by several manufacturers over time. Likewise, the diversity of raw data deposited in public databases such as NCBI GEO or EBI ArrayExpress has grown enormously.This has resulted in databases currently containing several hundred thousand microarray samples clustered by different species, manufacturers and chip generations. While one of the original goals of these databases was to make the data available to other researchers for independent analysis and, where appropriate, integration with their own data, current software implementations could not provide that feature.Only those data sets generated on the same chip platform can be readily combined and even here there are batch effects to be taken care of. A straightforward approach to deal with multiple chip types and batch effects has been missing.The software presented here was designed to solve both of these problems in a convenient and user friendly way.

Results: The virtualArray software package can combine raw data sets using almost any chip types based on current annotations from NCBI GEO or Bioconductor. After establishing congruent annotations for the raw data, virtualArray can then directly employ one of seven implemented methods to adjust for batch effects in the data resulting from differences between the chip types used. Both steps can be tuned to the preferences of the user. When the run is finished, the whole dataset is presented as a conventional Bioconductor "ExpressionSet" object, which can be used as input to other Bioconductor packages.

Conclusions: Using this software package, researchers can easily integrate their own microarray data with data from public repositories or other sources that are based on different microarray chip types. Using the default approach a robust and up-to-date batch effect correction technique is applied to the data.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Scheme of the steps performed by virtualArray. A total of seven distinct steps is needed to create a virtual array. The first two require user input while all others are performed without user intervention. Please see “Implementation” for detailed descriptions.
Figure 2
Figure 2
Hierarchical clusterings and principle component analyses of ExpressionSets outputted by virtualArray. On the basis of the combined dataset from three different platforms a hierarchical clustering was calculated based on Euclidian distance matrices. Samples from GSE23402 are marked in red, samples from GSE26428 are marked in green and samples from GSE28688 are marked in blue. ESC, human embryonic stem cells; iPSC, human induced pluripotent stem cells. A, clustering of combined data without batch effect removed, B, clustering of combined data with batch effect removed in non-supervised mode; C, clustering of combined data with batch effect removed in supervised mode. The direct analysis of the combined dataset exhibits strong batch effects (A), that can be reduced by the use of EBM (B) in non-supervised mode. The benefit of the supervised mode can be seen in PCA plots (D, E) but not hierarchical clusterings (C). Principle component analyses were performed on the combined batch effect removed dataset. The batch effects were removed in non-supervised (D) and supervised mode (E), respectively.

Similar articles

Cited by

References

    1. Allison DB, Cui X, Page GP, Sabripour M. Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet. 2006;7(1):55–65. doi: 10.1038/nrg1749. - DOI - PubMed
    1. Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002;30(1):207–210. doi: 10.1093/nar/30.1.207. - DOI - PMC - PubMed
    1. Parkinson H, Sarkans U, Kolesnikov N, Abeygunawardena N, Burdett T, Dylag M, Emam I, Farne A, Hastings E, Holloway E. ArrayExpress update-an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Res. 2011;39:D1002–D1004. doi: 10.1093/nar/gkq1040. - DOI - PMC - PubMed
    1. Cahan P, Rovegno F, Mooney D, Newman JC, St Laurent G, McCaffrey TA. Meta-analysis of microarray results: challenges, opportunities, and recommendations for standardization. Gene. 2007;401(1-2):12–18. doi: 10.1016/j.gene.2007.06.016. - DOI - PMC - PubMed
    1. Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ, de Longueville F, Kawasaki ES, Lee KY. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol. 2006;24(9):1151–1161. doi: 10.1038/nbt1239. - DOI - PMC - PubMed

Publication types

LinkOut - more resources