Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Sep 22:8:e9954.
doi: 10.7717/peerj.9954. eCollection 2020.

The reuse of public datasets in the life sciences: potential risks and rewards

Affiliations

The reuse of public datasets in the life sciences: potential risks and rewards

Katharina Sielemann et al. PeerJ. .

Abstract

The 'big data' revolution has enabled novel types of analyses in the life sciences, facilitated by public sharing and reuse of datasets. Here, we review the prodigious potential of reusing publicly available datasets and the associated challenges, limitations and risks. Possible solutions to issues and research integrity considerations are also discussed. Due to the prominence, abundance and wide distribution of sequencing data, we focus on the reuse of publicly available sequence datasets. We define 'successful reuse' as the use of previously published data to enable novel scientific findings. By using selected examples of successful reuse from different disciplines, we illustrate the enormous potential of the practice, while acknowledging the respective limitations and risks. A checklist to determine the reuse value and potential of a particular dataset is also provided. The open discussion of data reuse and the establishment of this practice as a norm has the potential to benefit all stakeholders in the life sciences.

Keywords: Bioinformatics; Computational biology; Data science; Databases; Genomics; Open science; Reuse; Sequencing data.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Figure 1
Figure 1. The evolution of data sharing behaviour.
(1) Technical progress makes global sharing of large datasets possible, (2) increased accessibility to required technology makes it widely available, (3) obligations and benefits for researchers establish sharing behaviour, (4) the size of datasets increases and makes them attractive, (5) reuse develops over time—which results in a positive feedback loop and a habit to share all data.
Figure 2
Figure 2. Types of reusable data classified into primary and derived/secondary data.
Specific examples for each data type are provided in parentheses. The data classification is based on: Wooley & Lin, 2005. (Sources of the pictures: ‘Protein Data Bank in Europe—Logo’, https://www.ebi.ac.uk/pdbe/about/logo; Pucker, Holtgräwe & Weisshaar, 2017; Schilbert et al., 2018; Mitchell et al., 2019; Frey & Pucker, 2020).
Figure 3
Figure 3. The increasing size of selected databases over time.
The number of bases/sequence entries in GenBank, the Sequence Read Archive (SRA) and UniProtKB/TrEMBL are shown, respectively. Note the logarithmic scale of the y-axes. The drop of sequence entries in UniProtKB/TrEMBL (in 2015) can be explained by the removal of duplicates.
Figure 4
Figure 4. Advantages and limitations of data reuse.
Figure 5
Figure 5. Summary of outstanding questions and challenges.

Similar articles

Cited by

References

    1. Abolfathi B, Aguado DS, Aguilar G, Prieto CA, Almeida A, Ananna TT, Anders F, Anderson SF, Andrews BH, Anguiano B, Aragón-Salamanca A, Argudo-Fernández M, Armengaud E, Ata M, Aubourg E, Avila-Reese V, Badenes C, Bailey S, Balland C, Barger KA, Barrera-Ballesteros J, Bartosz C, Bastien F, Bates D, Baumgarten F, Bautista J, Beaton R, Beers TC, Belfiore F, Bender CF, Bernardi M, Bershady MA, Beutler F, Bird JC, Bizyaev D, Blanc GA, Blanton MR, Blomqvist M, Bolton AS, Boquien M, Borissova J, Bovy J, Bradna Diaz CA, Nielsen Brandt W, Brinkmann J, Brownstein JR, Bundy K, Burgasser AJ, Burtin E, Busca NG, Cañas CI, Cano-Díaz M, Cappellari M, Carrera R, Casey AR, Sodi BC, Chen Y, Cherinka B, Chiappini C, Choi PD, Chojnowski D, Chuang C-H, Chung H, Clerc N, Cohen RE, Comerford JM, Comparat J, Do Nascimento JC, Da Costa L, Cousinou M-C, Covey K, Crane JD, Cruz-Gonzalez I, Cunha K, Ilha GS, Damke GJ, Darling J, Davidson JW, Jr, Dawson K, De Icaza Lizaola MAC, Macorra A, De la Torre S, De Lee N, Sainte Agathe V, Deconto Machado A, Dell’Agli F, Delubac T, Diamond-Stanic AM, Donor J, Downes JJ, Drory N, Mas des Bourboux H, Duckworth CJ, Dwelly T, Dyer J, Ebelke G, Eigenbrot AD, Eisenstein DJ, Elsworth YP, Emsellem E, Eracleous M, Erfanianfar G, Escoffier S, Fan X, Alvar EF, Fernandez-Trincado JG, Cirolini RF, Feuillet D, Finoguenov A, Fleming SW, Font-Ribera A, Freischlad G, Frinchaboy P, Fu H, Chew YGM, Galbany L, García Pérez AE, Garcia-Dias R, García-Hernández DA, Garma Oehmichen LA, Gaulme P, Gelfand J, Gil-Marín H, Gillespie BA, Goddard D, González Hernández JI, Gonzalez-Perez V, Grabowski K, Green PJ, Grier CJ, Gueguen A, Guo H, Guy J, Hagen A, Hall P, Harding P, Hasselquist S, Hawley S, Hayes CR, Hearty F, Hekker S, Hernandez J, Hernandez Toledo H, Hogg DW, Holley-Bockelmann K, Holtzman JA, Hou J, Hsieh B-C, Hunt JAS, Hutchinson TA, Hwang HS, Jimenez Angel CE, Johnson JA, Jones A, Jönsson H, Jullo E, Sakil Khan F, Kinemuchi K, Kirkby D, Kirkpatrick IV CC, Kitaura F-S, Knapp GR, Kneib J-P, Kollmeier JA, Lacerna I, Lane RR, Lang D, Law DR, Le Goff J-M, Lee Y-B, Li H, Li C, Lian J, Liang Y, Lima M, Lin L, Long D, Lucatello S, Lundgren B, Mackereth JT, MacLeod CL, Mahadevan S, Geimba Maia MA, Majewski S, Manchado A, Maraston C, Mariappan V, Marques-Chaves R, Masseron T, Masters KL, McDermid RM, McGreer ID, Melendez M, Meneses-Goytia S, Merloni A, Merrifield MR, Meszaros S, Meza A, Minchev I, Minniti D, et al. The fourteenth data release of the Sloan Digital Sky Survey: first spectroscopic data from the extended Baryon Oscillation Spectroscopic Survey and from the second phase of the Apache Point Observatory Galactic Evolution Experiment. Astrophysical Journal Supplement Series. 2018;235(2):42. doi: 10.3847/1538-4365/aa9e8a. - DOI
    1. Ali-Khan SE, Harris LW, Gold ER. Motivating participation in open science by examining researcher incentives. eLife. 2017;6:e29319. doi: 10.7554/eLife.29319. - DOI - PMC - PubMed
    1. Announcement Announcement: where are the data? Nature. 2016;537(7619):138. doi: 10.1038/537138a. - DOI - PubMed
    1. Arend D, Junker A, Scholz U, Schüler D, Wylie J, Lange M. PGP repository: a plant phenomics and genomics data publication infrastructure. Database. 2016;2016:baw033. doi: 10.1093/database/baw033. - DOI - PMC - PubMed
    1. Beaufils P, Karlsson J. Legitimate division of large datasets, salami slicing and dual publication. Where does a fraud begin? Orthopaedics & Traumatology: Surgery & Research. 2013;99(2):121–122. doi: 10.1016/j.otsr.2013.01.001. - DOI - PubMed

Grants and funding

Support for the Article Processing Charge is provided by the Deutsche Forschungsgemeinschaft and the Open Access Publication Fund of Bielefeld University. K.S. is funded by Bielefeld University. A.H. received the 2018 Richard Hardy Award (St. Catharine’s College, University of Cambridge) which partly supported an internship at Bielefeld University, leading to this collaboration. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

LinkOut - more resources