KARAJ: An Efficient Adaptive Multi-Processor Tool to Streamline Genomic and Transcriptomic Sequence Data Acquisition

doi:10.3390/ijms232214418

. 2022 Nov 20;23(22):14418.

doi: 10.3390/ijms232214418.

KARAJ: An Efficient Adaptive Multi-Processor Tool to Streamline Genomic and Transcriptomic Sequence Data Acquisition

Mahdieh Labani^{1

2}, Amin Beheshti², Nigel H Lovell^{3

4}, Hamid Alinejad-Rokny^{1

5

6}, Ali Afrasiabi^{1

7}

Affiliations

¹ Biomedical Machine Learning Lab, The Graduate School of Biomedical Engineering, University of New South Wales (UNSW), Sydney, NSW 2052, Australia.
² Data Analytics Lab, Department of Computing, Macquarie University, Sydney, NSW 2109, Australia.
³ The Graduate School of Biomedical Engineering (GSBmE), University of New South Wales (UNSW), Sydney, NSW 2052, Australia.
⁴ Tyree Institute of Health Engineering (IHealthE), University of New South Wales (UNSW), Sydney, NSW 2052, Australia.
⁵ UNSW Data Science Hub, University of New South Wales (UNSW), Sydney, NSW 2052, Australia.
⁶ Health Data Analytics Program, Centre for Applied Artificial Intelligence, Macquarie University, Sydney, NSW 2109, Australia.
⁷ Centre for Immunology and Allergy Research, Westmead Institute for Medical Research, University of Sydney, Sydney, NSW 2006, Australia.

PMID: 36430895
PMCID: PMC9694301
DOI: 10.3390/ijms232214418

KARAJ: An Efficient Adaptive Multi-Processor Tool to Streamline Genomic and Transcriptomic Sequence Data Acquisition

Mahdieh Labani et al. Int J Mol Sci. 2022.

. 2022 Nov 20;23(22):14418.

doi: 10.3390/ijms232214418.

Authors

Mahdieh Labani^{1

2}, Amin Beheshti², Nigel H Lovell^{3

4}, Hamid Alinejad-Rokny^{1

5

6}, Ali Afrasiabi^{1

7}

Affiliations

¹ Biomedical Machine Learning Lab, The Graduate School of Biomedical Engineering, University of New South Wales (UNSW), Sydney, NSW 2052, Australia.
² Data Analytics Lab, Department of Computing, Macquarie University, Sydney, NSW 2109, Australia.
³ The Graduate School of Biomedical Engineering (GSBmE), University of New South Wales (UNSW), Sydney, NSW 2052, Australia.
⁴ Tyree Institute of Health Engineering (IHealthE), University of New South Wales (UNSW), Sydney, NSW 2052, Australia.
⁵ UNSW Data Science Hub, University of New South Wales (UNSW), Sydney, NSW 2052, Australia.
⁶ Health Data Analytics Program, Centre for Applied Artificial Intelligence, Macquarie University, Sydney, NSW 2109, Australia.
⁷ Centre for Immunology and Allergy Research, Westmead Institute for Medical Research, University of Sydney, Sydney, NSW 2006, Australia.

PMID: 36430895
PMCID: PMC9694301
DOI: 10.3390/ijms232214418

Abstract

Here we developed KARAJ, a fast and flexible Linux command-line tool to automate the end-to-end process of querying and downloading a wide range of genomic and transcriptomic sequence data types. The input to KARAJ is a list of PMCIDs or publication URLs or various types of accession numbers to automate four tasks as follows; firstly, it provides a summary list of accessible datasets generated by or used in these scientific articles, enabling users to select appropriate datasets; secondly, KARAJ calculates the size of files that users want to download and confirms the availability of adequate space on the local disk; thirdly, it generates a metadata table containing sample information and the experimental design of the corresponding study; and lastly, it enables users to download supplementary data tables attached to publications. Further, KARAJ provides a parallel downloading framework powered by Aspera connect which reduces the downloading time significantly.

Keywords: Bioinformatics; Download; FASTQ; Genomics; Linux; biological data; sequence data; transcriptomics.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Figure 1**
The architecture of *KARAJ*. Input and output file formats are shown by green and red boxes, respectively. The blue box represents the processing steps provided by *KARAJ*. The input to *KARAJ* is a list of either PubMed Central PMCIDs or URLs for articles. *KARAJ* then mines the text of corresponding articles for the accession numbers (Extracted list). Then, *KARAJ* generates a report summary of these accession numbers containing the information including number of samples, description, experimental design, and the sequencing technology. This report summary gives the user the opportunity to choose accession numbers that are of interest (Selected list). *KARAJ* fetches the header for data linked to these accession numbers and calculates the size of these data and checks with the local drive to ensure the availability of adequate space. When adequate local storage space exists, *KARAJ* downloads all files using a parallel framework powered by the *Aspera* protocol. *KARAJ* also accepts list of accession numbers as an input to retrieve sequence data. Image created with BioRender.com under the *NX24GYLITA* agreement number.

See this image and copyright information in PMC

Cited by

A Comprehensive Investigation of Genomic Variants in Prostate Cancer Reveals 30 Putative Regulatory Variants.
Labani M, Beheshti A, Argha A, Alinejad-Rokny H. Labani M, et al. Int J Mol Sci. 2023 Jan 27;24(3):2472. doi: 10.3390/ijms24032472. Int J Mol Sci. 2023. PMID: 36768794 Free PMC article.

References

1. Stephens Z.D., Lee S.Y., Faghri F., Campbell R.H., Zhai C., Efron M.J., Iyer R., Schatz M.C., Sinha S., Robinson G.E. Big Data: Astronomical or Genomical? PLoS Biol. 2015;13:e1002195. doi: 10.1371/journal.pbio.1002195. - DOI - PMC - PubMed
1. Afrasiabi A., Keane J.T., Heng J.I., Palmer E.E., Lovell N.H., Alinejad-Rokny H. Quantitative neurogenetics: Applications in understanding disease. Biochem. Soc. Trans. 2021;49:1621–1631. doi: 10.1042/BST20200732. - DOI - PubMed
1. Navarro F.C.P., Mohsen H., Yan C., Li S., Gu M., Meyerson W., Gerstein M. Genomics and data science: An application within an umbrella. Genome Biol. 2019;20:109. doi: 10.1186/s13059-019-1724-1. - DOI - PMC - PubMed
1. Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., 1000 Genome Project Data Processing Subgroup The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. - DOI - PMC - PubMed
1. Cock P.J., Fields C.J., Goto N., Heuer M.L., Rice P.M. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2010;38:1767–1771. doi: 10.1093/nar/gkp1137. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

p3432/UNSW Sydney

LinkOut - more resources

Full Text Sources

[1] Stephens Z.D., Lee S.Y., Faghri F., Campbell R.H., Zhai C., Efron M.J., Iyer R., Schatz M.C., Sinha S., Robinson G.E. Big Data: Astronomical or Genomical? PLoS Biol. 2015;13:e1002195. doi: 10.1371/journal.pbio.1002195. - DOI - PMC - PubMed

[2] Stephens Z.D., Lee S.Y., Faghri F., Campbell R.H., Zhai C., Efron M.J., Iyer R., Schatz M.C., Sinha S., Robinson G.E. Big Data: Astronomical or Genomical? PLoS Biol. 2015;13:e1002195. doi: 10.1371/journal.pbio.1002195. - DOI - PMC - PubMed

[3] Afrasiabi A., Keane J.T., Heng J.I., Palmer E.E., Lovell N.H., Alinejad-Rokny H. Quantitative neurogenetics: Applications in understanding disease. Biochem. Soc. Trans. 2021;49:1621–1631. doi: 10.1042/BST20200732. - DOI - PubMed

[4] Afrasiabi A., Keane J.T., Heng J.I., Palmer E.E., Lovell N.H., Alinejad-Rokny H. Quantitative neurogenetics: Applications in understanding disease. Biochem. Soc. Trans. 2021;49:1621–1631. doi: 10.1042/BST20200732. - DOI - PubMed

[5] Navarro F.C.P., Mohsen H., Yan C., Li S., Gu M., Meyerson W., Gerstein M. Genomics and data science: An application within an umbrella. Genome Biol. 2019;20:109. doi: 10.1186/s13059-019-1724-1. - DOI - PMC - PubMed

[6] Navarro F.C.P., Mohsen H., Yan C., Li S., Gu M., Meyerson W., Gerstein M. Genomics and data science: An application within an umbrella. Genome Biol. 2019;20:109. doi: 10.1186/s13059-019-1724-1. - DOI - PMC - PubMed

[7] Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., 1000 Genome Project Data Processing Subgroup The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. - DOI - PMC - PubMed

[8] Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., 1000 Genome Project Data Processing Subgroup The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. - DOI - PMC - PubMed

[9] Cock P.J., Fields C.J., Goto N., Heuer M.L., Rice P.M. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2010;38:1767–1771. doi: 10.1093/nar/gkp1137. - DOI - PMC - PubMed

[10] Cock P.J., Fields C.J., Goto N., Heuer M.L., Rice P.M. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2010;38:1767–1771. doi: 10.1093/nar/gkp1137. - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

KARAJ: An Efficient Adaptive Multi-Processor Tool to Streamline Genomic and Transcriptomic Sequence Data Acquisition

Affiliations

KARAJ: An Efficient Adaptive Multi-Processor Tool to Streamline Genomic and Transcriptomic Sequence Data Acquisition

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources