Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Nov 20;23(22):14418.
doi: 10.3390/ijms232214418.

KARAJ: An Efficient Adaptive Multi-Processor Tool to Streamline Genomic and Transcriptomic Sequence Data Acquisition

Affiliations

KARAJ: An Efficient Adaptive Multi-Processor Tool to Streamline Genomic and Transcriptomic Sequence Data Acquisition

Mahdieh Labani et al. Int J Mol Sci. .

Abstract

Here we developed KARAJ, a fast and flexible Linux command-line tool to automate the end-to-end process of querying and downloading a wide range of genomic and transcriptomic sequence data types. The input to KARAJ is a list of PMCIDs or publication URLs or various types of accession numbers to automate four tasks as follows; firstly, it provides a summary list of accessible datasets generated by or used in these scientific articles, enabling users to select appropriate datasets; secondly, KARAJ calculates the size of files that users want to download and confirms the availability of adequate space on the local disk; thirdly, it generates a metadata table containing sample information and the experimental design of the corresponding study; and lastly, it enables users to download supplementary data tables attached to publications. Further, KARAJ provides a parallel downloading framework powered by Aspera connect which reduces the downloading time significantly.

Keywords: Bioinformatics; Download; FASTQ; Genomics; Linux; biological data; sequence data; transcriptomics.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
The architecture of KARAJ. Input and output file formats are shown by green and red boxes, respectively. The blue box represents the processing steps provided by KARAJ. The input to KARAJ is a list of either PubMed Central PMCIDs or URLs for articles. KARAJ then mines the text of corresponding articles for the accession numbers (Extracted list). Then, KARAJ generates a report summary of these accession numbers containing the information including number of samples, description, experimental design, and the sequencing technology. This report summary gives the user the opportunity to choose accession numbers that are of interest (Selected list). KARAJ fetches the header for data linked to these accession numbers and calculates the size of these data and checks with the local drive to ensure the availability of adequate space. When adequate local storage space exists, KARAJ downloads all files using a parallel framework powered by the Aspera protocol. KARAJ also accepts list of accession numbers as an input to retrieve sequence data. Image created with BioRender.com under the NX24GYLITA agreement number.

Similar articles

Cited by

References

    1. Stephens Z.D., Lee S.Y., Faghri F., Campbell R.H., Zhai C., Efron M.J., Iyer R., Schatz M.C., Sinha S., Robinson G.E. Big Data: Astronomical or Genomical? PLoS Biol. 2015;13:e1002195. doi: 10.1371/journal.pbio.1002195. - DOI - PMC - PubMed
    1. Afrasiabi A., Keane J.T., Heng J.I., Palmer E.E., Lovell N.H., Alinejad-Rokny H. Quantitative neurogenetics: Applications in understanding disease. Biochem. Soc. Trans. 2021;49:1621–1631. doi: 10.1042/BST20200732. - DOI - PubMed
    1. Navarro F.C.P., Mohsen H., Yan C., Li S., Gu M., Meyerson W., Gerstein M. Genomics and data science: An application within an umbrella. Genome Biol. 2019;20:109. doi: 10.1186/s13059-019-1724-1. - DOI - PMC - PubMed
    1. Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., 1000 Genome Project Data Processing Subgroup The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. - DOI - PMC - PubMed
    1. Cock P.J., Fields C.J., Goto N., Heuer M.L., Rice P.M. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2010;38:1767–1771. doi: 10.1093/nar/gkp1137. - DOI - PMC - PubMed

Grants and funding

LinkOut - more resources