Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Feb 25;9(2):e89445.
doi: 10.1371/journal.pone.0089445. eCollection 2014.

RNA CoMPASS: a dual approach for pathogen and host transcriptome analysis of RNA-seq datasets

Affiliations

RNA CoMPASS: a dual approach for pathogen and host transcriptome analysis of RNA-seq datasets

Guorong Xu et al. PLoS One. .

Abstract

High-throughput RNA sequencing (RNA-seq) has become an instrumental assay for the analysis of multiple aspects of an organism's transcriptome. Further, the analysis of a biological specimen's associated microbiome can also be performed using RNA-seq data and this application is gaining interest in the scientific community. There are many existing bioinformatics tools designed for analysis and visualization of transcriptome data. Despite the availability of an array of next generation sequencing (NGS) analysis tools, the analysis of RNA-seq data sets poses a challenge for many biomedical researchers who are not familiar with command-line tools. Here we present RNA CoMPASS, a comprehensive RNA-seq analysis pipeline for the simultaneous analysis of transcriptomes and metatranscriptomes from diverse biological specimens. RNA CoMPASS leverages existing tools and parallel computing technology to facilitate the analysis of even very large datasets. RNA CoMPASS has a web-based graphical user interface with intrinsic queuing to control a distributed computational pipeline. RNA CoMPASS was evaluated by analyzing RNA-seq data sets from 45 B-cell samples. Twenty-two of these samples were derived from lymphoblastoid cell lines (LCLs) generated by the infection of naïve B-cells with the Epstein Barr virus (EBV), while another 23 samples were derived from Burkitt's lymphomas (BL), some of which arose in part through infection with EBV. Appropriately, RNA CoMPASS identified EBV in all LCLs and in a fraction of the BLs. Cluster analysis of the human transcriptome component of the RNA CoMPASS output clearly separated the BLs (which have a germinal center-like phenotype) from the LCLs (which have a blast-like phenotype) with evidence of activated MYC signaling and lower interferon and NF-kB signaling in the BLs. Together, this analysis illustrates the utility of RNA CoMPASS in the simultaneous analysis of transcriptome and metatranscriptome data. RNA CoMPASS is freely available at http://rnacompass.sourceforge.net/.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Schematic of RNA CoMPASS (RNA comprehensive multi-processor analysis system for sequencing) architecture.
RNA CoMPASS is a graphical user interface (GUI) based parallel computation pipeline for the analysis of both exogenous and human sequences from RNA-seq data. It employs a commercial and several open-source programs to analyze RNA-seq data sets including Novoalign, SAMMate, BLAST, and MEGAN. Each step results in the subtraction of reads in order to further analyze the unmapped reads for pathogen discovery. The mapped reads are analyzed separately. The end result from this pipeline is pathogen discovery and host transcriptome analysis.
Figure 2
Figure 2. Performance Analysis of RNA CoMPASS.
RNA CoMPASS was deployed on a local cluster and benchmarking was performed. An Akata RNA-seq data set was split into six files of varying sizes: 1–393.4 MB, 1,397,139 reads, 2–757 MB, 2,685,149 reads, 3–1.44 GB, 5,120,805 reads, 4–2.72 GB, 9,651,466 reads, 5–5.01 GB, 25,465,406 reads, sample 6–8.99 GB, 50,930,812 reads. Overall time was calculated for each file on a single machine (blue column) and on the local 4-node cluster (red column). Speedup time is represented as a green line.
Figure 3
Figure 3. Detection of EBV in Human B-Cells using RNA CoMPASS.
Analysis of all 45 single-end RNA-seq data sets (22-Lymphoblastoid cell lines, 23-Burkitt's lymphomas) were analyzed using RNA CoMPASS. (A) The virome branch of the taxonomy trees for two representative LCLs and Burkitt's lymphomas were generated using the metagenome analysis tool, MEGAN 4. (B) EBV reads were quantified in all 45 RNA-seq data sets and are represented as per 5,000,000 total sequence reads.
Figure 4
Figure 4. Circos plot of two EBV samples shows distinct gene expression.
An annotated Circos plot depicts the EBV read coverage across the EBV genome of two samples. The graph displays the number of reads mapped to each nucleotide position of the genome and are depicted in log scale. Blue features represent lytic genes, red features represent latency genes, green features represent potential non-coding genes, and black features represent non-gene features (e.g. repeat regions and origins of replication).
Figure 5
Figure 5. Heat Map representing Human B-Cells analyzed using RNA CoMPASS.
Human transcript counts from the 45 B-cell samples were imported into the R software environment and analyzed using the edgeR package . Genes with low transcript counts (less than 1 CPM (count per million)) in the majority of samples were filtered out. The Manhattan (L-1) distance matrix for the samples was computed using the remaining transcript counts, and this was taken as input for hierarchical clustering using the Ward algorithm. After assigning each sample to one of two groups identified by hierarchical clustering (Human B-Cell or Burkitt's Lymphoma), the glmFit function was used to fit the mean log(CPM) for each group and likelihood ratio tests were used to identify those genes that were differentially expressed, with adjusted P<0.05 following the Benjamini-Hochberg correction for multiple testing. The fitted log(CPM) values for the subset of genes that were differentially expressed in the LCL samples relative to the Burkitt's lymphoma samples were then clustered using the Euclidean distance and complete linkage algorithm to detect groups of co-expressed genes. The expression heat map displays the top 250 differentially expressed genes.

Similar articles

Cited by

References

    1. Feng H, Shuda M, Chang Y, Moore PS (2008) Clonal Integration of a Polyomavirus in Human Merkel Cell Carcinoma. Science 319: 1096–1100. - PMC - PubMed
    1. Kostic AD, Ojesina AI, Pedamallu CS, Jung J, Verhaak RGW, et al. (2011) PathSeq: software to identify or discover microbes by deep sequencing of human tissue. Nat Biotech 29: 393–396. - PMC - PubMed
    1. Castellarin M, Warren R, Freeman JD, Dreolini L, Krzywinski M, et al... (2011) Fusobacterium nucleatum infection is prevalent in human colorectal carcinoma. Genome Research. - PMC - PubMed
    1. Coco JR, EK Flemington, CM Taylor (2011) PARSES: A Pipeline for Analysis of RNA-Seq Exogenous Sequences. Proceedings of the ISCA 3rd International Conference on Bioinformatics and Computational Biology. Holiday Inn Downtown-Superdome, New Orleans, Louisiana, USA 2011: BICoB-2011. pp. 196–200.
    1. Weber G, Shendure J, Tanenbaum DM, Church GM, Meyerson M (2002) Identification of foreign gene sequences by transcript filtering against the human genome. Nat Genet 30: 141–142. - PubMed

Publication types