Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2023 Nov 1;8(1):bpad028.
doi: 10.1093/biomethods/bpad028. eCollection 2023.

Protocol for transcriptome assembly by the TransBorrow algorithm

Affiliations
Review

Protocol for transcriptome assembly by the TransBorrow algorithm

Dengyi Zhao et al. Biol Methods Protoc. .

Abstract

High-throughput RNA-seq enables comprehensive analysis of the transcriptome for various purposes. However, this technology generally generates massive amounts of sequencing reads with a shorter read length. Consequently, fast, accurate, and flexible tools are needed for assembling raw RNA-seq data into full-length transcripts and quantifying their expression levels. In this protocol, we report TransBorrow, a novel transcriptome assembly software specifically designed for short RNA-seq reads. TransBorrow is employed in conjunction with a splice-aware alignment tool (e.g. Hisat2 and Star) and some other transcriptome assembly tools (e.g. StringTie, Cufflinks, and Scallop). The protocol encompasses all necessary steps, starting from downloading and processing raw sequencing data to assembling the full-length transcripts and quantifying their expressed abundances. The execution time of the protocol may vary depending on the sizes of processed datasets and computational platforms.

Keywords: RNA-seq data; splice variants; transcriptome assembly.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Workflow of transcriptome assembly of RNA-seq experiments with TransBorrow.
Figure 2.
Figure 2.
The help information of TransBorrow.
Figure 3.
Figure 3.
A snapshot of 20 lines in the SRR7807492_1.fastq file that stores the sequencing reads. Each four lines contains the information about one sequencing read. The first line begins with a “@” character followed by a sequence identifier and an optional description. The second line contains the raw sequence letters. The third line begins with a “+” character optionally followed by the same sequence identifier. The fourth line records the sequencing quality values for the sequence.
Figure 4.
Figure 4.
A snapshot of the first several lines in the SRR7807492_genome.sam file that stored the alignment information of RNA-seq reads to the reference genome. Lines starting with “@” are the headers of the file. Each line after the headers records the alignment information of one reads (e.g. query name, flag, reference sequence name, start position, CIGAR string, mate reference sequence name, mate start position, insert size, base sequence of the read, etc.).
Figure 5.
Figure 5.
A snapshot of several lines in the SRR7807492_TransBorrow.gtf file that stored the assembled transcripts. The file includes the name of the chromosome, the source of the annotation (TransBorrow), the type of the annotation (“transcript” or “exon”), the start and end positions of the annotation; the score or confidence level of the annotation (usually 1000), the direction of the strand of the annotation (with “+” representing the positive strand, “-” representing the negative strand, and “.” indicating no strand information), gene_id (the index of a gene), transcript_id (the index of a transcript), exon_number (the index of exons), cov (estimated coverage), and transcripts Per Million (TPM).
Figure 6.
Figure 6.
A snapshot of the SRR7807492_TransBorrow.stats file that stored the evaluation results of TransBorrow assembled results for the dataset SRR7807492. According to the output, the number of candidate transcripts was 44,300 and the number of transcripts that correctly match known annotated transcripts was 18,370, with a sensitivity/recall of 20.4% from the locus level and a precision of 41.5% from the transcript level.
Figure 7.
Figure 7.
Performance comparisons of the assemblers on the SRR7807492 dataset (paired-end and nonstranded). (A) The number of correctly assembled transcripts by the assemblers. (B) Assembly accuracy of the assemblers in terms of precision and recall. (C) F-scores of the assemblers.
Figure 8.
Figure 8.
Performance comparisons of the assemblers on the ERR3639851data (single-end and nonstranded). (A) The number of correctly assembled transcripts by the assemblers. (B) Assembly accuracy of the assemblers in terms of precision and recall. (C) F-scores of the assemblers.
Figure 9.
Figure 9.
Performance comparisons of the assemblers on the SRR10611964 data (paired-end and strandspecific). (A) The number of correctly assembled transcripts by the assemblers. (B) Assembly accuracy of the assemblers in terms of precision and recall. (C) F-scores of the assemblers.

Similar articles

References

    1. Yarden K, Wang ET, Airoldi EM. et al. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat Methods 2010;7:1009–15. - PMC - PubMed
    1. Rory S, Marta G, James H.. RNA sequencing: the teenage years. Nat Rev Genet 2019;20:631–56. - PubMed
    1. Marioni JC, Mason CE, Mane SM. et al. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 2008;18:1509–17. - PMC - PubMed
    1. Ali M, Williams BA, Kenneth M. et al. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 2008;5:621–628. - PubMed
    1. Teng M, Love MI, Davis CA. et al. A benchmark for RNA-seq quantification pipelines. Genome Biol 2016;17:74. - PMC - PubMed