Entering edit mode
2.6 years ago
Elizabeth Alice
•
0
So I am trying to analyze some data for aberrant splicing following a previously published protocol. In it they describe making a custom GTF file based on both the downloadable UCSC gtf and their seq data to use with rMATS. I have used STAR and GRCh38 to align my reads as well as generated gtf files with StringTie from the BAM files, but I am unsure how to concatenate the ensemble gtf and my StringTie gtfs to use with rMATS. I would greatly appreciate any advice in this regard. The format for the StringTie output is:
seqname source feature start end score strand frame attributes
chrX StringTie transcript 281394 303355 1000 + . gene_id "ERR188044.1"; transcript_id "ERR188044.1.1"; reference_id "NM_018390"; ref_gene_id "NM_018390"; ref_gene_name "PLCXD1"; cov "101.256691"; FPKM "530.078918"; TPM "705.667908";
chrX StringTie exon 281394 281684 1000 + . gene_id "ERR188044.1"; transcript_id "ERR188044.1.1"; exon_number "1"; reference_id "NM_018390"; ref_gene_id "NM_018390"; ref_gene_name "PLCXD1"; cov "116.270836";
You may want to include some more information on what that "previously published protocol" exactly does, especially with respect to your gtf, and what you did, what gtf you download is, etc. With the information you provide, at least I'd have to do a lot of guess work.
Classic problems with gtfs/gffs are the column 1 sequence identifiers don't match the expected format, for example UCSC and NCBI identifiers for the human genome. In case that's your issue vkkodali_ncbi 's chtreepo might help you
From Dolatshad, et al Leukemia(2016) reads were aligned using STAR29 against the human genome assembly (NCBI build37 (hg19) UCSC transcripts). Non-uniquely mapped reads and reads that were identified as PCR duplicates using Samtools30 were discarded. The aligned reads were reconstructed into transcripts using Cufflinks31 and were then merged into a single assembly, along with known isoforms from the NCBI build37 (hg19) UCSC transcripts. This reference-guided assembly was then used as the transcripts annotation by rMATS
I download is the hsGRCH38_HapScaf.gtf.