Hi,
I was trying to use IsoformSwitchAnalyzeR (v2.2.0) for differential transcript expression. I used kallisto to create the abundance (h5 & tsv) files. However, I get an error when I try to use importRdata(). I used the same fa file in importRdata(), as the once I used to build the kallisto index. Here is my code and the error:
> kal.quant <- importIsoformExpression(kallisto.dir,addIsofomIdAsColumn=T)
> names(kal.quant)
[1] "abundance" "counts" "infReps" "length" "importOptions"
> aSwitchList <- importRdata(
+ isoformCountMatrix = kal.quant$counts,
+ isoformRepExpression = kal.quant$abundance,
+ ignoreAfterBar = TRUE,
+ designMatrix = mdata.iso,
+ isoformExonAnnoation = paste0(ann.dir, 'gencode.v44.chr_patch_hapl_scaff.annotation.gtf.gz'),
+ isoformNtFasta = paste0(ann.dir,'gencode.v44.pc_transcripts.fa.gz')
+ )
Step 1 of 10: Checking data...
Step 2 of 10: Obtaining annotation...
importing GTF (this may take a while)...
Error in importRdata(isoformCountMatrix = kal.quant$counts, isoformRepExpression = kal.quant$abundance, :
The annotation and quantification (count/abundance matrix and isoform annotation) seems to be different (Jaccard similarity < 0.925).
Either isforoms found in the annotation are not quantifed or vise versa.
Specifically:
110962 isoforms were quantified.
170712 isoforms are annotated.
Only 110962 overlap.
0 isoforms quantifed had no corresponding annoation
This combination cannot be analyzed since it will cause discrepencies between quantification and annotation thereby skewing all analysis.
If there is no overlap (as in zero or close) there are two options:
.
.
In previous posts on similar errors, it was mentioned that this error sometimes comes up because the fa and gtf files are not fully compatable(?), but not sure if it's the case here.
I have also tried to remove the duplicate annotations in the h5 abundance files manually (thanks sudarshan), but his doesn't seem to effect the error I get. I used:
files <- list.files(kallisto.dir, pattern=".h5", recursive=TRUE, full.names=TRUE)
for (currentFile in files) {
oldids <- h5read(currentFile, "/aux/ids")
newids <- gsub("\\|.*", "", oldids)
h5write(newids, currentFile, "/aux/ids")
}
What should I be doing differently?
Thanks for your help!
Brian
My sessionInfo:
> sessionInfo()
`R version 4.3.0 (2023-04-21) Platform: x86_64-apple-darwin20 (64-bit) Running under: macOS 14.1.1
Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Asia/Kolkata tzcode source: internal
attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] rhdf5_2.46.1 IsoformSwitchAnalyzeR_2.2.0 pfamAnalyzeR_1.2.0 dplyr_1.1.4
[5] stringr_1.5.1 readr_2.1.4 sva_3.50.0 genefilter_1.84.0
[9] mgcv_1.9-1 nlme_3.1-164 satuRn_1.10.0 DEXSeq_1.48.0
[13] RColorBrewer_1.1-3 AnnotationDbi_1.64.1 DESeq2_1.42.0 SummarizedExperiment_1.32.0
[17] GenomicRanges_1.54.1 GenomeInfoDb_1.38.5 IRanges_2.36.0 S4Vectors_0.40.2
[21] MatrixGenerics_1.14.0 matrixStats_1.2.0 Biobase_2.62.0 BiocGenerics_0.48.1
[25] BiocParallel_1.36.0 limma_3.58.1 biomaRt_2.58.0 fastqq_0.1.3
[29] sleuth_0.30.1 tximeta_1.20.1 ggplot2_3.4.4
loaded via a namespace (and not attached):
[1] splines_4.3.0 later_1.3.2 BiocIO_1.12.0 bitops_1.0-7
[5] filelock_1.0.3 tibble_3.2.1 XML_3.99-0.16 lifecycle_1.0.4
[9] edgeR_4.0.5 vroom_1.6.5 lattice_0.22-5 ensembldb_2.26.0
[13] magrittr_2.0.3 rmarkdown_2.25 yaml_2.3.8 remotes_2.4.2.1
[17] httpuv_1.6.13 sessioninfo_1.2.2 pkgbuild_1.4.3 pbapply_1.7-2
[21] DBI_1.2.0 abind_1.4-5 pkgload_1.3.3 zlibbioc_1.48.0
[25] purrr_1.0.2 AnnotationFilter_1.26.0 RCurl_1.98-1.13 rappdirs_0.3.3
[29] GenomeInfoDbData_1.2.11 annotate_1.80.0 codetools_0.2-19 DelayedArray_0.28.0
[33] xml2_1.3.6 tidyselect_1.2.0 futile.logger_1.4.3 locfdr_1.1-8
[37] farver_2.1.1 BiocFileCache_2.10.1 GenomicAlignments_1.38.0 jsonlite_1.8.8
[41] ellipsis_0.3.2 survival_3.5-7 tools_4.3.0 progress_1.2.3
[45] Rcpp_1.0.11 glue_1.6.2 gridExtra_2.3 SparseArray_1.2.3
[49] xfun_0.41 usethis_2.2.2 withr_2.5.2 formatR_1.14
[53] BiocManager_1.30.22 fastmap_1.1.1 boot_1.3-28.1 rhdf5filters_1.14.1
[57] fansi_1.0.6 digest_0.6.33 R6_2.5.1 mime_0.12
[61] colorspace_2.1-0 networkD3_0.4 RSQLite_2.3.4 tidyr_1.3.0
[65] utf8_1.2.4 generics_0.1.3 data.table_1.14.10 rtracklayer_1.62.0
[69] prettyunits_1.2.0 httr_1.4.7 htmlwidgets_1.6.4 S4Arrays_1.2.0
[73] pkgconfig_2.0.3 gtable_0.3.4 blob_1.2.4 hwriter_1.3.2.1
[77] XVector_0.42.0 htmltools_0.5.7 profvis_0.3.8 geneplotter_1.80.0
[81] ProtGenerics_1.34.0 scales_1.3.0 png_0.1-8 lambda.r_1.2.4
[85] knitr_1.45 rstudioapi_0.15.0 tzdb_0.4.0 reshape2_1.4.4
[89] rjson_0.2.21 curl_5.2.0 cachem_1.0.8 BiocVersion_3.18.1
[93] parallel_4.3.0 miniUI_0.1.1.1 restfulr_0.0.15 pillar_1.9.0
[97] grid_4.3.0 vctrs_0.6.5 urlchecker_1.0.1 promises_1.2.1
[101] dbplyr_2.4.0 xtable_1.8-4 tximport_1.30.0 evaluate_0.23
[105] VennDiagram_1.7.3 GenomicFeatures_1.54.1 futile.options_1.0.1 cli_3.6.2
[109] locfit_1.5-9.8 compiler_4.3.0 Rsamtools_2.18.0 rlang_1.1.2
[113] crayon_1.5.2 labeling_0.4.3 plyr_1.8.9 fs_1.6.3
[117] stringi_1.8.3 munsell_0.5.0 Biostrings_2.70.1 lazyeval_0.2.2
[121] devtools_2.4.5 Matrix_1.6-4 BSgenome_1.70.1 hms_1.1.3
[125] bit64_4.0.5 Rhdf5lib_1.24.1 KEGGREST_1.42.0 statmod_1.5.0
[129] shiny_1.8.0 interactiveDisplayBase_1.40.0 AnnotationHub_3.10.0 igraph_1.6.0
[133] memoise_2.0.1 bit_4.0.5 `
Seems to me that the problem is that the annotation used for the quantification is not the same as the annotation used to pass to aSwitch List. Kalisto has quantified ~110,000 transcripts, but your annotation has 170,000 transcripts in it.
I wonder if its possible that you have quantified only against protein coding transcripts, but the annotation GTF you have passed to aSwitchList has all transcripts?
As the creator of IsoformSwitchAnalyzeR I approve of this solution :-)
Thanks for the reply!
Yes, that is possible. From the gencode website, I picked:
How can I make out if this gene annotation is compatible with the fasta file? Is there a generic way that is used to match up the annotation and fasta files on gencode?
We tend to build our own fasta files from the GTFs. We have our own tool for this (gtf2fasta from cgat-apps), but I believe that
gffread
from stringtie should also be able to do it. Then you are certain that your fasta and gtf are identical!An alternative, if you didn't want to to do that, would be to filter the GTF. It might be more fiddly to do this way round, but does avoid having to re-quantify.
The first thing to try would just be grepping out the protein_coding transcripts from the GTF:
This might well get you close enough to get past the isoformSwitchAnalyzeR check.