Changes for v1.1.1 (#170)

* Update default_params.config Added the site_representation_cutoff * Create variant_table_to_fasta.py Added the script which replace the sed commands used to create the fasta file for phylogeny * Update default_params.config Moved site representation to EXPERIENCED USERS section * Added the resistance database for TBProfiler version 5 * Update magma-env-1.yml * Update setup_conda_envs.sh * Update default_params.config * Update build.sh * Update Dockerfile * Update summarize_resistance.py Added the command line argument for the structural variant results directory. They are not yet added to the summary file, as this requires testing with real data. * replace sed -> python scripts [ci skip] * tweak for python2 [ci skip] * document the different GVCF files * fix typo [ci skip] * Added the structural variant workflow and the resistance profiling of these variants. * cleanup * Update setup_conda_envs.sh * Update default_params.config * Update rename_vcf_chrom.py * Update rename_vcf_chrom.py * interim commit [ci skip] * tweak variants to fasta [ci skip] * accommodate the new design for structural variants [ci skip] * fix imports [ci skip] * fix input to workflow [ci skip] * dev [ci skip] * dev [ci skip] * build and push new containers for v1.1.1 [ci skip] * Fixed the summarize resistance script and added the strcutural variants to it * add back the bc dependency [ci skip] * Update magma-env-1.yml MAke sure to use xlsxwriter 3.1.1 * minimal change, add bc to container-2 only [ci skip] * Update CHANGELOG.md * Changed the permissions on some files * Fixed a typo in the script causing structural variants to not shoiw up * add the default lineage reference files GVCF [ci skip] * tweak comments [ci skip] * tweak comments in the config file [ci skip] * fixed the filtering bug * finilize filtering bug fix * added sample filtering for the structural variant workflow * Update structural_variants_analysis_wf.nf * Moved the vcf filenames from bcftools merge into a file such that the command does not become impossibly long * Added multithreading to tbprofiler * Fixed it that the file listing samples actually contains the newlines * removed sorting from merge channels * Fixed bug in the bcftools merge where the input file was uanavailble on AWS * tweak the readme [ci skip] * add params specific to iqtree [ci skip] * Update iqtree.nf fixed flag for standard bootstrapping * switch the script to python3 [ci skip] * add view for file filtering logic [ci skip] * refactor the location of view [ci skip] * refactor the location of view [ci skip] * refactor the location of view [ci skip] * revert to binary invocation of the script in SNPEFF [ci skip] * use generic python [ci skip] --------- Co-authored-by: LennertVerboven <lennert.verboven@uantwerpen.be> Co-authored-by: Tim H. Heupink <tim.heupink@uantwerpen.be> Co-authored-by: vrennie <113892099+vrennie@users.noreply.github.com>
TORCH-Consortium · Sep 9, 2023 · c84ee9e · c84ee9e
1 parent 86f2188
commit c84ee9e
Show file tree

Hide file tree

Showing 48 changed files with 18,834 additions and 4,594 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,5 @@
+Created a parallel workflow for mapping without using the strict seed lenght for use in the structural variant workflow.
+
+Updated TBProfiler to version 5.0.0 and recreated the resistance database to work with the the new version
+
+Updated the summarize resistance script to include the structural variants in the excel output
diff --git a/README.md b/README.md
@@ -12,12 +12,27 @@ MAGMA (**M**aximum **A**ccessible **G**enome for **M**tb **A**nalysis) is a pipe
   - MAGMA parameters (`default_parameters.config`)
   - Hardware requirements (`conf/standard.config`)
   - Execution (software) requirements (`conf/docker.config` or `conf/conda.config`)
-- An (optional) GVCF reference dataset for ~600 samples is provided for augmenting smaller datasets
 
 
-# (Optional) GVCF for analyzing small number of samples
+# (Optional) GVCF datasets 
+
+We also provide some reference GVCF files which you could use for specific use-cases.
+
+- For small datasets (20 samples or less), we recommend that you download the `EXIT_RIF GVCF` files from https://zenodo.org/record/8054182
+containing GVCF reference dataset for ~600 samples is provided for augmenting smaller datasets
+
+- For including Mtb lineages and outgroup (M. canettii) in the phylogenetic tree, you can download the `LineagesAndOutgroup` files from https://zenodo.org/record/8233518
+
+
+```
+use_ref_exit_rif_gvcf = false
+ref_exit_rif_gvcf =  "/path/to/FILE.g.vcf.gz" 
+ref_exit_rif_gvcf_tbi =  "/path/to/FILE.g.vcf.gz.tbi"
+```
+
+> :note: **Custom GVCF dataset**:
+For creating a custom GVCF dataset, you can refer the discussion [here](https://github.com/TORCH-Consortium/MAGMA/issues/162).
 
-You can download the  `EXIT_RIF GVCF` files from https://zenodo.org/record/8054182
 
 ## Tutorials and Presentations
 
@@ -91,7 +106,7 @@ Which could be provided to the pipeline using `-params-file` parameter as shown
 ```console
 nextflow run 'https://github.com/TORCH-Consortium/MAGMA' \
 		 -profile conda_local \ 
-		 -r v1.0.1 \
+		 -r v1.1.1 \
 		 -params-file  my_parameters_1.yml
 
 ```
@@ -139,9 +154,9 @@ We provide [two docker containers](https://github.com/orgs/TORCH-Consortium/pack
 Although, you don't need to pull the containers manually, but should you need to, you could use the following commands to pull the pre-built and provided containers 
 
 ```console
-docker pull ghcr.io/torch-consortium/magma/magma-container-1:1.1.0
+docker pull ghcr.io/torch-consortium/magma/magma-container-1:1.1.1
 
-docker pull ghcr.io/torch-consortium/magma/magma-container-2:1.1.0
+docker pull ghcr.io/torch-consortium/magma/magma-container-2:1.1.1
 ```
 
 
@@ -154,7 +169,7 @@ Here's the command which should be used
 nextflow run 'https://github.com/torch-consortium/magma' \
 		 -params-file my_parameters_2.yml \
 		 -profile docker \
-		 -r v1.0.1 
+		 -r v1.1.1 
 ```
 
 > :bulb: **Hint**: <br>
@@ -189,7 +204,7 @@ You can then include this configuration as part of the pipeline invocation comma
 ```console
 nextflow run 'https://github.com/torch-consortium/magma' \
 		 -profile docker \
-		 -r v1.0.1 \
+		 -r v1.1.1 \
                  -c custom.config \
 		 -params-file my_parameters_2.yml
 ```

diff --git a/bin/reformat_lofreq.py b/bin/reformat_lofreq.py
@@ -45,6 +45,7 @@ def write_vcf(filename, df, header):
     args = vars(parser.parse_args())
 
     vcf, header, not_empty = read_vcf(args['lofreq_vcf_file'])
+    header = '\n'.join([i for i in header.split('\n') if 'lofreq' not in i])
     if not_empty:
         vcf['FORMAT'] = 'GT:AD:DP:GQ:PL'
 

diff --git a/bin/rename_vcf_chrom.py b/bin/rename_vcf_chrom.py
@@ -1,4 +1,4 @@
-#! /usr/bin/env python3
+#! /usr/bin/env python
 
 '''Original author Jody Phelan at https://github.com/jodyphelan/pathogen-profiler/blob/master/scripts/rename_vcf_chrom.py'''
 import sys
@@ -30,11 +30,11 @@ def cmd_out(cmd,verbose=1):
     stderr.close()
 
 def main(args):
-    generator = cmd_out(f"bcftools view {args.vcf}") if args.vcf else sys.stdin
+    generator = cmd_out("bcftools view " + args.vcf) if args.vcf else sys.stdin
     convert = dict(zip(args.source,args.target))
     for l in generator:
         if l[0]=="#":
-            sys.stdout.write(l)
+            sys.stdout.write(l.strip()+"\n")
         else:
             row = l.strip().split()
             row[0] = convert[row[0]]

diff --git a/bin/summarize_resistance.py b/bin/summarize_resistance.py
diff --git a/bin/variant_table_to_fasta.py b/bin/variant_table_to_fasta.py
@@ -0,0 +1,30 @@
+#! /usr/bin/env python3
+
+import sys
+import argparse
+
+def main(args):
+    table = []
+    with open(args.table, 'r') as table_file:
+        table.append(table_file.readline().strip().split('\t')) # Get the headerline without modifying
+        # Process the actual variants
+        for idx, l in enumerate(table_file):
+            l = l.strip().split('\t')
+            l = [i.replace('*', '-').replace('.', '-') for i in l]
+            if l.count('-')/len(l) < (1-args.site_representation_cutoff):
+                table.append(l)
+            else:
+                pass
+    with open(args.output_fasta, 'w') as fasta_file:
+        for l in list(map(list, zip(*table))):
+            fasta_file.write('>{}\n{}\n'.format(l[0].replace('.GT', ''), ''.join(l[1:])))
+
+
+
+parser = argparse.ArgumentParser(description='tbprofiler script',formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+parser.add_argument('table', type=str, help='The input table to convert (stdin if empty)')
+parser.add_argument('output_fasta', type=str, help='The output fasta file')
+parser.add_argument('site_representation_cutoff', type=float, help='Minimum fraction of samples that need to have a call at a site before it is considered')
+parser.set_defaults(func=main)
+args = parser.parse_args()
+args.func(args)
diff --git a/conda_envs/magma-env-1.yml b/conda_envs/magma-env-1.yml
@@ -2,18 +2,16 @@ name: magma-env-1
 channels:
   - conda-forge
   - bioconda
+  - defaults
 dependencies:
-  - bioconda::gatk4=4.2.6.1 
-  - conda-forge::r-ggplot2=3.3.5 
-  - conda-forge::pandas=1.5.1
-  - conda-forge::xlsxwriter=3.0.3
-  - bioconda::datamash=1.1.0 
-  - bioconda::delly=0.8.7 
-  - bioconda::lofreq=2.1.5 
-  - bioconda::tb-profiler=4.1.1 
-  - bioconda::multiqc=1.11 
-  - bioconda::fastqc=0.11.8
-  - bioconda::fastq_utils=0.25.1
-  - conda-forge::bc=1.07.1
-  - conda-forge::sed=4.8
-  - conda-forge::grep=3.11
+  - gatk4=4.2.6.1 
+  - r-ggplot2=3.3.5 
+  - pandas=1.5.1
+  - xlsxwriter=3.1.1
+  - datamash=1.1.0 
+  - delly=0.8.7 
+  - lofreq=2.1.5 
+  - tb-profiler=5.0.0 
+  - multiqc=1.11 
+  - fastqc=0.11.8
+  - fastq_utils=0.25.1
diff --git a/conda_envs/magma-env-2.yml b/conda_envs/magma-env-2.yml
@@ -2,16 +2,15 @@ name: magma-env-2
 channels:
   - conda-forge
   - bioconda
+  - defaults
 dependencies:
-  - conda-forge::python=2.7
-  - bioconda::bwa=0.7.17 
-  - bioconda::samtools=1.9 
-  - bioconda::iqtree=2.1.2 
-  - bioconda::snp-dists=0.8.2 
-  - bioconda::snp-sites=2.4.0 
-  - bioconda::bcftools=1.9 
-  - bioconda::snpeff=4.3.1t 
-  - bioconda::clusterpicker=1.2.3
-  - conda-forge::bc=1.07.1
-  - conda-forge::sed=4.8
-  - conda-forge::grep=3.11
+  - python=2.7
+  - bwa=0.7.17 
+  - samtools=1.9 
+  - iqtree=2.1.2 
+  - snp-dists=0.8.2 
+  - snp-sites=2.4.0 
+  - bcftools=1.9 
+  - snpeff=4.3.1t 
+  - clusterpicker=1.2.3
+  - bc=1.07.1
diff --git a/conda_envs/setup_conda_envs.sh b/conda_envs/setup_conda_envs.sh
@@ -20,7 +20,7 @@ cp -r ../resources/resistance_db_who ./
 cd resistance_db_who
 
 echo "INFO: Load the database within tb-profiler"
-tb-profiler load_library resistance_db_who
+tb-profiler load_library ./resistance_db_who
 
 echo "INFO: Remove the local copy of the database folder"
 cd ..

diff --git a/conf/docker.config b/conf/docker.config
@@ -6,12 +6,12 @@ process {
 
     withName:
     'GATK.*|LOFREQ.*|DELLY.*|TBPROFILER.*|MULTIQC.*|FASTQC.*|UTILS.*|FASTQ.*|SAMPLESHEET.*' {
-        container = "ghcr.io/torch-consortium/magma/magma-container-1:1.1.0"
+        container = "ghcr.io/torch-consortium/magma/magma-container-1:1.1.1"
     }
 
     withName:
     'BWA.*|IQTREE.*|SNPDISTS.*|SNPSITES.*|BCFTOOLS.*|BGZIP.*|SAMTOOLS.*|SNPEFF.*|CLUSTERPICKER.*' {
-        container = "ghcr.io/torch-consortium/magma/magma-container-2:1.1.0"
+        container = "ghcr.io/torch-consortium/magma/magma-container-2:1.1.1"
     }
 
 }

diff --git a/conf/podman.config b/conf/podman.config
@@ -6,12 +6,12 @@ process {
 
     withName:
     'GATK.*|LOFREQ.*|DELLY.*|TBPROFILER.*|MULTIQC.*|FASTQC.*|UTILS.*|FASTQ.*|SAMPLESHEET.*' {
-        container = "ghcr.io/torch-consortium/magma/magma-container-1:1.1.0"
+        container = "ghcr.io/torch-consortium/magma/magma-container-1:1.1.1"
     }
 
     withName:
     'BWA.*|IQTREE.*|SNPDISTS.*|SNPSITES.*|BCFTOOLS.*|BGZIP.*|SAMTOOLS.*|SNPEFF.*|CLUSTERPICKER.*' {
-        container = "ghcr.io/torch-consortium/magma/magma-container-2:1.1.0"
+        container = "ghcr.io/torch-consortium/magma/magma-container-2:1.1.1"
     }
 
 }