Skip to content

A pipeline for comprehensive genomic analyses of Mycobacterium tuberculosis with a focus on clinical decision making as well as research

License

Notifications You must be signed in to change notification settings

TORCH-Consortium/MAGMA

Repository files navigation

MAGMA

MAGMA (Maximum Accessible Genome for Mtb Analysis) is a pipeline for comprehensive genomic analyses of Mycobacterium tuberculosis with a focus on clinical decision making as well as research.

Salient features of the implementation

  • Fine-grained control over resource allocation (CPU/Memory/Storage)
  • Reliance of bioconda for installing packages for reproducibility
  • Ease of use on a range of infrastructure (cloud/on-prem HPC clusters/ servers (or local machines))
  • Resumability for failed processes
  • Centralized locations for specifying analysis parameters and hardware requirements
    • MAGMA parameters (default_parameters.config)
    • Hardware requirements (conf/standard.config)
    • Execution (software) requirements (conf/docker.config or conf/conda.config)

(Optional) GVCF datasets

We also provide some reference GVCF files which you could use for specific use-cases.

  • For small datasets (20 samples or less), we recommend that you download the EXIT_RIF GVCF files from https://zenodo.org/record/8054182 containing GVCF reference dataset for ~600 samples is provided for augmenting smaller datasets

  • For including Mtb lineages and outgroup (M. canettii) in the phylogenetic tree, you can download the LineagesAndOutgroup files from https://zenodo.org/record/8233518

use_ref_exit_rif_gvcf = false
ref_exit_rif_gvcf =  "/path/to/FILE.g.vcf.gz" 
ref_exit_rif_gvcf_tbi =  "/path/to/FILE.g.vcf.gz.tbi"

:note: Custom GVCF dataset: For creating a custom GVCF dataset, you can refer the discussion here.

Tutorials and Presentations

For the tutorials(./docs/tutorials.md) and presentations please refer the docs folder.

Prerequisites

Nextflow

  • git : The version control in the pipeline.
  • Java-11 or Java-17 (preferred)

⚠️ Check java version!: The java version should NOT be an internal jdk release! You can check the release via java -version

  • Download Nextflow
$ curl -s https://get.nextflow.io | bash
  • Make Nextflow executable
$ chmod +x nextflow
  • Add nextflow to your path (for example /usr/local/bin/)
$ mv nextflow /usr/local/bin
  • Sanity check for nextflow installation
$ nextflow info

  Version: 23.04.1 build 5866
  Created: 15-04-2023 06:51 UTC (08:51 SAST)
  System: Mac OS X 12.6.5
  Runtime: Groovy 3.0.16 on OpenJDK 64-Bit Server VM 17.0.7+7-LTS
  Encoding: UTF-8 (UTF-8)

✔️ With this you're all set with Nextflow. Next stop, conda or docker - pick one!:

Customizing pipeline parameters for your dataset

The pipeline parameters are distinct from Nextflow parameters, and therefore it is recommended that they are provided using a yml file as shown below

# Sample contents of my_parameters_1.yml file

input_samplesheet: /path/to/your_samplesheet.csv
only_validate_fastqs: true
conda_envs_location: /path/to/both/conda_envs

Note The -profile mechanism is used to enable infrastructure specific settings of the pipeline. The example below, assumes you are using conda based setup.

Which could be provided to the pipeline using -params-file parameter as shown below

nextflow run 'https://github.com/TORCH-Consortium/MAGMA' \
		 -profile conda_local \ 
		 -r v1.1.1 \
		 -params-file  my_parameters_1.yml

Running MAGMA using conda

You can run the pipeline using Conda, Mamba or Micromamba package managers to install all the prerequisite softwares from popular repositories such as bioconda and conda-forge.

You can use the conda based setup for the pipeline for running MAGMA

  • On a local linux machine(e.g. your laptop or a university server)
  • On an HPC cluster (e.g. SLURM, PBS) in case you don't have access to container systems like Singularity, Podman or Docker

All the requisite softwares have been provided as a conda recipe (i.e. yml files)

These files can be downloaded using the following commands

wget https://raw.githubusercontent.com/TORCH-Consortium/MAGMA/master/conda_envs/magma-env-2.yml
wget https://raw.githubusercontent.com/TORCH-Consortium/MAGMA/master/conda_envs/magma-env-1.yml

The conda environments are expected by the conda_local profile of the pipeline, it is recommended that it should be created prior to the use of the pipeline, using the following commands. Note that if you have mamba (or micromamba) available you can rely upon that instead of conda.

$ conda env create -n magma-env-1 --file magma-env-1.yml

$ conda env create -n magma-env-2 --file magma-env-2.yml

Once the environments are created, you can make use of the pipeline parameter conda_envs_location to inform the pipeline of the names and location of the conda envs.

ℹ️ Conda environments and cheatsheet:
You can find out the location of conda environments using conda env list. Here's a useful cheatsheet for conda operations.

Running MAGMA using docker

We provide two docker containers with the pipeline so that you could just download and run the pipeline with them. There is NO need to create any docker containers, just download and enable the docker profile.

🚧 Container build script: The script used to build these containers is provided here.

Although, you don't need to pull the containers manually, but should you need to, you could use the following commands to pull the pre-built and provided containers

docker pull ghcr.io/torch-consortium/magma/magma-container-1:1.1.1

docker pull ghcr.io/torch-consortium/magma/magma-container-2:1.1.1

📝 Have singularity or podman instead?:
If you do have access to Singularity or Podman, then owing to their compatibility with Docker, you can still use the provided docker containers.

Here's the command which should be used

nextflow run 'https://github.com/torch-consortium/magma' \
		 -params-file my_parameters_2.yml \
		 -profile docker \
		 -r v1.1.1 

💡 Hint:
You could use -r option of Nextflow for working with any specific version/branch of the pipeline.

Customizing the pipeline configuration for your infrastructure

There might be cases when you need to customize the default configuration such as cpus and memory etc. For these cases, it is recommended you refer Nextflow configuration docs as well as the default_params.config file.

Shown below is one sample configuration

  • custom.config => Ideally this file should only contain hardware level configurations such as
process {
    errorStrategy = { task.attempt < 3 ? 'retry' : 'ignore' }

    time = '1h'
    cpus = 8
    memory = 8.GB

   withName:FASTQ_VALIDATOR {
      cpus = 2
      memory = 4.GB
   }
}

You can then include this configuration as part of the pipeline invocation command

nextflow run 'https://github.com/torch-consortium/magma' \
		 -profile docker \
		 -r v1.1.1 \
                 -c custom.config \
		 -params-file my_parameters_2.yml

Running MAGMA on HPC and cloud executors

  1. HPC based execution for MAGMA, please refer this doc.
  2. Cloud batch (AWS/Google/Azure) based execution for MAGMA, please refer this doc

Citation

The MAGMA pipeline paper has been submitted.

The XBS variant calling core was published here: https://doi.org/10.1099%2Fmgen.0.000689

TODO: Update this section and add a citation.cff file

Contributions and Interactions with us

Contributions are warmly accepted! We encourage you to interact with us using Discussions and Issues feature of Github.

License

Please refer the GPL 3.0 LICENSE file.