The repo markgene/atacseq is an ATAC-seq pipeline configured to be run on internal high performance cluster (HPC) with job scheduler platform Load Sharing Facility (LSF). It is originally forked from nf-core/atacseq, a bioinformatics analysis pipeline used for ATAC-seq data. I am working on additional features to the pipeline.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.
- Raw read QC (
FastQC
) - Adapter trimming (
Trim Galore!
) - Alignment (
BWA
) - Mark duplicates (
picard
) - Merge alignments from multiple libraries of the same sample (
picard
)- Re-mark duplicates (
picard
) - Filtering to remove:
- reads mapping to mitochondrial DNA (
SAMtools
) - reads mapping to blacklisted regions (
SAMtools
,BEDTools
) - reads that are marked as duplicates (
SAMtools
) - reads that arent marked as primary alignments (
SAMtools
) - reads that are unmapped (
SAMtools
) - reads that map to multiple locations (
SAMtools
) - reads containing > 4 mismatches (
BAMTools
) - reads that are soft-clipped (
BAMTools
) - reads that have an insert size > 2kb (
BAMTools
; paired-end only) - reads that map to different chromosomes (
Pysam
; paired-end only) - reads that arent in FR orientation (
Pysam
; paired-end only) - reads where only one read of the pair fails the above criteria (
Pysam
; paired-end only)
- reads mapping to mitochondrial DNA (
- Alignment-level QC and estimation of library complexity (
picard
,Preseq
) - Create normalised bigWig files scaled to 1 million mapped reads (
BEDTools
,wigToBigWig
) - Generate gene-body meta-profile from bigWig files (
deepTools
) - Calculate genome-wide enrichment (
deepTools
) - Call broad/narrow peaks (
MACS2
) - Annotate peaks relative to gene features (
HOMER
) - Create consensus peakset across all samples and create tabular file to aid in the filtering of the data (
BEDTools
) - Count reads in consensus peaks (
featureCounts
) - Differential accessibility analysis, PCA and clustering (
R
,DESeq2
) - Generate ATAC-seq specific QC html report (
ataqv
)
- Re-mark duplicates (
- Merge filtered alignments across replicates (
picard
)- Re-mark duplicates (
picard
) - Remove duplicate reads (
SAMtools
) - Create normalised bigWig files scaled to 1 million mapped reads (
BEDTools
,wigToBigWig
) - Call broad/narrow peaks (
MACS2
) - Annotate peaks relative to gene features (
HOMER
) - Create consensus peakset across all samples and create tabular file to aid in the filtering of the data (
BEDTools
) - Count reads in consensus peaks relative to merged library-level alignments (
featureCounts
) - Differential accessibility analysis, PCA and clustering (
R
,DESeq2
)
- Re-mark duplicates (
- Create IGV session file containing bigWig tracks, peaks and differential sites for data visualisation (
IGV
). - Present QC for raw read, alignment, peak-calling and differential accessibility results (
ataqv
,MultiQC
,R
)
i. Install nextflow
# Make sure that Java v8+ is installed:
java -version
# Install Nextflow
curl -fsSL get.nextflow.io | bash
# Add Nextflow binary to your user's PATH:
mv nextflow ~/bin/
# Test with hello-world example
nextflow run hello
ii. Install conda environment.
git clone https://github.com/markgene/atacseq.git
cd atacseq
# The command will create a Conda env named atacseq. Be sure you do not have
# the evironment of the same name.
conda env create -f environment.yml
# Wait for the installlation and then activate it.
conda activate atacseq
iii. Download the pipeline and test it on a minimal dataset with a single command
## Notice the test is run in local executor.
bsub -P MC -J testLocal -q priority -n 2 -R "rusage[mem=8000]" "nextflow run main.nf -profile test -with-report report.html -with-trace -with-timeline timeline.html -with-dag flowchart.png"
# Test with test_sjlsf
bsub -P MC -J testLSF -q priority -n 1 -R "rusage[mem=8000]" "nextflow run main.nf -profile test_sjlsf -with-report report.html -with-trace -with-timeline timeline.html -with-dag flowchart.png"
Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use
-profile institute
in your command. This will enable eitherdocker
orsingularity
and set the appropriate execution settings for your local compute environment.
iv. Start running your own analysis!
# DO NOT RUN. This will take a while.
# It runs four mouse samples with the config specified in `conf/sjlsf.config`.
bsub -P MC -J exampleRun -q priority -n 1 "nextflow run main.nf -profile conf/sjlsf.config -with-report report.html -with-trace -with-timeline timeline.html -with-dag flowchart.png --outdir sjlsf_results -resume"
See usage docs for all of the available options when running the pipeline.
The markgene/atacseq document is found in the docs/
directory:
The original nf-core/atacseq pipeline comes with documentation:
- Installation
- Pipeline configuration
- Troubleshooting
- Add Irreproducible Discovery Rate.
- Downstream analysis. I am going to add the following analyses to the pipeline, i) motif enrichment analysis, ii) transcription factor footprinting, and iii) nucleosome positioning. Perhaps other analyses in the future.
- Generate report. The pipeline organizes the output in a hierarchical structure. It reflects how the workflow goes and sometime the result is stored in a deep manner. I will collect the results I am most concerned to one place, perhaps by creating symbolic links of the different sub-folders or files.