A Nextflow pipeline for processing ChIP-seq data.
Nextflow can be installed by using the following command:
curl -fsSL get.nextflow.io | bash
First you need to pull the pipeline using Nextflow:
$ nextflow pull guigolab/chip-nf
Checking guigolab/chip-nf ...
downloaded from https://github.com/guigolab/chip-nf.git
You can get the pipeline help with the following command:
$ nextflow run chip-nf --help
N E X T F L O W ~ version 0.24.1
Launching `guigolab/chip-nf` [nostalgic_franklin] - revision: 974a45c356 [master]
C H I P - N F ~ ChIP-seq Pipeline
---------------------------------
Run ChIP-seq analyses on a set of data.
Usage:
chipseq-pipeline.nf --index TSV_FILE --genome GENOME_FILE [OPTION]...
Options:
--help Show this message and exit.
--index TSV_FILE Tab separted file containing information about the data.
--genome GENOME_FILE Reference genome file.
--genome-index GENOME_INDEX_ FILE Reference genome index file.
--genome-size GENOME_SIZE Reference genome size for MACS2 callpeaks. Must be one of
MACS2 precomputed sizes: hs, mm, dm, ce. (Default: hs)
--mismatches MISMATCHES Sets the maximum number/percentage of mismatches allowed for a read (Default: 2).
--multimaps MULTIMAPS Sets the maximum number of mappings allowed for a read (Default: 10).
--min-matched-bases BASES Sets the minimum number/percentage of bases that have to match with the reference (Default: 0.80).
--quality-threshold THRESHOLD Sets the sequence quality threshold for a base to be considered as low-quality (Default: 26).
--fragment-length LENGTH Sets the fragment length globally for all samples (Default: 200).
--remove-duplicates Remove duplicate alignments instead of just flagging them (Default: false).
--rescale Rescale peak scores to conform to the format supported by the
UCSC genome browser (score must be <1000) (Default: false).
--shift Move fragments ends and apply global extsize in peak calling. (Default: false).
The input data and metadata should be specified using a tab separated file and passing it to the pipeline command with the option --index
. Here is an example of the file format:
sample1 sample1_run1 /path/to/sample1_run1.fastq.gz - H3
sample1 sample1_run2 /path/to/sample1_run2.fastq.gz - H3
sample1 sample1_run3 /path/to/sample1_run3.fastq.gz - H3
sample1 sample1_run4 /path/to/sample1_run4.fastq.gz - H3
sample2 sample2_run1 /path/to/sample2_run1.fastq.gz control1 H3K4me2
control1 control1_run1 /path/to/control1_run1.fastq.gz control1 input
The fields in the file correspond to:
-
identifier used for merging the BAM files
-
single run identifier
-
path to the fastq file to be processed
-
identifier of the input or
-
if no control is used -
mark/histone or
input
if the line refers to a control -
optional sample fragment length. If not specified the fragment length is estimated using SPP
The pipeline will produce the following output data:
-
Alignments
-
pileupSignal
, pileup signal tracks -
fcSignal
, fold enrichment signal tracks -
pvalueSignal
, -log_10(P) signal tracks -
narrowPeak
, peak locations with peak summit, pvalue and qvalue (BED6+4
) -
broadPeak
, similar tonarrowPeak
(BED6+3
) -
gappedPeak
, both narrow and broad peaks (BED12+3
)
Check MACS2 output files for details.
The output data information is written to a file called chipseq-pipeline.db
created in the folder from where the pipeline is run. Here is an example of the db file:
sample1 /path/to/results/peakOut/sample1.pileup_signal.bw H3 255 pileupSignal 0.9960 0.4393
sample1 /path/to/results/peakOut/sample1_peaks.narrowPeak H3 255 narrowPeak 0.9960 0.4393
sample1 /path/to/results/sample1.bam H3 255 Alignments 0.9960 0.4393
sample1 /path/to/results/peakOut/sample1_peaks.gappedPeak H3 255 gappedPeak 0.9960 0.4393
sample1 /path/to/results/peakOut/sample1_peaks.broadPeak H3 255 broadPeak 0.9960 0.4393
sample2 /path/to/results/peakOut/sample2_peaks.gappedPeak H3K4me2 200 gappedPeak 0.9995 0.7216
sample2 /path/to/results/peakOut/sample2.fc_signal.bw H3K4me2 200 fcSignal 0.9995 0.7216
sample2 /path/to/results/peakOut/sample2.pval_signal.bw H3K4me2 200 pvalueSignal 0.9995 0.7216
sample2 /path/to/results/peakOut/sample2_peaks.broadPeak H3K4me2 200 broadPeak 0.9995 0.7216
sample2 /path/to/results/peakOut/sample2.pileup_signal.bw H3K4me2 200 pileupSignal 0.9995 0.7216
sample2 /path/to/results/peakOut/sample2_peaks.narrowPeak H3K4me2 200 narrowPeak 0.9995 0.7216
sample2 /path/to/results/sample2_GCCAAT_primary.bam H3K4me2 200 Alignments 0.9995 0.7216
The fields in the file correspond to:
-
merge identifier
-
path
-
mark/histone
-
(estimated) fragment length
-
data type
-
NRF (Nonredundant Fraction)
-
FRiP (Fraction of Reads in Peaks)