A pipeline that processes multiplexed 4C-seq reads directly from FASTQ files. It generates files in a range of widely used formats to facilitate visualization and further data analysis using standard genome browsers and tools, including our recently developed peak caller for 4C-seq data peakC.
If you have any difficulties using the pipeline, please do not hesitate to contact us (delaatbioinf@hubrecht.eu).
Krijger PHL, Geeven G, Bianchi V, Hilvering CRE, de Laat W. 4C-seq from beginning to end: A detailed protocol for sample preparation and data analysis. Methods. 2019 Jul 26. pii: S1046-2023(18)30474-2. doi: 10.1016/j.ymeth.2019.07.014. https://doi.org/10.1016/j.ymeth.2019.07.014
- A Unix like shell (e.g. Bash v3.2+)
- Bowtie2 v2.3+ available from http://bowtie-bio.sourceforge.net/bowtie2/.
- SAMtools v1.3+ available from http://www.htslib.org/. note: The pipeline will produce a sort error when older versions are used.
- R v3.5+ available from https://www.r-project.org/.
- The following R packages available from CRAN:
- optparse
- caTools
- config
- The following R packages available from Bioconductor:
- shortRead
- genomicRanges
- genomicAlignments
- BSgenome
- BSgenome of interest
- The peakC package available from https://github.com/deWitLab/peakC/.
Download the latest version of the pipeline from this git repository using:
$ wget https://github.com/deLaatLab/pipe4C/archive/master.zip
$ unzip master.zip
$ cd ./pipe4C-master
note: the pipe4C.R and functions.R files need to be placed in the same folder.
-
Reads in (compressed) FASTQ format.
- Illumina Sequencing Systems generate raw data files in binary base call (BCL) format. Illumina offers bcl2fastq conversion software to demultiplex (based on the index used in the non-reading primer) and convert BCL files. If multiple flow cell lanes have been used to sequence the library, a single Read 1 FASTQ file should be created per index per sequence run by combining the FASTQ files for each flow cell lane using a standard “cat” command after BCL conversion or by using the no-lane-splitting option when running bcl2fastq.
-
Configuration file (conf.yml)
- Global and system specific parameters (such as e.g. paths and genome assemblies installed) that are likely to remain constant across different runs of the pipeline are defined in the global configuration file (conf.yml). In each run the pipeline initially loads the parameters defined in this global configuration file, and then proceeds to load run specific parameters and the experiment specific data defined in a separate file (vpFile). The global configuration file can be edited using any standard text editor. The list of parameters that need to be set at least once upon installation on a system in the global configuration file are shown in table 1.
Name | Description |
---|---|
fragFolder | Path to the folder containing the fragment end libraries of the reference genomes |
normalizeFactor | Reads mapped to the 4C fragment end library are normalized to account for sequencing depth according to the normalizeFactor |
enzymes | Enzyme names used in the viewpoint file and their corresponding recognition motifs |
genomes | Genome names used in the viewpoint file plus corresponding BSgenome packages |
bowtie2 | Path to corresponding bowtie2 index of reference genome. The reference genome assembly used to generate the index should match to the reference genome that was used to generate the BSgenome. NOTE: Some genome builds contain haplotype copies of the same chromosome. You should not include multiple haplotypes in the Bowtie2 index because the pipeline will consider reads mapping to two haplotypes as non-uniquely determined alignments. The pipeline removes chromosomes with 'hap' in the name when generating the fragmented genome. |
maxY | Maximal Y value in local 4C cis plot |
plotView | Number of bp to plot around viewpoint in local 4C cis plot |
xaxisUnit | X-axis unit (Mb, Kb or bp) |
plotType | Plots will either be saved as PDF or PNG |
binSize | Genome bin size used in the genome plot |
qualityCutoff | Q-score. Trim 3′-end of all sequences using a sliding window as soon as 2 out of 5 nucleotides have quality encoding less than the Q-score. 0 = no trimming |
trimLength | Trim reads to defined capture length from 3′-end. 0 = no trimming |
minAmountReads | Minimum required amount of reads containing the primer sequence. If less reads are identified the experiment will not be further processed |
readsQuality | Bowtie2 minimum required mapping quality score for mapped reads |
mapUnique | Extract uniquely mapped reads, based on the lack of XS tag |
cores | Number of CPU cores for parallelization |
wSize | The running mean window size |
nTop | Top fragment ends discarded for calculation of normalizeFactor |
nonBlind | Only keep non-blind fragments |
wig | Create wig files for all samples |
plot | Create viewpoint coverage plot for all samples |
genomePlot | Create genomeplot for all samples (only possible if analysis is “all” in vpFile) |
tsv | Create tab separated value file for all samples |
bins | Count reads for binned regions |
mismatchMax | The maximum number of mismatches allowed during demultiplexing |
Table 1. Description of parameters that need to be defined in the configuration file.
- Viewpoint file
- Experiment specific parameters for each 4C-seq experiment are organized in a viewpoint file. Parameters in this file are stored in a tab-delimited format, with each row containing information for a separate experiment:
expname | primer | firstenzyme | secondenzyme | genome | vpchr | vppos | analysis | fastq |
---|---|---|---|---|---|---|---|---|
mESC_Sox2 | GAGGGTAATTTTAGCCGATC | DpnII | Csp6I | mm9 | 3 | 34547661 | all | index1.fastq.gz |
mESC_Mccc1 | TTGCACCCGTCTTCTTGATC | DpnII | Csp6I | mm9 | 3 | 35873313 | cis | index1.fastq.gz |
Table 2. Example of a viewpoint file in which two experiments are demultiplexed from the same FASTQ file based on their primer sequence.
Name | Description |
---|---|
expname | Unique experiment name |
primer | Primer sequence |
firstenzyme | First restriction enzyme name (nearest to reading primer) |
secondenzyme | Second restriction enzyme name |
genome | Reference genome of interest |
vpchr | The chromosome that contains the viewpoint. This information is only required to calculate the stats, if you have multiple integrations or if you do not know your VP location, just use 1. |
vppos | Coordinate of viewpoint position. Any bp position within the VP can be used except the RE motifs. This information is only required to calculate the stats, if you have multiple integrations or if you do not know your VP location, just use 1. |
analysis | The final output tables will contain all reads (all) or only the reads that have been mapped to the VP chromosome (cis). For most analysis cis is sufficient and the generated output files will be smaller and therefore easier to process on local computers |
fastq | Name of the FASTQ file |
spacer (optional) | Spacer length. Number of nt included as spacer in the primer to enable out of phase sequencing. Default = 0. The spacer sequence will not be used for demultiplexing. If the spacer sequence is used as a barcode include the sequence in the primer sequence and set the spacer length to 0 |
Table 3. Description of parameters that are required in the viewpoint file for processing a 4C-seq experiment.
Rscript <path to pipe4C.R script> [any additional arguments].
A list of both required and optional parameters that are recognized by the pipe4C.R script are shown in table 4. Default values (except vpFile, fqFolder, outFolder and confFile) are stored in the configuration file.
For example,
Rscript pipe4C.R –-vpFile [path to vpFile] --fqFolder [path to folder containing the FASTQ files] –-outFolder [path to output folder] --cores 8 --wig --plot --genomePlot
will run the pipeline using 8 cores and generates a wig file, a viewpoint plot and a genome plot as output, next to the default outputs.
Name | Description |
---|---|
vpFile* | path to the viewpoint file |
fqFolder* | path to the folder containing the FASTQ files |
outFolder* | path to the output folder |
confFile | path to configuration file – default is conf.yml in folder containing the pipeline script |
mismatchMax | The maximum number of mismatches allowed during demultiplexing |
qualityCutoff | Q-score. Trim 3′-end of all sequences using a sliding window as soon as 2 out of 5 nucleotides has quality encoding less than the Q-score |
trimLength | Trim reads to defined capture length from 3′-end |
minAmountReads | Minimum required amount of reads containing the primer sequence. If less reads are identified the experiment will not be further processed |
readsQuality | Bowtie2 minimum required mapping quality score for mapped reads |
mapUnique | Extract uniquely mapped reads, based on the lack of XS tag |
cores | Number of CPU cores for parallelization |
wSize | The running mean window size |
nTop | Top fragment ends discarded for normalization |
nonBlind | Only keep non-blind fragments |
wig | Create wig files for all samples |
plot | Create viewpoint coverage plot for all samples |
genomePlot | Create genomeplot for all samples (only possible if analysis is “all” in vpFile) |
tsv | Create tab separated value file for all samples |
bins | Count reads for binned regions |
Table 4. Description of parameters that are recognized by the pipe4C.R script. * are required.
In this tutorial we will explain how to run the pipeline and perform peakC analysis on the Sox2 4C-seq data from Geeven et al. (SRA files GSM2824300, GSM2824301 and GSM2824302).
Download the pipeline and example files using the following command.
$ wget https://github.com/deLaatLab/pipe4C/archive/master.zip
$ unzip master.zip
$ cd ./pipe4C-master
note: the pipe4C.R and functions.R files need to be placed in the same folder. The FASTQ files and viewpoint file can be found in the example folder.
- Change the location in which the fragmented genome will be generated (fragFolder).
- Change the location in which the bowtie2 index is stored.
Rscript pipe4C.R --vpFile=./example/VPinfo.txt --fqFolder=./example/ --outFolder=./outF/ --cores 8 --plot --wig
This will run the pipeline using 8 cores and generates a viewpoint plot and wig file as output, next to the default outputs. note: Running multiple pipe4C processes in parallel that are using the same fragFolder in the conf.yml can corrupt the fragmented genome file.
A report file is generated that contains quality metrics of all the 4C-libraries processed by the pipeline.
The report file indicates that the quality of the experiment is good, as:
- ~80% of the reads map to the viewpoint chromosome (fragMappedCisPercCorr).
- ~75% of the reads mapping to the viewpoint chromosome maps within 1MB from the viewpoint (cov1Mb).
- Furthermore ~90% of the mappable DpnII fragment ends within 100kb from the viewpoint (capt100Kb) have at least one read.
The viewpoint plots can be found in the PLOT folder. A coverage plot of the viewpoint region is generated based on the normalized and smoothened data. The genomic region that is visualized, the height of the Y-axis, the unit of the X-axis (Mb, kb, bp) the file type (PDF or PNG) and window size are defined and can be adjusted in the configuration file. In addition the quality metrics are displayed in this figure for a quick impression of the data.
The normalized smoothened data will be written to a wig file for visualization in genome browsers such as the UCSC Genome Browser (https://genome.ucsc.edu/) and IGV.
The pipeline produces R-objects stored as rds files, which contain all mapped 4C-seq reads as well as the mapping statistics and viewpoint information relevant to the experiment. This allows for a more thorough interactive analysis and visualization of the data in R.
PeakC is a method that was designed to identify reproducible peaks in regions close to the viewpoint in 4C-seq data. It computes non-parametric statistics based on ranks of coverage of 4C fragment ends with respect to a background model. This model estimates the background contact frequency of both the region upstream and downstream of the viewpoint independently for each 4C-seq experiment individually and uses a statistical model to identify genomic regions that are significantly contacted.
To run peakC on the rds files created by the pipe4C pipeline, you need to load the peakC R-package (make sure it's installed first!) and source the pipe4C functions in R:
library(peakC)
pipe4CFunctionsFile <- "functions.R"
source(pipe4CFunctionsFile)
Next, you need to select rds files from 4C-Seq experiments generated with the same VP that you want to analyze.
Select 4C-seq experiments generated using the same VP
resultsDir <- "./outF/RDS/"
setwd(resultsDir)
rdsFiles <- c("set_1_viewpoint_10_ESC_replicate_1.rds","set_1_viewpoint_10_ESC_replicate_2.rds", "set_1_viewpoint_10_ESC_replicate_3.rds")
Now run peakC, with the default parameters
resPeakC <- doPeakC(rdsFiles = rdsFiles)
and plot the results
plot_C(resPeakC,y.max=750)
Finally, extract the identified peak regions from the peakC analysis from the objects and export the results in a BED file which can be used for visualization in a genome browser together with the generated wig files.
resPeaks <- getPeakCPeaks(resPeakC=resPeakC)
peaksFile <- "./outF/set_1_viewpoint_10_ESC_peakC_peaks.bed"
exportPeakCPeaks(resPeakC=resPeakC,bedFile=peaksFile,name="set_1_viewpoint_10_ESC_peakC_peaks")