A snakemake pipeline for processing data generated using the CEL-Seq protocol. Takes BCL or fastq input files and generates a single-cell experiment object using scpipe. Also performs QC using Fastq Screen and FastQC, collated in a MultiQC report. In principle, the pipeline can be used for a range of singe-cell protocols.
The only prerequisite is snakemake. To install snakemake, you will need to install a Conda-based Python3 distribution. For this, Mambaforge is recommended. Once mamba is installed, snakemake can be installed like so:
mamba create -c conda-forge -c bioconda -n snakemake snakemake
Now activate the snakemake environment (you'll have to do this every time you want to run the pipeline):
conda activate snakemake
Now clone the repository:
git clone https://github.com/WEHISCORE/CELSeq-pipeline.git
cd CELSeq-pipeline
If you would like to test the pipeline, first download and prepare the test data:
(cd .test && ./download_test_data.sh)
mkdir -p fastq && ln -s $PWD/.test/*fastq.gz fastq
You'll now have to generate a STAR index for the test genome:
mamba install -c conda-forge -c bioconda star=2.7.8a
STAR --runMode genomeGenerate --genomeDir ./test/ERCC92-STAR-index --genomeFastaFiles ./test/ERCC92.fa
And a Bowtie index for FastQ Screen:
mamba install -c conda-forge -c bioconda bowtie2=2.4.2
bowtie2-build .test/ERCC92.fa .test/ERCC-bowtie-index/ERCC
Make sure your config/config.yaml
and config/fastq_screen.conf
reflect these paths (you can comment out other indexes in the FastQ Screen config). Now run as follows:
snakemake --use-conda --conda-frontend mamba --cores 1
The configuration file is found under config/config.yaml
and the config file for FastQ Screen is found under config/fastq_screen.conf
. Please carefully go through these settings. The main settings to consider will be
process_from_bcl
-- set this toTrue
only if converting from BCL files. If so, make sure the demultiplexing argumentbcl2fastq
is set properly (underparams
).sample_sheet
-- this is the sample sheet for bcl2fastq conversion. Please check the bcl2fastq documentation for more info. You can skip this if you're using fastq files.barcode_file
-- contains a comma-separated file with an ID column (matching you well/cell IDs) and the corresponding barcode in the following format:
ID,Cell_Barcode
S1,ATATATAT
S2,GCGCGCGC
gtf
andstar_index
underref
-- make sure the chromosome names match for these and that you've generated an index for STAR-2.7.8, as this is the version used by the pipeline.read_structure
-- ensurebarcode_in_r1
is set toTRUE
if your barcodes are in R1 (which is standard for CEL-Seq). WEHI's modified CEL-Seq protocol uses a barcode size of 7 (barcode_len_2
default), so set this to 8 if using a standard version of the protocol.
If you are running from BCL, make sure you put your BCL files under the bcl_input
directory, and if running from fastqs, put them all under a fastq
directory from where you run the pipeline and make sure that your files are in the format fastq/{sample}_R1.fastq.gz
and fastq/{sample}_R2.fastq.gz
.
Run the pipeline as follows:
conda activate snakemake
snakemake --use-conda --conda-frontend mamba --cores 1
If you want to submit your jobs to the cluster using SLURM, use the following to run the pipeline:
conda activate snakemake
snakemake --use-conda --conda-frontend mamba --profile slurm --jobs 8 --cores 24
The pipeline will generate all results under a results
directory. The final output will be under results/sc_demultiplex/{sample}/sce.rds
.