Bug: cutadapt adapter trimming with default 3 bases overlap instead of 6 - bug in Snakemake templates #21
Description
cutadapt command in snakemake templates (for example cemba_data/mapping/Snakefile_template/mc.Snakefile) has a bug
Trim reads
rule trim_r1:
input:
"fastq/{cell_id}-R1.fq.gz"
output:
fq=temp("fastq/{cell_id}-R1.trimmed.fq.gz"),
stats=temp("fastq/{cell_id}-R1.trimmed.stats.tsv")
threads:
2
shell:
"cutadapt --report=minimal -a {r1_adapter} {input} 2> {output.stats} | "
"cutadapt --report=minimal -O 6 -q 20 -u {r1_left_cut} -u -{r1_right_cut} -m 30 "
"-o {output.fq} - >> {output.stats}"
rule trim_r2:
input:
"fastq/{cell_id}-R2.fq.gz"
output:
fq=temp("fastq/{cell_id}-R2.trimmed.fq.gz"),
stats=temp("fastq/{cell_id}-R2.trimmed.stats.tsv")
threads:
2
shell:
"cutadapt --report=minimal -a {r2_adapter} {input} 2> {output.stats} | "
"cutadapt --report=minimal -O 6 -q 20 -u {r2_left_cut} -u -{r2_right_cut} -m 30 "
"-o {output.fq} - >> {output.stats}"
Adaptor trimming is defined here:
"cutadapt --report=minimal -a {r1_adapter} {input} 2> {output.stats} | "
then output is piped to another instance of cutadapt:
"cutadapt --report=minimal -O 6 -q 20 -u {r2_left_cut} -u -{r2_right_cut} -m 30 "
that has -O 6 option for specifying minimal overlap of adaptor and read
But -O option is not taken into account since adaptor trimming happened already at the first execution of cutadapt
Thus reads are not trimmed from adaptor if there is 6 bases overlap but only if there is 3 bases overlap
Broad institute version of this pipeline (https://broadinstitute.github.io/warp/docs/Pipelines/snm3C/README#snm3c-installation) does not have this problem:
start=$(date +%s)
echo "Run cutadapt"
/opt/conda/bin/cutadapt \
-a R1Adapter=~{r1_adapter} \
-A R2Adapter=~{r2_adapter} \
--report=minimal -O 6 -q 20 \
-u ~{r1_left_cut} \
-u -~{r1_right_cut} \
-U ~{r2_left_cut} \
-U -~{r2_right_cut} \
-Z \
-m ${min_read_length}:${min_read_length} \
--pair-filter 'both' \
-o ${sample_id}-R1_trimmed.fq.gz \
-p ${sample_id}-R2_trimmed.fq.gz \
${sample_id}-R1_sorted.fq ${sample_id}-R2_sorted.fq \
> ${sample_id}.trimmed.stats.txt
end=$(date +%s)
elapsed=$((end - start))
echo "Elapsed time to run cutadapt: $elapsed seconds"