Skip to content

Bug: cutadapt adapter trimming with default 3 bases overlap instead of 6 - bug in Snakemake templates #21

Open
@mbatiuk

Description

cutadapt command in snakemake templates (for example cemba_data/mapping/Snakefile_template/mc.Snakefile) has a bug

Trim reads

rule trim_r1:
input:
"fastq/{cell_id}-R1.fq.gz"
output:
fq=temp("fastq/{cell_id}-R1.trimmed.fq.gz"),
stats=temp("fastq/{cell_id}-R1.trimmed.stats.tsv")
threads:
2
shell:
"cutadapt --report=minimal -a {r1_adapter} {input} 2> {output.stats} | "
"cutadapt --report=minimal -O 6 -q 20 -u {r1_left_cut} -u -{r1_right_cut} -m 30 "
"-o {output.fq} - >> {output.stats}"

rule trim_r2:
input:
"fastq/{cell_id}-R2.fq.gz"
output:
fq=temp("fastq/{cell_id}-R2.trimmed.fq.gz"),
stats=temp("fastq/{cell_id}-R2.trimmed.stats.tsv")
threads:
2
shell:
"cutadapt --report=minimal -a {r2_adapter} {input} 2> {output.stats} | "
"cutadapt --report=minimal -O 6 -q 20 -u {r2_left_cut} -u -{r2_right_cut} -m 30 "
"-o {output.fq} - >> {output.stats}"

Adaptor trimming is defined here:

"cutadapt --report=minimal -a {r1_adapter} {input} 2> {output.stats} | "

then output is piped to another instance of cutadapt:

"cutadapt --report=minimal -O 6 -q 20 -u {r2_left_cut} -u -{r2_right_cut} -m 30 "

that has -O 6 option for specifying minimal overlap of adaptor and read

But -O option is not taken into account since adaptor trimming happened already at the first execution of cutadapt

Thus reads are not trimmed from adaptor if there is 6 bases overlap but only if there is 3 bases overlap

Broad institute version of this pipeline (https://broadinstitute.github.io/warp/docs/Pipelines/snm3C/README#snm3c-installation) does not have this problem:

      start=$(date +%s)
      echo "Run cutadapt"
      /opt/conda/bin/cutadapt \
      -a R1Adapter=~{r1_adapter} \
      -A R2Adapter=~{r2_adapter} \
      --report=minimal -O 6 -q 20 \
      -u ~{r1_left_cut} \
      -u -~{r1_right_cut} \
      -U ~{r2_left_cut} \
      -U -~{r2_right_cut} \
      -Z \
      -m ${min_read_length}:${min_read_length} \
      --pair-filter 'both' \
      -o ${sample_id}-R1_trimmed.fq.gz \
      -p ${sample_id}-R2_trimmed.fq.gz \
      ${sample_id}-R1_sorted.fq ${sample_id}-R2_sorted.fq \
      > ${sample_id}.trimmed.stats.txt
      end=$(date +%s) 
      elapsed=$((end - start)) 
      echo "Elapsed time to run cutadapt: $elapsed seconds"

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions