Skip to content

Filter out sequences in FASTQ files that match a reference genome

License

Notifications You must be signed in to change notification settings

petersm3/filterfastq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 Cannot retrieve latest commit at this time.

History

38 Commits
 
 
 
 
 
 

Repository files navigation

Summary

Filter out sequences in Illumina HiSeq/MiSeq gzipped FASTQ files (single or paired-end) that match a reference genome, e.g., PhiX (Control Libraries)

Disclaimer

  • For removal of contamination it is recommended that you use other programs, e.g., JGI's BBDuk "Kmer filtering" feature as part of Data Preprocessing
  • filterfastq.pl was originally written as a proof of concept in 2015 (and used in production on a limited basis) due to Illumina's elimination of the dedicated PhiX control lane (#8) on their HiSeq 3000/4000 instruments
    • e.g., Sequencing a non-indexed sample (whole genome) with a large PhiX spike-in (e.g., 10%) may have required filtering of the final set of FASTQ reads
  • filterfastq.pl uses NCBI's command-line BLAST (blastn) to compare sequences in a semi-serial fashion, i.e., sets of 4 million sequences, using multiple blastn threads, against the reference genome (as a BLAST database), which can take a long time
    • e.g., Approximately 11.75 days to process 264,724,686 single end reads, which had a recorded PhiX spike-in at 0.6%; 0.521% reads were filtered out by the script

Dependencies

Usage

$ ./filterfastq.pl -h

Usage: ./filterfastq.pl -r reference_genome.fa -i fastq_input_dir -o fastq_output_dir [-t threads] [-e evalue]

-r, --reference
       FASTA reference genome for sequences to be filtered against
-i, --input
       Directory containing the gzipped FASTQ file(s) to be filtered
       Sets of paired-end reads need to be in the same directory and are
       identified by their filenames containing either *_R1_* and *_R2_*
-o, --output
       Directory to store the filtered FASTQ and intermediary files
       Script will create output directory but not parents, e.g., mkdir -p
-t, --threads
       Optional number of threads for blastn to use (default is 1)
-e, --evalue
       Optional evalue for blastn to use (default is 1e-15)
-q, --quiet
       Do not write script progress to standard output
-h, --help
       This usage information.

About

Filter out sequences in FASTQ files that match a reference genome

Topics

Resources

License

Stars

Watchers

Forks

Languages