BMC Bioinformatics - HoCoRT: host contamination removal tool
bioRxiv - HoCoRT: Host contamination removal tool
Host Contamination Removal Tool (HoCoRT)
Removes specific organisms from sequencing reads on Linux.
Supports un-/paired FastQ input. Outputs in FastQ format.
Python 3.7+
External programs:
- Bowtie2 (Tested with version 2.4.5)
- HISAT2 (Tested with version 2.2.1)
- Kraken2 (Tested with version 2.1.2)
- BioBloom Tools (Tested with version 2.3.5)
- bwa-mem2 (Tested with version 2.2.1)
- BBMap (Tested with version 38.96)
- Minimap2 (Tested with version 2.24)
- BioBloomTools (Tested with version 2.3.5)
- samtools (Tested with version 1.15)
To install with Bioconda run the following command:
conda install -c conda-forge -c bioconda hocort
HoCoRT's dependencies may conflict with existing packages. This can be solved by installing HoCoRT in a separate environment. To create a new conda environment and install HoCoRT run the following command:
conda create -n hocort -c conda-forge -c bioconda hocort
First ensure that there is no conda environment called "hocort".
Now download the necessary files:
wget https://raw.githubusercontent.com/ignasrum/hocort/main/install.sh && wget https://raw.githubusercontent.com/ignasrum/hocort/main/environment.yml
After downloading the files, run the installation bash script to install HoCoRT:
bash ./install.sh
The installation is done. Activate the Conda environment:
conda activate hocort
Pipelines are named after the tools they utilize. For example, the pipeline bowtie2 uses Bowtie2 to map the reads, and kraken2bowtie2 first classifies using Kraken2, then maps using Bowtie2.
Indexes are required to map sequences, and may be built either manually or with "hocort index" which simplifies the process. A Bowtie2 index may built using "hocort index" with the following command:
hocort index bowtie2 --input genome.fasta --output dir/basename
If one wishes to remove multiple organisms from sequencing reads, the input fasta should contain multiple genomes.
cat genome1.fasta genome2.fasta > combined.fasta
To map reads and output mapped/unmapped reads use the following command:
hocort map bowtie2 -x dir/basename -i input1.fastq input2.fastq -o out1.fastq out2.fastq
Exactly as above, but with one input file and one output file.
hocort map bowtie2 -x dir/basename -i input1.fastq -o out1.fastq
Most pipelines support .gz compressed input and output. No extra configuration is required aside from having ".gz" extension in the filename.
The filter "--filter true/false" argument may be used to switch between outputting mapped/unmapped sequences. For example, if the reads are contaminated with human sequences and the index was built with the human genome, use "--filter true" to output unmapped sequences (everything except the human reads).
The filter "--filter true/false" argument may also be used to extract specific sequences. First, the index should be built with the genomes of the organisms to extract. Second, the sequencing reads should be mapped with the "--filter false" argument to output only the mapped sequences (sequences which map to the index containing genomes of the specific organisms).
HoCoRT can be imported in Python scripts and programs with "import hocort". This allows precise configuration of the tools being run.
import hocort.pipelines.bowtie2 as Bowtie2
idx = "dir/basename"
seq1 = "in1.fastq"
seq2 = "in2.fastq"
out1 = "out1.fastq"
out2 = "out2.fastq"
options = ["--local", "--very-fast-local"] # options are passed to the aligner/mapper, this allows precise configuration of the underlying tools
returncode = Bowtie2().run(idx, seq1, out1, seq2=seq2, out2=out2, options=options)
It is possible to pass arguments to the underlying tools by specifying them in the -c/--config argument like this:
hocort map bowtie2 -c="--local --very-fast-local --score-min G,21,9"