A deep-learning method for detecting methylation state from Oxford Nanopore sequencing reads of plants.
deepsignal-plant applies BiLSTM to detect methylation from Nanopore reads. It is built on Python3 and PyTorch.
deepsignal-plant is built on Python3 and PyTorch. tombo is required to re-squiggle the raw signals from nanopore reads before running deepsignal-plant.
- Prerequisites:
Python3.*
tombo (version 1.5.1) - Dependencies:
numpy
h5py
statsmodels
scikit-learn
PyTorch (version >=1.2.0, <=1.6.0?)
We highly recommend to use a virtual environment for the installation of deepsignal-plant and its dependencies. A virtual environment can be created and (de)activated as follows by using conda:
# create
conda create -n deepsignalpenv python=3.6
# activate
conda activate deepsignalpenv
# deactivate
conda deactivate
The virtual environment can also be created by using virtualenv.
- After creating and activating the environment, download deepsignal-plant (lastest version) from github:
git clone https://github.com/PengNi/deepsignal-plant.git
cd deepsignal-plant
python setup.py install
or install deepsignal-plant using pip:
pip install deepsignal-plant
- PyTorch can be automatically installed during the installation of deepsignal-plant. However, if the version of PyTorch installed is not appropriate for your OS, an appropriate version should be re-installed in the same environment as the instructions:
# install using conda
conda install pytorch==1.4.0 torchvision==0.5.0 cudatoolkit=10.1 -c pytorch
# or install using pip
pip install torch==1.4.0 torchvision==0.5.0
- tombo is required to be installed in the same environment:
# install using conda
conda install -c bioconda ont-tombo
# or install using pip
pip install ont-tombo
Currently we have trained the following models:
- model.dp2.CNN.arabnrice2-1_120m_R9.4plus_tem.bn13_sn16.both_bilstm.epoch6.ckpt: A 5mC model trained using A. thaliana and O. sativa R9.4 1D reads.
- (deprecated) model.dp2.CG.arabnrice2-1_R9.4plus_tem.bn13_sn16.balance.both_bilstm.b13_s16_epoch6.ckpt: A CpG (5mC) model trained using A. thaliana and O. sativa R9.4 1D reads.
- (deprecated) model.dp2.CHG.arabnrice2-1_R9.4plus_tem.bn13_sn16.denoise_signal_bilstm.both_bilstm.b13_s16_epoch4.ckpt: A CHG (5mC) model trained using A. thaliana and O. sativa R9.4 1D reads.
- (deprecated) model.dp2.CHH.arabnrice2-1_R9.4plus_tem.bn13_sn16.denoise_signal_bilstm.both_bilstm.b13_s16_epoch7.ckpt: A CHH (5mC) model trained using A. thaliana and O. sativa R9.4 1D reads.
- fast5s.sample.tar.gz: 4000 A. thaliana R9.4 raw reads, with a genome reference.
To call modifications, the raw fast5 files should be basecalled (Guppy>=3.6.1) and then be re-squiggled by tombo. At last, modifications of specified motifs can be called by deepsignal. Belows are commands to call 5mC in CG, CHG, and CHH contexts:
# Download and unzip the example data and pre-trained models.
# 1. guppy basecall using GPU
guppy_basecaller -i fast5s/ -r -s fast5s_guppy --config dna_r9.4.1_450bps_hac_prom.cfg --device CUDA:0
# 2. tombo resquiggle
cat fast5s_guppy/*.fastq > fast5s_guppy.fastq
tombo preprocess annotate_raw_with_fastqs --fast5-basedir fast5s/ --fastq-filenames fast5s_guppy.fastq --basecall-group Basecall_1D_000 --basecall-subgroup BaseCalled_template --overwrite --processes 10
tombo resquiggle fast5s/ GCF_000001735.4_TAIR10.1_genomic.fna --processes 10 --corrected-group RawGenomeCorrected_000 --basecall-group Basecall_1D_000 --overwrite
# 3. deepsignal-plant call_mods
# 5mCs in all contexts (CG, CHG, and CHH) can be called at one time
# C
CUDA_VISIBLE_DEVICES=0 deepsignal_plant call_mods --input_path fast5s/ --model_path model.dp2.CNN.arabnrice2-1_120m_R9.4plus_tem.bn13_sn16.both_bilstm.epoch6.ckpt --result_file fast5s.C.call_mods.tsv --corrected_group RawGenomeCorrected_000 --reference_path GCF_000001735.4_TAIR10.1_genomic.fna --motifs C --nproc 30 --nproc_gpu 6
deepsignal_plant call_freq --input_path fast5s.C.call_mods.tsv --result_file fast5s.C.call_mods.frequency.tsv
# split 5mC call_freq file into CG/CHG/CHH call_freq files
python /path/to/deepsignal_plant/scripts/split_freq_file_by_5mC_motif.py --freqfile fast5s.C.call_mods.frequency.tsv
Before run deepsignal, the raw reads should be basecalled (Guppy>=3.6.1) and then be processed by the re-squiggle module of tombo.
Note:
- If the fast5 files are in multi-read FAST5 format, please use multi_to_single_fast5 command from the ont_fast5_api package to convert the fast5 files before using tombo (Ref to issue #173 in tombo).
multi_to_single_fast5 -i $multi_read_fast5_dir -s $single_read_fast5_dir -t 30 --recursive
- If the basecall results are saved as fastq, run the tombo proprecess annotate_raw_with_fastqs command before re-squiggle.
For the example data:
# 1. basecall using GPU
guppy_basecaller -i fast5s/ -r -s fast5s_guppy --config dna_r9.4.1_450bps_hac_prom.cfg --device CUDA:0
# or using CPU
guppy_basecaller -i fast5s/ -r -s fast5s_guppy --config dna_r9.4.1_450bps_hac_prom.cfg
# 2. proprecess fast5 if basecall results are saved in fastq format
cat fast5s_guppy/*.fastq > fast5s_guppy.fastq
tombo preprocess annotate_raw_with_fastqs --fast5-basedir fast5s/ --fastq-filenames fast5s_guppy.fastq --basecall-group Basecall_1D_000 --basecall-subgroup BaseCalled_template --overwrite --processes 10
# 3. resquiggle, cmd: tombo resquiggle $fast5_dir $reference_fa
tombo resquiggle fast5s/ GCF_000001735.4_TAIR10.1_genomic.fna --processes 10 --corrected-group RawGenomeCorrected_000 --basecall-group Basecall_1D_000 --overwrite
Features of targeted sites can be extracted for training or testing.
For the example data (By default, deepsignal-plant extracts 13-mer-seq and 13*16-signal features of each CpG motif in reads. Note that the value of --corrected_group must be the same as that of --corrected-group in tombo.):
# extract features of all Cs
deepsignal_plant extract -i fast5s --reference_path GCF_000001735.4_TAIR10.1_genomic.fna -o fast5s.C.features.tsv --corrected_group RawGenomeCorrected_000 --nproc 30 --motifs C
The extracted_features file is a tab-delimited text file in the following format:
- chrom: the chromosome name
- pos: 0-based position of the targeted base in the chromosome
- strand: +/-, the aligned strand of the read to the reference
- pos_in_strand: 0-based position of the targeted base in the aligned strand of the chromosome
- readname: the read name
- read_strand: t/c, template or complement
- k_mer: the sequence around the targeted base
- signal_means: signal means of each base in the kmer
- signal_stds: signal stds of each base in the kmer
- signal_lens: lens of each base in the kmer
- raw_signals: signal values for each base of the kmer, splited by ';'
- methy_label: 0/1, the label of the targeted base, for training
To call modifications, either the extracted-feature file or the raw fast5 files (recommended) can be used as input.
For the example data:
# call 5mCs for instance
# extracted-feature file as input, use CPU
CUDA_VISIBLE_DEVICES=-1 deepsignal_plant call_mods --input_path fast5s.C.features.tsv --model_path model.dp2.CNN.arabnrice2-1_120m_R9.4plus_tem.bn13_sn16.both_bilstm.epoch6.ckpt --result_file fast5s.C.call_mods.tsv --nproc 30
# extracted-feature file as input, use GPU
CUDA_VISIBLE_DEVICES=0 deepsignal_plant call_mods --input_path fast5s.C.features.tsv --model_path model.dp2.CNN.arabnrice2-1_120m_R9.4plus_tem.bn13_sn16.both_bilstm.epoch6.ckpt --result_file fast5s.C.call_mods.tsv --nproc 30 --nproc_gpu 6
# fast5 files as input, use CPU
CUDA_VISIBLE_DEVICES=-1 deepsignal_plant call_mods --input_path fast5s/ --model_path model.dp2.CNN.arabnrice2-1_120m_R9.4plus_tem.bn13_sn16.both_bilstm.epoch6.ckpt --result_file fast5s.C.call_mods.tsv --corrected_group RawGenomeCorrected_000 --reference_path GCF_000001735.4_TAIR10.1_genomic.fna --motifs C --nproc 30
# fast5 files as input, use GPU
CUDA_VISIBLE_DEVICES=0 deepsignal_plant call_mods --input_path fast5s/ --model_path model.dp2.CNN.arabnrice2-1_120m_R9.4plus_tem.bn13_sn16.both_bilstm.epoch6.ckpt --result_file fast5s.C.call_mods.tsv --corrected_group RawGenomeCorrected_000 --reference_path GCF_000001735.4_TAIR10.1_genomic.fna --motifs C --nproc 30 --nproc_gpu 6
The modification_call file is a tab-delimited text file in the following format:
- chrom: the chromosome name
- pos: 0-based position of the targeted base in the chromosome
- strand: +/-, the aligned strand of the read to the reference
- pos_in_strand: 0-based position of the targeted base in the aligned strand of the chromosome
- readname: the read name
- read_strand: t/c, template or complement
- prob_0: [0, 1], the probability of the targeted base predicted as 0 (unmethylated)
- prob_1: [0, 1], the probability of the targeted base predicted as 1 (methylated)
- called_label: 0/1, unmethylated/methylated
- k_mer: the kmer around the targeted base
A modification-frequency file can be generated by call_freq
function with the call_mods file as input:
# call 5mCs for instance
# output in tsv format
deepsignal_plant call_freq --input_path fast5s.C.call_mods.tsv --result_file fast5s.C.call_mods.frequency.tsv
# output in bedMethyl format
deepsignal_plant call_freq --input_path fast5s.C.call_mods.tsv --result_file fast5s.C.call_mods.frequency.bed --bed
# use --sort to sort the results
deepsignal_plant call_freq --input_path fast5s.C.call_mods.tsv --result_file fast5s.C.call_mods.frequency.bed --bed --sort
The modification_frequency file can be either saved in bedMethyl format (by setting --bed
as above), or saved as a tab-delimited text file in the following format by default:
- chrom: the chromosome name
- pos: 0-based position of the targeted base in the chromosome
- strand: +/-, the aligned strand of the read to the reference
- pos_in_strand: 0-based position of the targeted base in the aligned strand of the chromosome
- prob_0_sum: sum of the probabilities of the targeted base predicted as 0 (unmethylated)
- prob_1_sum: sum of the probabilities of the targeted base predicted as 1 (methylated)
- count_modified: number of reads in which the targeted base counted as modified
- count_unmodified: number of reads in which the targeted base counted as unmodified
- coverage: number of reads aligned to the targeted base
- modification_frequency: modification frequency
- k_mer: the kmer around the targeted base
# please use deepsignal_plant denoise -h/--help for instructions
deepsignal_plant denoise --train_file /path/to/train/file
A new model can be trained as follows:
# need to split training samples to two independent datasets for training and validating
# please use deepsignal_plant train -h/--help for instructions
deepsignal_plant train --train_file /path/to/train/file --valid_file /path/to/valid/file --model_dir /dir/to/save/the/new/model
Copyright (C) 2020 Jianxin Wang, Feng Luo, Peng Ni
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.
Jianxin Wang, Peng Ni, School of Computer Science and Engineering, Central South University, Changsha 410083, China
Feng Luo, School of Computing, Clemson University, Clemson, SC 29634, USA