Deepro Banerjee, Michael A. Jindra, Alec J. Linot, Brian F. Pfleger, Costas D. Maranas, EnZymClass: Substrate specificity prediction tool of plant acyl-ACP thioesterases based on ensemble learning, Current Research in Biotechnology, Volume 4, 2022, Pages 1-9, ISSN 2590-2628, https://doi.org/10.1016/j.crbiot.2021.12.002.
EnZymClass has been tested in macos and linux platforms.
foo@bar:~$ conda create -n enzymclass -c conda-forge -c bioconda -c anaconda python=3.9 scikit-learn pandas multiprocess blast wget bioconductor-kebabs
foo@bar:~$ conda activate enzymclass
foo@bar:~$ pip install ngrampro ifeatpro pssmpro
-
root_dir: The directory path where all intermediate and final output files generated by EnZymClass will be stored. Look at the section "Output directory structure" for more details.
-
train_file: The path to the csv file which contains information about the training sequences required to train EnZymClass. For more details see "Train file format" section.
-
test_file: The path to the csv file which contains information about the test sequences whose labels will be predicted by EnZymClass. For more details see "Test file format" section.
The train file should be a csv file without any headers. The format is as follows:
protein_unique_name,protein_sequence,protein_numerical_category
For example:
A._hypogaea_l._(AhFatA),MLKVSCNGSDRVQFMAQCGFAGQPASVLVRRRSVSAVGFGYPMNRVLSVRAIVSDRDGAVVNRVGAEAGTLADRLRLGSLTEDGLSYKEKFIVRSYEVGINKTATVETIANLLQEVGCNHAQSVGYSTDGFATTPTMRKLGLIWVTARMHIEVYKYPAWSDVVEIETWCQGEGRVGIRRDFILKDYATDQVIGRATSKWLMMNQETRRLQKVSDDVREEVLIYCPREPRLAIPEEDSNCLKKIPKLEDPGQYSRLRLMPRRADLDMNQHVNNVTYIGWVLESMPQEIIDSHELHSITLDYRRECQRDDIVDSLTSIEGDGVLLEVNGTNGSSVAWEHGHAYQQFLHLLKLSTDEGLEINRGRTAWRKKASRL,1
.
.
.
The test file should be a csv file without any headers. The format is as follows:
protein_unique_name,protein_sequence
For example:
Uncharacterized_protein__ECO_0000313_EMBL_EMT12172.1_,MAGSVASGFFPTPGSSPAASARGSKNMSGELPESLSVRGMVAKPNTPPASMQVKARAQALPKVNGSKVNLKTTGSDKEDTVPYTSSKTFYNQLPDWSMLLAAVTTIFLAAEKQWTMLDWKPKRPDMLVDTFGFGRIIQDGLVFRQNFLIRSYEIGADRTASIETLMNHLQETALNHVKTAGLLGDGFGATPEMSKRNLIWVVSKIQLLVEHYPSWEDMVQVDTWVASAGKNGMRRDWHIRDYNSGRTILKATSVWVMMNKTTRRLSKMPDEVRGEIGPHFNDRSAITEEQGEKLAKPRNKVVDPANKQFIRKGLTPKWGDLDVNQHVNNVKYIGWILESAPISILEKHELASMTLDYRKECCRDSVLQSLTNVSGECVDGSPDSAIQCDHLLQLESGADVVKAHTTWRPKRAHGEGNLGLFPVESA
.
.
.
EnZymClass will create the following directory structure to store all intermediate and final output files.
📦root
┣ 📂features
┣ 📂label
┣ 📂mappings
┣ 📂predictions
┣ 📂seq
┗ 📂validation
Here, root refers to the root directory path provided by the user as the first required argument of EnZymClass. Under root there are 6 directories created by EnZymClass whose contents are defined as follows:
- features: The protein sequences provided to EnZymClass are numerically encoded and stored in this directory. To know the type of feature encodings used by EnZymClass please go through our paper.
- label: The training protein alias and their corresponding numerical categories are stored here.
- mappings: Protein aliases created by EnZymClass mapped to their original user provided names are stored here.
- predictions: EnZymClass' test set predictions are stored here. For the format of this file please see section Output file format.
- seq: Protein aliases mapped to their user provided sequences are stored here.
- validation: EnZymClass validation dataset predictions for N runs are stored here. For each run, EnZymClass creates a different training and validation set and assesses its own performance based on these multiple runs. We refer the reader to our paper for more details.
The output prediction file is a csv file without any headers. The format is as follows:
protein_unique_name,predicted_numerical_category
For example:
Uncharacterized_protein__ECO_0000313_EMBL_EMT12172.1_,1
.
.
.
# clone this repo
foo@bar:~$ git clone https://github.com/deeprob/EnZymClass.git
# change directory to repo's root
foo@bar:~$ cd /path/to/EnZymClass
# activate the conda environment created in Step 1
foo@bar:~$ conda activate enzymclass
# run EnZymClass
foo@bar:~$ python src/enzymclass/run_model.py /path/to/root_dir /path/to/train.csv /path/to/test.csv
EnZymClass checks if the uniref database has already been downloaded and is present in the "/path/to/root_dir/pssmpro" directory. If not present, then it will download and store the database on the said directory. Since it is a huge file, an user can download it prior to running EnZymClass and store in the directory mentioned above as uniref50.fasta. After downloading the uniref database, the user will also have to convert it to a blast compatible database using the "makeblastdb" command as mentioned below:
foo@bar:~$ makeblastdb -in "/path/to/root_dir/pssmpro/uniref50.fasta" -dbtype "prot" -out "/path/to/root_dir/pssmpro/uniref50"
Users can pass the number of cores to be used by EnZymClass using the "--threads" optional argument. By default EnZymClass uses maximum available threads.
EnZymClass estimates an accurate validation performance by simulating N number of models where each model uses unique training and validation sets. By default, N=1000. To get faster results, the user can reduce the number of simulations to run by specifying a lower number through the "--nsim" optional argument. To stop EnZymClass from generating this report, "--nsim" can be set to 0.
To only create features and not run the prediction model, run enzymclass with the "--featurize" argument.
To only run the prediction model, run enzymclass with the "--predict" argument. This assumes that the features are already stored in the /path/to/root_dir/features/ dir.
- Packaging EnZymClass.
- Incorporating feature embedders in EnZymClass.