Preforms De novo protein design using Deep Learning and PyRosetta to generate a novel synthetic protein structure.
- Install PyRosetta as the website describes.
- Use the following command (in GNU/Linux) to update your system and install all necessary programs and python modules for this script to run successfully:
sudo apt update && sudo apt full-upgrade && sudo apt install git dssp gnuplot python3-pip && pip3 install biopython pandas numpy tqdm beautifulsoup4 lxml scipy keras tensorflow
- Get this script:
git clone https://github.com/sarisabban/RamaNet.git
RamaNet (after Gopalasamudram Ramachandran) is a script that uses Deep Learning (GAN and LSTM networks) along with PyRosetta to preform De Novo Protein Design (from the beginning) i.e. generate and design a synthetic protein structure completely computationally. The script autonomously generates a backbone topology (random every time) then designs a sequence that fits the generated backbone. It then generates fragments for the structure using its FASTA sequence in preparation for evaluation using the Abinitio folding simulation. The Abinitio protocol script can be found here. From the fragments files it calculates the RMSD for each fragment's position on the designed structure and plots an (RMSD vs Position) graph to indicate how good the Abinitio folding simulation might go (good quality fragments have an average RMSD < 2Å).
Here is a video that explains this whole process. For quick structure generation skip to the last step (step 3).
- You do not need to generate the Deep Learning datasets, it is already provided and can be downloaded here:
Dataset name | Description |
---|---|
Backbone dataset | Dataset of protein structure Φ and Ψ angles (PS.csv) and contact maps (CM.csv). The downloaded file needs to be uncompressed. |
Sequence dataset | Dataset of amino acid sequences |
Fragment dataset | Dataset of amino acid sequences, secondary structures, SASA, phi, psi, and omega angle features from the vall.jul19.2011 database (used by Rosetta for fragment generation) in .csv format |
If you want to replicate our work use the following steps:
1.1 These are are steps to generate the backbone dataset (computation time ~168 hours and requires more than 128GB of free disk space):
The default parameters for generating the dataset is isolating proteins between 80 and 150 amino acids, that have more helices and sheets than loops (rigid structures), and with an Rg (radius of gyration) value of less than 15Å (compact structures). The script results in two dataset (.csv) files: the PS.csv file with each structure's Φ/Ψ angles, and the CM.csv file with each structure's contact map (values 0 < x <= 12Å). Both datasets have the first column as the PDB ID followed by the respective data. If errors occur, that is fine, some protein files will cause errors (they will be deleted/ignored), but the script should continue all the way to the end and result in both .csv dataset files.
The dataset generation protocol is as follows:
- Download the PDB database.
- Extract files.
- Remove non-protein structures.
- Remove structures less than or larger than a specified amino acid length.
- Remove structure with broken chains.
- Remove structures that have loops that are larger than a specific length.
- Renumber structures' amino acids.
- Remove structures that are below a specified Radius of Gyration value.
- ########## --- HUMAN EYE FILTERING --- ##########
- Clean every structure in the database.
- Make a list of all paths (if next step is performed in a high performance computer HPC).
- Generate HPC submission file (PBS job scheduler).
- Relax each structure multiple times (on a HPC or a local computer), this is to augment the examples.
- Generate the Φ/Ψ angles (PS.csv) dataset file and the contact map (CM.csv) dataset file.
- Get the largest contact map value from the CM.csv dataset file.
- Combine, normalise, vectorise, and serialise the PS.csv and CM.csv datasets into a PS+CM.h5 file.
Each of these points can be switched on or off using this strange system we developed. So to run the first section of the protocol use this command:
python3 RamaNet.py --dataset 11111111000000000000 DIRECTORY
or python3 RamaNet.py -d 11111111000000000000 DIRECTORY
After the human eye filtering step you can choose what to run, whether to generate the nessesary files to augment the examples through relaxation onto a HPC or just run the augmentation locally, and to generate the Φ/Ψ/Ω angles and Cα distances dataset. The switches are ordered as in the written protocol above.
python3 RamaNet.py --dataset 00000000100000001110 DIRECTORY
or python3 RamaNet.py -d 00000000100000001110 DIRECTORY
This system should give you control to run individual steps at will.
The most difficult step is the Human Eye Filtering step which requires a person to filter out all the unwanted structures manually before moving onto cleaning up each structure and augmenting the data. Unwanted structures such as non-compact structures, structures with more loops than helices and sheets, weird looking structures are all deleted. Also, this is the step to separate structures and collect the ones with the traits that you need; then the dataset is augmented (preferably on a HPC to save time).
It is best to contact me if you want to generate your own dataset and I will walk you through the protocol, it is not difficult, but works on individual basis.
1.2 These are are steps to generate the sequence dataset (computation time ~168 hours and requires more than 128GB of free disk space):
- Download the PDB database.
- Extract files.
- Remove non-protein structures.
- Remove structure with broken chains.
- Renumber structures' amino acids.
- Generate the sequence dataset (AsPSa.csv) and contact map (M.csv) file.
- Generate headers for both .csv files.
- Fill gaps in both .csv files with zeros (padding).
- Combine, normalise, vectorise, and serialise the AsPSa.csv and M.csv datasets into a AsPSaM.h5 file.
To complete all these steps use the following command:
python3 RamaNet.py --dataset 11111111100011110001 DIRECTORY
or python3 RamaNet.py -d 11111111100011110001 DIRECTORY
1.3 These are are steps to generate the fragment dataset:
- Manually get the vall.jul19.2011 file from Rosetta (or PyRosetta) and make sure it is in the same directory as this script.
- Generate the fragment dataset (Fragment.csv).
- Normalise, vectorise, and serialise the Fragments.csv dataset into a Frag_X.h5 (features) and Frag_Y.h5 (labels) files.
To complete all these steps use the following command:
python3 RamaNet.py --frag
or python3 RamaNet.py -f
- You do not need to train the neural network because it is already trained and the weights file is available here:
Weights name | Description |
---|---|
Backbone Weights | Backbone neural network weights |
Sequence Weights | Sequence neural network weights |
Fragment Weights | Fragment neural network weights |
You can use the following command to train the neural network on the dataset (whether you use our dataset or generate your own).
Train on backbone dataset:
python3 RamaNet.py ----TrainBack
or python3 RamaNet.py -tb
Train on sequence dataset:
python3 RamaNet.py ----TrainSeq
or python3 RamaNet.py -ts
Train on fragment dataset:
python3 RamaNet.py ----TrainFrag
or python3 RamaNet.py -tf
#TODO
- Vectorise sequence dataset tp get AsPSaM.h5
- ADD NEW VIDEO (line 18)
- optimise networks
- Train networks
- make weights available
- write README file's generate instructions
When using these scripts kindly reference the following: