(pLM4Alg: Protein Language Model-Based Predictors for Allergenic Proteins and Peptides)[https://pubs-acs-org.er.lib.k-state.edu/doi/10.1021/acs.jafc.3c07143]
Notice: pLM4Allergen is ONLY freely available for academic research; for commercial usage, please contact us, zhenjiao@ksu.edu; yonghui@ksu.edu;
The dataset used in this study was compiled from four known public databases: CAMPARE, AllergenOnline, IUIS, and Uniprot. Datamining was used for AllergenOnline and IUIS (you can use our code for your latest dataset retrival). See the data mining code.
The dataset can be downloaded at publisher website
The original codes can be downloaded at Google Drive
Implementation platform Google Colab. The majoy dependencies used in this project are as following:
Python 3.8.16
fair-esm 2.0.0
keras 2.9.0
pandas 1.3.5
numpy 1.21.6
scikit-learn 1.0.2
tensorflow 2.9.2
torch 1.13.0+cu116
h5py 1.21.6
More detailed python libraries used in this project are referred to requirements.txt. All the implementation can be down in Google Colab and all you need is just a browser and a google account. Install all the above packages by !pip install fair-esm==2.0.0
In our experiments, we follow our previous architecture design UniDL4BioPep, and conduct a series paramaters tuning based on experiences, mainly focus on filter size selected from [16,32,64,128,256], kernel size selected from [3,6,9,12], stride selected from [1,2,4,8] and units selected from [32,64,128,256,512,1024,2048,4096,8192].
Feel free to make your personalized modifications. Just scroll down to the model architecture sections and make revisions to fit your expectation. Based on our experience in esm protein language models, our model architecture is quite reliable as your initial attempts on your own datasets. But, any attempts in new architecture design are highly encouraged.