Skip to content

tiagolbiotech/NPOmix_python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NPOmix v1.0

Publication:

logo

NPOmix: A machine learning classifier to connect mass spectrometry fragmentation data to biosynthetic gene clusters 

Tiago F Leão,  Mingxun Wang,  Ricardo da Silva,  Alexey Gurevich,  Anelize Bauermeister, Paulo Wender P Gomes,  
Asker Brejnrod,  Evgenia Glukhov,  Allegra T Aron, Joris J R Louwen,  Hyun Woo Kim,  Raphael Reher,  Marli F Fiore, 
Justin J J van der Hooft,  Lena Gerwick,  William H Gerwick,  Nuno Bandeira, Pieter C Dorrestein

PNAS Nexus, Volume 1, Issue 5, November 2022, pgac257.
DOI: https://doi.org/10.1093/pnasnexus/pgac257

NPOmix PNAS Nexus link

More information about NPOmix at (including workshops):

https://www.tfleao.com/npomix1

Quick tutorial:

Installing packages and softwares (using a bash terminal):

– Anaconda We suggest using the command-line install according to https://docs.anaconda.com/anaconda/install/. For example:

For macOSX, you can use the following commands in the terminal window.

cd ~/Downloads/
wget https://repo.anaconda.com/archive/Anaconda3-2023.03-MacOSX-x86_64.sh
bash Anaconda3-2023.03-MacOSX-x86_64.sh

– Conda packages:

conda install -c bioconda pyteomics
conda install -c anaconda requests
conda install -c anaconda networkx
conda install datetime
conda install -c conda-forge biopython
conda install -c anaconda scikit-learn

- antiSMASH: (offline, we recommend the online version, it has great interface)

conda install -c bioconda antismash

– BiG-SCAPE:

Download and parse the pFAM database:
wget http://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam35.0/Pfam-A.hmm.gz
gzip -d Pfam-A.hmm.gz
conda install -c bioconda hmmer
hmmpress Pfam-A.hmm

Install the BiG-SCAPE software:
conda create -n bigscape -c bioconda bigscape

Instructions:

1) Clone the GitHub repository:

cd path/to/root/folder
git clone https://github.com/tiagolbiotech/NPOmix_python.git

2) Download the training BGCs (ideally in the test/ folder):

wget -O antismash_bgcs.zip https://zenodo.org/record/6637083/files/antismash_only_gbk.zip?download=1
unzip antismash_bgcs.zip && rm antismash_bgcs.zip

3) Clone the GNPS job with the metabolomes for the strains in the training set and add your files (ideally to G3 group): https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=b915d1f78c364bf4be9900047760d95c

4) Download GNPS output. Run and download the bacterial antiSMASH file. Place the antiSMASH file in the antismash_gbk_only folder.

5) Run BiG-SCAPE for the antiSMASH BGCs (including from your samples) and concatenate outputs for all classes.

A) Run BiG-SCAPE:
conda activate bigscape
bigscape.py --pfam_dir /path/to/pfam_files/ -i /path/to/antismash/ \ -o /path/to/bigscape_outputs_220603_1791samples/ -c 8 --include_singletons --mibig

B) Concatenate outputs:
sh concat_bigscape.sh /path/to/bigscape_output/ 2023-11-11_11-11-11_hybrids_glocal/

6) Run the dereplication or not dereplication notebooks. Make sure to adjust the paths for the antiSMASH, GNPS and BiG-SCAPE files.


For detailed instructions, please check the video below:

video1828216508.mp4

Submit your samples

If you have difficulties running the NPOmix tool for your samples or have questions, please submit your samples at the link below.

https://www.tfleao.com/general-8

Overview of the methodology

To use the NPOmix approach (Fig. 1, schematic example for the approach used in only four samples), you will need a dataset of so-called paired genome-MS/MS samples, samples that contain both a genome (or metagenome) and a group of MS/MS spectra obtained via untargeted LC-MS/MS. Many paired datasets are available at the Paired omics Data Platform (PoDP), one of the first initiatives to gather paired genome-MS/MS samples. By applying BiG-SCAPE, each biosynthetic gene cluster (BGC) in the genomes will go through a pairwise similarity comparison (Fig. 1A) to every other BGC in the same set of genomes to compute similarity scores (1 minus BiG-SCAPE raw distance) and to assign BGCs to Gene Cluster Families (GCFs), if possible. In order to create a BGC fingerprint (Fig. 1C), we identify the maximum similarity of the query BGC to one of the many BGCs in each genome (which can be considered a pool of BGCs) in the dataset. Therefore, the BGC fingerprints can be represented as a row of values (a vector with the maximum similarity scores), and each column is a different genome from the selected dataset. Similarity scores range from 0.0 to 1.0; for instance, identical BGCs have a perfect similarity score of 1.0, a score of 0.8 would represent that a homologous BGC is present in the genome, and the score of 0.0 (or below the similarity cutoff of 0.7) represents that the queried BGC is absent in the genome. A similar process happens to create MS/MS fingerprints (Fig. 1B); however, genomes are replaced by MS/MS spectra, and a query BGC is replaced by a query MS/MS spectrum; either a reference spectrum from GNPS or a cryptic MS/MS spectrum from a new sample (that contains a genome and experimental MS/MS spectra). In the case of MS/MS fingerprints (Fig. 1D), we used GNPS to calculate the pairwise modified cosine score and then identify the maximum similarity of the query MS/MS spectrum to one of the many MS/MS spectra in each experimental sample. Of particular note, we did not use the full classical GNPS molecular networking capabilities; we only used the functions required to calculate a modified cosine similarity score between a pair of MS/MS spectra. The BGC fingerprints create a training matrix (Fig. 1E) where rows are the maximum similarity scores, normally thousands of rows (e.g., for our first release, round 4, we have used 5,421 BGCs in 1,040 genomes/metagenomes), where each genome is a column. This matrix can be fed to the k-nearest neighbor (KNN) algorithm in order to train it with the genomic data. Additionally, one extra column is required in this genomic data matrix, a column that labels each BGC fingerprint with a GCF so the KNN algorithm will know the fingerprint patterns that belong together. The KNN algorithm plots the BGC fingerprints in the KNN space (in Fig. 1G, the KNN space is exemplified by only 2 dimensions). Next, the MS/MS fingerprints also form a testing matrix (Fig. 1F); in this case, this matrix also contains 1,040 columns due to the 1,040 sets of paired experimental MS/MS spectra. For example, for our first release, this testing matrix contained 15 reference MS/MS fingerprints (rows) for MS/MS reference spectra from the GNPS database. Each query MS/MS fingerprint (a row in the testing metabolomic matrix and columns are the experimental MS/MS spectra per sample) will also be plotted into the same KNN space (Fig. 1G) so the algorithm can obtain the GCF labels for the k-nearest neighbors to the query MS/MS fingerprint (e.g., for three most similar BGC neighbors, k = 3). We note that GCF labels can be present more than once in the returned list if two or more BGC nearest neighbors belong to the same gene family. This repetition of the GCF classification is a common behavior of the KNN approach. Our approach is suitable for bacterial, fungal, algal, and plant genomes and MS/MS spectra. Metagenomes and metagenome-assembled genomes can also be used instead of genomes; however, complete genomes are preferred. This KNN approach also supports LC-MS/MS from fractions or from different culture conditions; multiple LC-MS/MS files for the same genome were merged together into a single set of experimental MS/MS spectra.

Fig1_part1

Fig1_part2

Video overview:

NPOmix-summary-TFL210827-v1.0-edited.mp4

References:

  1. 3% of the biosynthetic potential: Gavriilidou, A., Kautsar, S. A., Zaburannyi, N., Krug, D, Muller, R., Medema, M. H. & Ziemert, N. A global survey of specialized metabolic diversity encoded in bacterial genomes. bioRxiv (2021).
  2. Piared omics Data Platform: Schorn, M. A. et al. A community resource for paired genomic and metabolomic data mining. Nat. Chem. Biol. 17, 363–368 (2021).
  3. New approved drugs from 1981 to 2014: Newman, D. J. & Cragg, G. M. Natural Products as Sources of New Drugs from 1981 to 2014. J. Nat. Prod. 79, 629–61 (2016).

Video details on the methodology:

NPOmix-detailed-TFL210918-v1.0.mp4

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages