Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Feb 24:15:54.
doi: 10.1186/1471-2105-15-54.

FISH Amyloid - a new method for finding amyloidogenic segments in proteins based on site specific co-occurrence of aminoacids

Affiliations

FISH Amyloid - a new method for finding amyloidogenic segments in proteins based on site specific co-occurrence of aminoacids

Pawel Gasior et al. BMC Bioinformatics. .

Abstract

Background: Amyloids are proteins capable of forming fibrils whose intramolecular contact sites assume densely packed zipper pattern. Their oligomers can underlie serious diseases, e.g. Alzheimer's and Parkinson's diseases. Recent studies show that short segments of aminoacids can be responsible for amyloidogenic properties of a protein. A few hundreds of such peptides have been experimentally found but experimental testing of all candidates is currently not feasible. Here we propose an original machine learning method for classification of aminoacid sequences, based on discovering a segment with a discriminative pattern of site-specific co-occurrences between sequence elements. The pattern is based on the positions of residues with correlated occurrence over a sliding window of a specified length. The algorithm first recognizes the most relevant training segment in each positive training instance. Then the classification is based on maximal distances between co-occurrence matrix of the relevant segments in positive training sequences and the matrix from negative training segments. The method was applied for studying sequences of aminoacids with regard to their amyloidogenic properties.

Results: Our method was first trained on available datasets of hexapeptides with the amyloidogenic classification, using 5 or 6-residue sliding windows. Depending on the choice of training and testing datasets, the area under ROC curve obtained the value up to 0.80 for experimental, and 0.95 for computationally generated (with 3D profile method) datasets. Importantly, the results on 5-residue segments were not significantly worse, although the classification required that algorithm first recognized the most relevant training segments. The dataset of long sequences, such as sup35 prion and a few other amyloid proteins, were applied to test the method and gave encouraging results. Our web tool FISH Amyloid was trained on all available experimental data 4-10 residues long, offers prediction of amyloidogenic segments in protein sequences.

Conclusions: We proposed a new original classification method which recognizes co-occurrence patterns in sequences. The method reveals characteristic classification pattern of the data and finds the segments where its scoring is the strongest, also in long training sequences. Applied to the problem of amyloidogenic segments recognition, it showed a good potential for classification problems in bioinformatics.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Construction of the co-occurrence matrix. Construction of the co-occurrence matrix (for the simplicity windows are of length 4, and 3 sub-matrices are generated in each direction of the general matrix). Coordinates of the general matrix (large numbers) represent the location of aminoacids in the sequences. Each aminoacid is represented by a number between 1 and 20 (ordered alphabetically), located within sub-matrices. For example, the point highlighted in red would indicate a high co-occurrence score between lysine (K) at position 1 of the sequence and tryptophan (W) at position 3 of the sequence.
Figure 2
Figure 2
Training algorithm. Training algorithm of the method. Here YES (NO) denotes the set of positive (negative) training sequences, including nYES (nNO) number of instances, which are tested with a window of a length n; MatrixYES (MatrixNO) are corresponding co-occurrence matrices with coordinates i and j; k denotes the subsequent number of a positive training sequence, Mk is a temporary positive correlation matrix obtained up to the k-th sequence, a denotes the beginning position of a tested window; X is the normalized sum of all previously calculated matrices M; l is an iteration counter; w denotes distance between current positive and negative co-occurrence matrices, wd is the maximal distance later used in the classification.
Figure 3
Figure 3
Classification of long proteins. The results of our classification on 4 amyloid proteins. The method was trained on Waltz dataset. Black blocks indicate location of amyloidogenic segments obtained with wl = 0.14, which was equivalent to the specificity of 60% on Waltz dataset. The brown blocks at the top indicate where the amyloidigenic segments would begin if a different wl value would be assumed. The circles show amyloidogenic segments obtained experimentally by different groups, working on protein fragments of various lengths (green – above 16, blue -11, red - 7).
Figure 4
Figure 4
Classification performance on a complete experimental dataset. ROC obtained with FISH Amyloid on all available experimental data (all datasets with peptides 4-10 aminoacids long and experimental fragments from sup35). The total AUC ROC is 0.80 and the diagonal classification point has both sensitivity and specificity of 74%. The curve is based on average values of 40 independent trials from 4-fold cross-validations. The quantiles 0.95, 0.85 and median are presented as a boxplot at the diagonal classification point.
Figure 5
Figure 5
Final co-occurrence matrix. Graphical representation of the final co-localization matrix on extended experimental dataset. Large matrix coordinates represent the location of aminoacids couples, obtained from the 5-residue sliding window. The most frequent couples of aminoacids, which indicate the classification pattern, assume the darkest colors of dots. Aminoacids are denoted with small numbers, ordered alphabetically (A = 1, C = 2, D = 3, E = 4, F = 5, G = 6, H = 7, I = 8, K = 9,L = 10, M = 11, N = 12, P = 13, Q = 14, R = 15, S = 16, T = 17, V = 18, W= 19, Y = 20).

Similar articles

Cited by

References

    1. Jaroniec CP, MacPhee CE, Bajaj VS, McMahon MT, Dobson CM, Griffin RG. High-resolution molecular structure of a peptide inan amyloid fibril determined by magic angle spinning NMR spectroscopy. Proc Natl Acad Sci U S A. 2004;101:711–716. doi: 10.1073/pnas.0304849101. - DOI - PMC - PubMed
    1. Makin OS, Atkins E, Sikorski P, Johansson J, Serpell LC. Molecular basis for amyloid fibril formation and stability. Proc Natl Acad Sci U S A. 2005;102:315–320. doi: 10.1073/pnas.0406847102. - DOI - PMC - PubMed
    1. Nelson R, Sawaya MR, Balbirnie M, Madsen AO, Riekel C, Grothe R, Eisenberg D. Structure of the cross- beta spine of amyloid-like fibrils. Nature. 2005;435:773–778. doi: 10.1038/nature03680. - DOI - PMC - PubMed
    1. Sawaya MR, Sambashivan S, Nelson R, Ivanova MI, Sievers SA, Apostol MI, Thompson MJ, Balbirnie M, Wiltzius JJW, McFarlane HT, Madsen AØ, Riekel C, Eisenberg D. Atomic structures of amyloid cross β-spines reveal varied steric zippers. Nature. 2007;447:453–457. doi: 10.1038/nature05695. - DOI - PubMed
    1. Thompson MJ, Balbirnie M, Wiltzius JJW, McFarlane HT, Madsen AØ, Riekel C, Eisenberg D. Atomic structures of amyloid cross β-spines reveal varied steric zippers. Nature. 2007;447:453–457. doi: 10.1038/nature05695. - DOI - PubMed

Publication types