Background

Cancer, at its most basic, is caused by the accrual of somatic mutations within oncogenes and tumor suppressors in the genome [1]. While mutations within tumor suppressors usually lower or completely disrupt the activity of genes that promote cell apoptosis or regulate the cell cycle, oncogenic mutations typically increase or destabilize the resulting protein output. As it is easier to disrupt protein function than restore it, there has been significant pharmacological research geared towards inhibiting oncogenic mutations as described in [2, 3] and [4]. Coupled with the idea of “oncogene addiction”, that a small set of “driver” genes promote uncontrolled cellular growth in a wide variety of cancers and that inactivation of these genes can significantly impair tumorigenesis [5, 6], the identification of driver oncogenic mutations has become of key importance due to its potential translational benefit.

Due to the biological importance of this problem, a variety of methodologies have been proposed to identify regions where activating mutations may occur. One approach is based on the idea that compared to the background mutation rate, driver mutations will have a higher frequency of non-synonymous mutations [7, 8]. Several improvements to this approach have been made such as normalizing for gene length [9] as well as accounting for different mutation rates due to features such as transitions versus transversions, location of CpG sites and tumor type [10]. Relatedly, instead of comparing the mutational frequency directly to the background rate, one can also compare the ratio of nonsynonymous (K a ) to synonymous (K s ) mutations per site [11]. Recently, [5, 8, 12] and [13] showed that somatic mutations appear to cluster within protein kinases while [14] and [15] demonstrated that mutational clustering can be a sign of positive selection for protein function and thus sites for protein engineering.

Alternatively, several machine learning methods have been developed to determine the nature of a specific mutation. For instance, Polyphen-2[16] attempts to discern whether a mutation is deleterious while CHASM[17] attempts to distinguish between driver and passenger mutations. These methods use a wide range of sequence and non-sequence features to build a set of rules that are then used to score each mutation with the score value determining the mutation classification. The rules are developed on a training data set via a variety of methods such as Random Forests [18], Bayesian Networks [19] and Support Vector Machines [20]. The features used often include information on the size and polarity of the substituted and original residues, available structural information as well as evolutionary conservation [21]. Some classifiers are optimized to use only a small set of features in prediction. For example, the SIFT classifier [22], only uses evolutionary conservation to predict whether the protein functional change is tolerated or damaging.

While all these methods have shown some success in identifying damaging or deleterious mutations, they nevertheless have limitations. Methods that rely upon differentiating the frequency of synonymous and non-synonymous mutations as compared to the background rate may fail to take into account that selection may occur upon only a small region of the gene and that signal loss may occur when the gene is considered in total. The methodologies proposed by [9] and [11] fail to distinguish between activating and non-activating non-synonymous mutations while the method developed by [10] may be biased if the background mutational rate is not accurately estimated. Furthermore, not only do machine learning classifiers require several sources of information that need to be periodically updated to account for new research, it is often the case that much of the requisite information needs to be collected for the first time at significant expense.

Building upon the hypothesis that driver mutations tend to cluster in functionally relevant protein regions, [23, 24] and [25] recently developed several statistical methodologies to identify mutational clusters. Specifically, [23] developed Non-Random Mutational Clustering (NMC) which identifies clustering by testing against the null hypothesis that missense substitution locations are distributed uniformly. If two mutations are closer together than expected by chance, the null hypothesis that every amino acid has an equal probability of mutation is thus rejected. However, as the distance between every pair of mutations needs to be tested, a large multiple comparison penalty is incurred. iPAC[24] and GraphPAC[25] expanded upon NMC by taking into account protein tertiary structure via a MultiDimensional Scaling approach (MDS) [26] and a graph theoretical approach, respectively. While both of these methods improved over the linear NMC method, they nevertheless remap the protein to one dimensional space resulting in information loss.

In this article, we provide an improvement to iPAC, GraphPAC and NMC by considering the protein directly in three dimensional space and thereby avoid the information loss inherent in dimension reduction algorithms. Using this new approach, we are able to identify proteins with significant clusters, such as FGFR3 and CHRM2, which are otherwise missed by iPAC, GraphPAC and NMC (see Section “SpacePAC identifies additional proteins containing clusters”). In addition, SpacePAC provides better “localization” for mutational hotspots (see Section “SpacePAC improves cluster localization”). Finally, we show that many of the mutational hotspots identified by SpacePAC are categorized as activating mutations by CHASM and damaging mutations by PolyPhen-2. Overall, by avoiding the protein remapping step required by iPAC and GraphPAC as well as the multiple comparison penalty these methods incur for looking at every pairwise combination of mutations, we are better able to identify mutational hotspots that are indicative of driver mutations. For the rest of this paper, we refer to the set of NMC, GraphPAC, and iPAC as the “pairwise methods” as they consider every pairwise combination at the cost of an extra multiple comparison adjustment.

Methods

SpacePAC uses a three step process to identify mutational clusters. Step one is to obtain the mutational and structural data (see Sections “Obtaining mutational data” and “Obtaining 3D structural data”). Step two is to reconcile the databases so that the mutational information can be mapped onto the protein structure (see Section “Reconciling structural and mutational data”). The third step is to simulate the distribution of mutation locations over the protein tertiary structure and identify if any regions of the protein have observed mutational counts in the tail of the distribution (see Section “Identifying mutational hotspots”). Finally, although not part of the SpacePAC algorithm, we perform a multiple comparison adjustment to account for the multiple structures considered (see Section “Multiple comparison adjustment for structures”).

Obtaining mutational data

The 65th Oracle version of the COSMIC database, the latest version as of when this paper was written, was used to obtain the mutation data. First, only missense substitution mutations recorded as “confirmed somatic variant” or “Reported in another cancer sample as somatic” were used. Further as both SpacePAC and the pairwise methods are tested against the null hypothesis that mutations are randomly distributed along the protein, only mutations from whole gene or whole genome screens were retained in order to avoid selection bias. Next, only genes labeled with a UniProt Accession Number [27] were kept as the UniProt identification was used to match the protein sequence to the protein structure. Finally, as several studies may report mutational data from a single cell line, mutation duplications were removed in order to avoid over counting specific mutations (see “Additional file 1: COSMIC query” for the SQL code). Mutations are then aggregated from various sources (cell lines, tumor samples, etc.) to help provide a complete mutational distribution for the protein during the analysis described in Section “Algorithm for identifying sphere positions”.

Obtaining 3D structural data

Protein structures were obtained from the PDB. The spatial coordinates of the α-carbon atom in each amino acid were used to represent that amino acid’s location in 3D space. Further, as multiple structures are often available for the same protein, all structures that matched the protein’s UniProt Accession Number were analyzed and an appropriate multiple comparison adjustment applied afterwards (see Section “Multiple comparison adjustment for structures”). If multiple polypeptide chains within the same structure matched the UniProt Accession Number, the first matching chain shown in the file was used (commonly chain “A”). Similarly, if the structure determination provided more than one protein conformation, the first conformation listed in the pdb file was kept. We note that while we only used the first conformation displayed in the file, as done by [24] and [25], the SpacePAC software allows the user to select the desired conformation should an analysis of more than one alternative conformation be required. For a full listing of the 1,903 structure/side-chain combinations considered, see “Additional file 2: Structure files”.

Reconciling structural and mutational data

As the residue numbering scheme often differs between the COSMIC and PDB databases, we reconciled the information in both sources. Similar to iPAC and GraphPAC, SpacePAC provides the user two possible reconciliation options. The first option is based upon a numerical reconstruction from the structural data available directly in the PDB file while the second performs a pairwise alignment as detailed in [28]. As the PDB file structure may change depending upon when the file was added to the database along with other technical difficulties, we used pairwise alignment to reconcile the mutational and positional information for each structure unless specifically noted. For further information on the pairwise alignmnent, please see the iPAC package available on Bioconductor. Successful alignment was obtained for 131 proteins corresponding to 1,110 individual structure/side-chain combinations. Note that structures that did not have tertiary data on at least two mutations were considered blank (as no clustering is possible) and dropped from the analysis. See “Additional file 2: Structure files” for full details of each combination.

Identifying mutational hotspots

The general principle for SpacePACis that we identify the one, two and three non-overlapping spheres that cover as many of the mutations as possible at various radii lengths. We then normalize the number of mutations covered by the spheres and find the maximum normalized value. This value is then compared to a simulated distribution to obtain a p-value. Specifically, we proceed as follows:

  • Let s be the number of spheres we consider. s∈{1,2,3}.

  • Let r be the radius considered. Here we consider, r∈{1,2,3,4,5,6,7,8,9,10}.

  • Simulate T(≥1000) distributions of mutation locations over the protein structure. Specifically, for each simulation, every mutation is randomly assigned to a residue i where 1≤iN and N is the total number of residues in the protein.

Next, let X0,s,r represent the number of mutations that occur within the s spheres over all samples. For instance, suppose we had three tumor samples, A, B and C, and were considering two spheres. Further, suppose that sample A had mutations at residues 5 and 20, sample B had mutations at residues 6 and 21, and sample C had only one mutation at residue 5. Thus, if sphere one overlapped residues 5 and 6 while sphere two overlapped residues 20, 21, then X0,2,r=5. Specifically, there are 3 mutations from samples A, B, C within the first sphere and 2 mutations within the second sphere.

Given, s we then identify the sphere centers in such a way that capture as many of the total mutations as possible (See Section “Algorithm for identifying sphere positions”). For instance, if s=3, we then need to identify three sphere centers, p1,p2 and p3. Let Xi,s,r represent the same statistic but for simulation i. For given {s,r}, calculate μ s , r = mean 1 i T { X i , s , r } and σ s , r = std. dev. 1 i T { X i , s , r }. For each simulation, calculate Z i = maxi{(Xi,s,rμs,r)/σs,r}. The p-value is then estimated as: ( 1 Z 0 > Z i )/T. Note that while we identify up to three spheres (or “hotspots”) that contain mutational clustering, only one p-value is necessary to reflect the statistical significance of all the hotspots. This process is best seen through Figure 1. For the rest of this paper, we will refer to “hotspot” and “cluster” interchangeably.

Figure 1
figure 1

Statistic construction. In this example, we consider r∈{3,9} and up to 3 potential mutational hotspots in the protein. First, μ and σ are calculated over each column. Next, we normalize each entry in the column by calculating Z i , s , r = X i , s , r μ s , r σ s , r . We then take the maximum over each row to get Z0,…,Z1000. The percentage of times Z0Z i , where i∈{1,…,1000}, is the p-value of our observed statistic Z0. Note that if Z0 is greater than Z i for all 1000 simulations, we report a p-value of <1.00 E-03.

While our algorithm is generalizable to m spheres (m<<N, where n is the total number of residues), we note that the worst case running time is O N m . Due to the nature in which we traverse the sample space of possible sphere orientations (see Section “Algorithm for identifying sphere positions”), we rarely hit the worst case scenario, and even then, in the case of three spheres the running time is only O N 3 which is still computationally possible. Further, our algorithm was able to identify only 1 or 2 significant spheres for over 70% of the structures considered. Even more so, for the structures that did have 3 statistically significant spheres (hotspots), the three spheres covered on average more than 75% of all the mutations available. Given these empirical results, three spheres were selected to balance computational costs.

While sphere radii ranging from 1 to 10Å were considered in this study, the researcher can input larger values. However, over all the structures evaluated in this study, our methodology never required spheres of radius 10Å. Optimal sphere orientations were always found at smaller radii. Thus while larger sphere radii could of been considered, they would not have been selected by the algorithm.

Finally, due to our consideration of only non-overlapping spheres, additional constraints on the number of possible sphere orientations are placed which allows SpacePAC to efficiently solve for the optimal sphere orientation. For example, if two nearby residues are mutated, SpacePAC will incorporate them into a larger sphere and then use the remaining spheres to identify more distant clusters. If sphere overlap was permitted, not only would it become difficult to assign a mutation to a specific sphere but it could also lead to uninformative results. For instance, consider a worst case scenario where one residue has very many mutations and two spheres are being used to find the mutational hotspots. If the rest of the protein has no mutations, the two spheres will perfectly overlap which would in turn provide a meaningless result. If the mutations were spread over two nearby residues, the sphere overlap would again be almost perfect and not informative. By using the non-overlapping sphere constraint, SpacePAC will dynamically increase the sphere radius and capture the mutations in one sphere. On the whole, by considering only non-overlapping spheres we impose boundary constraints in order to find meaningful sphere configurations.

Algorithm for identifying sphere positions

In the approach described in Section “Identifying mutational hotspots”, we find the 1, 2 and 3 non-overlapping spheres that cover the most mutations given a pre-specified radius length. Ignoring sphere overlap for the moment, if only one sphere is considered, the number of possible spheres is linear in the length of the protein (namely, a sphere centered at each residue). If two spheres are considered, there are N 2 possible sphere combinations if the protein is N residues long. Similarly, if three spheres are considered, there are N 3 such combinations (once again ignoring sphere overlap). For a medium-sized protein like PIK3C α which is 1,068 residues long, considering three spheres allows for 1 , 068 3 = 202 , 461 , 116 possible positions. This makes it prohibitively expensive to perform a brute force approach. To quickly find the best sphere orientation, we execute Algorithm ?? below. Algorithm ?? is presented for 2 spheres but is trivially extendable to 3 or more spheres as well.

To help illustrate this algorithm, consider a protein N residues long with five mutated amino acid positions and suppose that we are interested in finding the two spheres that capture the most mutations. Without loss of generality and for ease of exposition, number the mutated residues 1 through 5. Further, suppose that after aggregating all the mutational counts from all samples the following mutational counts are recorded: residue 1 - 50 mutations, residue 2 - 40 mutations, residue 3 - 30 mutations, residue 4 - 20 mutations and residue 5 - 10 mutations. For clarity, we assume that the mutation counts are unique, but the algorithm is unaffected if there are identical mutation counts for some residues. We first construct the table shown in Figure 2, where the inside of the matrix is calculated to be the sum of the number of mutations at amino acid i and amino acid j. Observing, that the table is symmetric, we only need to evaluate the residues below the diagonal as the entries on the diagonal originate from residues that overlap each other perfectly. Thus for entry (i,j) we are considering two spheres, one centered at residue i and one centered at residue j.

Figure 2
figure 2

Organizing the data example. In our example, the protein has 5 residues with mutations. The residues are sorted from largest to smallest (so residue 1 has the largest number of mutations, residue 2 the second largest number of mutations, etc.), and the inside of the table is calculated as the sum of the mutations on both residues using all the samples in the study. For instance, as residue 1 has 50 mutations and residue 2 has 40 mutations, there were 90 total samples that had a mutation either on residue 1 or 2. In the actual code, only the lower half of the table is considered and then only sequentially to decrease running time, but we present the whole table here for clarity.

The algorithm proceeds by appending to the “candidate” stack the element directly below and the element directly to the right starting from the (2,1) entry. An “element” consists of a 3-tuple (i,j,s) where i represents the row, j represents the column and s represents the value in position (i,j). After every two potential appends to the stack, the maximum value over the 3rd position is found (with the third position corresponding to the number of mutations covered by both spheres). The two spheres that contribute to this max element are then checked for overlap. If the spheres do not overlap, then a successful case has been found and the algorithm completes. If the spheres do overlap, the next set of elements are appended and the process continues. By proceeding in the way described in Algorithm ??, at each iteration, the pair of spheres being considered contains the maximum number of mutations possible from the remaining set of sphere combinations. Hence, once the first pair of non-overlapping spheres is found, the optimal sphere combination has been found and the algorithm can terminate. To see this process, see Figure 3.

Figure 3
figure 3

Algorithm execution example. This figure refers to the data in Figure 2. The first index, i, represents the row and the second index, j, represents the column. The third index, s, represents the total number of mutations at amino acids i and j. Beginning in position (2,1,90), we then add (3,1,80) to the stack, then {(4,1,70), (3,2,70)} and so forth. After each addition to the stack, we pick the element with the highest value in the third position of the 3-tuple.

It is worth noting, that by proceeding as presented above, we are able to traverse the space of possible solutions in a manner that does not require every combination to be explicitly considered and yet at the same time results in a globally optimal solution. At each step, every subsequent combination of spheres will capture fewer mutations (or in very rare cases as many mutations). As soon as the first combination of non-overlapping spheres is found, it is immediately known that all the other remaining sphere combinations will capture fewer mutations. Therefore, there is no longer any need to explicitly consider them. In the worst-case scenario (where the spheres keep overlapping at every iteration), SpacePAC will still need to consider all N 3 combinations, just as the brute force method. However, by leveraging the fact that at small enough radii the algorithm quickly identifies a case where the spheres do not overlap, the worst-case scenario almost never occurs and the running time is drastically reduced.

Multiple comparison adjustment for structures

A multiple comparison adjustment was performed to account for testing 1,110 protein structures. Since many structures may pertain to one protein, a Bonferroni adjustment is too conservative and an FDR approach was used. Specifically, a rough FDR (rFDR) [29] approach, which is a good approximation to the standard FDR approach [30] when there are a large number of positively correlated or independent tests, was applied. In our case, the cutoff is set at:

rFDR = α k + 1 2 k

where k=1,110, the total number of structures in the study. Using an α=0.05, the r F D R≈0.025023. To be conservative, we rounded down and deemed all clusters with a p-value less than or equal to 0.025 to be significant.

Results and discussion

Of the 131 proteins considered, SpacePAC identified 18 proteins with significant clustering as shown in Table 1. For a full list of which structures were found significant under SpacePAC, GraphPAC and NMC, see “Additional file 3: Results summary”. We note that while Table 1 shows the p-values for the 18 proteins found significant by SpacePAC, there were 5 proteins that were found significant only by GraphPAC, 1 protein only by iPAC and 1 protein only by NMC (see “Additional file 4: Results table” for a complete list of what proteins were identified significant under each method). However, we also note that SpacePAC identified the largest number of proteins with significant clustering at the same false positive thresholda. We further note that several of the proteins identified only by SpacePAC have already been associated with cancer as shown in Section “SpacePAC identifies additional proteins containing clusters”. Furthermore, as shown in Figure 4, 14 out of the 18 proteins identified by SpacePAC have their most significant hotspot overlap a biologically relevant region and three of the remaining four proteins (CTNNB1, FGFR3 and FSHR) have been implicated with cancer. For a full description regarding the overlap between SpacePAC identified hotspots and structurally significant regions, see “Additional file 5: Relevant sites”.

Table 1 Summary of genes with significant clusters
Figure 4
figure 4

Results summary. A breakout of what biologically relevant regions are overlapped by the most significant cluster for each of the 18 proteins. Overall, 77% of the hotspots overlap a binding site or a protein domain.

Specifically, for CTNNB1, the SpacePAC identified hotspot covers mutations G34R and G34V, which are associated with hepatocellular carcinoma and hepatoblastoma [31, 32], respectively. Further, FSHR has been shown to be expressed in the vascular endothelial tissue of a wide range of human tumors including lung, breast, prostate, colon and leiomyosarcoma [33]. For more detail on FGFR3, see Section “SpacePAC identifies additional proteins containing clusters”.

Finally, we evaluated SpacePAC performance via two common machine learning methods, PolyPhen-2[16] and CHASM[17]. Before we summarize the results, we note that PolyPhen-2 and CHASM utilize a large set of features when evaluating each mutation. The advantage of SpacePAC is that it is able to be run with vastly less a priori information. Out of the 38 mutated amino acids that fall within SpacePAC identified hotspots, PolyPhen-2 identifies 36 (95%) as damaging while CHASM identifies 31 (82%) as driver mutations at a FDR of 20%. On the protein level, PolyPhen-2 identifies all the 18 proteins identified by SpacePAC as significant while CHASM identifies 14 proteins as significant. Moreover, SpacePAC identifies several proteins with significant clustering that are missed by the machine learning methods. For instance, SpacePAC identifies FSHR as significant, which, as described above, has recently been associated with cancer. However, CHASM calculates a FDR of 0.45 for FSHR, which is above the significance threshold. See “Additional file 6: Performance evaluation” for more information.

SpacePACidentifies additional proteins containing clusters

As described in Section “Results and discussion”, SpacePAC identified three proteins with significant clustering that were missed by NMC, iPAC and GraphPAC. We will now consider two of these proteins, Fibroblast Growth Factor Receptor 3 (FGFR3), and Muscarinic Acetylcholine Receptor M2 (CHRM2 or M2).

The CHRM2 structure (PDB ID: 3UON) [34] was identified by SpacePAC as having two significant mutational hotspots (p-value = 0.011) located at residues 52 and 144 (see Figure 5). CHRM2, essential for the physiological regulation of cardiovascular function [34] has been implicated in a variety of cardiovascular diseases. Recently, CHRM2 has also been associated with both autoimmune diseases and cancer [35]. Current research shows that M2 receptors are expressed in both glioblastoma cell lines and human samples. Moreover, the M2 agonist arecaidine strongly decreases cell proliferation in both primary cultures and cell lines in a dose and time dependent response profile. This suggests that M2 activation has an important role in suppressing glioma cell proliferation and can provide a novel therapeutic target [36]. Had the spatial structure not been taken into account, as under NMC, or if the structure was accounted for but only via remapping to 1D space followed by a pairwise multiple comparison adjustment, as under iPAC and GraphPAC, this cluster would have been missed.

Figure 5
figure 5

CHRM 2 clustering. The CHRM2 structure (PDB ID: 3UON) where residues 52 and 144 are highlighted.

SpacePAC identified the FGFR3 structure (PDB ID: 1RY7) [37] as having one significant hotspot (p-value =0.020) centered at amino acid 248 (see Figure 6). FGFR3 is a tyrosine-protein kinase which plays a critical role in regulating cell differentiation, proliferation and apoptosis and is often associated with cancer and developmental disorders [38]. Mutation R248C occurs in the Ig-like domain and is a severe and lethal mutation associated with bladder cancer [39] along with a variety of other phenotypes such as thanatophoric dwarfism [40] and epidermal nevi [41]. This cluster represents a perfect example of signal loss when all pairwise mutations are considered. In the case of our data, as all the mutations occur on one residue, a cluster is formed at that one residue only. There is therefore no difference between any of the pairwise methods as the remapping step has no effect. However, since iPAC, GraphPAC and NMC need to account for all pairwise comparisons between mutations (all occurring on residue 248), the signal is lost under all three methods. On the other hand, as SpacePAC does not need to perform such a correction, it is successfully able to detect the cluster. Moreover, we note that [42] recently developed an anti-FGFR3 monoclonal antibody that interferes with FGFR3 binding and inhibits R248C [39].

Figure 6
figure 6

FGFR3 clustering. The FGFR3 structure (PDB ID: 1RY7) where residue 248 is highlighted blue.

SpacePACimproves cluster localization

For protein-structure combinations in which mutational clusters are detected by other methods, SpacePAC is often able to provide a smaller set of clusters compared to the pairwise methods while still covering the majority of mutations. To illustrate this point we consider two examples, the BRAF structure (PDB ID: 3Q96) [43] where SpacePAC identifies 3 mutational hotspots and the ALK structure (PDB ID: 2XBA) [44] where SpacePAC identifies 2 mutational hotspots.

BRAF is a well known oncogene that is part of the RAS-RAF-MEK-ERK-MAP kinase pathway which is often activated in human tumors. Further, it is estimated that approximately 90% of mutations in this gene are a substitution of a glutamate for a valine at residue 600 (V600E) [45]. In our mutational data, 187 (83.5%) mutations were on residue 600 with the remaining 37 mutations spread over 13 other residues. Mutations on V600 typically result in constitutively elevated kinase activity and have been found in a wide range of cancers such as metastatic melanoma, ovarian carcinoma and colorectal carcinoma [5, 7, 4649]. Due to the large number of V600 mutations, SpacePAC and all the pairwise methods identified residue 600 as the most significant “cluster” in all structures where tertiary information was available for that residue. It is worth noting that BRAF V600 inhibitors, such as Vemurafenib, have already been developed, further supporting the hypothesis that mutational clusters may represent pharmaceutical targets [50].

As the signal presented by V600 is so strong, it may mask the signal from other mutations within the BRAF protein. As such, we considered structure 3Q96 which does not have tertiary information for residues 600 and 601. Of the remaining 28 mutations spread over 12 residues, SpacePAC identifies three hot spots with 7-8 mutations per cluster as shown in Table 2 (combined hot spot p-value <1.00 E-03). Moreover, each of the three regions identified have been associated with oncogenic elevated kinase activity as well as a variety of cancers such as lung adenocarcinoma, melanoma, colorectal adenocarcinoma and ovarian serous carcinoma [5, 46, 51]. See Figure 7 for a visual orientation of the clusters presented in Table 2.

Table 2 BRAF clusters
Figure 7
figure 7

BRAF clustering. The BRAF structure (PDB ID: 3Q96) where cluster 464-466 is shown in blue, 469-471 is shown in red and 595-597 is shown in purple. The central residue in each cluster (465, 470 and 596 for the blue, red and purple clusters, respectively) is labeled.

Together, the three hot spots identified by SpacePAC cover 79% of the mutations for which tertiary information is available. Moreoever, while NMC, iPAC and the three GraphPAC methods report approximately 8 to 16 times as many clusters as SpacePAC (see Table 3), the additional clusters only cover the remaining 21% of mutations. Finally, all the residues that do not fall within SpacePAC hot spots are those which have only one mutation. These additional clusters stem from the fact that NMC, iPAC and GraphPAC must consider every pairwise combination of mutations resulting in many clusters that only differ from each other by a few residues. Further, by considering every pairwise combination, many smaller clusters are often combined into larger clusters with a less significant p-value. While technically still a “significant” cluster, these extra clusters provide little additional information. As SpacePAC does not have to consider every combination, it does not suffer from this issue.

Table 3 Methodology comparison for BRAF

We now consider the Anaplastic Lymphoma Kinase (ALK) protein for which SpacePAC identifies two mutational hotspots (combined hot spot p-value <0.001) (see Table 4). The ALK protein is a receptor-type tyrosine kinase that is preferentially expressed in neurons during the late embryonic stages [52]. Mutations in this protein have been associated with both neuroblastoma as well as non-small cell lung cancer [53, 54]. Hotspots A and B in Table 4 both occur in the protein kinase. Further, mutations on F1174 and R1275 can cause constitutive activation which impairs receptor trafficking [55]. We note that SpacePAC perfectly identifies both hotspot locations. Moreover, it has recently been shown that activating mutations F1174L and R1275Q provide therapeutic targets in neuroblastoma [54], supporting the hypothesis that mutational clusters are indicative of activating mutations. See Figure 8 for a visual orientation of the clusters presented in Table 4.

Table 4 ALK clusters
Figure 8
figure 8

ALK clustering. The ALK structure (PDB ID: 2XBA) where cluster 1173-1175 is shown in blue and cluster 1274-1276 is shown in red. The central residue in each cluster (1174 and 1275 for the blue and red clusters respectively) is labeled.

The two hotspots identified by SpacePAC cover 88.5% of all the mutations in our data with the remaining mutations occurring on residues with only one mutation each. Further, as seen in Table 3, the pairwise methods have 3 to 5.5 times as many clusters as SpacePAC. As before, by not having to consider every pairwise combination and thus not reporting similar clusters, SpacePAC is able to better localize the critical mutational areas.

Conclusion

In this article we provide a novel algorithm to account for protein tertiary structure when identifying mutational clusters in proteins. By considering the protein structure directly in 3D space, we avoid the use of a dimension reduction algorithm and potential information loss. Further, by not considering every pair of mutations, we are able to reduce the multiple comparison penalty and identify additional clusters. We show several examples of clusters that are not identified by alternative methods as well as the ability to improve cluster localization while still covering the majority of mutations. Moreover, several of our examples identified clusters which overlap potential therapeutic targets, supporting the hypothesis that clusters may be indicative of activating driver mutations. Finally, since the SpacePAC methodology does not need to look at every pairwise combination of mutations it also runs much faster for proteins with many (>400) mutations. In these situations, while the pairwise methods may take several days to complete, SpacePAC still finishes in a matter of hours. For proteins with fewer mutations, the running time of all the pairwise methods as well as SpacePAC is comparable, with the majority of protein/structure combinations terminating in under 10 minutes when executed on a consumer desktop with an Intel i7-2600k processor (at a frequency of 3.40 GHZ) and 16 GB of DDR3 RAM.

SpacePAC, while presenting an important alternative to the one dimensionality restriction required by iPAC and GraphPAC, is nonetheless subject to several limitations. First, SpacePAC is currently limited to at most three mutational hotspots to save on computational time. While it is unlikely that a single structure will have more than three hotspots, the extension to allow SpacePAC to account for more than three hotspots is algorithmically simple. As SpacePAC was able to process our entire database of protein/structure combinations in under 5 hours (with all the structures evaluated in parallel), this restriction is minimal and will only grow smaller as computational power increases.

Second, to satisfy the uniformity assumption, the mutational status of each amino acid must be known. However, due to improvements in high-throughput sequencing, this is rapidly becoming a non-issue. Next, unequal rates of mutagenesis in specific genomic regions may violate the assumption that each residue has an equal probability of mutation. To help ensure that our data met this statistical assumption, we only considered missense mutations as many insertion and deletion mutations are sequence dependent. Relatedly, while the literature shows that CpG dinucleotides often have a mutational rate ten times or higher when compared to other locations [56], approximately only 14% of the clusters presented in section “SpacePAC identifies additional proteins containing clusters” and “SpacePAC improves cluster localization” overlapped CpG sites. Similarly, cigarette use typically causes transversion mutations within lung carcinomas [23] while colorectal carcinomas result in transition mutations [57]. In the case of KRAS however, the vast majority of mutations are located on residues 12, 13 and 61 for both cancers. This signifies that while the mutational type may differ, the impact on mutation location is minimal and does not violate the uniformity assumption. We also note that while we used a uniform mutational distribution for our simulations (as the goal of this paper was to present a novel technique that utilizes tertiary protein structure), this is not a requirement of the overall approach. Prior knowledge of varying mutation rates can be integrated into the simulation step described in Section “Identifying mutational hotspots”. This would allow for increased accuracy when identifying clusters by potentially increasing both the specificity and sensitivity of the statistic. Further research is required in this area to realize the full benefit of such an approach. Lastly, it is worth noting that as we obtained our mutational data from the COSMIC database, specific tissue types may be more represented than others. However, under this situation our analysis would be more conservative and the resulting findings even more significant. Specifically, aggregating over all tissues increases the total number of mutations, thus increasing the number of simulated mutations within any given segment of the protein. The resulting p-value of any observed hotspot would lose significance as more mutations would be simulated within the cluster. Overall, while this as well as previous studies are impacted by several external factors, it appears that selection of the cancer phenotype is the primary cause of clustering.

In conclusion, SpacePAC presents a novel approach to account for protein tertiary structure when identifying mutational hotspots. We show that SpacePAC identifies novel clusters of biological relevance, improves cluster localization and in several cases identifies pharmaceutical targets for which therapies are already in production. In turn, we further confirm the hypothesis that mutational hotspots are indicative of driver mutations and show that SpacePAC can be used to quickly locate such potential mutations as additional structures are published. Several novel areas for further research are directly applicable as well. First, while the algorithm developed here uses up to three spheres, the methodology can be expanded to use more spheres when identifying clustering. Further, the approach proposed here is not limited to identifying human mutational clusters in proteins, but can be extended to other species as well as DNA and RNA, once the requisite positional information is available.

Endnote

a The GraphPAC algorithm was run using each of the three insertion methods described in [25]. While the methodology is the same, the nature of the algorithm leads to different results when different insertion methods are considered. SpacePAC outperformed GraphPAC in comparison to each of the individual insertion methods.