Why do mutant allele frequencies in oncogenes peak around .40 and rapidly decrease

The mutant allele frequencies in oncogenes peak around .40 and rapidly decrease. In this article, we explain why this is the case. Invoking a key result from mathematical analysis in our model, namely, the inverse function theorem, we estimate the selection coefficients of the mutant alleles as a function of germline allele frequencies. Under complete dominance of oncogenic mutations, this selection function is expected to be linearly correlated with the distribution of the mutant alleles. We demonstrate that this is the case by investigating the allele frequencies of mutations in oncogenes across various cancer types, validating our model for mean effective selection. Consistent with the population genetics model of fitness, the selection function fits a gamma-distribution curve that accurately describes the trend of the mutant allele frequencies. While existing equations for selection explain evolution at low allele frequencies, our equations are general formulas for natural selection under complete dom...


Introduction
Cancer is an evolutionary disease with mutant alleles acting as a unit of selection and, therefore, quantifying selection is necessary. Since, the distribution of the allele frequencies of somatic mutations reflects the nature of selection underlying the mutant clones, modelling this distribution is essential. Recently, Williams, Werner, Barnes, Graham, and Sottoriva (2016) showed neutral tumour evolution results in a power-law distribution of the mutant allele frequencies, and this law fits 303 of 904 cancers of various types. While neutral evolution remains an important aspect in several cancer types, the distribution of the allele frequencies of mutations in genes that undergo positive selection, has not been determined so far. This is difficult to model because the allele frequencies of positively selected mutations are related to their functional consequences, and therefore we have to take into account the degree of dominance exhibited by these mutations. However, in the case of oncogenes, which are primarily dominant, determining the distribution of the mutant allele frequencies should be feasible. It is worth noting that in the context of human polymorphisms, the allele frequencies of new deleterious mutations have been studied and fitness distribution is shown to follow a gamma distribution (Eyre-Walker, Woolfit, & Phelps, 2006). Nonetheless, the fitness function and distribution of allele frequencies of oncogenic mutations in tumours have not been described.
The mutant allele frequencies in oncogenes peak around .40 and rapidly decrease. We built a mathematical model to describe the trend of these frequencies. We assumed complete dominance of oncogenic mutations, although we realize dominance (or partial dominance) can be modelled as a function of the functional impact of these mutations, which can be derived from algorithms such as PolyPhen and SIFT (Adzhubei et al., 2010;Ng & Henikoff, 2003). The scope of this article is restricted to describe the general tendency of the frequencies rather than considering the impact of the mutations on their frequencies. By taking advantage of a key result in mathematical analysis, namely, the inverse function theorem, we estimate the mean effective selection of the mutations as a function of germline allele frequencies. Under complete dominance of oncogenic mutations, this selection function is expected to be linearly correlated with the distribution of the mutant alleles. We demonstrate that this is the case by investigating the allele frequencies of mutations in oncogenes across various cancer types, validating our model for mean effective selection. Consistent with population genetics model of fitness, the selection function fits a gammadistribution curve that accurately describes the trend of the mutant allele frequencies.
This model infers mean effective selection for oncogenic mutations without considering other alterations in the DNA that could change the dynamics of the tumour microenvironment. Moreover, this measure is an effective selection coefficient in the sense that it is selection coefficient relative to the change in mutant/germline allele frequencies.
Combining this estimate with other modifications, such as copy number changes, and integrating the functional impact of the resulting protein will help in understanding the evolution of the mutant clones in various tumours. Although the equations that we derive are currently applied in the context of oncogenes, these are general formulas for natural selection under complete dominance operating at all frequencies. While being consistent with known formulas that explain evolution at low frequencies, one of these equations (Equation (9)) shows that selection against recessive alleles behaves in a power-law-like manner, reiterating the powerful role of natural selection. This would also explain the reason some tumours undergo rapid clonal evolution. Further, at high frequencies of the dominant alleles, the linear expansion exhibited by selection could partly be the reason behind drug resistance, under dominance.

Determining selection coefficients
We use the standard model described in Falconer (1960) for selection under complete dominance. An illustration of this model for dominant and recessive alleles when the frequencies are not time dependent is shown in Supplementary Figure 1. Under this model, if s is the coefficient of selection, p and q are the allele frequencies with p + q = 1, the new allele frequencies f , in terms of p and q are given by (1) Using this model for time-dependent allele frequencies, we derive a new equation for the number of mutant alleles in terms of the germline allele frequency. At time t, let M be a population of cells consisting of mutant and germline genotypes with p 2 t , 2p t q t and q 2 t as frequencies of mutant homozygous, mutant heterozygous and germline genotypes of M, respectively, with p t + q t = 1. Let the strength of the selection be expressed as a coefficient of selection, s, which is proportional to the reduction of the germline genotype, compared to the mutant genotypes which are favoured for the tumour growth. In reality, the selection coefficient should be a function of time. However, we can consider s as mean selection acting over time and so the time dependence can be omitted. Thus, in our model, we assume the frequencies are time dependent, but the selection is time independent. If the fitness of the homozygous and heterozygous mutant genotypes are taken to be 1 as it is likely the case in oncogenes, the fitness of the germline genotype which is selected against is then 1 − s. Thus, after one generation, the new mutant allele frequency, M q t , in terms of the recessive germline allele frequency q t is given by Equation (1), which is Hence, change in mutant allele frequency, M q t , resulting in a small time interval t of selection is Let T 0 and T be the time of tumour initiation and the time of tumour biopsy, respectively. Then, Let C[T 0 , T] be the set of all continuous functions on the interval [T 0 , T] and let q t ∈ C[T 0 , T]. Note that C[T 0 , T] can be viewed as a set of random variables (since continuous functions on R are measurable). Let this set when viewed as a set of random variables be denoted by Denoting F = G −1 and u = q t , we see that, if u 0 = q T 0 and u T = q T are initial and final frequencies, then Equation (3) is This formula allows us to express the number of mutant alleles purely in terms of germline allele frequencies and decouples time dependence. Since G defines a function in C as a random variable, G essentially associates a cumulative distribution function (CDF) (or a probability density function) that describes the random variable. If λ is the fraction of reduction of germline alleles due to selection s at any time t, we can model the CDF associated with G either as a relative increase in mutant alleles or as an absolute increase in mutant alleles that corresponds to an absolute decrease in germline alleles. That is, an absolute decrease in germline allele frequency g t due to λ reduction in germline alleles is given by, Alternatively, a relative increase in mutant allele frequency is given by the CDF (Note: this CDF is a relative increase) This is a relative increase because (ignoring the quadratic and higher powers of λ) In this paper, we will model F as a relative increase in the mutant alleles, and so the derivative of F(u) is given by F (u) = − 1 (1 − λ)u 2 and therefore, Equation (4) reduces to where s λ = s/1 − λ can be interpreted as mean effective selection coefficient favouring mutant genotypes. If the coefficient is extremely small at T 0 , then s λ C ≈ 0, and hence the number of mutations in terms of the observed germline allele frequency q at time T is given by and s λ is therefore, Also, from Equation (1), since we have if s λ is very small or when p is small, the number of mutations M(p) in terms of the mutant allele frequency p can be approximated by s λ p + p. Therefore, we expect s λ to be correlated with the distribution of the mutant allele frequencies for small values of s λ . However, we note from Equations (6) and (7) that extremely small values of q or M(p) = 0 will make s λ unbounded or negative, and hence correlation may not be valid.

Results
We determined the selection coefficients s λ for mutations in oncogenes to see if they are correlated with the distribution of the mutant allele frequencies. To do this, we first identified 574 proto-oncogenes from the Uniprot database (The UniProt Consortium, 2015) out of which 236 were exclusive to Homo sapiens. A total of 42,525 mutations in these genes were queried from the cBio portal (Cerami, 2012;Gao, Aksoy, & Dogrusoz, 2013) and 25,848 mutations for which mutant allele frequencies were available, was retained (Supplementary Table 1 available upon request). The germline allele frequencies were computed by subtracting the mutant allele frequencies from 1 and M(q) was normalized with its standard Euclidean norm. Equation (5) essentially states s λ should be correlated with the random variable M(q) under the random variable q − q 2 /2. Therefore, it is natural to fit the data (q − q 2 /2, M(q)). Since two fitness distributions, the gamma and the exponential have been traditionally applied to model selection coefficients (Gillespie, 1994), we employed both functions. Consistent with the fitness function of deleterious mutations in human polymorphisms (Eyre-Walker et al., 2006), the gamma-distribution curve fitted well with minimal residual error of 6 × 10 −4 with fitting function with respect to x = q − q 2 /2, given by where shape, scale, amplitude, and offset parameters α, β, ρ, and δ were identified as 68.17, .006, .02, and .01, respectively. MATLAB's 1 nlinfit was used to fit the data with initial conditions [1, 1, .5, 1] for the parameters. Figure 1 shows the scatter plot of (q − q 2 /2, M(q)) and the gamma-distribution curve fit. This function allows us to compute s λ for each mutation through Equation (6), i.e. s λ = g (α, β, ρ, δ; x)/x, where x is the transformed allele frequency, q − q 2 /2. Since s λ coefficients are as much as the number of mutations and M(p) is restricted by the frequency bin size, to find the correlation, we sub-sampled s λ for 10,000 iterations and considered mean correlation. Also, for the reasons discussed following Equation (7), mutant allele frequencies greater than .90 and M(p) = 0 were not considered. The mean correlation was determined to be .79 with mean p-value 2.3e −12 . We also determined the line of best fit (MATLAB's polyfit routine with degree 1) to infer the slope to find the optimal s λ that would fit with the mutant allele frequencies. Figure 2(a) shows the correlation between s λ coefficients and M(p) and Figure 2(b) shows optimal s λ that fits the mutant allele frequencies.

Discussion
Good correlation between selection coefficients s λ and the mutant allele frequencies explain why the frequencies are centred around .40 and rapidly decrease. Selection coefficients are maximized around this region and reduce exponentially.
In population genetics, it is known that at low frequencies under dominance, selection for/against dominant alleles follows O(p) and selection for/against recessive alleles grows at O(q 2 ). This is because Equation (1) tells us that While this is true when the unit of measure is generation time, integrating the selection equation with respect to time establishes that at all frequencies selection for the dominant alleles with respect to change in recessive allele frequency are linear. This can be seen by differentiating Equation (5): Similar analysis on selection against recessive allele with respect to dominant allele frequency would reveal (See Supplementary Method 1) where −s θ = −s(1 − λ) can be interpreted as effective selection against recessive alleles. These formulas, Equations (8) and (9), as relative increase and decrease, define natural selection under complete dominance. Equation (8) is natural after all -for a given dominant allele frequency, the rate of change of dominant alleles in terms of (the loss of) germline alleles, should be proportional to the amount of selection that takes place. While both equations are consistent with known observation under low frequencies, Equation (9) suggests that at low frequencies of the dominant alleles, selection against the recessive alleles acts in a power-law-like manner, demonstrating why positive (natural) selection is a potent force. It is worth noting that this power-law growth of relative fitness has been observed in long-term evolution experiments (Lenski, 2015;Wiser, Ribeck, & Lenski, 2013). Under complete dominance, selection would maximize the dominant alleles at all costs.
Equations (8) and (9) also allow us to compute the rate of change of gene/mutation frequencies with respect to time. If α and β are the rate of growth of dominant and recessive alleles, respectively, we can expect an exponential growth. Therefore, dp dt = αp; dq dt = βq.
Hence, denoting M(q) by M q and M(p) by M p , we see and dM p dt = dM p dp · dp dt = −s θ αq 2 .
Thus, if the genes/mutations don't directly depend on time, the total derivative, i.e. selection acting over time, is given by dM dt = s λ βpq − s θ αq 2 .
Further, Equations (10) and (11), help us write the general equation for evolution through natural selection for diploid genomes under complete dominance α 2 s θ dqdM q + β 2 s λ dpdM p = 0.
In the context of cancer, especially in the evolution of mutant clones in oncogenes when mutant allele frequencies are small, the normal linear growth of the mutant alleles along with the power-law-like loss of recessive germline alleles will doubly accelerate the progression of the mutant clones, possibly contributing to heterogeneity in the presence of competing mutations. Similarly, when recessive germline allele frequencies are small, and they grow quadratically, the progression of the mutant alleles will proceed linearly, still dominating and possibly conferring more resistance to therapy and giving rise to metastatic clones. Therefore, incorporating selection measures in evaluating functional impact of the mutations and assessing the aggressiveness of the tumours by taking the degree of selection into account will lead to better understanding of this complex evolutionary disease.

Note
1. The MATLAB routine used to generate the results is available upon request.