Gaussian Process (GP)
Gaussian Process (GP) is a type of stochastic processes, whose application in the machine learning field enables us to infer a nonlinear function f(x) over a continuous domain x (e.g., time and space). Precisely, f(x) is a draw from a GP, if {f(x1), , f(xN)} follows a N-dimensional multivariate normal distribution for the N input data points ({{{x}_{i}}}_{i = 1}^{N}). Let us denote (X={({x}_{1},ldots ,{x}_{N})}^{top }) and (f={(f({x}_{1}),ldots ,f({x}_{N}))}^{top }), a GP is formally written as
$$f sim {{{{{{{mathcal{N}}}}}}}}(m(X),k(X,X)),$$
where m() denotes the mean function and k(,) denotes the kernel function [11]. The simplest kernel function would be the linear kernel, such that k(X, X)=2XX, while the automatic relevance determination squared exponential (ARD-SE) kernel is defined as
$$k({x}_{j},{x}_{k})={sigma }^{2}exp left[-{sum }_{q=1}^{Q}frac{{({x}_{jq}-{x}_{kq})}^{2}}{2{rho }_{q}}right]$$
for the (j, k) element of k(X, X), where ({x}_{j},{x}_{k}in {{mathbb{R}}}^{Q}) are Q-dimensional input vectors. Here 2 is the kernel variance parameter and (rho ={({rho }_{1},ldots ,{rho }_{Q})}^{top }) is the vector of characteristic length scales, whose inverse determines the relevance of each element of the input vector. Typically, the mean function is defined as m(X)=0.
Because the GP yielding f(x) has various useful properties inherited from the normal distribution, GP can be used to estimate a nonlinear function f(X) from output data (y={({y}_{1},ldots ,{y}_{N})}^{top }) along continuous factor X. The extended linear model y=f(X)+ is referred to as the GP regression and widely used in the machine learning framework [12]. This model can be used to map dynamic genetic associations for normalized gene expression or other common complex quantitative traits (e.g., human height) along the continuous factor x (e.g., cellular states or donors age). Let us denote the genotype vector (g={({g}_{1},ldots ,{g}_{N})}^{top }) and the kinship matrix R among N individuals, the mapping model, as proposed by us or others [8, 10] can be expressed as follows:
$$y=alpha +beta odot g+gamma +varepsilon ,$$
(1)
where
$$alpha sim {{{{{{{mathcal{N}}}}}}}}(0,K),quad beta sim {{{{{{{mathcal{N}}}}}}}}(0,{delta }_{g}K),quad gamma sim {{{{{{{mathcal{N}}}}}}}}(0,{delta }_{d}Kodot R)$$
are all GPs with similar covariance matrices, where denotes element wise product between two vectors or matrices with the same dimensions, K=k(X, X) denotes the covariance matrix with a kernel function, and denotes the residuals. Intuitively, models the average baseline change of y in relation to x, while represents the dynamic genetic effect along x. The effect size is multiplied by the genotype vector g, indicating that the output yi varies between different genotype groups (gi {0, 1, 2}). In fact, the effect size (xi) is additive to the baseline (xi) at each xi, which is the same as the standard association mapping. Here statistical hypothesis testing is performed under the null hypothesis of g=0, as the strength of genetic association is determined by g.
It is important to note that the model (1) includes a correction term that accounts for the between-donor variation of dynamic changes along x, particularly when multiple data points are measured from the same donor or samples are taken from related donors. This term is essential for statistical calibration of the genetic effect , because other genetic associations scattered over the genome (trans effects) can confound the target genotype effect. Therefore, to adjust for the confounding effect, we need to include the extra GP , which is drawn from a normal distribution with the covariance matrix of K multiplied by the kinship matrix R.
Here, the kinship matrix is estimated by (hat{R}=sumnolimits_{l = 1}^{L}{tilde{g}}_{l}{tilde{g}}_{l}^{top }/L) using genome-wide variants gl(l=1, ,L), where ({tilde{g}}_{l}) is a standardized genotype vector (centered and scalced) based on the allele frequency at genetic variant l, while L denotes the total number of all variants across the genome [6]. The matrix is initially a NN dense matrix, but it can be simplified if donors are (sufficiently) unrelated. Let us introduce a design matrix of donor configuration, (Zin {{mathbb{R}}}^{Ntimes {N}_{d}}), for the Nd donors (i.e., zij=1 if the sample i is taken from the donor j; otherwise zij=0), the kinship matrix can then be approximated as R=ZZ. Thus, can be expressed as a linear combination of Nd independent GPs ({{gamma }_{j} sim {{{{{{{mathcal{N}}}}}}}}(0,{delta }_{d}K);j=1,ldots ,{N}_{d}}), such that (gamma =mathop{sum }nolimits_{j = 1}^{{N}_{d}}{gamma }_{j}odot {z}_{j}), where zj denotes the jth column vector of Z. This approximation is particularly useful for parameter estimation with large Nd (as discussed in section 2.4).
When the sample size N is large, an ordinary GP faces a severe scalability issue due to the dimension of the dense matrix K being NN, resulting in a total computational cost of ({{{{{{{mathcal{O}}}}}}}}({N}^{3})). As a result, the application of GP in the GWAS field is hindered, as the sample sizes often reach a million these days. However, there are several alternatives to approximate the full GP model, including Nystrm approximation (low-rank approximation), Projected Process approximation [13], Sparse Pseudo-inputs GP [14], Fully Independent Training Conditional approximation and Variational Free Energy approximation [15]. In this section, we introduce a sparse GP approximation proposed by [16].
The sparse GP is a scalable model using the technique of inducing points [14]. Since the computational cost of the sparse GP is ({{{{{{{mathcal{O}}}}}}}}(N{M}^{2})) with M inducing points, we can greatly reduce the computational cost, which is essentially linear to N under the assumption of MN. Let us denote M inducing points by (T={({t}_{1},ldots ,{t}_{M})}^{top }) and corresponding GPs by (u={(u({t}_{1}),ldots ,u({t}_{M}))}^{top }), the joint distribution of f and u becomes a multivariate normal distribution. Therefore a lower bound of the conditional distribution p(yu) can be written as
$$log p(y| u) = log int,p(y| f)p(f| u)dfge intleft[log p(y| f)right]p(f| u)df\ = log {{{{{{{mathcal{N}}}}}}}}(y| bar{f},{sigma }^{2}I)-frac{1}{2{sigma }^{2}}{{{{{{{rm{tr}}}}}}}}{{tilde{K}}_{NN}}equiv {{{{{{{{mathcal{L}}}}}}}}}_{1},$$
where
$$bar{f}={K}_{NM}{K}_{MM}^{-1}u,quad {tilde{K}}_{NN}={K}_{NN}-{K}_{NM}{K}_{MM}^{-1}{K}_{MN},$$
and
$${K}_{NN}=k(X,X),quad {K}_{NM}=k(X,T),quad {K}_{MM}=k(T,T).$$
Therefore, the marginal distribution of the output y is approximated by
$$p(y) = int,p(y| u)p(u)duge intexp {{{{{{{{{mathcal{L}}}}}}}}}_{1}}p(u)du\ = log {{{{{{{mathcal{N}}}}}}}}(y| 0,V)-frac{1}{2{sigma }^{2}}{{{{{{{rm{tr}}}}}}}}{{tilde{K}}_{NN}}equiv exp {{{{{{{{{mathcal{L}}}}}}}}}_{2}},$$
where (V={sigma }^{2}I+{K}_{NM}{K}_{MM}^{-1}{K}_{MN}). The lower bound ({{{{{{{{mathcal{L}}}}}}}}}_{2}) is referred to as the Titsias bound and can be used for parameter estimation as well as statistical hypothesis testing.
Selecting the optimal number of inducing points M and their coordinates is crucial for accurately approximating a GP. Although a larger value of M provides a better approximation of GP, it is not feasible to increase M when N reaches hundreds of thousands in large-scale genetic association studies. Additionally, the accuracy of the GP is influenced by the complexity of nonlinearity of y and the dimension Q of input points x. There are few approaches inferring an optimal value of M from data [17], but the size of the example used in the study is too small (48 genes437 samples) to be applied to real-world data. However, it is worth noting that the optimal coordinate of inducing points with a fixed M can be easily learned from data, as described in the next section.
Genetic association mapping involves performing tens of millions of hypothesis tests. Therefore, it is almost impossible to estimate the parameters of GPs from each pair of trait and variant across the genome, even with use of the sparse approximation mentioned in the last subsection. Furthermore, both the baseline and the correction term share the characteristic length parameter (rho ={({rho }_{1},ldots ,{rho }_{Q})}^{top }) and the inducing points T. This can lead to unstable optimization and prolonged parameter estimation times. To address this issue, we have previously proposed a three-step parameter estimation strategy for performing the statistical hypothesis testing [10]. Especially, optimizing with respect to using a quasi-Newton approach (such as the BFGS method) is sufficient in the first step, because the variance explained by is typically much smaller than that explained by . The three steps are:
y=+ (baseline model: H0) to estimate and T.
y=++ (baseline model: H1) to estimate variance parameters d and 2. Here (hat{rho }) and (hat{T}) estimated in H0 are plugged into H1.
y=+g++ (full model: H2) to test whether g=0. Here ({hat{rho },hat{T},{hat{delta }}_{d},{hat{sigma }}^{2}}) estimated in H0 and H1 are used.
Here the Titsias bounds for these models are given by
$${{{{{{{{mathcal{L}}}}}}}}}_{2}^{h}=left{begin{array}{ll}log {{{{{{{mathcal{N}}}}}}}}(y| 0,V)-frac{1}{2{sigma }^{2}}{{{{{{{rm{tr}}}}}}}}{{tilde{K}}_{NN}},hfill &h={H}_{0},\ log {{{{{{{mathcal{N}}}}}}}}(y| 0,{V}_{d})-frac{1}{2{sigma }^{2}}{{{{{{{rm{tr}}}}}}}}{(1+{delta }_{d}){tilde{K}}_{NN}},hfill&h={H}_{1},\ log {{{{{{{mathcal{N}}}}}}}}(y| 0,{V}_{g})-frac{1}{2{sigma }^{2}}{{{{{{{rm{tr}}}}}}}}{(1+{delta }_{d}){tilde{K}}_{NN}+{delta }_{g}G{tilde{K}}_{NN}G},&h={H}_{2},end{array}right.$$
where
$${V}_{d}=V+{delta }_{d}({K}_{NM}{K}_{MM}^{-1}{K}_{MN})odot R,quad {V}_{g}={V}_{d}+{delta }_{g}G{K}_{NM}{K}_{MM}^{-1}{K}_{MN}G,$$
and G=diag(g) denotes the diagonal matrix whose diagonal elements are given by the elements of g. The estimators (hat{rho }) and (hat{T}) are obtained by maximizing ({{{{{{{{mathcal{L}}}}}}}}}_{2}^{{H}_{0}}) with respect to and T, and ({hat{delta }}_{d}) and ({hat{sigma }}^{2}) are obtained by maximizing ({{{{{{{{mathcal{L}}}}}}}}}_{2}^{{H}_{1}}) with respect to d and 2 given (hat{rho }) and (hat{T}).
It is worth noting that, when the kinship matrix R can be expressed as R=ZZ with a lower rank matrix (Z=({z}_{1},ldots ,{z}_{{N}_{d}})) with Nd $${V}_{d}=V+{delta }_{d}left({K}_{NM}{K}_{MM}^{-1}{K}_{MN}right)odot (Z{Z}^{top })={sigma }^{2}I+A{B}^{-1}{A}^{top },$$ where $$A = , (C,{{{{{{{rm{diag}}}}}}}}({z}_{1})C,ldots ,{{{{{{{rm{diag}}}}}}}}({z}_{D})C),quad \ B = , {{{{{{{rm{diag}}}}}}}}({K}_{MM},{delta }_{d}{K}_{MM},ldots ,{delta }_{d}{K}_{MM}),$$ and (C={K}_{NM}{K}_{MM}^{-1}), and B becomes a M(Nd+1)M(Nd+1) block diagonal matrix. Since the computational complexity of H1 or H2 is ({{{{{{{mathcal{O}}}}}}}}({N}_{d}^{2}{M}^{2}N)), for large Nd such as MNd>N, the total complexity is over ({{{{{{{mathcal{O}}}}}}}}({N}^{3})) and we again face the scalability issue. However, if the donors in the data are unrelated, we can significantly reduce the memory usage and the computational burden to be ({{{{{{{mathcal{O}}}}}}}}({N}_{d}{M}^{2}N)). This is because the matrix A becomes a sparse matrix, with ({z}_{i}^{top }{z}_{{i}^{{prime} }}=0) for (ine {i}^{{prime} }), resulting in NM(Nd1) elements out of NMNd bing 0. Additionaly, non-zero elements of A are repeated and identical to the elements of C, and the block diagonal element of B is essentially ({K}_{MM}^{-1}). To perform GWAS with GP, it is crucial to reduce the computational time required to map a genetic association for each variant. The Score statistic to test g=0 can be computed from the first derivative of ({{{{{{{{mathcal{L}}}}}}}}}_{2}^{{H}_{2}}) with respect to g, and the variance parameters ({{hat{sigma }}^{2},{hat{delta }}_{d}}) of Vd are estimated from ({{{{{{{{mathcal{L}}}}}}}}}_{2}^{{H}_{1}}) once for every single variant to be tested. Therefore, it is ideal to test tens of millions of variants independently. To use the fact that the first derivative of ({V}_{g}^{-1}) given g=0 depends only on Vd, such that $${left.frac{partial {V}_{g}^{-1}}{partial {delta }_{g}}rightvert }_{{delta }_{g} = 0}=-{V}_{d}^{-1}G{K}_{NM}{K}_{MM}^{-1}{K}_{MN}G{V}_{d}^{-1},$$ the Score statistic can be explicitly written as $$S={y}^{top }{hat{V}}_{d}^{-1}G{K}_{NM}{K}_{MM}^{-1}{K}_{MN}G{hat{V}}_{d}^{-1}y,$$ (2) whose distribution is the generalized 2 distribution, that is, the distribution of the weighted sum of M independent 2 statistics, such as (mathop{sum }nolimits_{m = 1}^{M}{lambda }_{m}{chi }_{m}^{2}) [8, 10]. It is known that the weights m(m=1, , M) are given by the non-negative eigenvalues of $${K}_{MM}^{-1/2}{K}_{MN}G{hat{V}}_{d}^{-1}G{K}_{NM}{K}_{MM}^{-top /2},$$ where ({K}_{MM}^{-1/2}) can be computed using the Cholesky decomposition of ({K}_{MM}={K}_{MM}^{top /2}{K}_{MM}^{1/2}). To compute the p-value from S, we can use the Davies exact method, implemented in the CompQuadForm package on R. Note that, if we use a linear kernel, S can be simplified as described [8]. Although the Score based approach is an easy and quick solution for genome-wide mapping, to check the asymptotic behavior and the statistical calibration of the Score statistics, we should use a QQ-plot to verify that the p-values obtained from multiple variants follow a uniform distribution under the null hypothesis. If the collocalisation analysis [18] or Bayesian hierarchical model [19] is considered as a downstream analysis using the test statistics, a Bayes factor can also be computed using the Titsias bounds, such as $$log (BF)={{{{{{{{mathcal{L}}}}}}}}}_{2}^{{H}_{2}}-{{{{{{{{mathcal{L}}}}}}}}}_{2}^{{H}_{1}}.$$ Here we would use some empirical values g={0.01, 0.1, 0.5} to average the Bayes factor, instead of integrating out g from ({{{{{{{{mathcal{L}}}}}}}}}_{2}^{{H}_{2}}) [20]. In a real genetic association mapping, most of genetic associations are indeed static and ubiquitous over the factor x. To capture such a static association, we can come up with the following model $$y={alpha }_{0}{1}_{N}+alpha +{beta }_{0}g+beta odot g+{gamma }_{0}+gamma +varepsilon ,$$ where 0 denotes the intercept, 1N denotes the N-dimensional vector of all 1s, 0 denotes the effect size of the static genetic association, and ({gamma }_{0} sim {{{{{{{mathcal{N}}}}}}}}(0,{sigma }^{2}{delta }_{d0}R)) denotes the donor variation which confounds 0. For instance, in [8], the static genetic association 0 is modeled as a fixed effect, and the dynamic effect is tested using the Score statistic. On the other hand, in [10], the authors modeled both the static and dynamic associations as a random effect to test via a Bayes factor. In this case, the covariance matrix K can be rewritten as $${K}^{* }={sigma }^{2}{e}^{-{rho }_{0}}{1}_{N}{1}_{N}^{top }+K$$ to estimate the model parameters in (1), and then the variance g=0 for is tested. Note here that, the kernel parameter 0 is not necessarily common and shared across , and . Indeed, in [10], the authors estimated ({hat{rho }}_{0}^{alpha }) and ({hat{rho }}_{0}^{gamma }) independently in ({{{{{{{{mathcal{L}}}}}}}}}_{2}^{{H}_{1}}). To compute the Score statistic, the authors assumed that ({hat{rho }}_{0}^{beta }={hat{rho }}_{0}^{gamma }) for and , because the ratio of the static effect to the dynamic effect can be the same for cis and trans genetic effects. In longitudinal studies, the factor x is typically observed explicitly (e.g., donors age or physical locations where samples were taken). This makes it straightforward to perform genetic association mapping along x using the Score statistics or Bayes factors, as described above. However, this is not often the case for the molecular studies, and therefore we need to estimate the underlying biological state from the data. In single-cell biology, typically, the hidden cellular state x is often referred to as pseudotime", and the principal component analysis is normally used to estimate it as part of dimension reduction [21]. Gaussian process latent variable model (GPLVM) is a strong alternative to extract the pseudotime when the molecular phenotype gradually changes along pseudotime x in a nonlinear fashion [22, 23]. We have also proposed a GPLVM that uses the baseline model H0 to estimate the latent variable X from the single-cell RNA-seq data (see Section 3 for more details). Let Y=(y1,,yJ) be the gene expression matrix of J genes, whose column is a vector of gene expression for the gene j, the Titsias lower bound of the GPLVM based on the baseline model H0 can be written as $$p(Y| X)ge {{{{{{{mathcal{MN}}}}}}}}(Y| 0,Sigma ,I+{K}_{NM}{K}_{MM}^{-1}{K}_{MN})-frac{J}{2}{{{{{{{rm{tr}}}}}}}}{{tilde{K}}_{NN}}={{{{{{{{mathcal{L}}}}}}}}}_{2}.$$ To obtain the optimal cellular state (hat{X}), this lower bound can be maximized with respect to {, X, T, } [10, 24]. Here (Sigma ={{{{{{{rm{diag}}}}}}}}({sigma }_{j}^{2};j=1,ldots ,J)) denotes the residual variance parameters of J genes, and ({{{{{{{mathcal{MN}}}}}}}}(cdot )) denotes the matrix normal distribution. Due to the uniqueness of the model parameters, the variance parameter in the kernel function is set to be 2=1. In addition, to maintain the uniqueness of the latent variable estimation, a prior probability on X is required. It is quite common to assume independent standard normal distributions for each of the elements of (X sim {{{{{{{mathcal{MN}}}}}}}}(0,I,I)) [24], although there are multiple alternatives to consider depending on the nature of the modeled data [10, 23]. In the parameter estimation, the limited-memory BFGS method can be used to implement GPLVM for large N. In addition, the stochastic variational Bayes approach can be used to fit GPLVM to larger data sets, while reducing the fitting time [25,26,27]. For the non-Gaussian output y, the Titsias bound ({{{{{{{{mathcal{L}}}}}}}}}_{2}) is not analytically available. However, for the Poisson distribution case, a lower bound of the conditional probability p(yu) can be computed as follows: $${{{{{{{{mathcal{L}}}}}}}}}_{1}=mathop{sum}_{i}left[-log ({y}_{i}!)+{y}_{i}{bar{f}}_{i}-exp left({bar{f}}_{i}+frac{{tilde{k}}_{ii}}{2}right)right],$$ where ({tilde{k}}_{ii}) denotes the ith diagonal element of ({tilde{K}}_{NN}). Let i and wi be the working response and the iterative weight of GLM for the ith sample, such that $${nu }_{i}={bar{f}}_{i}+({y}_{i}-{w}_{i})/{w}_{i}quad {{{{{{{rm{and}}}}}}}}quad {w}_{i}=exp left({bar{f}}_{i}+frac{{tilde{k}}_{ii}}{2}right)$$ for i=1, , N, the optimal (hat{u}) which maximizes (exp {{{{{{{{{mathcal{L}}}}}}}}}_{1}}p(u)) satisfies $$left({K}_{MM}^{-1}+{K}_{MM}^{-1}{K}_{MN}W{K}_{NM}{K}_{MM}^{-1}right)u=Wnu ,$$ (3) where W=diag(wi; i=1, , N), which suggests $$nu | u sim {{{{{{{mathcal{N}}}}}}}}(bar{f},{W}^{-1})$$ as described in elsewhere [28]. Therefore, we can maximize $${{{{{{{{mathcal{L}}}}}}}}}_{2}={{{{{{{mathcal{N}}}}}}}}(nu | 0,{W}^{-1}+{K}_{NM}{K}_{MM}^{-1}{K}_{MN})$$ with respect to {2, } where (u=hat{u}) is iteratively updated as in (3). Thus, to obtain the Score statistic for non-Gaussian y, we replace y= and ({hat{V}}_{d}={W}^{-1}+A{B}^{-1}A) in (2). For a binary output y, it is more complicated than the Poisson case, bacause it is even impossible to analytically compute the ({{{{{{{{mathcal{L}}}}}}}}}_{1}) bound with logit or Probit link function. For logit link function, several useful alternatives to the ({{{{{{{{mathcal{L}}}}}}}}}_{1}) bound have been proposed [29]. For Probit link function [30], proposed an approximation of ({{{{{{{{mathcal{L}}}}}}}}}_{1}) using the Gauss-Hermite quadrature. However, in both cases, the computational cost is much higher than the Poisson case and it is rather impractical to conduct a large genome-wide association mapping at this moment. See the rest here:
Genetic association mapping leveraging Gaussian processes | Journal of Human Genetics - Nature.com