Chapter 4 Association testing
In this assignment, we will use the kinship matrix and principal components derived from our dataset to fit a generalized linear model using the method of restricted maximum likelihood [15, 16] to find the association of a polymorphism with the disease phenotype.
4.1 Statistics
4.1.1 The odds ratio
The logic of any genetic association study is to see if an allele is enriched in subjects affected with disease. In other words, we want to see if the allele is associated with disease. We do this by collecting may individuals with disease and a similar number of healthy controls from the general popuation. An individual’s risk is its inherited (unobserved) probability \(p=P\left(\text{Disease}\mid\text{Allele}\right)\) of developing disease over its lifetime [17], and the relative risk or RR is the ratio of the risk to carriers to the risk to non-carriers of the allele. However, the RR is difficult to measure prosepctively because it involves waiting a long time for a potentially smaller number of cases to develop. If we retrospectively select cases that have already developed and match them with healthy controls, we can instead calculate an individual’s probability \(P\left(\text{Allele}\mid\text{Disease}\right)\) of carrying the allele conditioned on its disease status. Related to the probability \(p\) of an event is the odds \(\frac{p}{1-p}\) of the the event, and it turns out that the odds ratio \[\begin{equation} \text{OR}=\frac{\frac{P\left(\text{Allele}\mid\text{Cases}\right)}{1-P\left(\text{Allele}\mid\text{Cases}\right)}}{\frac{P\left(\text{Allele}\mid\text{Controls}\right)}{1-P\left(\text{Allele}\mid\text{Controls}\right)}} \tag{4.1} \end{equation}\]is invariant to whether we collect subjects prospectively or retrospectively. Hence we use the OR as a convenient measure of the effect of an allele on disease risk in case-control studies. However, the crude measure (4.1) cannot be adjusted for other factors like sex, age, and genetic ancestry which may also have an effect on disease risk. For this type of analysis, we need logistic regression.
4.1.2 The logistic model of disease risk
Normally we test the association between two variables using linear regression, but doing so requires that both the dependent and independent variables be continuous. In genetic association studies, neither the outcome (disease) nor the predictor (number of risk alleles) is continuous. However, an individual’s unobserved propensity \(p\) is a continuous variable that ranges between 0 and 1; furthermore, the individual’s log-odds \(\log{\frac{p}{1-p}}\) of disease is a continuous value that ranges between \(-\infty\) and \(+\infty\). Thus if we assume that the log-odds of disease can be represented by the equation \[\begin{equation} \log{\frac{p}{1-p}}=\beta_0+\beta_1X_1, \tag{4.2} \end{equation}\] where \(X_1\) is number of risk alleles and \(\beta_1\) is the log-odds-ratio, then we can in principle fit a line and estimate its slope. This slope \(\beta_1\) would then be interpretted as the multiplicative increase in the log-odds of disease.
Now, since we cannot observe \(p_i\) for each subject \(i\), we cannot actually fit (4.2) using linear regression. We can, however, use the concept of maximum likelihood. In statistics, the likelihood of a disease model like (4.2) is the probability of the data being generated by the model, or \[\begin{equation} \mathcal{L}\left(\text{Model}\mid\text{Data}\right)=P\left(\text{Data}\mid\text{Model}\right). \tag{4.3} \end{equation}\]Here the model is the set of parameters \(\beta_0,\beta_1\) required to predict disease risk, and we can find estimates for the parameters by maximizing (4.3), i.e., by finding the model most consistent with the data. Furthermore, if the likelihood function has approximately the shape of a normal distribution, then we can estimate the statistical significance of our estimates by computing their standard error, got from the curvature or second derivative of \(\mathcal{L}\) near the \(\beta\) which maximize it.
The binomial distribution is a good approximation to the normal distribution, so if we model the likelihood of the observed data as \[\begin{equation} \mathcal{L}=\prod_ip_i^{y_i}\left(1-p_i\right)^{1-y_i}, \tag{4.4} \end{equation}\] where \(y_i=0,1\) is an indicator of disease status, then the log-likelihood is \[\begin{align} \ell&=\sum_iy_i\log{p_i}+\left(1-y_i\right)\log{\left(1-p_i\right)}\\&=\sum_iy_i\log{\frac{p_i}{1-p_i}}+\log{\left(1-p_i\right)} \tag{4.5} \end{align}\]using (4.2) \[\begin{equation} \ell=\sum_iy_i\left(\beta_0+\beta_1X_{i1}\right)-\log{\left(1+e^{\beta_0+\beta_1X_{i1}}\right)}. \tag{4.6} \end{equation}\]Eq. (4.6) is a function of the parameter \(\beta_1\). Thus we can find the maximum-likelihood estimate \(\hat{\beta_1}\) of \(\beta_1\) by solving \(\frac{\partial \ell}{\partial \beta_1}\bigr\rvert_{\beta_1=\hat{\beta_1}}\) for \(\hat{\beta_1}\) and getting its standard error \(\frac{-1}{\frac{\partial^2\ell}{\partial \beta_1^2}\bigr\rvert_{\beta_1=\hat{\beta_1}}}\) by evaluating the curvature of the log-likelihood at the best estimate of \(\beta_1\). From these we can also get an estimate of the statistical significance.
4.1.2.1 Simulating phenotypes
The PhenotypeSimulator package [11] used to generate the phenotype data used in this assignment generates a vector \(\theta\) of log-odds for each individual based on its genotype, given an input list of risk SNPs. To go from the log-odds scale scale to the binary disease scale, we can rearrange the equation \(\log\frac{p}{1-p}=\theta\) to generate a vector \[\begin{equation} \mathbb{P}\left(\text{Disease}\mid \theta\right)=\frac{e^{\theta}}{1+e^{\theta}} \tag{4.7} \end{equation}\]of disease probabilities. In a balanced case-control study, the probability of disease should be approximately 50%, corresponding to a mean log-odds of \(\overline{\theta}=0\). If this condition does not obtain, we can recenter the data so that \[\begin{equation} \mathbb{P}\left(\text{Disease}\mid \theta\right)=\frac{e^{\theta -\overline{\theta}}}{1+e^{\theta-\overline{\theta}}}. \tag{4.8} \end{equation}\]This is always possible (in simulated data) because the intercept term \(\beta_0\) in Eq. (4.2) can be shifted without changing the odds ratio \(\beta_1\). Disease phenotype can then be generated by drawing \(N\) random numbers \(t_i\) uniformly from the interval \(\left[0,1\right]\) and calling everyone a case whose probability \(p_i\geq t_i\).
4.1.3 Linear mixed models
In genome-wide association studies (GWAS) we want to estimate the odds ratio \(e^{\beta_1}\) for each SNP to see if any SNPs are associated with disease. However, there are millions of SNPs and only thousands of subjects, so we cannot fit all the parameters simultaneously. Instead, one SNP effect \(\beta_1\) is fit at a time, together with the other fixed covariate effects \(\beta_j\) against a background of the composite, random effects of all the remaining SNPs together, so that our model has two components: \[\begin{equation} Y_i=\log{\frac{p_i}{1-p_i}}=\sum_jX_{ij}\beta_j+\sum_jZ_{ij}u_j+\varepsilon_i. \tag{4.9} \end{equation}\]Here, \(\mathbf{Z}\) is an \(n\times m\) matrix of the (standardized) genotypes of \(n\) individuals at \(m\) SNPs and \(u_j\) is the effect of SNP \(j\) on the log-odds for individual \(i\). The mixed-model framework assumes the random \(u\) come from a normal distribution with mean 0 and standard deviation \(\sigma\), so that each variant has but a small effect on disease risk. This theory is known as restricted maximum-likelihood (REML) [15, 16].
Now, the variance of the log-odds \(Y\) about its mean \(\mathbf{X}\beta\) becomes \[\begin{align} \left(Y-\mathbf{X}\beta\right)\left(Y-\mathbf{X}\beta\right)^T&=\mathbf{Z}uu^T\mathbf{Z}^T+\varepsilon\varepsilon^T\\ &=\left(\mathbf{Z}\mathbf{Z}^T+\mathbf{I}\right)\sigma^2=\mathbf{V}. \tag{4.10} \end{align}\]Rearranging and differentiating with respect to \(\beta_k\) obtains (with summation over \(i\), \(j\), and \(l\)) \[\begin{align} V_{li}^{-1}X_{ik}\left(Y_l-X_{lj}\beta_j\right)^T+V_{il}^{-1}\left(Y_l-X_{lj}\beta_j\right)X^T_{ki}&=0\\\left(X^T_{ki}V_{il}^{-1}Y_l-X^T_{ki}V_{il}^{-1}X_{lj}\beta_j\right)^T+\left(X^T_{ki}V_{il}^{-1}Y_l-X^T_{ki}V_{il}^{-1}X_{lj}\beta_j\right)&=0, \tag{4.11} \end{align}\]since \(\mathbf{V}\)–and hence \(\mathbf{V}^{-1}\)–is a symmetric matrix. And because the \(k\)th entry of a transposed vector is equal to the \(k\)th entry of the original vector, we get the maximum-likelihood solution \[\begin{equation} \mathbf{X}^T\left(\mathbf{I}+\mathbf{Z}\mathbf{Z}^T\right)^{-1}\mathbf{X}\hat{\beta}=\mathbf{X}^T\left(\mathbf{I}+\mathbf{Z}\mathbf{Z}^T\right)^{-1}Y, \tag{4.10} \end{equation}\]where \(\frac{1}{m}\mathbf{Z}\mathbf{Z}^T\) is the genomic relationship matrix (GRM) we used to compute principle components in Week 1. Thus we can estimatimate the fixed effects \(\beta\)—including the SNP effect \(\beta_1\)—and their standard errors without fitting the random effects of every other SNP simultaneously.
4.2 Practice
4.2.1 Get phenotypes
The first item we shall need when associating SNPs with disease risk is a phenotypes file. Download and save this file to the same location as your work from Week 2, along with the (unzipped) vcf file.
Read in the phenotypes file and redname the columns using
fam <- read.table("/path/to/file/EUR_BCa.fam",header = FALSE) # import
colnames(fam) <- c("FID","IID","MID","PID","Sex","Phenotype") # name the columns
This file is in PLINK format: its first column contains the family id of the individual in the second column; if the samples are unrelated, these two columns will be the same. Similarly, the maternal and paternal ids are the ids of the mother and father of the sample, should they happen to be in the dataset. Sex is left unspecified in this analysis. The most important column for our purposes is the Phenotype column, which contains a 1 for controls and a 2 for cases.
Complete the steps KING, PC-AiR, and PC-Relate from Week 2. At the end of this analysis, you should have the results mypcair and mypcrel of PC-AiR and PC-Relate, an updated GRM, an open GenotypeData object called genoData, and a GenotypeBlockIterator object called genoData.iterator. This last will iterate over each of the pruned SNPs included in your model as you test each one for association with the disease phenotype.
The first step is to create a dataframe for the null model, which regresses phenotype on genetic ancestry only:
mydat <- data.frame(scanID = mypcair$sample.id,pc1 = mypcair$vectors[,1],pc2 = mypcair$vectors[,2],pc3 = mypcair$vectors[,3],pc4 = mypcair$vectors[,4],pc5 = mypcair$vectors[,5],pc6 = mypcair$vectors[,6],pc7 = mypcair$vectors[,7],pc8 = mypcair$vectors[,8],pc9 = mypcair$vectors[,9],pc10 = mypcair$vectors[,10],pheno = fam$Phenotype - 1) # data frame of fixed effects
Here we have included the first ten PCs. The last column of this dataframe contains the phenotypes from the fam file; we have subtracted 1 from each entry so that 0 corresponds to controls, and 1 to cases.
4.2.2 Fitting the model
Now we will fit the null model, i.e., Eq. (4.9) with just the fixed covariates \(\beta\) and the random SNP effects \(u\); no SNP main effects have been fitted yet. The computation is accomplished efficiently in the GENESIS package [14]. We need to specify the covariates as columns in dataframe mydat, and pass the GRM as a covariate matrix:
scanAnnot <- ScanAnnotationDataFrame(mydat)
nullmod <- fitNullModel(scanAnnot, outcome = "pheno", covars = c("pc1","pc2","pc3","pc4","pc5","pc6","pc7","pc8","pc9","pc10"), cov.mat = myGRM, family = "binomial") # null model
Now using our GenotypeDataIterator object, we can fit each SNP one-at-a-time:
assoc <- assocTestSingle(genoData.iterator,null.model = nullmod) # model including SNPs
4.2.3 Analyzing the results
The dataframe assoc will contain a column Est of log-odds-ratios and another column Score.pval of p-values. Make a Manhattan plot of the data using:
par(mar = c(5.1,5.1,4.1,2.1) ) # left default plus one
plot(assoc$pos,-log10(assoc$Score.pval),xlab = "Position",ylab = "-log10(p)",pch = 19,cex.axis = 1.5,cex.lab = 1.5,cex.main = 1.5) # Manhattan plot
abline(h = 8 - log10(5),col = 'red',lty = 2) # genome-wide significance threshold
close(genoData)
Does any SNP appear to be associated with disease? Is is genome-wide significant?
To check that your answer makes sense, extract the data for the most-statistically-significant SNP:
assoc[assoc$Score.pval < 1e-5,] # most-significant SNP(s)
including its id and association statistics. Then using the GWASTools getVariable on your genoData object, extract the name of this SNP:
getVariable(genoData,"snp.rs.id")[...] # insert the position of the variant in the list
To check that the SNP is indeed associated with disease, compute the allele frequency overall, and of the cases and controls separately (being sure to fill in ‘…’ with the variant.id you found above):
cases_genotypes <- getVariable(genoData,"genotype")[fam$Phenotype == 2,...] # genotypes of cases at the SNP
controls_genotypes <- getVariable(genoData,"genotype")[fam$Phenotype == 1,...] # genotypes of controls at the SNP
p_0 <- sum(getVariable(genoData,"genotype")[,...]) / 2 / length(getVariable(genoData,"genotype")[,...]) # combined allele frequency
p_cases <- sum(cases_genotypes) / 2 / length(cases_genotypes) # cases allele frequency
p_controls <- sum(controls_genotypes) / 2 / length(controls_genotypes) # controls allele frequency
Now you can compute a z-score, p-value, and (log) OR:
z <- (p_cases - p_controls) / sqrt(p_0 * (1 - p_0) * (1 / (length(cases_genotypes) - 1) + 1 / (length(controls_genotypes) - 1))) # z-score for allele frequency difference
pnorm(-abs(z),lower.tail = TRUE) + pnorm(abs(z),lower.tail = FALSE) # two-sided p-value
log( p_cases * (1 - p_controls) / (1 - p_cases) / p_controls) # log-OR
How does your answer compare with the adjusted p-value found in the assoc table? What might account for the difference?
4.2.4 To turn in:
- A Manhattan plot of the SNPs in the region
- Information extracted from the assoc dataframe about the most-significant SNP
- Calculation of the crude logOR and p-value of the most-significant SNP
- Biological relevance of the variant you found, located in the ensembl genome browser: What is the frequency in the European population of the allele you identified? What are some nearby genes? What is the likely causal variant linked to the variant you found? Hint: see [18].