In this assignemnt we will combine the data-cleaning techniques we learned in Week 1 and Week 2 to a simulated case-control study of disease.  We will create a simulated dataset from one of the 1KG populations, declare some alleles to be risk alleles, and find the association of each SNP with the simulated phenotype.  First we will give an overview of logistic regression and linear mixed models.

Statistics

The odds ratio

The logic of any genetic association study is to see if an allele is enriched in subjects affected with disease.  In other words, we want to see if the allele is associated with disease.  We do this by collecting may individuals with disease and a similar number of healthy controls from the general popuation.  An individual’s risk is its probability \(P\left(\text{Disease}\mid\text{Allele}\right)\) of developing disease over its lifetime, and the relative risk or RR is the ratio of the risk to carriers to the risk to non-carriers of the allele.  However, the RR is difficult to measure prosepctively because it involves waiting a long time for a potentially smaller number of cases to develop.  If we retrospectively select cases that have already developed and match them with healthy controls, we can instead calculate an individual’s probability \(P\left(\text{Allele}\mid\text{Disease}\right)\) of carrying the allele conditioned on its disease status.  Related to the probability \(p\) of an event is the odds \(\frac{p}{1-p}\) of the the event, and it turns out that the odds ratio\[\text{OR}=\frac{\frac{P\left(\text{Allele}\mid\text{Cases}\right)}{1-P\left(\text{Allele}\mid\text{Cases}\right)}}{\frac{P\left(\text{Allele}\mid\text{Controls}\right)}{1-P\left(\text{Allele}\mid\text{Controls}\right)}}\tag{1}\]is invariant to whether we collect subjects prospectively or retrospectively.  Hence we use the OR as a convenient measure of the effect of an allele on disease risk in case-control studies.  However, the crude measure (1) cannot be adjusted for other factors like sex, age, and genetic ancestry which may also have an effect on disease risk.  For this type of analysis, we need logistic regression.

The logistic model of disease risk

Normally we test the association between two variables using linear regression, but doing so requires that both the dependent and independent variables be continuous.  In genetic association studies, neither the outcome (disease) nor the predictor (number of risk alleles) is continuous.  However, an individual’s unobserved propensity \(p\) is a continuous variable that ranges between 0 and 1; furthermore, the individual’s log-odds \(\log{\frac{p}{1-p}}\) of disease is a continuous value that ranges between \(-\infty\) and \(+\infty\).  Thus if we assume that the log-odds of disease can be represented by the equation \[\log{\frac{p}{1-p}}=\beta_0+\beta_1X_1,\tag{2}\] where \(X_1\) is number of risk alleles and \(\beta_1\) is the log-odds-ratio, then we can in principle fit a line and estimate its slope.  This slope \(\beta_1\) would then be interpretted as the multiplicative increase in the log-odds of disease.

Now, since we cannot observe \(p_i\) for each subject \(i\), we cannot actually fit (2) using linear regression.  We can, however, use the concept of maximum likelihood.  In statistics, the likelihood of a disease model like (2) is the probability of the data being generated by the model, or\[\mathcal{L}\left(\text{Model}\mid\text{Data}\right)=P\left(\text{Data}\mid\text{Model}\right).\tag{3}\]Here the model is the set of parameters \(\beta_0,\beta_1\) required to predict disease risk, and we can find estimates for the parameters by maximizing (3), i.e., by finding the model most consistent with the data.  Furthermore, if the likelihood function has approximately the shape of a normal distribution, then we can estimate the statistical significance of our estimates by computing their standard error, got from the curvature or second derivative of \(\mathcal{L}\) near the \(\beta\) which maximize it.

The binomial distribution is a good approximation to the normal distribution, so if we model the likelihood of the observed data as\[\mathcal{L}=\prod_ip_i^{y_i}\left(1-p_i\right)^{1-y_i},\] where \(y_i=0,1\) is an indicator of disease status, then the log-likelihood is\[\begin{align}\ell&=\sum_iy_i\log{p_i}+\left(1-y_i\right)\log{\left(1-p_i\right)}\\\ &=\sum_iy_i\log{\frac{p_i}{1-p_i}}+\log{\left(1-p_i\right)}\end{align}\]or using (2)\[\ell=\sum_iy_i\left(\beta_0+\beta_1X_{i1}\right)-\log{\left(1+e^{\beta_0+\beta_1X_{i1}}\right)}.\tag{4}\]Eq. (4) is a function of the parameter \(\beta_1\).  Thus we can find the maximum-likelihood estimate \(\hat{\beta_1}\) of \(\beta_1\) by solving \(\frac{\partial \ell}{\partial \beta_1}\bigr\rvert_{\beta_1=\hat{\beta_1}}\) for \(\hat{\beta_1}\) and getting its standard error \(\frac{-1}{\frac{\partial^2\ell}{\partial \beta_1^2}\bigr\rvert_{\beta_1=\hat{\beta_1}}}\) by evaluating the curvature of the log-likelihood at the best estimate of \(\beta_1\).  From these we can also get an estimate of the statistical significance.

Linear mixed models

In genome-wide association studies (GWAS) we’d like to estimate the odds ratio \(e^{\beta_1}\) for each SNP to see if any are associated with disease.  However, there are millions of SNPs and only thousands of subjects, so we cannot fit all the parameters simultaneously.  Instead, one SNP effect \(\beta_1\) is fit at a time together with the other fixed covariate effects \(\beta_j\) against a background of the composite, random effects of all the remaining SNPs together, so that our model has two components:\[Y_i=\log{\frac{p_i}{1-p_i}}=\sum_jX_{ij}\beta_j+\sum_jZ_{ij}u_j+\varepsilon_i.\tag{5}\]Here, \(\mathbf{Z}\) is an \(n\times m\) matrix of the (standardized) genotypes of \(n\) individuals at \(m\) SNPs and \(u_j\) is the effect of SNP \(j\) on the log-odds for individual \(i\).  The mixed-model framework assumes the random \(u\) come from a normal distribution with mean 0 and standard deviation \(\sigma\), so that each variant has but a small effect on disease risk.

Now, the variance of the log-odds \(Y\) about its mean \(\mathbf{X}\beta\) becomes\[\begin{align}\left(Y-\mathbf{X}\beta\right)\left(Y-\mathbf{X}\beta\right)^T&=\mathbf{Z}uu^T\mathbf{Z}^T+\varepsilon\varepsilon^T\\\ &=\left(\mathbf{Z}\mathbf{Z}^T+\mathbf{I}\right)\sigma^2=\mathbf{V}.\end{align}\]Rearranging and differntiating with respect to \(\beta_k\) obtains (with summation over \(i\), \(j\), and \(l\))\[\begin{align}V_{li}^{-1}X_{ik}\left(Y_l-X_{lj}\beta_j\right)^T+V_{il}^{-1}\left(Y_l-X_{lj}\beta_j\right)X^T_{ki}&=0\\\ \left(X^T_{ki}V_{il}^{-1}Y_l-X^T_{ki}V_{il}^{-1}X_{lj}\beta_j\right)^T+\left(X^T_{ki}V_{il}^{-1}Y_l-X^T_{ki}V_{il}^{-1}X_{lj}\beta_j\right)&=0,\end{align}\]since \(\mathbf{V}\) and hence \(\mathbf{V}^{-1}\) is a symmetric matrix.  And because the \(k\)th entry of a transposed vector is equal to the \(k\)th entry of the original vector, we get the maximum-likelihood solution\[\mathbf{X}^T\left(\mathbf{I}+\mathbf{Z}\mathbf{Z}^T\right)^{-1}\mathbf{X}\hat{\beta}=\mathbf{X}^T\left(\mathbf{I}+\mathbf{Z}\mathbf{Z}^T\right)^{-1}Y,\tag{6}\]where \(\frac{1}{m}\mathbf{Z}\mathbf{Z}^T\) is the genomic relationship matrix (GRM) we used to compute principle components in Week 1.  Thus we can estimatimate the fixed effects \(\beta\)—including the SNP effect \(\beta_1\)—and their standard errors without fitting the random effects of every other SNP simultaneously.

Simulating genotypes and phenotypes

Today we will be simulating a case-control study using the package sim1000G and the logistic model.  Then we will perform a “genome-wide” association study to discover the SNPs that affect disease risk.  We will need the following packages:

library(sim1000G)
library(SNPRelate)
library(GENESIS)
library(GWASTools)
library(SeqVarTools)
library(data.table)
library(fields)

Importing data and simulating genotypes

First get the file of individuals and their population codes:

indivs <- read.table("path/to/file/CHB+YRI+CEU.txt",header = FALSE)
colnames(indivs) <- c("id","pop")

Then import the 1KG vcf file for chr1 using the sim1000G function readVCF.  Take 100 variants with minor allele frequencies between 0.10 and 0.90 to simulate genotypes comprising commmon variants across the chromosome:

vcf <- readVCF("path/to/BIOL 350 Spring 2023/CHB+YRI+CEU.chr1.vcf.gz", maxNumberOfVariants = 100 , min_maf = 0.10 , max_maf = 0.90)

Make a table of variants as we did previously; call it variants.  Use this table to make a map file POP.simulation.chr1.map (POP = CHB, YRI, or CEU) by

write.table(data.frame(chr = variants$chr,rsid = variants$rsid,X = 0,pos = variants$pos),col.names = FALSE,row.names = FALSE,quote = FALSE,file = "path/to/file/POP.simulation.chr1.map") # variant map file, POP = CHB, YRI, or CEU

Next simulate 2000 individuals from your choice of POP, also as we did before.  It is important that we use a large number of subjects so that we have sufficient statistical power to detect ORs near 1.  At the end of this step you should have, using your presvious code, two data tables dt.gt1.allele and dt.gt2.allele of alleles on each of the two chromosomes of 2000 subjects.  We will use these tables to make a ped file.

You should also have two 2000-by-100 data tables dt.gt1 and dt.gt2 of 0’s and 1’s, corresponding to the number of copies of the alternative allele on each chromosome.  Turn these into a matrix from which we can compute allele frequencies:

genotype.matrix <- as.matrix(dt.gt1 + dt.gt2) # add alleles together and make genotype matrix
colnames(genotype.matrix) <- 1:ncol(genotype.matrix)
rownames(genotype.matrix) <- 1:nrow(genotype.matrix)

You can compute the frequency of each minor allele by applying the sum function to the columns of genotype.matrix and dividing by twice the number of rows.

apply(genotype.matrix[,disease.snps],2,sum) / (2 * nrow(genotype.matrix)) # minor allele frequencies

These values are called the minor allele frequencies (MAF) and will be used to simulate phenotypes.

Simulating phenotypes

We will simulate a binary phenotype by picking three alleles at random to increase the risk of disease.  We will weight the log-OR \(\beta\) by allele frequency according to the formula \[\beta=\log{\left(10\right)}\times-\log_{10}{\left(\text{MAF}\right)}.\tag{7}\]Eq. (7) says that the odds of disease increases by a factor of 10 for alleles which occur on only 10% of chromosomes.  This is a much stronger effect than most SNPs show, but implementing it will increase the power of our small study.  Choose and view the effect sizes using:

set.seed(your_student_id) # set random seed
disease.snps <- sample(1:nrow(variants),3,replace = FALSE) # randomly select three disease SNPs
beta <- log(10/1) * -log10(apply(genotype.matrix[,disease.snps],2,sum) / (2 * nrow(genotype.matrix))) # weight log-OR by allele frequency
rbind(beta,freq = apply(genotype.matrix[,disease.snps],2,sum) / (2 * nrow(genotype.matrix))) # view frequency and beta values

According to Eq. (2), the number of copies of each allele will increase the log-odds of disease for each subject by \(X_{i1}\beta_1+X_{i2}\beta_2+X_{i3}\beta_3\).  Carry out this matrix product by

logits <- genotype.matrix[,disease.snps] %*% (beta) # genotype matrix multiplied by column of beta values

The baseline probability of disease in our study will be 50%, because there will be an approximately equal number of cases and controls, corresponding to a mean log-odds of \(\log{\left(\frac{0.5}{0.5}\right)}=0\).  In order that the mean log-odds should be zero, simply subtract the mean of logits from logits itself:

logits <- logits + (0 - mean(logits)) # log-odds in a balanced case-control study

Now we can convert odds into probability by rearrangeing Eq. (2)\[p_i=\frac{e^{\beta_0+X_{i1}\beta_1+X_{i2}\beta_2+X_{i3}\beta_3}}{1+e^{\beta_0+X_{i1}\beta_1+X_{i2}\beta_2+X_{i3}\beta_3}}.\tag{3}\]In R, we can do

probs <- exp(logits) / (1 + exp(logits)) # disease probability 

Now we can simulate the binary phenotype by drawing a random number between 0 and 1.  If the number is less than probs[i] for subject i, then subject i is a case; otherwise it is a control.  Simulate the phenotypes by

pheno <- as.numeric(runif(nrow(probs)) <= probs) + 1 # generate disease phenotypes (2 for cases, 1 for controls)

Verify that the mean of pheno is close to 0.5.

Saving your genotype and phenotypes

To make a ped file with phenotype information, we first need to make a six-column data frame to prepend the genotypes:

fam <- data.frame(fid = row.names(genotype.matrix),id = row.names(genotype.matrix), mother = 0,father = 0,sex = sample(c(0,1),nrow(genotype.matrix),replace = TRUE),phenotype = pheno) # variant ped file

This fam object is the pedigree of a cohort of unrelated individuals having an approximately equal distributions of cases and controls and men and women.  Now, prepend these six columns to the vector of each subject’s alleles on each of its two chromosomes:

write.table(cbind(fam,data.frame(cbind(dt.gt1.allele,dt.gt2.allele))[,rep(1:ncol(dt.gt1.allele),each = 2) + (0:1) * ncol(dt.gt1.allele)]),col.names = FALSE,row.names = FALSE,quote = FALSE,file = "path/to/file/POP.simulation.chr1.ped") # genotype .ped file,  POP = CHB, YRI, or CEU

This will be our ped file.  Verify that the object you created has 2000 rows and 206 columns (what do these numbers represent?).

Association testing

In the next part of the analysis, our goal will be to use the GENESIS package to estimate the log-OR associated with each SNP in our dataset. According to Eq. (6), we need the GRM to do this properly.  GENESIS requires that the GRM be in a specific form, which we will achieve by following the protocol for PC-AiR and PC-Relate.  First of all, we need principal components from the implementation of KING in SNPRelate.

PCA, LD pruning, and KING in SNPRelate

If you successfully created map and ped files, you can convert them to a gds object using the SNPRelate package:

snpgdsPED2GDS("path/to/file/POP.simulation.chr1.ped","path/to/file/POP.simulation.chr1.map","path/to/file/POP.simulation.chr1.gds")
snpgdsSummary("path/to/file/POP.simulation.chr1.gds")
genofile <- SNPRelate::snpgdsOpen("path/to/file/POP.simulation.chr1.gds")

As in Week 1, generate KING PCA estimates by

pca <- snpgdsPCA(genofile) # principal components analysis

Like we did before, make a plot of the first few PCs.  We’re only using a single population now; how is your PCA plot different from when we used three?

Next perform LD-pruning to get a set of unlinked SNPs:

snpset <- snpgdsLDpruning(genofile,method = "corr", slide.max.bp = 10e6,ld.threshold = sqrt(0.1), verbose = FALSE)
pruned <- unlist(snpset, use.names = FALSE)

There is a possibility your disease SNPs will removed by this analysis, but only if they can be represented by proxy SNPs. Make and plot the LD matrix for your data using:

LD.mat <- snpgdsLDMat(genofile, snp.id = pruned, slide = 0,method = "corr")
fields::image.plot(1:nrow(LD.mat$LD),1:ncol(LD.mat$LD),LD.mat$LD ^ 2,main = "LD r2",xlab = "SNPs",ylab = "SNPs",asp = 1,frame.plot = FALSE)

Finally, we will use our pruned set of SNPs to get kinship estimates for our subjects:

ibd <- snpgdsIBDKING(genofile,snp.id = pruned)
colnames(ibd$kinship) <- ibd$sample.id
rownames(ibd$kinship) <- ibd$sample.id

Make a plot of the kinship coefficient vs. the IBS0 proportion using

plot(ibd$IBS0,ibd$kinship,ylab = "Kinship coeffecient",xlab = "IBS0",main = "KING relatedness estimation")

There are a lot of data points, so be patient!

We can now close our gds object before moving on to PC-AiR and PC-Relate:

closefn.gds(genofile)

If ever you have trouble, try

showfile.gds(closeall = TRUE)

to close all gds files.

PC-AiR, PC-Relate, and the revised GRM

We will follow the same steps as last week to get a revised GRM for association analysis.  First use GWASTools to get GenotypeData object from uour gds file, as required by PC-AiR:

geno <- GdsGenotypeReader("path/to/file/GWAS.simulation.chr1.gds")
genoData <- GenotypeData(geno)

Now run pcair to get genotype principal components based on an unrelated, ancestry-representative subset of subjects.  You will need the output of the KING kinship analysis first.

mypcair <- pcair(genoData,kinobj = ibd$kinship,divobj = ibd$kinship,snp.include = pruned) # genotype principal components based on a subset of unrelated individuals

See how the PC estimates have changed for the first few PCs using:

plot(mypcair,vx = 1, vy = 2) # plot of PC2 vs PC1 using GENESIS package method

Now we’ll create a GenotypeBlockIterator object, as required by PC-Relate:

genoData.iterator <- GenotypeBlockIterator(genoData,snpInclude = pruned)

Recall that PC-relate updates the kinship coefficients by adjusting out shared genetic ancestry and looking only for sharing between close relatives.  We can get and plot our new estimates using:

mypcrel <- pcrelate(genoData.iterator,pcs = mypcair$vectors[,1:2],training.set = mypcair$unrels) # kinship based on unrelated individuals
plot(mypcrel$kinBtwn$k0,mypcrel$kinBtwn$kin,xlab = "IBD0 proportion",ylab = "Kinship coefficient",main = "PC-relate relatedness estimation") # kinship vs. IBD0

and update the GRM using

myGRM <- pcrelateToMatrix(mypcrel,scaleKin = 2) # genomic relationship matrix, multiplying each phi by 2  

Print out a ten-by-ten sample of the GRM to see if your results make sense.

With this GRM, we’re ready to do the association test.

Association testing

To measure the SNP effects, we’re first going to fit a null model \[\log{\left(\frac{p_i}{1-p_i}\right)}=\beta_0+\sum_{j=1}^{10}PC_{ji}\beta_{PC_j}+\sum_jZ_{ij}u_j\tag{9}\]with ten PCs as fixed effects and all SNPs as random effects and then compare it to a model \[\log{\left(\frac{p_i}{1-p_i}\right)}=\beta_0+\sum_{j=1}^{10}PC_{ji}\beta_{PC_j}+X_{ik}\beta_k+\sum_jZ_{ij}u_j\tag{10}\]with a single SNP effect.  We will estimate \(\hat{\beta_k}\) for each SNP \(k\) and estimate its statistical significance.

Fitting models (9) and (10) is not trivial; fortunately, we have the package GENESIS to help us.  GENESIS requires us to convert a data frame \(\mathbf{X}\) of fixed effects to a ScanAnnotationDataFrame object from the package GWASTools.  Your 2000-by-12 data frame should consist of the ten PCs for each subject and the phenotype (0/1 instead of 1/2):

mydat <- data.frame(scanID = mypcair$sample.id,pc1 = mypcair$vectors[,1],pc2 = mypcair$vectors[,2],pc3 = mypcair$vectors[,3],pc4 = mypcair$vectors[,4],pc5 = mypcair$vectors[,5],pc6 = mypcair$vectors[,6],pc7 = mypcair$vectors[,7],pc8 = mypcair$vectors[,8],pc9 = mypcair$vectors[,9],pc10 = mypcair$vectors[,10],pheno = pheno - 1) # data frame of fixed effects

Then convert it to a ScanAnnotationDataFrame using:

scanAnnot <- ScanAnnotationDataFrame(mydat)

We can use this object in the GENESIS function fitNullModel, which fits model (9).  We will have to specify the regression outcome and a vector covars of covariates, both consisting of text labels of the fields of mydat as they appear in scanAnnot.  We also need to specify the covariance matrix GRM for including the random effects, and, since we’re doing logistic regression, an indication that our likelihood function is of the binomial family:

nullmod <- fitNullModel(scanAnnot, outcome = "pheno", covars = c("pc1","pc2","pc3","pc4","pc5","pc6","pc7","pc8","pc9","pc10"), cov.mat = myGRM, family = "binomial")

This step may take a few iterations to complete.  Now, to fit model (10) for each SNP, we’ll use our GenotypeBlockIterator object to see if any SNPs significantly increase the likelihood.  If adding a SNP to the model causes the log-likelihood function (4) to get significantly bigger, we accept it in the model.  Run

assoc <- assocTestSingle(genoData.iterator,null.model = nullmod)

to get a table of the tested variants along with the Score.pval for each one.  Make a Manhattan plot by plottong \(-\log_{10}{p}\) vs. SNP position:

plot(assoc$pos,-log10(assoc$Score.pval),xlab = "Position",ylab = "-log10(p)") # Manhattan plot
abline(h = -8 * log10(5),col = 'red',lty = 2) # genome-wide significance threshold

Are any SNPs above the genome-wide significance level \(p=5.0\times10^{-8}\)?  Extract the SNPs disease.snps from assoc using

assoc[assoc$variant.id %in% disease.snps,]

and see if the log-OR estimates Est agree with the values beta you previously assigned in

rbind(beta,freq = apply(genotype.matrix[,disease.snps],2,sum) / (2 * nrow(genotype.matrix))) # view frequency and beta values

You may have to multiply Est by -1 if the allele frequency is reversed.

Finally, we’re going to make a QQ (quantile-quantile) plot by sorting the p-values to see if the observed association statistics depart from expectation.  If we have one-hundred p-values, we have one-hundred empirical quantiles, with the \(i\)th largest quantile being greater than or equal to \(i\)% of the other quantiles.  We compare this expected p-value to the observed p-value, which represents the probability that a score statistic greater or equal to \(100p\)% of other score statistics should have been obtained.  Instead of plotting on the nominal scale, we negative-log10-transform each value.  Departure from expectation is indicated by deparature of the observed statistics from a line though the first and third quantiles.  We can generate such a plot by

qqplot(sort(-log10((1:length(assoc$Score.pval))/length(assoc$Score.pval))),sort(-log10(assoc$Score.pval)),xlab = "Expected −log10(p)",ylab = "Observed −log10(p)",pch = 19)
qqline(sort(-log10(assoc$Score.pval)),probs = c(0.25,0.75),col = 'red') # line through (-log10(empirical Q1),-log10(p-value of empirical Q1)) and (-log10(empirical Q3),-log10(p-value of empirical Q3))

Each point represents an association test for one of the SNPs.  Do any SNPs significantly depart from the line of identity?

When you’re done with this analysis, close your genotypeData object:

close(genoData)

To turn in:

  1. Three PCA plots output by PC-AiR to show whether your population has homogeneous genetic ancestry
  2. A Manhattan plot of the single-SNP association tests
  3. A QQ plot for the same
  4. A table of the three disease SNPs and their MAFs comparing the true OR with the estimated OR and p-value