Chapter 3 Kinship Analysis
3.1 Kinship: theory
Kinship can be defined as the expected fraction of alleles that two individuals got from the same ancestor(s). We say that two individuals share an allele of a SNP identical-by-descent or IBD if they inherited the same copy of the allele from a common ancestor. IBD-sharing is different from simply carrying the same allele of a gene (known as identical-by-state or IBS-sharing), which unrelated individuals may do if the allele is common enough in the population. The degree \(R\) of relationship may be defined as the effective number of meioses separating the relatives through the equation \[\begin{equation} \frac{1}{2^R}=\frac{1}{2^{R_1}}+\frac{1}{2^{R_2}}, \tag{3.1} \end{equation}\] in which \(R_i\) is the number of meioses separating the relatives through the first relative’s \(i\)th parent. For example, sibs are connected by two meioses through two parents, while a parent and child are connected by one meiosis through one parent: both relationships are degree-1.
The probability that two relatives share an allele IBD is \(\frac{1}{2^R}\), as there is a \(\frac{1}{2}\) probability that an allele is passed on in any meiosis, and \(R\) is the effective number of meioses or steps between the relatives. If \(2\times\frac{1}{2^R}\) is the expected number of alleles shared IBD at any given locus, then the fraction of the genome shared by any two relatives is \[\begin{equation} r=\frac{2\times\frac{1}{2^R}}{2}=\frac{1}{2^R}. \tag{3.2} \end{equation}\]
However, genomic sharing can be realized in different ways depending on the probabilities \(\pi_0\), \(\pi_1\), and \(\pi_2\) that individuals share zero, one, or two copies IBD at a locus. The probability that the relatives inherit both both copies IBD, viz., \[\begin{equation} \pi_2=P\left(\text{share 2 IBD}\right)=2^2\frac{1}{2^{R_1}2^{R_2}}, \tag{3.3} \end{equation}\]is simply the product of the probabilities of sharing through both parents. The probability of sharing exactly one allele IBD, viz., \[\begin{equation} \pi_1=P\left(\text{share 1 IBD}\right)=2\left(\frac{1}{2^{R_1}}+\frac{1}{2^{R_2}}\right)-2^3\frac{1}{2^{R_1}2^{R_2}}, \tag{3.4} \end{equation}\]is the got by finding the probability \(2\left(\frac{1}{2^{R_1}}+\frac{1}{2^{R_2}}\right)-2^2\frac{1}{2^{R_1}2^{R_2}}\) of sharing at least one allele IBD less the probability \(\pi_2\) of sharing two. Finally, the probability of sharing at least zero alleles IBD, viz., \[\begin{equation} \pi_0=P\left(\text{share 0 IBD}\right)=1-2\frac{1}{2^{R_1}}-2\frac{1}{2^{R_2}}+2^2\frac{1}{2^{R_1}2^{R_2}}, \tag{3.5} \end{equation}\]is got by subtracting the probability \(\pi_1+\pi_2\) of sharing at least one allele IBD from 1. The coefficients account for the fact that there are \(2\) alleles at each locus, and \(2^2\) that can be shared.
From (3.3)–(3.5), the fraction of the genome shared IBD is \[\begin{equation} r=\frac{2\pi_2+1\pi_1}{2}=\frac{1}{2^{R_1}}+\frac{1}{2^{R_2}}=\frac{1}{2^R}. \tag{3.4} \end{equation}\]But the same value of \(r\) can obtain from different values of \(\pi_1\) and \(\pi_2\). For example, full sibs have a 25% probability of sharing two alleles and a 50% chance of sharing one allele at a locus, for a total fraction \(r=0.5\) shared. But a parent and child have 0% chance of sharing two alleles and a 100% chance of sharing one, also giving \(r=0.5\). Thus, your parent does is not equal your sibling, despite the fact of your sharing equal amounts of your genome with each of them. Put another way, parents cannot pass on their genotypes to their offspring.
3.1.1 KING
KING [6] computes both the probability \(\pi_0\) that two relatives share 0 alleles IBD as well as the coefficient of relatedness \(\phi=\frac{r}{2}\), defined as the probability that two alleles taken one from each relative are IBD at a locus (the maximum probability is \(\frac{1}{2}\) because there is a 50% chance that the alleles chosen come from different parents). The idea is to compare the counts \(X\) and \(Y\) of the alternative alleles which two individuals each have at a genetic locus. If the pair are from a single ancestral population, the expected values and variances of the allele counts are \(\mathbb{E}\left(X\right)=\mathbb{E}\left(Y\right)=2p\) and \(\sigma_X^2=\mathbb{E}\left(X^2\right)-\mathbb{E}\left(X\right)^2=\mathbb{E}\left(Y^2\right)-\mathbb{E}\left(Y\right)^2=2p\left(1-p\right)\). Thus the expected value of the difference \(\mathbb{E}\left(\left(X-Y\right)^2\right)=\mathbb{E}\left(X^2\right)+\mathbb{E}\left(Y^2\right)-2\mathbb{E}\left(XY\right)\) is \[\begin{equation}\frac{\mathbb{E}\left(\left(X-Y\right)^2\right)}{\sigma_X^2+\sigma_Y^2}=1-\frac{\sigma_{XY}}{\sigma_X\sigma_Y}=1-r, \tag{3.6} \end{equation}\]where \(\sigma_{XY}=\mathbb{E}\left(XY\right)-\mathbb{E}\left(X\right)\mathbb{E}\left(Y\right)\) is the covariance of the genotype counts and \(r_{ij}=2\varphi_{ij}\) is the genetic correlation between individuals \(i\) and \(j\); the latter can be interpreted as the amount of the genome shared IBD.
KING estimates \(\varphi\) from Eq. (3.6) by counting the number \(N\) of loci at which two individuals are heterozygous \(Aa,Aa\) or opposite homozygous \(AA,aa\), as well as the total number of alleles at which each individual is heterozygous \(Aa\): \[\begin{equation} \hat{\varphi_{ij}}=\frac{N_{Aa,Aa}-2N_{AA,aa}}{N_{Aa}^{\left(i\right)}+N_{Aa}^{\left(j\right)}}. \tag{3.7} \end{equation}\]From (3.7) it can be seen that shared heterozygous sites increase the estimated relatedness, and that unshared homozygous sites decrease relatedness. Eq. (3.7) is called a “robust” estimator because it measures relatedness in a purely pairwise fashion: it does not rely on population estimates of allele frequencies. However, if the individuals are not of the same genetic background, the allele frequency \(p\) is not well-defined and Eq. (3.6) does not hold, leading to negative estimates of \(\varphi\); this feature is not necessarily a problem, as it helps us to distinguish different ancestries within a single population.
3.1.2 PC-AiR
KING provides an estimate of relatedness. But as cryptic relatedness can skew ancestry estimates determined by PCA, we should use kinship to correct PCA. This is where principal components analysis in related samples, or PC-AiR [7] comes in handy. PC-AiR identifies a subset of \(\mathcal{U}\) of individuals who are unrelated to each other (via negative KING kinship estimates), but maximally related to individuals in a remaining set \(\mathcal{R}\). Recalling the math in Week 1, the matrix \[\begin{equation} \frac{1}{m}\mathbf{W}=\mathbf{X}^T\mathbf{U}=\mathbf{V}\mathbf{\Sigma}, \tag{3.8} \end{equation}\]so that \(\frac{1}{m}w_{jk}\lambda_{kk}^{-1}=x_{ij}u_{ik}\lambda_{kk}^{-1}=v_{jk}\) is the genotype at SNP \(k\) in ancestry \(j\) inferred from subjects’ genotypes \(\mathbb{X}\) who are in subset \(\mathcal{U}\). Then if we have an \(m\times n_u\) genotype matrix \(\mathbf{X}'\) of individuals in the subset \(\mathcal{R}\), we can “apply” \(\mathbf{W}\) to get an estimate of the “ancestry-corrected” PCs by: \[\begin{equation} \frac{1}{m}\mathbf{X}'\mathbf{W}\mathbf{\Sigma}^{-1}=\mathbf{X}'\mathbf{V}. \tag{3.9} \end{equation}\]Eq. (3.9) is the matrix \(x'_{i\cdot}v_{\cdot j}\) of similarities between subjects \(i\) in set \(\mathcal{R}\) and ancestry \(j\). In this way, we have estimated genotype principal components for all individuals (i.e., those in \(\mathcal{U}\) and \(\mathcal{R}\)) without the problem confounding due to cryptic relatedness.
3.1.3 PC-Relate
Once we have ancestry-corrected principal components for all individuals, we can obtain “corrected” expected genotypes \(\mathbb{E}\left(X_{ik}\right)=2p_{ik}\) of SNPs \(k\) for populations of admixed individuals \(i\) using a method known as PC-Relate [8]. Because in populations with both shared genetic ancestry and cryptic relatedness, we can use the PC estimates \(v_{kj}\) of the allele frequencies in each population \(j\) to express the scaled genotypes \(x_{ij}=\frac{g_{ik}-2p_k}{2p_k\left(1-p_k\right)}\) as \[\begin{equation} \mathbb{E}\left(x_{ik}\mid u_{ij}\right)=u_{ij}\lambda_{jj}v_{kj}, \tag{3.9} \end{equation}\]or, in terms of the measured genotypes: \[\begin{equation} \mathbb{E}\left(g_{ik}\mid u_{ij}\right)=2p_k+u_{ij}\lambda_{jj}v_{kj}\cdot 2p_k\left(1-p_k\right), \tag{3.10} \end{equation}\]in which \(p_k\) is the allele frequency estimated from the entire sample. Eq. (3.10) says that if we make a graph of each individual’s genotype \(g_{ik}=0,1,2\) at SNP \(k\) vs. the (scaled) of the amount the individual’s population-\(j\) genetic ancestry, the slope \(\lambda_{jj}v_{kj}2p_k\left(1-p_k\right)\) should be the expected allele frequency in population \(j\). In practice, the slopes are obtained from linear regression of the observed genotypes on genetic PCs, and the updated expected allele frequencies \(\mathbb{E}\left(g_{ik}\mid u_{ij}\right)=2p_{ik}\) of SNPs \(k\) for individuals \(i\) are found by the corresponding value on the best-fit hyperplane, given individual \(i\)’s genetic ancestry \(u_{ij}\). We can then obtain a new genetic relationship matrix: \[\begin{equation} 2\hat{\varphi}_{ij}=\frac{\sum_k\left(g_{ik}-2p_{ik}\right)\left(g_{jk}-2p_{jk}\right)}{2\sqrt{p_{ik}\left(1-p_{ik}\right)p_{jk}\left(1-p_{jk}\right)}}. \tag{3.11} \end{equation}\]This new GRM will be used in Week 3 when we attempt to fit a linear mixed model for the effect of genotype on disease risk; for now we will use it to make an updated table of the relationship between individuals.
3.2 Kinship: practice
3.2.1 Importing data
Now we’ll try kinship analysis for ourselves using a simulated GWAS dataset. This dataset was created using the program bioGWAS [9], which uses other programs to simulate haplotypes [10] and phenotypes [11] based on a set of input variants previously associated with breast cancer [12] and made available by the MRC Interactive Epidemiology Unit’s Open GWAS Project [13]. This particular vcf file contains 1000 chromosome 10 genotypes at 345,130 variants (304,770 of which are SNPs) generated from 263 1KGP females of European origin. Download and unzip the file:
gunzip EUR_BCa.vcf.gz
and move it to your desired location. The unzipped file is readable in plain text format, although it is very big.
Next load the following R packages:
library(gdsfmt)
library(GWASTools)
library(SNPRelate)
library(GENESIS)
Our first task will be to import the vcf file and convert it to gds format. First specify the directory in which you saved your EUR_BCa.vcf:
directory <- "path/to/folder # fill in the appropriate path to downloads folder
Then define two file names:
vcf.fn <- sprintf("%s/EUR_BCa.vcf",directory) # vcf file name (exists already)
gds.fn <- sprintf("%s/EUR_BCa.gds",directory) # gds file name (about to be created)
Next perform the conversion and importation into R:
snpgdsVCF2GDS(vcf.fn,gds.fn) # create gds file from vcf
genofile <- snpgdsOpen(gds.fn,readonly = FALSE) # import gds object
The GDS format is an efficient representation of vcf files which does not require that the entire file be loaded into memory at once [4]. By typing
genofile
you can see all the fields under your gds object. To access one of these nodes, type
index.gdsn(genofile,"sample.id") # vector of sample names
or
read.gdsn(index.gdsn(genofile,"genotype"),start = c(1,1), count = c(5,5)) # read a 5 x 5 matrix of SNP genotypes
This second line uses the start and count options to extract a range of data from the 1000 × 304,770 matrix of genotypes. If you have a vector of phenotypes (as will will in Week 3), you can add it as a node in your gds:
add.gdsn(genofile,"phenotype",val = fam$Phenotype) # add vector of phenotypes from a table
3.2.2 PCA and LD-pruning
For now, we can run PCA on the genotype matrix by simply typing
pca <- snpgdsPCA(genofile) # principal components analysis
We do not even have to specify the genotype node of genofile. Make a plot of the first few PCs as we did in Week 1.
There are 304770 SNPs in this dataset; many are in LD with their neighbors. To find a set of independent SNPs, which are not in LD (\(r>0.10\)) inside a sliding window of size ten million bp, we run:
set.seed(566) # set random seed to initiate pruning process
snpset <- snpgdsLDpruning(genofile,method = "corr", slide.max.bp = 10e6,ld.threshold = sqrt(0.1), verbose = FALSE) # LD prining
pruned <- unlist(snpset, use.names = FALSE) # set of independent SNPs
3.2.3 KING IBD and kinship calculation
With our set of pruned SNPs, we can run KING to compute the pairwise kinship coefficient (\(\varphi_{ij}=\frac{1}{2}r_{ij}\)). We will extract the kinship matrix from the created ibd object and rename the rows and columns with the ids of the samples:
ibd <- snpgdsIBDKING(genofile,snp.id = pruned) # KING kinship estimation
colnames(ibd$kinship) <- ibd$sample.id
rownames(ibd$kinship) <- ibd$sample.id
Print a 10 × 10 sample of the kinship matrix using:
ibd$kinship[1:10,1:10]
Do the results make sense? Hint: what should an individual’s kinship be with itself?
Finally, we will make a plot of kinship vs. the fraction of the genome shared IBS = 0. This will take a minute or two to plot, because the data consist of all pairwise observations.
par(mar = c(5.1,5.1,4.1,2.1) ) # left default plus one
plot(ibd$IBS0,ibd$kinship,ylab = "Kinship coeffecient",xlab = "IBS0",main = "KING relatedness estimation", ylim = c(0,1),pch = 19,cex.lab = 1.5,cex.axis = 1.5,cex.main = 1.5) # plot kinship coefficient vs. proportion of SNPs not shared IBS
You should notice that related individuals have a low fraction of IBS = 0, whereas unrelated individuals have a larger proportion. The graph should be approximately linear, but with discontinuities as the relationsips become closer. There should not be many close relatives in this dataset. What, for example, is the largest inferred value of \(\varphi\)?
In addition, if you remove the ylim = c(0,1) option, you will see a large cluster of negative kinship coefficients. These values are expected if the (simulated) individuals are from different genetic backgrounds.
Before proceding to the next step, close your gds file:
closefn.gds(genofile)
If ever you should run into trouble, you can close all your open gds connections using:
showfile.gds(closeall = TRUE)
3.2.4 PC-AiR
The PC-AiR method is part of the GENESIS package, which imports gds files through:
geno <- GdsGenotypeReader(sprintf("%s/EUR_BCa.gds",directory)) # import the gds file you created above
genoData <- GenotypeData(geno)
and converts the objects to GenotypeData format [14].
Now we can run PC-Air. It is important to include the kinship estimates we got from KING, as well as the list of pruned SNPs. These values will help PC-AiR determine a set \(\mathcal{U}\) of ancestry-representative individuals and their principal components, which will then imputed into the remaining subset \(\mathcal{R}\) of individuals:
mypcair <- pcair(genoData,kinobj = ibd$kinship,divobj = ibd$kinship,snp.include = pruned) # genotype principal components based on a subset of unrelated individuals
Next we can make a plot of the first few PCs using:
par(mar = c(5.1,5.1,4.1,2.1) ) # left default plus one
plot(mypcair,vx = 1, vy = 2) # plot of PC2 vs PC1
Change the values of vx and vy to change which PCs are plotted.
3.2.5 PC-Relate
With our improved PCA results from PC-AiR, we will next update the GRM using individual-specific expected genotype counts \(2p_{ik}\). First we need to create an interator that evaluates blocks of SNPs one at a time.
genoData.iterator <- GenotypeBlockIterator(genoData,snpInclude = pruned)
From our pruned set of SNPs, you should have only one block.
Next we can run PC-Relate to update the kinship coefficients. From the PCs determined from PC-Air, we can get the new GRM by running:
mypcrel <- pcrelate(genoData.iterator,pcs = mypcair$vectors[,1:2],training.set = mypcair$unrels) # kinship based on unrelated individuals
The output object contains a field called kinBtwn which contains subfields k0 and kin corresponding to the fraction of the genome shared IBD = 0 and the kinship between each pair of individuals. Make a plot of these values, similar to KING’s, using:
par(mar = c(5.1,5.1,4.1,2.1) ) # left default plus one
plot(mypcrel$kinBtwn$k0,mypcrel$kinBtwn$kin,xlab = "IBD0 proportion",ylab = "Kinship coefficient",main = "PC-relate relatedness estimation",pch = 19,cex.axis = 1.5,cex.lab = 1.5,cex.main = 1.5) # kinship vs. IBD0
Finally, we convert the results of pcrelate to a GRM:
myGRM <- pcrelateToMatrix(mypcrel,scaleKin = 2) # genomic relationship matrix, multiplying each phi by 2
We scale by a factor of 2 in this code so that each kinship is multiplied by 2: then the entries of the GRM will represent the fraction of the genome shared IBD. Extract a 10 × 10 subset of this matrix to see if the results make sense, and how they compare to KING’s kinship matrix computed above:
myGRM[1:10,1:10]
Now you can close the file:
close(genoData)
Save all your steps from today’s analysis–we will be repeating them in Week 3 when we perform the association study.
3.2.6 To turn in:
Generate the following plots generated using the different methods, and comment on how they compare.
- PCA plots of a) PC1 vs. PC2, b) PC1 vs. PC3, and c) PC2 vs. PC3 generated by both snpgdsPCA and PC-AiR.
- Plots of kinship vs. IBD0 fraction generated by both KING and PC-Relate.
- 10 × 10 samples of the GRM generated by both KING and PC-Relate.