Chapter 2 Principal Components Analysis

This week we’ll see how individuals can be separated by genetic ancestry using principal components analysis. We’ll practice applying PCA on a subset of the 1KGP data [1].

2.1 Pricipal components analysis: theory

The populations in our dataset can be separated into clusters based on their genotypes. The inferred groups help control for confounding due to ancestry, and are also more reliable than self-reported race in association studies [2]. To see how it works, suppose \(\mathbf{X}\) is an \(n\times m\) (standardized) genotype matrix with individuals down the rows and SNPs across the columns. Principal components analysis (PCA) says we can find an \(m\times n\) matrix \(\mathbf{V}\), a diagonal \(n\times n\) matrix \(\mathbf{\Sigma}\), and an \(n\times n\) matrix \(\mathbf{U}\) satisfying \[\begin{equation} \mathbf{X}=\mathbf{U}\mathbf{\Sigma}\mathbf{V}^T. \tag{2.1} \end{equation}\]If we think of \(\mathbf{V}\) as the the (standardized) SNP genotypes of an “ideal” person of a certain ancestry, then \[\begin{equation} x_{j\cdot} v_{\cdot i}=u_{ji}\lambda_{ii} \tag{2.2} \end{equation}\]represents the amount of idealized person \(i\) in actual person \(j\), up to some proportionality constant \(\lambda_{ii}\) that depends on the ancestry. Then rows \(j\) of \(\mathbf{U}\) are the ancestries of person \(j\), and columns \(i\) of \(\mathbf{U}\) are the ancestries of each individual on ancestry \(i\). It is important to remember that these “idealized” ancestries do not necessarily correspond with our preconceived notions of ancestry, so we cannot interpret them as “European,” “African,” or “Asian,” say. If we rearrange Eq. (2.1) and use the fact that the columns of \(\mathbf{U}\) and \(\mathbf{V}\) are orthonormal, we can find that \[\begin{equation} \mathbf{X}\mathbf{X}^T\mathbf{U}=\mathbf{U}\mathbf{\Sigma}^2, \tag{2.3} \end{equation}\]meaning that the columns of \(\mathbf{U}\) are the eigenvectors of the matrix \[\begin{equation} \frac{1}{m}\mathbf{X}\mathbf{X}^T \tag{2.4} \end{equation}\] whose \(\left(i,j\right)\) entry is the genetic correlation between individuals \(i\) and \(j\), sometimes known as the genomic relationship matrix or GRM. PCA works by finding the first several eigenvectors of the GRM and plotting each individual’s ancestry along each orthogonal vector in a rectangular grid. Clusters of individuals in this grid represent distinct ancestry groups.

PCA is used in many field besides genetics, and a good tutorial can be found here [3].

2.2 Principal components analysis: practice

We’ll need to use several Biocondunctor packages in the following analyses. If you have not used Biocondunctor before, run:

install.packages("BiocManager",repos = "http://cran.us.r-project.org")

Then to install a package from Biocondunctor for the first time, use the BiocManager::installs() command with the name of your package in quotes. For example:

BiocManager::install("SNPRelate")

2.2.1 Importing the data

To do PCA in R, we’ll need to load the libraries SNPRelate and SeqArray:

library(SNPRelate)
library(SeqArray)

Download the chr1 vcf file containing just the CEU, YRI, and CHB populations. Once you have the file, store its name as a variable:

vcf <- "path/to/file.vcf.gz"

Now we’ll convert from vcf format to gds format, retaining the base filename and changing the vcf.gz extension to gds. (This may take a minute to complete.) Then we’ll import the gds file as a gds object:

seqVCF2GDS(vcf.fn = vcf,"path/to/file.gds") # convert vcf to gds with a new file name
genofile <- seqOpen("path/to/file.gds") # import the gds object

You can can see the various fields under genofile by printing it. To access the data in one of the fields, do

seqGetData(genofile,"sample.id") # view the sample ids

where the name of the field is enclosed in quotes. Other functions for manipulating and view gds files are contained in the SeqArray package [4, 5].

2.2.2 Running PCA

To run PCA in R using the SNPRelate package [4], simply do

pca <- snpgdsPCA(genofile) # runs PCA

to create an objects with 32 eignevectors of the GRM. Make a data frame of the first several eigenvectors along with subject ids:

df.pca <- data.frame(sample = pca$sample.id,EV1 = pca$eigenvect[,1],EV2 = pca$eigenvect[,2],EV3 = pca$eigenvect[,3],stringsAsFactors = FALSE)

We’ll plot individuals along EV1, EV2, and EV3 in several two-dimensional projections. But we’ll want to see how the clustering done by PCA corresponds to individuals’ self-reported race; for that we’ll need another column in our data frame.

2.2.3 Getting the population labels

The indivs file contains each subject id along with its 1KGP population group. Let’s import it now:

indivs <- read.table("path/to/CHB+YRI+CEU.txt",header = FALSE)
colnames(indivs) <- c("id","pop")

The second field of this table is pop, an assignment to each id of one of three 1KG population groups. We want to match the right ID in indivs to the right ID in df.pca so that we can color our PCA plots by population. If the tables are in the same order, matching will be easy, but it not, we have to use the match(x,y) function, which finds the positions in y corresponding to the same items in x. Thus we can make a new column pop in df.pca with the corresponding pop values from indivs by

df.pca$pop[match(indivs$id,df.pca$sample)] <- indivs$pop # find the population group of each individual in df.pca

2.2.4 Plotting

Now that we have a column of population labels, we can make a scatter plot colored by treating the pop column as vector of factors; we can get the unique values of a factor vector by applying the function levels() to it. A plot of the second principal component vs. the first can then be generated by

par(mar = c(5.1,5.1,4.1,2.1) ) 
plot(df.pca$EV1,df.pca$EV2,pch = 19,col = factor(df.pca$pop),xlab = "PC1",ylab = "PC2",cex.lab = 1.5,cex.axis = 1.5,cex.main = 1.5) # plot of PC2 vs. PC1
legend("topright",legend = levels(factor(df.pca$pop)),bty = "n",pch = 19,col = factor(levels(factor(df.pca$pop))),pt.cex = 1,cex = 1.5,x.intersp = 0.2) # with a legend

Move the legend around if it covers any points, and make similar plots for the other two comparisons between the first three PCs.

Finally, close the connection to the gds file when you are done:

seqClose(genofile)

2.2.5 To turn in:

Make three (nicely formatted) plots of:

PC2 vs. PC1
PC3 vs. PC1
PC3 vs. PC2

For each plot, discuss:

Whether the populations appear to be well separated in PCA space
What the gradients of the different PCs represent, that is, what axis of variation each PC appears to explain
How to subset your df.pca data frame to isolate individuals of each population (i.e., provide code)

2.3 References

References

1. The 1000 Genomes Project Consortium, Auton A, Abecasis GR, Steering committee, Altshuler DM, Durbin RM, et al. A global reference for human genetic variation. Nature. 2015;526:68–74.

2. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics. 2006;38:904–9.

3. Shlens J. A Tutorial on Principal Component Analysis. 2014.

4. Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS. A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics. 2012;28:3326–8.

5. Zheng X, Gogarten SM, Lawrence M, Stilp A, Conomos MP, Weir BS, et al. SeqArray—a storage-efficient high-performance data format for WGS variant calls. Bioinformatics. 2017;33:2251–7.