pcadapt has been developped to detect genetic markers involved in biological adaptation. pcadapt provides statistical tools for outlier detection based on Principal Component Analysis (PCA).

In the following, we show how the pcadapt package can perform genome scans for selection. The package contains one example : geno3pops. The data contain simulated genotypes for 1,500 diploid markers. A total of 150 individuals coming from three different populations were genotyped. Simulations were performed with simuPOPusing a divergence model (http://simupop.sourceforge.net/) assuming that 200 SNPs confer a selective advantage.

To run the package, you need to install it and to load it using the following command lines:

install.packages("pcadapt")
library(pcadapt)

1. Reading the genotype data

The geno3pops data can be loaded using the following command.

data <- read4pcadapt(x="geno3pops",option="example")
print(dim(data))
## [1]  150 1500

For your own dataset, use also the read4pcadapt function. The current version of read4pcadapt supports two formats. The first format assumes that the genotype is a matrix with individuals in rows and genotype markers in columns. The second supported format is the standard .ped format. Assuming your file is called “mydata” and is available at “path_to_directory”, the command line should be :

data <- read4pcadapt(x="path_to_directory/mydata")

If it is a .ped file, the argument x should only contain the name of the file (without the .ped extension) and the option “ped” should be specified (if you use this option, make sure the .map file is in the same directory). Assuming your file is called “mydata.ped” and is available at “path_to_directory”, the command line should be :

data <- read4pcadapt(x="path_to_directory/mydata",option="ped")

2. Performing a genome scan with pcadapt

The pcadaptfunction performs two successive tasks. First, PCA is performed on the centered and scaled genotype matrix. The second stage consists of computing p-values based on the principal component analysis. There are two possibilities of genome scan. One scan is based on the loadings which are defined, up to a constant, as the correlations between the genetic markers and the principal components. The second possibility uses the communality statistic, which is defined as the proportion of variance of a SNP explained by the first K PCs

To run the function pcadapt, the user should specify the number K of principal components to work with. The parameter minmaf specifies a threshold of minor allele frequency. P-values for SNPs with a smaller minor allele frequency than the threshold are not computed (NA is returned).

K <- 5 
x <- pcadapt(data,K=K)
summary(x)
##                    Length Class      Mode   
## scores              750   -none-     numeric
## loadings           7500   -none-     numeric
## correlations       7500   -none-     numeric
## singular_values       5   -none-     numeric
## pvalues               5   data.frame list   
## maf                1500   -none-     numeric
## communality        1500   -none-     numeric
## neutral_sdev          5   -none-     numeric
## p                   100   -none-     numeric
## q                     5   data.frame list   
## proportion_removed    5   -none-     numeric

NB : another possibility is to read outputs from the software PCAdapt (http://membres-timc.imag.fr/Michael.Blum/PCAdapt.html), using the argument file instead of data. The argument file should be the name of the output provided by PCAdapt without any extention. This option should be used for a number of SNPs higher than 100,000 because the PCAdapt software is much faster than the R package for larger datasets. Assuming your file is called “myoutput” and is available at “path_to_directory”, the command line should be :

x <- pcadapt(file="path_to_directory/myoutput",K=K)

The object x returned by the function pcadapt contains numerical quantities obtained after performing a PCA on the genotype matrix.

All of these elements are accessible using the $ symbol.

Choice of K

An important parameter of the pcadaptfunction is the parameter K. A way to choose K is to use a scree plot which consists of plotting the percentage of explained variance for each PC. The ideal pattern in a scree plot is a steep curve followed by a bend and an almost horizontal line. The recommended value of K corresponds to the largest value of K before the plateau is reached. In the provided example, K=2 is the optimal choice of K. By default, the number of principal components taken into account in the scree plot is set to K but it can be reduced via the argument num_pc.

plot(x,option="screeplot",num_pc=4)

Visualization of the scores

Another available option in the plot.pcadapt method plots the projection of the individuals onto the specified principal components. In the case where the subpopulations are known, individuals of the same subpopulations can be displayed with the same color when the subpop field is filled with the list of indices of the different subpopulations. In our example, the first subpopulation is composed of the first 50 individuals, the second subpopulation of the next 50 individuals, and so on. If this field is left empty, the points will be displayed in black.

plot(x,option="scores",i=1,j=2,subpop=list(1:50,51:100,101:150))

By default, if the values of i and j are not specified, the projection is done onto the first two principal components.

The function getPopColors generates a list of indices of the different subpopulations based on a vector containing the indices of the population for each individual. Assuming the file containing the population indices (as a unique column) is called “mypop” and is available at “path_to_directory”, the command line should be :

mypop <- read.table(file="path_to_directory/mypop")
listpop <- getPopColors(mypop)
plot(x,option="scores",i=1,j=2,subpop=listpop)

3. Computation of p-values

Two types of genome scan can be performed based on the loadings.

The first possibility (component-wise p-values) is to perform one genome scan for each principal component. The test statistics are the loadings, which correspond to the correlations between each PC and each SNP. The advantage of this possibility is that each outlier is related to a particular PC. Comparisons of outliers for the different PCs can for instance be performed. The main drawback of this approach is that it becomes a burden when the number of PC is too large. P-values are computed by making a Gaussian approximation for each PC and by estimating the standard deviation of the null distribution.

The second possibility is to use the communality statistic, which measures the proportion of variance explained by the first K PCs. It has the advantage of providing a unique genome scan. When there are K+1 populations, this approach is similar to the common Fst statistic. P-values are computed by making a chi-square approximation.

3.1. Component-wise p-values

By default, pcadapt computes the p-values for each principal component.

pval <- x$pvalues
summary(pval)
##        p1                  p2               p3           
##  Min.   :0.0000459   Min.   :0.0000   Min.   :0.0001584  
##  1st Qu.:0.2492524   1st Qu.:0.2375   1st Qu.:0.2507789  
##  Median :0.4965266   Median :0.4888   Median :0.4998370  
##  Mean   :0.4914543   Mean   :0.4888   Mean   :0.4993279  
##  3rd Qu.:0.7403267   3rd Qu.:0.7482   3rd Qu.:0.7547545  
##  Max.   :0.9989524   Max.   :1.0000   Max.   :0.9997539  
##        p4                  p5           
##  Min.   :0.0009528   Min.   :0.0002269  
##  1st Qu.:0.2493065   1st Qu.:0.2529035  
##  Median :0.5130839   Median :0.4942363  
##  Mean   :0.5082326   Mean   :0.5038766  
##  3rd Qu.:0.7718209   3rd Qu.:0.7618044  
##  Max.   :0.9998370   Max.   :0.9997323

pcadapt returns K vectors of p-values (one for each principal component), all of them being accessible, using the $ symbol or the [] symbol. For example, typing pval$P3 or pval[,3] in the R console returns the list of p-values associated with the third principal component (provided that K is larger than 3).

The p-values are computed based on the matrix of loadings. The loadings of the neutral markers are assumed to follow a centered Gaussian distribution. The standard deviation of the Gaussian distribution is estimated after removing a proportion of genetic markers with the largest loadings (in absolute values). The removal proportion is the smallest percentage such that the kurtosis of the truncated distribution of the loadings matches the kurtosis of a Gaussian distribution, which is equal to 3. The standard deviation of the loadings is finaly estimated using the maximum likelihood of a truncated Gaussian distribution.

3.2. Communality test

The test based on the communality statistic can be performed by setting communality_test to TRUE in the pcadapt function.

x_com <- pcadapt(data,K=2,communality_test = TRUE)

This function rescales the communality in order to get a chi-square distribution, allowing to compute p-values for each SNP. The p-values are contained in pval$P1 (or pval[,1]).

4. Control of False Discovery Rate (FDR)

The package provides routines to control the false discovery rate (FDR).

4.1. List of ouliers

outlier returns a list of the outlier SNPs and the size of the list depends on the threshold of false discovery rate. The arguments of the function outlier are the following :

  • y an object of class pcadapt.

  • K an integer specifying the principal component the user is interested in. Default value set to 1. NB : this field should be left empty in the case of a communality test.

  • threshold a real value between 0 and 1 indicating the false discovery rate.

  • list_selection this argument should only be provided in the case of simulated data. It is a vector of indices containing the list of all the SNPs under selection. A list of false discovery rates then is calculated for different thresholds. Default value is set to NULL.

y <- outlier(x,K=1,threshold=0.2)
summary(y)
##               Length Class      Mode   
## rankedSNPs    1500   -none-     numeric
## adjusted_pval 1500   -none-     numeric
## outliers        65   -none-     numeric
## pvalues          5   data.frame list
  • rankedSNPs contains the list providing the indices of the SNPs after ranking the p-values in increasing order.

  • adjusted_pval contains a vector of adjusted p-values computed with Benjamini-Hochberg correction.

  • outliers contains a vector of indices corresponding to the outlier SNPs after controlling the proportion of false discoveries.

  • pvalues contains a vector of p-values ranked in increasing order.

The package also allows graphical displays such as Manhattan plot and Q-Q plot.

4.2. Manhattan Plot

By default, Manhattan plot displays p-values (on a log-scale) of the genetic markers. The argument threshold indicates the false discovery rate. If the argument threshold is specified, an horizontal bar indicates the corresponding threshold for p-values. If it is not specified, no horizontal bar is shown.

K <- 2 # Index of the principal component the user is interested in
plot(x,option="manhattan",K=2,threshold = 0.20)

In the case of a communality test, the K field can be left empty. The components taken into account in the communality test are the ones specified when x_com has been computed (K=2 here).

plot(x_com,option="manhattan",threshold = 0.20)

4.3. Q-Q Plot

The user is also given the possibility to check the distribution of the p-values using a Q-Q plot :

plot(x,option="qqplot",K=1,threshold=0.1)

On the right side of the blue dotted line are the top 10% lowest p-values.

When considering p-values obtained with the communality test, the argument K is not needed.

5. Additional features

5.1. Estimation of standard deviation

To compute p-values for each PC, we assume that the loadings of the neutral SNPs follow a normal distribution. The standard deviation of the Gaussian distribution is estimated after removing a proportion of genetic markers with the largest loadings (in absolute values).

To check the Gaussian assumption, we can display the histogram of loadings and superimpose their expected distribution for neutral SNPs.

plot(x,option="neutral",K=2)

5.2. Simulations and False Discovery Rate

In the case of a simulated dataset where true outliers are known, the user can compute the proportion of false discoveries for different levels of expected false discovery rates. The user should fill the list_selection field (which contains the indices of all the true outliers) in the pval function which is left to NULL by default :

y <- outlier(x_com,list_selection = 1:200)
plot(y,option="fdr")

5.3. Proportion of observations removed for standard deviation estimation

The following figure is obtained using the p and q data frames contained in an object of class pcadapt. They are used to estimate the standard deviations of the loadings distribution of the neutral SNPs. In this example, \(3.99\%\) of the SNPs with largest loadings in absolute value are removed to estimate the standard deviation of SNPs under neutrality.

plot(x,option="kurtosis",K=1)