pcadapt has been developped to detect genetic markers involved in biological adaptation. pcadapt provides statistical tools for outlier detection based on Principal Component Analysis (PCA).
In the following, we show how the pcadapt package can perform genome scans for selection. The package contains one example : geno3pops. The data contain simulated genotypes for 1,500 diploid markers. A total of 150 individuals coming from three different populations were genotyped. Simulations were performed with simuPOP
using a divergence model (http://simupop.sourceforge.net/) assuming that 200 SNPs confer a selective advantage.
To run the package, you need to install it and to load it using the following command lines:
install.packages("pcadapt")
library(pcadapt)
The geno3pops data can be loaded using the following command.
data <- read4pcadapt(x="geno3pops",option="example")
print(dim(data))
## [1] 150 1500
For your own dataset, use also the read4pcadapt
function. The current version of read4pcadapt
supports two formats. The first format assumes that the genotype is a matrix with individuals in rows and genotype markers in columns. The second supported format is the standard .ped format. Assuming your file is called “mydata” and is available at “path_to_directory”, the command line should be :
data <- read4pcadapt(x="path_to_directory/mydata")
If it is a .ped file, the argument x
should only contain the name of the file (without the .ped extension) and the option “ped” should be specified (if you use this option, make sure the .map file is in the same directory). Assuming your file is called “mydata.ped” and is available at “path_to_directory”, the command line should be :
data <- read4pcadapt(x="path_to_directory/mydata",option="ped")
pcadapt
The pcadapt
function performs two successive tasks. First, PCA is performed on the centered and scaled genotype matrix. The second stage consists of computing p-values based on the principal component analysis. There are two possibilities of genome scan. One scan is based on the loadings which are defined, up to a constant, as the correlations between the genetic markers and the principal components. The second possibility uses the communality statistic, which is defined as the proportion of variance of a SNP explained by the first K
PCs
To run the function pcadapt
, the user should specify the number K
of principal components to work with. The parameter minmaf
specifies a threshold of minor allele frequency. P-values for SNPs with a smaller minor allele frequency than the threshold are not computed (NA
is returned).
K <- 5
x <- pcadapt(data,K=K)
summary(x)
## Length Class Mode
## scores 750 -none- numeric
## loadings 7500 -none- numeric
## correlations 7500 -none- numeric
## singular_values 5 -none- numeric
## pvalues 5 data.frame list
## maf 1500 -none- numeric
## communality 1500 -none- numeric
## neutral_sdev 5 -none- numeric
## p 100 -none- numeric
## q 5 data.frame list
## proportion_removed 5 -none- numeric
NB : another possibility is to read outputs from the software PCAdapt (http://membres-timc.imag.fr/Michael.Blum/PCAdapt.html), using the argument file
instead of data
. The argument file
should be the name of the output provided by PCAdapt without any extention. This option should be used for a number of SNPs higher than 100,000 because the PCAdapt software is much faster than the R
package for larger datasets. Assuming your file is called “myoutput” and is available at “path_to_directory”, the command line should be :
x <- pcadapt(file="path_to_directory/myoutput",K=K)
The object x
returned by the function pcadapt
contains numerical quantities obtained after performing a PCA on the genotype matrix.
scores
is a matrix corresponding to the projections of the individuals onto each PC.
loadings
is a matrix containing the correlations between each genetic marker and each PC.
singular_values
contains the ordered squared root of the proportion of variance explained by each PC.
pvalues
is a data frame containing the p-values for the first K
principal components.
maf
is a vector containing the minor allele frequencies for each SNP.
communality
, also denoted as the \(h^2\) statistic, contains the communality for each PC which corresponds to the proportion of variance explained by the first K
PCs. If there are K+1
populations, the ranking provided by the communality is the same as the ranking provided by the common \(F_{ST}\) statistic.
neutral_sdev
contains the estimated standard deviations of the loadings for the neutral SNPs. For a given PC, if all the SNPs are neutral, the standard deviation should be equal to 1.
p
is a vector that gives the proportions of the markers with the largest loadings that are removed. The kurtosis of the remaining loadings is computed for each value of p
and each of the first `K’ principal component.
q
is a data frame with K
colums. Each column of q
represents the kurtosis evaluated on the distribution of truncated loadings for each chosen cut-off provided by p
.
proportion_removed
is a list of size K
corresponding to the proportions of markers to remove from the loading distributions to match the kurtosis of 3 expected for a Gaussian distribution.
All of these elements are accessible using the $
symbol.
K
An important parameter of the pcadapt
function is the parameter K
. A way to choose K
is to use a scree plot which consists of plotting the percentage of explained variance for each PC. The ideal pattern in a scree plot is a steep curve followed by a bend and an almost horizontal line. The recommended value of K
corresponds to the largest value of K before the plateau is reached. In the provided example, K=2
is the optimal choice of K
. By default, the number of principal components taken into account in the scree plot is set to K
but it can be reduced via the argument num_pc
.
plot(x,option="screeplot",num_pc=4)
Another available option in the plot.pcadapt
method plots the projection of the individuals onto the specified principal components. In the case where the subpopulations are known, individuals of the same subpopulations can be displayed with the same color when the subpop
field is filled with the list of indices of the different subpopulations. In our example, the first subpopulation is composed of the first 50 individuals, the second subpopulation of the next 50 individuals, and so on. If this field is left empty, the points will be displayed in black.
plot(x,option="scores",i=1,j=2,subpop=list(1:50,51:100,101:150))
By default, if the values of i
and j
are not specified, the projection is done onto the first two principal components.
The function getPopColors generates a list of indices of the different subpopulations based on a vector containing the indices of the population for each individual. Assuming the file containing the population indices (as a unique column) is called “mypop” and is available at “path_to_directory”, the command line should be :
mypop <- read.table(file="path_to_directory/mypop")
listpop <- getPopColors(mypop)
plot(x,option="scores",i=1,j=2,subpop=listpop)
Two types of genome scan can be performed based on the loadings.
The first possibility (component-wise p-values) is to perform one genome scan for each principal component. The test statistics are the loadings, which correspond to the correlations between each PC and each SNP. The advantage of this possibility is that each outlier is related to a particular PC. Comparisons of outliers for the different PCs can for instance be performed. The main drawback of this approach is that it becomes a burden when the number of PC is too large. P-values are computed by making a Gaussian approximation for each PC and by estimating the standard deviation of the null distribution.
The second possibility is to use the communality statistic, which measures the proportion of variance explained by the first K
PCs. It has the advantage of providing a unique genome scan. When there are K+1
populations, this approach is similar to the common Fst statistic. P-values are computed by making a chi-square approximation.
By default, pcadapt
computes the p-values for each principal component.
pval <- x$pvalues
summary(pval)
## p1 p2 p3
## Min. :0.0000459 Min. :0.0000 Min. :0.0001584
## 1st Qu.:0.2492524 1st Qu.:0.2375 1st Qu.:0.2507789
## Median :0.4965266 Median :0.4888 Median :0.4998370
## Mean :0.4914543 Mean :0.4888 Mean :0.4993279
## 3rd Qu.:0.7403267 3rd Qu.:0.7482 3rd Qu.:0.7547545
## Max. :0.9989524 Max. :1.0000 Max. :0.9997539
## p4 p5
## Min. :0.0009528 Min. :0.0002269
## 1st Qu.:0.2493065 1st Qu.:0.2529035
## Median :0.5130839 Median :0.4942363
## Mean :0.5082326 Mean :0.5038766
## 3rd Qu.:0.7718209 3rd Qu.:0.7618044
## Max. :0.9998370 Max. :0.9997323
pcadapt
returns K
vectors of p-values (one for each principal component), all of them being accessible, using the $
symbol or the []
symbol. For example, typing pval$P3
or pval[,3]
in the R
console returns the list of p-values associated with the third principal component (provided that K
is larger than 3
).
The p-values are computed based on the matrix of loadings. The loadings of the neutral markers are assumed to follow a centered Gaussian distribution. The standard deviation of the Gaussian distribution is estimated after removing a proportion of genetic markers with the largest loadings (in absolute values). The removal proportion is the smallest percentage such that the kurtosis of the truncated distribution of the loadings matches the kurtosis of a Gaussian distribution, which is equal to 3. The standard deviation of the loadings is finaly estimated using the maximum likelihood of a truncated Gaussian distribution.
The test based on the communality statistic can be performed by setting communality_test to TRUE
in the pcadapt function.
x_com <- pcadapt(data,K=2,communality_test = TRUE)
This function rescales the communality in order to get a chi-square distribution, allowing to compute p-values for each SNP. The p-values are contained in pval$P1
(or pval[,1]
).
The package provides routines to control the false discovery rate (FDR).
outlier
returns a list of the outlier SNPs and the size of the list depends on the threshold of false discovery rate. The arguments of the function outlier
are the following :
y
an object of class pcadapt
.
K
an integer specifying the principal component the user is interested in. Default value set to 1
. NB : this field should be left empty in the case of a communality test.
threshold
a real value between 0
and 1
indicating the false discovery rate.
list_selection
this argument should only be provided in the case of simulated data. It is a vector of indices containing the list of all the SNPs under selection. A list of false discovery rates then is calculated for different thresholds. Default value is set to NULL
.
y <- outlier(x,K=1,threshold=0.2)
summary(y)
## Length Class Mode
## rankedSNPs 1500 -none- numeric
## adjusted_pval 1500 -none- numeric
## outliers 65 -none- numeric
## pvalues 5 data.frame list
rankedSNPs
contains the list providing the indices of the SNPs after ranking the p-values in increasing order.
adjusted_pval
contains a vector of adjusted p-values computed with Benjamini-Hochberg correction.
outliers
contains a vector of indices corresponding to the outlier SNPs after controlling the proportion of false discoveries.
pvalues
contains a vector of p-values ranked in increasing order.
The package also allows graphical displays such as Manhattan plot and Q-Q plot.
By default, Manhattan plot displays p-values (on a log-scale) of the genetic markers. The argument threshold indicates the false discovery rate. If the argument threshold
is specified, an horizontal bar indicates the corresponding threshold for p-values. If it is not specified, no horizontal bar is shown.
K <- 2 # Index of the principal component the user is interested in
plot(x,option="manhattan",K=2,threshold = 0.20)
In the case of a communality test, the K
field can be left empty. The components taken into account in the communality test are the ones specified when x_com
has been computed (K=2
here).
plot(x_com,option="manhattan",threshold = 0.20)
The user is also given the possibility to check the distribution of the p-values using a Q-Q plot :
plot(x,option="qqplot",K=1,threshold=0.1)
On the right side of the blue dotted line are the top 10% lowest p-values.
When considering p-values obtained with the communality test, the argument K
is not needed.
To compute p-values for each PC, we assume that the loadings of the neutral SNPs follow a normal distribution. The standard deviation of the Gaussian distribution is estimated after removing a proportion of genetic markers with the largest loadings (in absolute values).
To check the Gaussian assumption, we can display the histogram of loadings and superimpose their expected distribution for neutral SNPs.
plot(x,option="neutral",K=2)
In the case of a simulated dataset where true outliers are known, the user can compute the proportion of false discoveries for different levels of expected false discovery rates. The user should fill the list_selection
field (which contains the indices of all the true outliers) in the pval
function which is left to NULL
by default :
y <- outlier(x_com,list_selection = 1:200)
plot(y,option="fdr")
The following figure is obtained using the p
and q
data frames contained in an object of class pcadapt
. They are used to estimate the standard deviations of the loadings distribution of the neutral SNPs. In this example, \(3.99\%\) of the SNPs with largest loadings in absolute value are removed to estimate the standard deviation of SNPs under neutrality.
plot(x,option="kurtosis",K=1)