calc_ehh()
, calc_ehhs()
and scan_hh()
functionsihh2ihs()
, ies2rsb()
and ies2xpehh()
functionsbifurcation.diagram()
functionThis vignette describes how the R package REHH can be applied to perform whole genome scans for footprints of selection using statistics related to the Extended Haplotype Homozygosity (EHH) (Sabeti et al. 2002).
The current implementation of the package needs biallelic genetic markers as input. Typically, albeit not necessarily, these will be SNPs.
The rehh package is currently available for most platforms (Linux, MS Windows and MacOSX) from the CRAN repository https://cran.r-project.org/ and may be installed using a standard procedure. Once the package has been successfully installed on your system, it can be loaded by:
library(rehh)
The package rehh requires as input:
Important Note: For a given chromosome, SNPs are assumed to be ordered in the same way in the haplotype and SNP information files.
For illustration purposes, example files that originate from a previously published study on the Creole cattle breed from Guadeloupe (CGU) (Gautier and Naves 2011) are provided in the package and can be copied in the working directory with the command:
make.example.files()
Throughout this vignette, this command is assumed to have been run so that the example files are in the working directory.
Three haplotype input file formats are supported:
bta12_cgu.hap
containing 280 haplotypes (identifier 1 to 280) with 1424 SNPs each (see section 1.3.1).bta12_cgu.thap
. This format is similar to the one produced by the phasing program SHAPEIT2 (O’Connell et al. 2014) (see section 1.3.2).bta12_hapguess_switch.out
example file. Note that this file format allows to include haplotypes from several populations (i.e., if the -u fastPHASE option was used) (see section 1.3.3).By default alleles are assumed to be coded as 0 (missing data), 1 (ancestral allele) or 2 (derived allele). Other encodings can be recoded into this format, using the SNP information data file (see section 1.2) and the recode.allele
option of the function data2haplohh()
(see section 1.3).
The SNP information file should contain columns without header as in the map.inp
example:
head(read.table("map.inp"))
> V1 V2 V3 V4 V5
> 1 F0100190 1 113642 T A
> 2 F0100220 1 244699 C G
> 3 F0100250 1 369419 G C
> 4 F0100270 1 447278 A T
> 5 F0100280 1 487654 T A
> 6 F0100290 1 524507 C G
For each SNP the five columns correspond to:
The fourth and fifth columns (allele coding) should be always filled in, although the corresponding information is only of relevance when the recode.allele
option of the function data2haplohh()
is used (see section 1.3). In that case, the allele encoding in the haplotype file will be replaced by either 1 (ancestral) if the allele is found in the fourth column of the info file, 2 (derived) if it is found on the fifth column and 0 (missing data) otherwise. The information about ancestral or derived status is relevant only for within population tests (based on iHS). If one is interested exclusively in cross-population tests (based on Rsb or XP-EHH), assignment of the two alleles to the fourth and fifth column may be done at random.
The data2haplohh()
function converts data files into an R object of class haplohh
subsequently used by several functions of the rehh package. The following main options are available to recode alleles or select SNPs (based on Minor Allele Frequency or percentage of missing data) and haplotypes (based on percentage of missing data):
recode.allele=TRUE
and in this case recodes the haplotype. The allele coding of the haplotype file is replaced by 0 (missing data), 1 (ancestral allele) or 2 (derived allele) according to the whether the allele coding is given at the fourth or fifth column in the corresponding row of the SNP information file.min_perc_geno.hap
of SNPs genotyped. By default min_perc_geno.hap=100
meaning that only completely phased haplotypes are retained.min_perc_geno.snp
of haplotypes. By default min_perc_geno.snp=100
meaning that only fully genotyped SNPs are retained.min_maf
. This is generally not recommended and by default min_maf=0
meaning that all SNPs are retained.The arguments min_perc_geno.hap
, min_perc_geno.snp
and min_maf
are evaluated in this order. More details about the different arguments of the function are available in the documentation accessible using the command:
?data2haplohh
In this example, the haplotype input file bta12_cgu.hap
(in a standard haplotype format) and SNP information input file map.inp
are converted into an haplohh
object named hap
. Because the map file contains information for SNPs mapping to multiple chromosomes we have to specify that the haplotype input file is about chromosome 12 by setting the option chr.name=12
. Allele recoding is activated (recode.allele=TRUE
) to recode alleles given in the haplotype file as nucleotides to 0, 1 or 2.
hap<-data2haplohh(hap_file="bta12_cgu.hap",map_file="map.inp",
recode.allele=TRUE,chr.name=12)
> Map file seems OK: 1424 SNPs declared for chromosome 12
> Standard rehh input file assumed
> Alleles are being recoded according to map file as:
> 0 (missing data), 1 (ancestral allele) or 2 (derived allele)
> Discard Haplotype with less than 100 % of genotyped SNPs
> No haplotype discarded
> Discard SNPs genotyped on less than 100 % of haplotypes
> No SNP discarded
> Data consists of 280 haplotypes and 1424 SNPs
If no value is specified for the chr.name
argument and more than one chromosome is detected in the map file, the function asks interactively which chromosome to choose:
hap<-data2haplohh(hap_file="bta12_cgu.hap",map_file="map.inp",
recode.allele=TRUE)
> More than one chromosome name in Map file: map.inp
> Which chromosome should be considered among:
> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
> 1:
12
> Map file seems OK: 1424 SNPs declared for chromosome 12
> Standard rehh input file assumed
> Alleles are being recoded according to map file as:
> 0 (missing data), 1 (ancestral allele) or 2 (derived allele)
> Discard Haplotype with less than 100 % of genotyped SNPs
> No haplotype discarded
> Discard SNPs genotyped on less than 100 % of haplotypes
> No SNP discarded
> Data consists of 280 haplotypes and 1424 SNPs
Finally, as an example of an error message, the following message is prompted if the number of SNPs for the chromosome in the info file does not correspond to the one in the haplotype file (for instance when a wrong chromosome is specified):
hap<-data2haplohh(hap_file="bta12_cgu.hap",map_file="map.inp",
recode.allele=TRUE,chr.name=18)
> Map file seems OK: 1123 SNPs declared for chromosome 18
> Standard rehh input file assumed
> The number of snp in the haplotypes 1424 is not equal
> to the number of snps declared in the map file 1123
> Error in data2haplohh(hap_file = "bta12_cgu.hap", map_file = "map.inp", : Conversion stopped
In this example, the haplotype input file bta12_cgu.thap
(transposed format) and the SNP information input file map.inp
are converted to an haplohh
object named hap
. Setting haplotype.in.columns=TRUE
informs the function that the haplotype file is in transposed format:
hap<-data2haplohh(hap_file="bta12_cgu.thap",map_file="map.inp",haplotype.in.columns=TRUE,
recode.allele=TRUE,chr.name=12)
> Map file seems OK: 1424 SNPs declared for chromosome 12
> Haplotype are in columns with no header
> Alleles are being recoded according to map file as:
> 0 (missing data), 1 (ancestral allele) or 2 (derived allele)
> Discard Haplotype with less than 100 % of genotyped SNPs
> No haplotype discarded
> Discard SNPs genotyped on less than 100 % of haplotypes
> No SNP discarded
> Data consists of 280 haplotypes and 1424 SNPs
In this example, the fastPHASE output file bta12_hapguess_switch.out
and the SNP information input file map.inp
are
converted into a haplohh
object named hap
. As explained above we use the option chr.name=12
.
Because haplotypes originated here from several populations (the -u fastPHASE option was used), we specify the population of interest
(in our example the 280 haplotypes from the CGU population, see above) using the option popsel=7
(7 corresponding to the code of CGU in the example fastPHASE input file).
hap<-data2haplohh(hap_file="bta12_hapguess_switch.out",map_file="map.inp",
recode.allele=TRUE,popsel=7,chr.name=12)
> Map file seems OK: 1424 SNPs declared for chromosome 12
> Looks like a FastPHASE haplotype file
> Haplotypes originate from 8 different populations in the fastPhase output file
> Alleles are being recoded according to map file as:
> 0 (missing data), 1 (ancestral allele) or 2 (derived allele)
> Discard Haplotype with less than 100 % of genotyped SNPs
> No haplotype discarded
> Discard SNPs genotyped on less than 100 % of haplotypes
> No SNP discarded
> Data consists of 280 haplotypes and 1424 SNPs
If no value is specified for the popsel
argument and more than one population is detected in the fastPHASE output file, the function asks interactively which population to chose:
hap<-data2haplohh(hap_file="bta12_hapguess_switch.out",map_file="map.inp",
recode.allele=TRUE,chr.name=12)
> Map file seems OK: 1424 SNPs declared for chromosome 12
> Looks like a FastPHASE haplotype file
> Haplotypes originate from 8 different populations in the fastPhase output file
> Chosen pop. is not in the list of pop. number: 1 2 3 4 5 6 7 8
> Which population should be considered among: 1 2 3 4 5 6 7 8
> 1:
7
> Map file seems OK: 1424 SNPs declared for chromosome 12
> Looks like a FastPHASE haplotype file
> Haplotypes originate from 8 different populations in the fastPhase output file
> Alleles are being recoded according to map file as:
> 0 (missing data), 1 (ancestral allele) or 2 (derived allele)
> Discard Haplotype with less than 100 % of genotyped SNPs
> No haplotype discarded
> Discard SNPs genotyped on less than 100 % of haplotypes
> No SNP discarded
> Data consists of 280 haplotypes and 1424 SNPs
calc_ehh()
, calc_ehhs()
and scan_hh()
functionsFor a given core allele (either ancestral or derived) at a focal SNP, the (allele–specific) extended haplotype homozygosity (EHH) is defined as the probability that two randomly chosen chromosomes (carrying the core allele considered) are identical by descent (as assayed by homozygosity at all SNPs) over a given surrounding chromosomal region (Sabeti et al. 2002). The EHH aims at measuring to which extent an extended haplotype is transmitted without mutation and recombination. It is computed for a given core allele \(a\) (ancestral or derived) of a focal SNP \(s\) over the chromosomal stretch extending to some SNP \(t\): \[\begin{equation} \mathrm{EHH}_{s,t}^a=\frac{1}{n_{a}(n_a-1)}\sum\limits_{k=1}^{K^a_{s,t}}n_k(n_k-1) \tag{2.1} \end{equation}\] where \(n_a\) represents the number of chromosomes carrying the core allele \(a\), \(K^a_{s,t}\) represents the number of different extended haplotypes that can be discerned among these chromosomes from SNP \(s\) to SNP \(t\), and \(n_k\) refers to the number of chromosomes pertaining to the \(k\)-th such extended haplotype (it yields \(n_a=\sum\limits_{k=1}^{K^a_{s,t}}n_k\)).
By definition, irrespective of the allele considered, EHH starts at 1, and decays monotonically to 0 with increasing distance from the focal SNP. For a given core allele, the integrated EHH (iHH) is defined as the area under the EHH curve with respect to map position (Voight et al. 2006). In rehh, iHH is computed using the trapezoid method. In practice, the integral is often truncated if the value of EHH reaches a certain lower threshold, e.g. 0.05.
An extended homozygosity can be defined for the whole set of chromosomes of a sample. In this case, the quantity is aimed to reflect the probability that any two randomly chosen chromosomes are identical by descent over a given surrounding chromosomal region of a focal SNP. In contrast to the allele-specific EHH defined above, the chromosomes are not devided with respect to their allele at the focal SNP. In order to distinguish this quantity from that defined in the previous section, we adopt the naming by (Tang et al. 2007) as site–specific EHH, abbreviated by EHHS. Note, however that this quantity is sometimes referred to as EHH, too, and there is no agreed notation in the literature.
EHHS was used in genome scans in two versions: un-normalized by (Sabeti et al. 2007) and normalized by (Tang et al. 2007).
In line with (Sabeti et al. 2007) we define \[\begin{equation} \tag{2.2} \mathrm{EHHS}^{\text{Sab}}_{s,t}=\frac{1}{n_s(n_s-1)}\left(\sum\limits_{k=1}^{K_{s,t}}n_k(n_k-1)\right) \end{equation}\] where we re-use notation from above and let \(n_s\) refer to the number of chromosomes at SNP \(s\). If there are no missing values at that SNP, this is simply the number of chromosomes in the sample.
(Tang et al. 2007) proposed an apparently different estimator for the normalized EHHS, namely \[\begin{equation} \tag{2.3} \mathrm{EHHS}^{\text{Tang}}_{s,t}=\frac{1-h_{s,t}}{1-h_s} \end{equation}\] where:
\(h_s=\frac{n_s}{n_s-1}\left(1-\frac{1}{n_s^2}\left(\sum\limits_{k=1}^{K_{s,s}}n_k^2 \right )\right)=\frac{n_s}{n_s-1}\left(1-\frac{1}{n_s^2}\left(n_{a1}^2+n_{a2}^2\right) \right)\) is an estimator of the focal SNP heterozygosity with \(a1\) and \(a2\) referring to the numbers of the two alleles at SNP \(s\) (\(n_{a1}+n_{a2}=n_s\)).
\(h_{s,t}=\frac{n_s}{n_s-1}\left(1-\frac{1}{n_s^2}\left(\sum\limits_{k=1}^{K_{s,t}}n_k^2 \right )\right)\) is an estimator of haplotype heterozygosity across the chromosome region extending from SNP \(s\) to SNP \(t\).
However both definitions are in fact equivalent, because it holds \(\mathrm{EHHS}^{\text{Sab}}_{s,t}=1-h_{s,t}\) and hence \[\begin{equation*} \mathrm{EHHS}^{\text{Tang}}_{s,t}=\frac{\mathrm{EHHS}^{\text{Sab}}_{s,t}}{\mathrm{EHHS}^{\text{Sab}}_{s,s}}\;. \end{equation*}\] Thus \(\mathrm{EHHS}^{\text{Tang}}_{s,t}\) is just normalized in order to yield 1 at the focal SNP \(s\). Note that the normalization factor depends on the frequency of the two alleles at the focal SNP and consequently is not constant over the whole data set.
Furthermore, we note that EHHS and EHH are related by \[\begin{equation*} \mathrm{EHHS}^{\text{Sab}}_{s,t}=\frac{n_{a1}(n_{a1}-1)}{n_s(n_s-1)}\mathrm{EHH}_{s,t}^{a1}+\frac{n_{a2}(n_{a2}-1)}{n_s(n_s-1)}\mathrm{EHH}^{a2}_{s,t}\;. \end{equation*}\] EHHS might hence be viewed as a linear combination of the EHH’s of the two alternative focal alleles, weighted by roughly the square of the focal allele frequencies.
As for the EHH (see section 2.1.2), \(EHHS^{\text{Tang}}\) starts at 1 and decays monotonically to 0 with increasing distance from the focal SNP. For a given focal SNP, analogously to iHH, iES is defined as the integrated EHHS (Tang et al. 2007). Depending on wether un-normalized or normalized EHHS is used (respectively, \(\mathrm{EHHS}^{\text{Sab}}\) or \(\mathrm{EHHS}^{\text{Tang}}\)), we yield two different values for iES that we denote by \(\mathrm{iES}^{\text{Sab}}\) and \(\mathrm{iES}^{\text{Tang}}\) respectively. As for iHH, the iES integral is computed using the trapezoid method and is often computed only for the region where EHHS lies over a given threshold (e.g., EHHS>0.05).
Certain genomic regions such as centromeres are usually difficult to sequence and can give rise to large gaps between consecutive SNPs. If not accounted for, these will cause spuriously inflated values of iHH and iES. (Voight et al. 2006) applied two corrections for gaps. First, they introduced a penalty (proportional to physical distance) that resulted in any gap being greater than 20 kb to be re-scaled to this number. Second, they discarded the integration, if two consecutive SNPs with a distance greater than 200 kb were encountered. Both methods are implemented in rehh, yet turned off by default, since the corresponding numbers are likely to depend on the specific data set.
In the computation of both EHH and EHHS from a focal SNP \(s\) to a SNP \(t\), only extended haplotypes with no missing data are considered. As a consequence, the number of extended haplotypes retained to compute these two statistics might decrease with increasing distance of \(t\) from the focal SNP \(s\). If the number of available extended haplotypes falls below a threshold, computation of EHH and EHHS stops. Note however that most phasing programs (such as fastPHASE or SHAPEIT2) allow to impute missing genotypes resulting in phased haplotypes with no missing data.
calc_ehh()
The calc_ehh()
function computes EHH for both ancestral and derived alleles of a focal SNP \(s\) relative to any other SNP \(t\) upstream or downstream. The corresponding integral iHH of these EHH values is returned as well. The two options limehh
and limhaplo
allow to specify conditions to truncate computing EHH (see section 2.1.1). By default limehh
=0.05 and limhaplo
=2. If the border of the chromosome is reached, but EHH has not yet decayed below limehh
, calculation of iHH is discarded in order to avoid border effects. This behaviour can be turned off by setting discard_integration_at_border
to FALSE
. To account for gaps between consecutive SNPs, two methods can be used: it is possible to set by scalegap
a distance such that any one greater is re-scaled to this number. Furthermore, the option maxgap
can be used to stop integration at gaps that are greater than the specified value. Again, if discard_integration_at_border
is set to TRUE
, no value is reported. Finally, if plotehh=TRUE
, the decay of EHH for both ancestral and derived allele is plotted against the SNP map position. More details are available in the R documentation by using the command:
?calc_ehh
In the following example, EHH is computed around the SNP with name “F1205400”. Note that the haplohh_cgu_bta12
object was generated using the data2haplohh()
function with the example input files (see section 1.3.1). For convenience, it is stored as an example object (accessible with the R function data
) as shown below:
#example haplohh object (280 haplotypes, 1424 SNPs) see ?haplohh_cgu_bta12 for details
data(haplohh_cgu_bta12)
#computing EHH statistics for the focal SNP with name "F1205400"
#which displays a strong signal of selection
res.ehh<-calc_ehh(haplohh_cgu_bta12,mrk="F1205400")
Figure 2.1: Graphical output of the calc_ehh() function
The output contained in res.ehh
is a list with four elements:
limhaplo
) or the minimal value of EHH (limehh
), is reached, no chromosomes are evaluated any more.res.ehh$ehh[1:2,454:458]
> F1205380 F1205390 F1205400 F1205420 F1205440
> Ancestral allele 0.2764706 0.5529412 1 0.8879552 0.6422969
> Derived allele 1.0000000 1.0000000 1 1.0000000 1.0000000
res.ehh$nhaplo_eval[1:2,454:458]
> F1205380 F1205390 F1205400 F1205420 F1205440
> Ancestral allele 85 85 85 85 85
> Derived allele 195 195 195 195 195
res.ehh$freq_all1
> [1] 0.3035714
res.ehh$ihh
> Ancestral allele Derived allele
> 284429.9 2057107.4
In addition, as plotehh=TRUE
by default, we obtain Figure 2.1.
calc_ehhs()
The calc_ehhs()
function computes \(\mathrm{EHHS}^{\text{Sab}}\) and \(\mathrm{EHHS}^{\text{Tang}}\) around the focal SNP \(s\) relative to each other SNP \(t\). This function also computes the corresponding integrals \(\mathrm{iES}^{\text{Sab}}\) and \(\mathrm{iES}^{\text{Tang}}\) respectively. The two options limehhs
and limhaplo
allow to specify conditions to truncate computing EHHS (see section 2.1.3). By default limehhs
=0.05 and limhaplo
=2. If the border of the chromosome is reached, but EHHS has not yet decayed below limehhs
, calculation of iES is discarded in order to avoid border effects. This behaviour can be turned off by setting discard_integration_at_border
to FALSE
. To account for gaps between consecutive SNPs, two methods can be used: it is possible to set by scalegap
a distance such that any one greater is re-scaled to this number. Furthermore, the option maxgap
can be used to stop integration at gaps that are greater than the specified value. Again, if discard_integration_at_border
is set to TRUE
, no value is reported. Finally, if plotehhs=TRUE
, the decay of EHHS is plotted against SNP map position. More details are available in the R documentation by using the command:
?calc_ehhs
In the following example, the EHHS statistics are computed around the SNP with name “F1205400” on the haplohh_cgu_bta12
object already mentioned above.
data(haplohh_cgu_bta12)
res.ehhs<-calc_ehhs(haplohh_cgu_bta12,mrk="F1205400")
Figure 2.2: Graphical output of the calc_ehhs() function
The output contained in res.ehhs
is a list with five elements:
limhaplo
) or the minimal value
of EHHS (limehhs
), is reached, no chromosomes are evaluated any more.res.ehhs$EHHS_Sabeti_et_al_2007[453:459]
> F1205370 F1205380 F1205390 F1205400 F1205420 F1205440 F1205450
> 0.5017153 0.5095238 0.5347926 0.5756528 0.5654122 0.5429595 0.5386841
res.ehhs$EHHS_Tang_et_al_2007[453:459]
> F1205370 F1205380 F1205390 F1205400 F1205420 F1205440 F1205450
> 0.8715588 0.8851234 0.9290193 1.0000000 0.9822104 0.9432066 0.9357794
res.ehhs$nhaplo_eval[453:459]
> F1205370 F1205380 F1205390 F1205400 F1205420 F1205440 F1205450
> 280 280 280 280 280 280 280
res.ehhs$IES_Tang_et_al_2007
> [1] 1760565
res.ehhs$IES_Sabeti_et_al_2007
> [1] 936407.6
In addition, as plotehh=TRUE
by default, we obtain Figure 2.2.
scan_hh()
The scan_hh()
function efficiently computes iHH for both the ancestral and derived alleles as well as \(\mathrm{iES}^{\text{Sab}}\) and \(\mathrm{iES}^{\text{Tang}}\) for all SNPs in the haplohh
object. The options limehh
, limehhs
and limhaplo
specify conditions to stop computing EHH and EHHS. By default limehh
=limehhs
=0.05 and limhaplo
=2. If the border of the chromosome is reached, but EHH(S) has not yet decayed below limehh(s)
, calculation of iHH resp. iES is discarded in order to avoid border effects. This behaviour can be turned off by setting discard_integration_at_border
to FALSE
. Large “gaps” between consecutive SNPs can be caused by problems in sequencing or SNP calling and may lead to spuriously long extended haplotypes. To account for this, the option maxgap
can be used to stop integration at gaps that are greater than the specified size. Again, if discard_integration_at_border
is set to TRUE
, no value is reported. Finally, the option threads
, set by default to 1, allows to specify the number of threads to parallelize computation (parallelization being carried out over SNPs).
For instance in order to scan the haplohh_cgu_bta12
object (containing data on 1424 SNPs for 280 haplotypes), one may use the following command:
data(haplohh_cgu_bta12)
res.scan<-scan_hh(haplohh_cgu_bta12)
The resulting object res.scan
is a data frame with haplohh_cgu_bta12@nsnp
(the number of SNPs declared in the haplohh
object) rows and seven columns yielding for each focal SNP in turn:
As an example, the following R code provides the dimension and a segment of the res.scan
data frame obtained above:
dim(res.scan)
> [1] 1424 7
res.scan[453:459,]
> CHR POSITION freq_A iHH_A iHH_D iES_Tang_et_al_2007
> F1205370 12 28925117 0.06071429 765720.1 1121814.8 1119699
> F1205380 12 28947722 0.82500000 1477031.7 454891.9 1433143
> F1205390 12 28967990 0.90000000 1211909.5 574301.6 1204234
> F1205400 12 28993983 0.30357143 284429.9 2057107.4 1760565
> F1205420 12 29101326 0.01785714 269322.2 1086530.3 1086279
> F1205440 12 29147373 0.05714286 336600.7 1185898.6 1182946
> F1205450 12 29197279 0.04285714 528732.6 1149951.0 1148696
> iES_Sabeti_et_al_2007
> F1205370 971728.6
> F1205380 966416.3
> F1205390 954903.3
> F1205400 936407.6
> F1205420 1041943.5
> F1205440 1036274.2
> F1205450 1039850.3
Note that scan_hh()
is more efficient than calc_ehh()
and calc_ehhs()
applied consecutively for each SNP as can be seen by running the two code snippets below:
system.time(res.scan<-scan_hh(haplohh_cgu_bta12))
> user system elapsed
> 0.225 0.000 0.225
foo<-function(haplo){
res.ihh=res.ies=matrix(0,haplo@nsnp,2)
for(i in 1:haplo@nsnp){
res.ihh[i,]=calc_ehh(haplo,mrk=haplo@snp.name[i],plotehh=FALSE)$ihh
tmp=calc_ehhs(haplo,mrk=haplo@snp.name[i],plotehhs=FALSE)
res.ies[i,1]=tmp$IES_Tang_et_al_2007
res.ies[i,2]=tmp$IES_Sabeti_et_al_2007
}
list(res.ies=res.ies,res.ihh=res.ihh)
}
system.time(res.scan2<-foo(haplohh_cgu_bta12))
> user system elapsed
> 13.590 0.004 13.594
Nevertheless, results are the same:
identical(res.scan2$res.ihh[,1],res.scan[,4])
> [1] TRUE
identical(res.scan2$res.ihh[,2],res.scan[,5])
> [1] TRUE
identical(res.scan2$res.ies[,1],res.scan[,6])
> [1] TRUE
identical(res.scan2$res.ies[,2],res.scan[,7])
> [1] TRUE
ihh2ihs()
, ies2rsb()
and ies2xpehh()
functionsThe abbreviation iHS refers to “integrated haplotype homozygosity score”. Let \(\mathrm{uniHS}\) represent the un-standardized log-ratio of ancestral iHH\(_a\) to derived iHH\(_d\) of a certain focal SNP \(s\): \[\mathrm{uniHS}=\log\left(\frac{\mathrm{iHH}_a}{\mathrm{iHH}_d}\right)\] Following (Voight et al. 2006) we perform a standardization by setting: \[\mathrm{iHS}=\frac{\mathrm{uniHS} - \mu^{p_s}_\mathrm{uniHS}}{\sigma^{p_s}_\mathrm{uniHS}}\] where \(\mu^{p_s}_\mathrm{uniHS}\) and \(\sigma^{p_s}_\mathrm{uniHS}\) represent the average and standard deviation of the \(\mathrm{uniHS}\) computed over all the SNPs with a derived allele frequency \(p_s\) similar to that of the SNP \(s\). In practice, the derived allele frequencies are binned so that each bin contains a large enough number of SNPs (e.g., >10) to obtain reliable estimates of \(\mu^{p_s}_\mathrm{uniHS}\) and \(\sigma^{p_s}_\mathrm{uniHS}\).
Note that the iHS is constructed to have an approximately standard Gaussian distribution and to be comparable across SNPs regardless of their underlying allele frequencies. Hence, one may further transform iHS into \(p_\mathrm{iHS}\) (Gautier and Naves 2011): \[p_\mathrm{iHS}=-\log_{10}\left(1-2|\Phi\left(\mathrm{iHS}\right)-0.5|\right)\] where \(\Phi\left(x\right)\) represents the Gaussian cumulative distribution function. Assuming most of the genotyped SNPs behave neutrally (i.e., the genome-wide empirical iHS distribution is a fair approximation of the neutral distribution), \(p_\mathrm{iHS}\) might thus be interpreted as a two-sided P-value (on a \(-\log_{10}\) scale) associated to the neutral hypothesis of no selection.
ihh2ihs()
The ihh2ihs()
function computes iHS using a matrix of iHH statistics (for both the ancestral and derived alleles) as obtained by the scan_hh()
function (see section 2.4). The argument minmaf
allows to remove SNPs according to their MAF (by default SNPs with a MAF<minmaf
=0.05 are discarded from the standardization). The argument freqbin
controls the size (or number) of the allele frequency bins used to perform standardization (see section 3.1.1). More precisely, allele frequency bins are built from minmaf
to 1-minmaf
in steps of size freqbin
(by default freqbin
=0.025). If instead an integer of 1 or greater is specified, a corresponding number of equally spaced bins is created. If freqbin
is set to 0, standardization is performed considering each observed frequency as a discrete frequency class, which is useful in case that there are only a few different haplotypes.
For instance, to perform a whole genome scan one might run scan_hh()
on haplotype data from each chromosome and concatenate the resulting matrices before standardization. In the following example, we assume that the haplotype files are named as hap_chr_i.pop1
where the chromosome number \(i\) goes from 1 to 29 and the SNP information file is named snp.info
. The R code below then generates a matrix wg.res
with \(iHH_a\) and \(iHH_d\) estimates for all SNPs in an appropriate format to perform standardization with the ihh2ihs()
function:
for(i in 1:29){
hap_file=paste("hap_chr_",i,".pop1",sep="")
data<-data2haplohh(hap_file="hap_file","snp.info",chr.name=i)
res<-scan_hh(data)
if(i==1){wg.res<-res}else{wg.res<-rbind(wg.res,res)}
}
wg.ihs<-ihh2ihs(wg.res)
For illustration, \(iHH\) values of a whole genome scan (Gautier and Naves 2011) are provided as example data. The following R code computes the iHS for the CGU population:
data(wgscan.cgu)
## results from a genome scan (44,057 SNPs) see ?wgscan.eut and ?wgscan.cgu for details
ihs.cgu<-ihh2ihs(wgscan.cgu)
The resulting object ihs.cgu
is a list with two elements:
ihs.cgu$iHS
data frame are displayed below using the R command:head(ihs.cgu$iHS)
> CHR POSITION iHS -log10(p-value)
> F0100190 1 113642 -0.5582992 0.2390952
> F0100220 1 244699 0.2723337 0.1049282
> F0100250 1 369419 0.4810736 0.2003396
> F0100270 1 447278 1.0618710 0.5401640
> F0100280 1 487654 0.8184060 0.3839181
> F0100290 1 524507 -0.3897024 0.1569189
ihs.cgu$frequency.class
data frame are displayed below:head(ihs.cgu$frequency.class)
> #mrk mean(log(iHHA/iHHD)) sd(log(iHHA/iHHD))
> 0.05 - 0.075 1635 0.7286087 0.6457742
> 0.075 - 0.1 1316 0.5804760 0.5556798
> 0.1 - 0.125 1478 0.4710504 0.5079392
> 0.125 - 0.15 1593 0.3720585 0.4708235
> 0.15 - 0.175 1078 0.3263215 0.4524270
> 0.175 - 0.2 1325 0.2721166 0.4533404
ihsplot()
The function ihsplot()
draws a Manhattan plot of the Whole Genome Scan results returned by the function ihh2ihs()
. Various options are available to modify the graphical display (see ?ihsplot
). Figure 3.1 was drawn using the following R code:
layout(matrix(1:2,2,1))
ihsplot(ihs.cgu,plot.pval=TRUE,ylim.scan=2,main="iHS (CGU cattle breed)")
Figure 3.1: Graphical output of the ihsplot() function
The abbreviation Rsb stands for “ratio of EHHS between populations”. For a given SNP \(s\), let \[\mathrm{LRiES}^{\text{Tang}}=\log\left(\frac{\mathrm{iES}_\text{pop1}^{\text{Tang}}}{\mathrm{iES}_\text{pop2}^{\text{Tang}}}\right)\] represent the log-ratio of the \(\mathrm{iES}^{\text{Tang}}\) values computed in the pop1 and pop2 populations (see section 2.1.4).
The Rsb for a given focal SNP is then defined as the standardized \(\mathrm{LRiES}^{\text{Tang}}\) (Tang et al. 2007):
\[\begin{equation*} \mathrm{Rsb}=\frac{\mathrm{LRiES}^{\text{Tang}} - \text{med}_{\mathrm{LRiES}^{\text{Tang}}}}{\sigma_{\mathrm{LRiES}^{\text{Tang}}}} \end{equation*}\] where \(\text{med}_{\mathrm{LRiES}^{\text{Tang}}}\) and \(\sigma_{\mathrm{LRiES}^{\text{Tang}}}\) represent the median and standard deviation of the \(\mathrm{LRiES}(s)^{\text{Tang}}\) computed over all analyzed SNPs. Note that we follow (Tang et al. 2007) in using the median instead of the mean (hence in contrast to the definitions of iHS and XP-EHH). They assume that this might increase the robustness against different demographic scenarios. It should be noticed, too, that the information about ancestral/derived status of alleles at the focal SNP does not figure in the formula. Furthermore, in contrast with \(iHS\), no binning is performed.
As iHS (see section 3.1.1), Rsb is constructed to have an approximately standard Gaussian distribution and may further be transformed into \(p_\mathrm{Rsb}\): \[\begin{equation*} p_\mathrm{Rsb}=-\log_{10}\left(1-2|\Phi\left(\mathrm{Rsb}\right)-0.5|\right) \end{equation*}\] where \(\Phi\left(x\right)\) represents the Gaussian cumulative distribution function. Assuming most of the genotyped SNPs behave neutrally (i.e., the genome-wide empirical Rsb distribution is a fair approximation of their corresponding neutral distributions), \(p_\mathrm{Rsb}\) might thus be interpreted as a two-sided P-value (in a \(-\log_{10}\) scale) associated to the neutral hypothesis of no selection. Alternatively, \(p_\mathrm{Rsb}\) might also be computed (Gautier and Naves 2011): \[\begin{equation*} p\prime_\mathrm{Rsb}=-\log_{10}\left(|\Phi\left(\mathrm{Rsb}\right)|\right) \end{equation*}\] \(p\prime_\mathrm{Rsb}\) and \(p\prime_\mathrm{Rsb}\) might then be interpreted as a one-sided P-value (in a \(-\log_{10}\) scale) allowing the identification of those sites displaying a significantly high extended haplotype homozygosity in population \(pop2\) (represented in the denominator of the corresponding \(\mathrm{LRiES}\)) relatively to the \(pop1\) reference population.
ies2rsb()
The ies2rsb()
function computes Rsb using two data frames containing the iES statistics for each of the two populations considered in the same format as obtained by running the scan_hh()
function (see section 2.4).
In order to perform a genome-wide scan one might first run for each population scan_hh()
on haplotype data from each chromosome and then concatenate the resulting matrices. In the following example, we assume that the haplotype files are named as hap_chr_i.pop1
and hap_chr_i.pop2
where \(i\) is the chromosome number (going from 1 to 29), the suffixes pop1 and pop2 indicate the population of origin and the SNP information file is named snp.info
. The R code below then generates two data frames (wg.res.pop1
and wg.res.pop2
) containing the results from all SNPs in the appropriate format to compute Rsb with the ies2rsb()
function:
for(i in 1:29){
hap_file=paste("hap_chr_",i,".pop1",sep="")
data<-data2haplohh(hap_file="hap_file","snp.info",chr.name=i)
res<-scan_hh(data)
if(i==1){wg.res.pop1<-res}else{wg.res.pop1<-rbind(wg.res.pop1,res)}
hap_file=paste("hap_chr_",i,".pop2",sep="")
data<-data2haplohh(hap_file="hap_file","snp.info",chr.name=i)
res<-scan_hh(data)
if(i==1){wg.res.pop2<-res}else{wg.res.pop2<-rbind(wg.res.pop2,res)}
}
wg.rsb<-ies2rsb(wg.res.pop1,wg.res.pop2)
For illustration, we take \(iES\) values from a genome scan (Gautier and Naves 2011) provided as example data and compute for each SNP the Rsb between the CGU and EUT populations as follows:
data(wgscan.cgu) ; data(wgscan.eut)
## results from a genome scan (44,057 SNPs) see ?wgscan.eut and ?wgscan.cgu for details
cguVSeut.rsb<-ies2rsb(wgscan.cgu,wgscan.eut,"CGU","EUT")
The resulting object cguVSeut.rsb
is a data frame which shows for each SNP its Rsb and corresponding P-Values assuming Rsb are normally distributed under the neutral hypothesis. The P-value might be either bilateral (default) or unilateral (specified by the method
argument). The first rows of the cguVSeut.rsb
data frame are displayed below using the following R command:
head(cguVSeut.rsb)
> CHR POSITION Rsb (CGU vs. EUT) -log10(p-value) [bilateral]
> F0100190 1 113642 -0.3398574 0.13432529
> F0100220 1 244699 -1.0566283 0.53658299
> F0100250 1 369419 -0.1468326 0.05390941
> F0100270 1 447278 -1.8191608 1.16186336
> F0100280 1 487654 -0.2193069 0.08280392
> F0100290 1 524507 -0.7941300 0.36945032
rsbplot()
The rsbplot()
function draws a Manhattan plot of the Whole Genome Scan results as obtained by the function ies2rsb()
. Various options are available to modify the graphical display (see ?rsbplot
). As an example, Figure 3.2 below provides the output of the function rsbplot()
for the Rsb computed above across the CGU and EUT populations (see section 3.2.1). It was drawn using the following R code:
layout(matrix(1:2,2,1))
rsbplot(cguVSeut.rsb,plot.pval=TRUE)
Figure 3.2: Graphical output of the rsbplot() function
The XP-EHH (cross-population EHH) statistic (Sabeti et al. 2007) is similar to Rsb except that it is based on \(\mathrm{iES}^{\text{Sab}}\) instead of \(\mathrm{iES}^{\text{Tang}}\) (see section 2.1.4). Hence, for or a given SNP \(s\), let \[\mathrm{LRiES}^{\text{Sab}}=\log\left(\frac{\mathrm{iES}_\text{pop1}^{\text{Sab}}}{\mathrm{iES}_\text{pop2}^{\text{Sab}}}\right)\] represent the log-ratio of the \(\mathrm{iES}^{\text{Sab}}\) values computed in the pop1 and pop2 populations (see section 2.1.4).
The XP-EHH for a given focal SNP is then defined as the standardized \(\mathrm{LRiES}^{\text{Sab}}\) (Sabeti et al. 2007):
\[\begin{equation*} \mathrm{XP-EHH}=\frac{\mathrm{LRiES}^{\text{Sab}} - \text{mean}_{\mathrm{LRiES}^{\text{Sab}}}}{\sigma_{\mathrm{LRiES}^{\text{Sab}}}} \end{equation*}\] where \(\text{mean}_{\mathrm{LRiES}^{\text{Sab}}}\) and \(\sigma_{\mathrm{LRiES}^{\text{Sab}}}\) represent the mean and standard deviation of \(\mathrm{LRiES}^{\text{Sab}}\) computed over all analyzed SNPs. As with Rsb, the information about the ancestral and derived status of alleles at the focal SNP does not figure in the formula and no binning is performed.
As with iHS (see section 3.1.1) and Rsb (see section 3.2.1), XP-EHH is constructed to have an approximately standard Gaussian distribution and may further be transformed into \(p_\mathrm{XP-EHH}\): \[\begin{equation*} p_\mathrm{XP-EHH}=-\log_{10}\left(1-2|\Phi\left(\mathrm{XP-EHH}\right)-0.5|\right) \end{equation*}\] where \(\Phi\left(x\right)\) represents the Gaussian cumulative distribution function. Assuming most of the genotyped SNPs behave neutrally (i.e., the genome-wide empirical XP-EHH distribution is a fair approximation of their corresponding neutral distributions), \(p_\mathrm{XP-EHH}\) might thus be interpreted as a two-sided P-value (in a \(-\log_{10}\) scale) associated to the neutral hypothesis of no selection. Alternatively, \(p_\mathrm{XP-EHH}\) might also be computed (Gautier and Naves 2011): \[\begin{equation*} p\prime_\mathrm{XP-EHH}=-\log_{10}\left(|\Phi\left(\mathrm{XP-EHH}\right)|\right) \end{equation*}\] \(p\prime_\mathrm{XP-EHH}\) and \(p\prime_\mathrm{XP-EHH}\) might then be interpreted as a one-sided P-value (in a \(-\log_{10}\) scale) allowing the identification of those sites displaying a significantly high extended haplotype homozygosity in population \(pop2\) (represented in the denominator of the corresponding \(\mathrm{LRiES}\)) relatively to the \(pop1\) reference population.
ies2xpehh()
The ies2xpehh()
function computes XP-EHH using two data frames containing the iES statistics for each of the two populations in the format as obtained by running the scan_hh()
function (see section 2.4).
For instance, to perform a genome scan one might first run for each population scan_hh()
in turn on haplotype data from each chromosome and concatenate the resulting matrices. In the following example, we assume that the haplotype files are named as hap_chr_i.pop1
and hap_chr_i.pop2
where \(i\) is the chromosome number (going from 1 to 29), the suffixes pop1 and pop2 indicate the population of origin and the SNP information file is named snp.info
. The R code below then generates two data frames (wg.res.pop1
and wg.res.pop2
) containing the results from all SNPs in the appropriate format to compute Rsb with the ies2rsb()
function:
for(i in 1:29){
hap_file=paste("hap_chr_",i,".pop1",sep="")
data<-data2haplohh(hap_file="hap_file","snp.info",chr.name=i)
res<-scan_hh(data)
if(i==1){wg.res.pop1<-res}else{wg.res.pop1<-rbind(wg.res.pop1,res)}
hap_file=paste("hap_chr_",i,".pop2",sep="")
data<-data2haplohh(hap_file="hap_file","snp.info",chr.name=i)
res<-scan_hh(data)
if(i==1){wg.res.pop2<-res}else{wg.res.pop2<-rbind(wg.res.pop2,res)}
}
wg.xpehh<-ies2xpehh(wg.res.pop1,wg.res.pop2)
For illustration, we consider the \(iES\) values of a genome scan (Gautier and Naves 2011) provided as example data and compute for each SNP the XP-EHH between the CGU and EUT populations as follows:
data(wgscan.cgu) ; data(wgscan.eut)
## results from a genome scan (44,057 SNPs) see ?wgscan.eut and ?wgscan.cgu for details
cguVSeut.xpehh<-ies2xpehh(wgscan.cgu,wgscan.eut,"CGU","EUT")
The resulting object cguVSeut.xpehh
is a data frame containing for each SNP the XP-EHH and corresponding P-value assuming XP-EHH are normally distributed under the neutral hypothesis. The P-value might be either bilateral (default) or unilateral (as specified by the method
argument). The first rows of this data frame are displayed below:
head(cguVSeut.xpehh)
> CHR POSITION XPEHH (CGU vs. EUT) -log10(p-value) [bilateral]
> F0100190 1 113642 -0.5943673 0.2578513
> F0100220 1 244699 -0.7903997 0.3672448
> F0100250 1 369419 -0.9273568 0.4513142
> F0100270 1 447278 -0.3858354 0.1551387
> F0100280 1 487654 -0.9570604 0.4703941
> F0100290 1 524507 -0.7908863 0.3675322
xpehhplot()
The xpehhplot()
draws a Manhattan plot of the Whole Genome Scan results produced by the function ies2xpehh()
. Various options are available to modify the graphical display (see ?xpehhplot
).
As an example, Figure 3.3 provides the output of the function xpehhplot
for the XP-EHH computed above across the CGU and EUT populations (see section 3.3.2). It was drawn by:
layout(matrix(1:2,2,1))
xpehhplot(cguVSeut.xpehh,plot.pval=TRUE)
Figure 3.3: Graphical output of the xpehhplot() function
A plot of the Rsb against XP-EHH values across the CGU and EUT populations (see section 3.2.2 and 3.3.2 respectively) is represented in Figure 3.4. Marked in red is the SNP that was used repeatedly in the examples above. The plot was generated using the following R code:
plot(cguVSeut.rsb[,3],cguVSeut.xpehh[,3],xlab="Rsb",ylab="XP-EHH",pch=".",
xlim=c(-7.5,7.5),ylim=c(-7.5,7.5))
points(cguVSeut.rsb["F1205400",3],cguVSeut.xpehh["F1205400",3],col="red")
abline(a=0,b=1,lty=2)
Figure 3.4: Comparison of Rsb and XP-EHH values across the CGU and EUT populations
distribplot()
The distribplot()
function allows to visualize the distributions of the standardized scores (either iHS, Rsb or XP-EHH) and compare them to the standard Gaussian distribution. As an example, Figure 3.5 below provides the output the function distribplot()
when considering the iHS estimates obtained for the CGU population (see section 3.1.1) using the following R code:
layout(matrix(1:2,2,1))
distribplot(ihs.cgu$iHS[,3],xlab="iHS")
Figure 3.5: Graphical output of the distribplot() function
bifurcation.diagram()
functionThe function bifurcation.diagram()
draws haplotype bifurcation diagrams (Sabeti et al. 2002) that visualize the decay of EHH around a focal SNP. A stark contrast of ancestral and derived bifurcation diagrams should correspond to outlier values of iHS. Within the plot, the root (focal SNP) is identified by a vertical dashed line.
The diagram is bi-directional, portraying decay along both sides of the focal SNP. Moving in one direction, each marker is an opportunity for a bifurcation to occurr further differentiating between extended haplotypes. The thickness of the lines corresponds to the number of chromosomes with the same haplotype.
Several options are available to modify the aspect of the plots (see command ?bifurcation.diagram
).
As an illustration, Figure 4.1 shows the bifurcation diagrams for both the derived and ancestral alleles at the SNP with name “F1205400” on BTA12 CGU haplotypes. This SNP is associated with a strong signal of selection (using both iHS and Rsb statistics) and is located closely (<5kb) to a strong candidate gene involved in horn development (Gautier and Naves 2011). Figure 4.1 was obtained by the following R code:
data(haplohh_cgu_bta12)
layout(matrix(1:2,2,1))
bifurcation.diagram(haplohh_cgu_bta12,mrk_foc="F1205400",all_foc=1,nmrk_l=20,nmrk_r=20,
main="Bifurcation diagram (RXFP2 SNP on BTA12): Ancestral Allele")
bifurcation.diagram(haplohh_cgu_bta12,mrk_foc="F1205400",all_foc=2,nmrk_l=20,nmrk_r=20,
main="Bifurcation diagram (RXFP2 SNP on BTA12): Derived Allele")
Figure 4.1: Graphical output of the bifurcation.diagram() function
The C++ program hapbin (Maclean et al. 2015) is an alternative tool to calculate the statistics iHS and XP-EHH. The obtained values vary between hapbin and rehh. As far as we can tell, this is due to the differences in implementation listed below. The first 4 are hard-wired and cannot be remediated without changing the code while all further discrepancies can be largely eliminated by appropriate parameter settings. We refer to version 1.3.0 of hapbin.
Hapbin disregards the SNPs directly at the border of chromosomes.
If an allele of a focal SNP is present only on a single chromosome, hapbin assigns a EHH value of 1 at the focal SNP and zero otherwise. rehh assigns zero at the focal SNP, too.
For the standardization of iHS resp. XP-EHH, hapbin uses the estimator \(\sqrt{\frac{1}{n}\sum(x_i-\bar{x})^2}\) for the standard deviation while rehh uses \(\sqrt{\frac{1}{n-1}\sum(x_i-\bar{x})^2}\).
The bins used by hapbin for the standardization of iHS cover
the whole interval ]0,1], while in rehh they span the interval [minmaf
,1-minmaf
[. Hapbin includes the upper endpoint into each bin, while rehh includes the lower endpoint.
The default number of bins is 50 in hapbin, yielding a bin width of 0.02. The default width in rehh is 0.025. Setting the number of bins in hapbin to 40 with option --bin
or -b
yields the latter.
If run in default mode, hapbin calculates EHH by (notation as in section 2.1.1)
\[\begin{equation*}
\mathrm{EHH}^a_{s,t}=\sum_{k=1}^{K^a_{s,t}}\left(\frac{n_k}{n_a}\right)^2\;.
\end{equation*}\]
For a set of \(n\) chromosomes this estimator reaches its minimum value of \(\frac{1}{n}\) if all of them are distinct. Yet formula (2.1) used by rehh and applied by hapbin if run with option --binom
or -a
returns zero in this case. The difference reflects distinct sampling strategies, either with replacement or without. For increasing sample size both converge.
The same holds for EHHS.
Integration over EHH resp. EHHS is performed by hapbin on the area between the curve spanned by these quantities and the x-axis (y=0) while rehh integrates only over the part of that area that is above the threshold set by the parameters limehh
resp. limehhs
, i.e. the area between the curve and the line y=threshold. This is not to be confused with the condition for truncation at left and right ends of the curve which is (for all practical purposes) handled identically by both programs. Hence, setting limehh(s)
to zero in rehh and to a very small value in hapbin (zero seems to “hang” it) with parameter -c
or --cutoff
yields virtually identical integrations.
By default, the parameter discard_integration_at_border
is TRUE
in rehh. It has to be set to FALSE
in order to conform to hapbin.
Large differences can arise from different handling of gaps during the integration of EHH resp. EHHS yielding iHH resp. iES. Hapbin has a parameter -s
or --scale
to “down-weight” large gaps by capping them to the specified value. Its default value is 20000 while the corresponding option in rehh is turned off by default, but can be set by scalegap
.
The option maxgap
within rehh leads to a stop of the integration and if the parameter discard_integration_at_border
is set to TRUE
, then no value is reported. This has no counterpart in hapbin. Instead, hapbin allows to specify a maximum length of Extended Haplotypes (disabled by default) which is not possible in rehh.
Gautier M., Naves M., 2011 Footprints of selection in the ancestral admixture of a new world creole cattle breed. Mol Ecol 20: 3128–3143.
Maclean C. A., Hong N. P. C., Prendergast J. G. D., 2015 Hapbin: An efficient program for performing haplotype-based scans for positive selection in large genomic datasets. Mol Biol Evol 32: 3027–3029.
O’Connell J., Gurdasani D., Delaneau O., Pirastu N., Ulivi S., others, 2014 A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genet 10: e1004234.
Sabeti P. C., Reich D. E., Higgins J. M., Levine H. Z. P., Richter D. J., others, 2002 Detecting recent positive selection in the human genome from haplotype structure. Nature 419: 832–837.
Sabeti P. C., Varilly P., Fry B., Lohmueller J., Hostetter E., others, 2007 Genome-wide detection and characterization of positive selection in human populations. Nature 449: 913–918.
Scheet P., Stephens M., 2006 A fast and flexible statistical model for large-scale population genotype data: Applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet 78: 629–644.
Tang K., Thornton K. R., Stoneking M., 2007 A new approach for using genome scans to detect recent positive selection in the human genome. PLoS Biol 5: e171.
Voight B. F., Kudaravalli S., Wen X., Pritchard J. K., 2006 A map of recent positive selection in the human genome. PLoS Biol 4: e72.