K. J. Galinsky1,2 ; G. Bhatia2,3 ; P. Loh2,3 ; S. Georgiev4 ; S. Mukherjee5 ; N. J. Patterson2 ; A. L. Price1,2,3
Population differentiation is a widely used approach to detect the action of natural selection. Existing methods search for unusual differentiation in allele frequencies across discrete populations, e.g. using FST. Loci that are unusually differentiated with respect to the genome-wide FST or with respect to a null distribution of FST are reported as signals of selection. These approaches are particularly powerful for closely related populations with large sample sizes.However, population genetic data often is not naturally partitioned into discrete populations. We developed a test for selection that uses SNP loadings from principal components analysis (PCA). For a given PC reflecting geographic ancestry, under the null hypothesis of no selection, the square of the SNP loadings, rescaled by a scaling factor derived from the eigenvalue of the PC, follows a chi-square (1 d.o.f.) distribution. This statistic is able to infer selection with genome-wide significance, a key consideration in genome scans for selection. We confirmed via simulations that this statistic has correct null calibration under a wide range of demographies and is well-powered to detect selection at large sample sizes.We applied the method to a cohort of 54,734 European Americans genotyped on genome-wide arrays. PCs were inferred using our FastPCA software (running time: 57 minutes). The top 4 PCs corresponded to clines of Irish, Eastern European, Northern European, Southeast European and Ashkenazi Jewish ancestry, validated via PCA projection of samples of known ancestry. We detected genome-wide significant signals of selection at 4 known selected loci (LCT, HLA, OCA2 and IRF4) and 3 novel loci: ADH1B, IGFBP3 and IGH. 2 of the 3 novel loci could not be detected using discrete-population tests (or other existing tests). The ADH1B gene is associated with alcoholism (via the same coding SNP rs1229984 producing a signal in our selection scan) and has been shown to be under recent selection in East Asians (via a haplotype-based test for recent selection); we show here that it is a rare example of independent evolution on two continents. The IGFBP3 gene and IGH locus have been implicated in breast cancer and multiple sclerosis, respectively. Our results show that application of our PC-based selection statistic to large data sets can infer novel, genome-wide significant signals of selection at loci linked to disease traits.