Selected abstracts from the 2010 meeting of the American Society of Human Genetics.
Accuracy of Ancestry Informative Markers (AIMs) for the estimation of individual ancestry in admixed populations. J. M. Galanter1, C. Gignoux1, M. Aldrich1, D. Torgerson1, J. G. Ford2, S. Nazarui2, J. R. Rodriguez-Santana3, J. Casal2, A. Torres-Palacios2, J. Salas4, R. Chapela4, H. Geoffrey Watson5, K. Meade6, M. LeNoir7, W. Rodriguez-Cintrón3, P. C. Avila8, A. Bigham9, M. Shriver9, E. González Burchard1,10 1) Department of Medicine, University of California, San Francisco, San Francisco, CA; 2) Veterans Caribbean Health Care System 10 Casia Street San Juan, PR 00921; 3) Centro de Neumologia Pediátrica Torre Medica Auxilio Mutuo Suite 215 Ave. Ponce de Léon No. 735 San Juan, Puerto Rico 00917; 4) Instituto Nacional de Enfermedades Respiratorias Mexico City, Mexico; 5) James A. Watson Wellness Center 5709 Market St Oakland, CA 94608; 6) Children's Hospital Oakland Research Institute 5700 Martin Luther King Jr Way Oakland, California 94609; 7) Bay Area Pediatrics Ste 1, 2940 Summit Street Oakland, CA 94609-3410; 8) Division of Allergy/Immunology Northwestern University M-316, McGaw Pavilion, 240 E. Huron, Chicago, IL 60611; 9) Department of Anthropology Penn State University 512 Carpenter Building State College, PA; 10) Department of Biopharmaceutical Sciences University of California, San Francisco, San Francisco, CA.
Introduction Ancestry informative markers (AIMs) have been used as a cost-effective way to estimate individual ancestral proportions in admixed populations such as African Americans and Latinos. We determined the accuracy of individual ancestry estimates derived from smaller AIMs panels compared to ancestry estimates using all genomewide data as the gold standard. Methods Latino participants of Mexican (n = 271) and Puerto Rican (n = 324) originwith asthma were recruited from the San Francisco Bay Area, New York City, Puerto Rico, and Mexico City. Genotyping was performed using the Affymetrix 6.0 GeneChip Array; after applying standard QC filters 729,685 markers remained for analysis. We used the intersection of Illumina-550 and Affymetrix 6.0 as our set of potential SNPs to encourage universal applicability of our marker panels. Ancestry information for each SNP was measured via pairwise In calculations. AIMs panels of 18, 36, 75, 150, 300, 600, 1200, and 2400 unlinked markers were selected. Individual ancestry was estimated using the program ADMIXTURE, specifying a three population model. Ancestral populations consisted of HapMap Yorubans and CEPH Europeans, as well as Maya and Nahua Native Americans. We compared differences in ancestry estimated with different size AIMs panels with ancestry estimated from genomewide markers. Mean and standard deviation of the difference in ancestry estimation between AIMs and genomewide data were calculated. Results There was an inverse correlation between the number of AIMs used to estimate ancestry and mean and standard deviation of the error in ancestry estimation. Using AIMs, African ancestry was consistently overestimated, while the major ancestral component (European in Puerto Ricans and Native American in Mexicans) was systematically underestimated. Using 300 or fewer AIMS consistently produced a standard deviation of ancestry estimation error of 10% or greater. Discussion Our results illustrate significant error in the estimation of individual ancestry using AIMs. There is both systematic bias resulting in overestimation of African ancestry (and underestimation of other continental ancestry) and random error. Such error is inversely proportional to the number of AIMs used. These findings may have implications for genetic association studies where ancestry is used to control for population stratification as well as for studies examining associations of individual ancestry estimates with a phenotype.
Admixture in New World populations: an analysis of Y-chromosome, mtDNA, and genome-wide microarray data. W. S. Watkins1, J. Xing1, D. J. Witherspoon1, Y. Zhang1, S. R. Woodward2, L. B. Jorde1 1) Department of Human Genetics, University of Utah, Salt Lake City, UT; 2) Sorenson Molecular Genealogy Foundation, Salt Lake City Utah.
The first major interaction between Native Americans and Europeans is documented historically and occurred less than 550 years ago. This recent time frame provides an excellent opportunity to investigate the effects of admixture between two populations that were previously separated for hundreds of generations. To characterize European admixture in Native American populations, we sampled and analyzed a group of isolated Totonac agriculturists from tropical Mexico near Veracruz and a group of native Bolivians predominantly from the mountainous region near La Paz, Boliva. Mitochondrial sequencing of HVS1 showed that all samples had pre-Columbian mtDNA haplogroups (A, B, C, and D). Using a panel of 48 STRs or 12 Y-chromosome SNPs, Totonac Y-chromosomes lineages were all assigned to the pre-Columbian haplogroup Q1a3a, and Bolivian Y-chromosome lineages were assigned to haplogroups Q1a3a, R1, and J2. Haplogroups R1 and J2 are common in European populations. Principal components analysis (PCA) using >800K autosomal SNPs typed in 24 Totonacs and 23 Bolivians showed that all Totonacs and 14 Bolivians clustered distinctly from Eurasian individuals. Nine Bolivians, however, were positioned between the New World and European PCA clusters. Admixture analysis showed that these nine samples had 21 - 33% European admixture using a European reference population. All three observed Y-chromosome haplogroups, including the well-studied pre-Columbian haplogroup Q1a3a, occurred in the admixed individuals. Two of the nine admixed individuals had pre-Columbian mtDNA and Y-chromosome haplogroups but 21-23% European ancestry. This result demonstrates that Y-chromosome and mtDNA haplogroups are only partial indicators of an individual’s complete ancestry.
Using EuroAIMs to measure admixture proportions in atypical European populations: the case of Canary Islanders. C. Flores1,2, M. Pino-Yanes1,2, A. Corrales1,2, A. Hernandez3, S. Basaldua1, L. Guerra4, J. Villar2,5,6 1) Research Unit, Hospital Universitario N.S. de Candelaria, Tenerife, Spain; 2) CIBER de Enfermedades Respiratorias, Instituto de Salud Carlos III, Madrid, Spain; 3) Instituto Nacional de Toxicologia y Ciencias Forenses, Delegación de Canarias, Tenerife, Spain; 4) Hematology Service, Hospital Universitario Dr. Negrin, Las Palmas de Gran Canaria, Spain; 5) Multidisciplinary Organ Dysfunction Evaluation Research Network, Research Unit, Hospital Universitario Dr. Negrin, Las Palmas de Gran Canaria, Spain; 6) Keenan Research Center, St. Michael's Hospital, Toronto, ON, Canada.
Using ancestry informative markers (AIMs) allows reducing the number of makers needed for population stratification adjustments in association studies. As few as 100 AIMs are sufficient to adjust for the largest European axis of differentiation (i.e. EuroAIMs). However, their use for ancestry inference and adjustment in association studies in atypical European populations such as the Canary Islanders, a recently African-admixed population from Spain, needs to be addressed. We aimed to explore whether EuroAIMs were suitable both for the inference of Spanish and Northwest African admixture proportions and for ancestry adjustments in association studies including samples from Canary Islanders. We analyzed samples from Canary Islanders, mainland Spanish (IBE) and Northwest Africans (NWA) for 93 EuroAIMs and compared the data with CEU and YRI from HapMap, Basques and Mozabite from HGDP, as well as from previously analyzed European samples. The major genetic difference was observed between NWA and all European populations, preserving the northwest-to-southeast differentiation of European populations in the second axis. Analyses revealed that Canary Islanders were intermediate between IBE and NWA, and that direct sub-Saharan African influences were negligible. Assessment of individual admixtures without prior population information clearly identified two subpopulations corresponding to NWA and IBE, while Canary Islanders were admixed with an average of 17.4% Northwest African contribution varying largely among individuals (range 0-95.7%). As few as 23 EuroAIMs correctly estimated population membership to IBE and NWA, while 69 EuroAIMs were required to accurately estimate individual admixture proportions in Canary Islanders. Ancestry estimates based on a subset of 69 EuroAIMs also controlled significant allele frequency differences between IBE and Canary Islanders. These data suggest that a handful of EuroAIMs would be useful to control false-positives in association studies performed in Spanish populations. Supported by FUNCIS 23/07 and grants from the Spanish Ministry of Science and Innovation PI081383 and EMER07/001 to CF.
CoAIMs: A Cost-Effective Panel of Ancestry Informative Markers for Determining Continental Origins. E. R. Londin1, M. A. Keller1, C. Maista1, G. Smith1, L. A. Mamounas2, R. Zhang2, S. J. Madore1, K. Gwinn2, R. A. Corriveau1 1) NINDS Repository, Coriell Inst Med Res, Camden, NJ; 2) National Institute for Neurological Disorders and Stroke, Bethesda, MD.
Genetic ancestry is known to impact outcomes of genotype-phenotype studies that are designed to identify risk for common diseases in human populations. Failure to control for population stratification due to genetic ancestry is a significant confounder of disease association studies. Self-identified race is the most common method used to track and control for population stratification; however, social constructs of race are not necessarily informative for genetic applications. The use of ancestry informative markers (AIMs) is a more accurate method for determining genetic ancestry for the purposes of population stratification. Here we use a panel of 36 microsatellite (MSAT) AIMs to determine continental admixture proportions in the context of a biorepository collection. This panel, named CoAIMs, consists of MSAT AIMs chosen based upon their measure of genetic variance (Fst), allele frequencies and their suitability for efficient genotyping. Genotype analysis with a Bayesian clustering method (STRUCTURE) is able to discern continental origins including Europe/Middle East (Caucasians), East Asia, Africa, Native America, and Oceania in reference populations. In addition to determining continental ancestry for such individuals without significant admixture, we applied CoAIMs to ascertain admixture proportions of a large collection of individuals of self-declared race. CoAIMs was used to efficiently and effectively determine continental admixture proportions in a sample set from the NINDS Human Genetics DNA and Cell Line Repository. Individuals of self-declared Caucasian (N=92), African-American (N=200), and Hispanic (N=200) race were analyzed. African American individuals displayed admixture of both African and European ancestry, with 2/200 (1%) of the samples having nearly 100% Caucasian Ancestry, suggesting discordance between self-declared and genetic race. Caucasian and Native American represented the highest ancestral proportions in self-reported Hispanic populations. The determination of genetic ancestry in biorepository collections can increase the utility of these samples for gene discovery. The CoAIMs panel used here has potential for broad applicability as a cost effective tool for determining admixture proportions.
The History in our Genes: the Complex Structure of the South African Coloured Population. M. Möller1, L. Quintana-Murci2,3, E. de Wit1, W. Delport4,5, C. Harmant2,3, C. E. Rugamika4, H. Quach2,3, A. Meintjes4, O. Balanovsky6, V. Zaporozhchenko6, C. Bormans7, P. D. van Helden1, C. Seoighe8, D. M. Behar2,9, E. G. Hoal1 1) Molecular Biology and Human Genetics, MRC Centre for Molecular and Cellular Biology, DST/NRF Centre of Excellence for Biomedical TB Research, Stellenbosch University, Tygerberg, Western Cape, South Africa; 2) Institut Pasteur, Human Evolutionary Genetics, Department of Genomes and Genetics, Paris, France; 3) Centre National de la Récherche Scièntifique, Paris, France; 4) Institute of Infectious Disease and Molecular Medicine, University of Cape Town, Cape Town, Western Cape, South Africa; 5) Department of Pathology, Antiviral Research Center,University of California, San Diego, USA; 6) Research Centre for Medical Genetics, Russian Academy of Medical Sciences, Moscow, Russia; 7) Genomics Research Center, Family Tree DNA, Houston, Texas, USA; 8) School of Mathematics, Statistics and Applied Mathematics, National University of Ireland, Galway, Ireland; 9) Molecular Medicine Laboratory, Rambam Health Care Campus, Haifa, Israel.
The study of recently admixed populations provides unique tools for understanding recent population dynamics, socio-cultural factors associated with the founding of emerging populations, and the genetic basis of disease by means of admixture mapping. The geographical position and complex history of South Africa has led to the establishment of the unique admixed population known as the South African Coloured. We performed a Genome-Wide Analysis of the genetic make-up of this population. We genotyped 959 self-identified individuals from the Western Cape area, using the Affymetrix 500k genotyping platform. This resulted in nearly 75 000 autosomal SNPs that could be compared with populations represented in the International HapMap Project and the Human Genome Diversity Project. Analysis in STRUCTURE revealed that the major ancestral components of this population are predominantly Khoesan (32-43%), Bantu-speaking Africans (20-36%), European (21-28%), and a smaller Asian contribution (9-11%), depending on the model used. However, the autosomal data provides little evidence about the mode in which this admixed population was founded. We went on to show, through detailed phylogeographic analyses of mitochondrial DNA and Y-chromosome variation in a large sample of South African Coloured individuals, that this population derives from at least five different parental populations (Khoisan, Bantus, Europeans, Indians and Southeast Asians), who have differently contributed to the foundation of the South African Coloured. Our analyses reveal extraordinarily unbalanced gender-specific contributions of the various population genetic components, the most striking being the massive maternal contribution of Khoisan peoples (more than 60%) and the almost negligible maternal contribution of Europeans with respect to their paternal counterparts. The overall picture of gender-biased admixture depicted in this study indicates that the modern South African Coloured population results to a large degree from the early encounter of European and African males with autochthonous Khoisan females of the Western Cape of Good Hope hundreds of years ago.
Inferring recombination rates in recently admixed human populations. D. Wegmann1, K. Veeramah1, D. Kessner2, N. Freimer3, J. Novembre1,2 1) Ecology and Evolutionary Biology, University of California Los Angeles, Los Angeles, CA; 2) Interdepartmental Program in Bioinformatics, University of California Los Angeles, Los Angeles, CA; 3) Center for Neurobehavioral Genetics, University of California Los Angeles, Los Angeles, CA.
To fully understand the evolution of recombination rates requires understanding how they vary among individuals and across populations. Here we focus on estimating recombination rates in recently admixed human populations. We take a novel approach based on identifying the ancestry of chromosomal segments. Wherever ancestry changes along a chromosome, a recombination event must have occurred in the history of the chromosome since admixture. Genome wide variation data, along with novel statistical methods, make it possible to infer the ancestry of local segments and identify ancestry switch points in large samples. Assuming a hidden Markov model, we compute the probability of ancestry switches between neighboring SNPs in the genomes of 3000 African Americans and compile a de-novo human recombination map. We account for the possibility that some observed recombination events may be identical-by-descent by a calibration based on simulations of African-American demography. Our recombination map allows us to characterize genomic regions with an excess or deficit in recombination when compared to existing maps. Finally, we aim to characterize the modulation of recombination rates by the genetic background of divergent populations, both at a genome-wide and local scale.
Capitalizing on Admixture in Genome-wide Association Studies: A Two-stage Testing Procedure and Application to Height in African-Americans. G. Kang1, G. Gao1, S. Shete2, D. Redden1, B. Chang3, T. Rebbeck3, J. Barnholtz-Sloan4, N. Pajewski1, D. Allison1 1) The University of Alabama at Birmingham, Birmingham, AL; 2) M. D. Anderson Cancer Center, University of Texas, Houston, TX; 3) School of Medicine, University of Pennsylvania, Philadelphia, PA; 4) Case Comprehensive Cancer Center, Case Western Reserve University, Cleveland, Ohio.
As genome-wide association studies expand beyond populations of European ancestry, the role of admixture will become increasingly important in the continued discovery and fine-mapping of variation influencing complex traits. Though admixture is commonly viewed as a confounding influence in association studies, approaches such as admixture mapping have demonstrated its ability to highlight disease susceptibility regions of the genome. In this study, we illustrate a powerful two-stage testing strategy designed to uncover trait-associated single nucleotide polymorphism in the presence of ancestral allele frequency differentiation. In the first stage, we conduct an association scan using predicted genotypic values based on regional admixture estimates. We then select a subset of promising markers for inclusion in a second-stage analysis, where association is tested between the observed genotype and the phenotype conditional on the predicted genotype. We prove that, under the null hypothesis, the test statistics used in each stage are orthogonal and asymptotically independent. Using simulated data designed to mimic African-American populations in the case of a quantitative trait, we show that our two-stage procedure maintains appropriate control of the family-wise type I error rate (FWER), and has higher power under realistic effect sizes than the one-stage testing procedure in which all markers are tested for association simultaneously with control of admixture. We apply the proposed procedure to a study of height in 201 African-Americans genotyped at 108 ancestry informative markers. The two-stage procedure identified two statistically significant markers rs1985080 (PTHB1/BBS9) and rs952718 (ABCA12). PTHB1/BBS9 is downregulated by parathyroid hormone in osteoblastic cells, and is thought to be involved in parathyroid hormone action in bones and may play a role in height. ABCA12 is a member of the superfamily of ATP-binding cassette (ABC) transporters and its potential involvement in height is not clear.
Population genomics in the Americas: sub-continental ancestry and its implications for medical genomics. A. Moreno Estrada1, M. Via2, C. Gignoux2, B. Henn1, V. Acuña Alonzo3, K. Bryc4, H. Rangel Villalobos5, S. Cañizales Quinteros6, A. Ruiz Linares7, E. G. Burchard2, C. D. Bustamante1 1) Department of Genetics, Stanford University, Stanford, CA; 2) Institute for Human Genetics, University of California at San Francisco, San Francisco, CA; 3) Molecular Genetics Laboratory, Escuela Nacional de Antropología e Historia (ENAH), Mexico City, Mexico; 4) Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, NY; 5) Molecular Genetics Research Institute, Universidad de Guadalajara, Ocotlan, Mexico; 6) Molecular Biology and Genomic Medicine, Instituto Nacional de Ciencias Médicas y Nutrición Salvador Zubirán (INCMNSZ), Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico; 7) Department of Genetics, Evolution and Environment, University College London, London, UK.
Human populations from the Americas are often of admixed origin, with significant genetic contributions from Native American and European populations (primarily involving local indigenous populations and migrants from the Iberian peninsula and Southern Europe) as well as West Africans brought to the Americas through the trans-Atlantic slave trade. As a result present-day Hispanic/Latino populations exhibit complex population structure, which poses additional challenges for characterizing their genetic makeup. In an effort to contribute to a better understanding of the genetic diversity in the Americas we have generated Affymetrix 6.0 genome-wide genotype data for more than 2,250 individuals from a diverse panel of populations including Mexicans, Colombians, Ecuadorians, Puerto Ricans, and Dominicans, as well as nearly 500 individuals from several Native American populations. We show at a finer scale that the complex historical events have affected patterns of genetic and genomic variation within and among present-day Hispanic/Latino populations in a heterogeneous fashion, resulting in rich and varied ancestry within and among populations as well as marked differences in the contribution of European, Native American, and African ancestry to autosomal, X chromosome, and uniparentally inherited genomes. We also stress the importance of characterizing Native Americans both for identifying potential source populations and for providing comprehensive reference panels when applying methods such as local ancestry estimation or admixture mapping. Also, as part of the 1000 Genomes Project, valuable resequencing data is becoming available for at least 280 genomes from individuals of Mexican, Puerto Rican, Colombian, and Peruvian origin, as well as nearly 60 genomes from HGDP populations, including some Native American genomes, being sequenced at 4x coverage as part of an independent sequencing effort. We are making use of both high-density genotype data as well as resequence data to fine map ancestry break points and understand diversity that is missed by current catalogs of human genomic diversity, which ultimately will help to better design epidemiological studies involving populations from the Americas.
Chromosomal segments in admixed individuals and inference of local ancestry. A. L. Price Harvard Sch. of Publ. Hlth.
Admixed populations are of special interest in human genetics because of the potential to exploit admixture linkage disequilibrium to map disease genes, especially with regard to diseases with different frequencies across populations. Furthermore, other types of genetic studies, including linkage, GWAS and candidate gene studies are often conducted in admixed populations. This session will describe methods used to estimate ancestry proportions, infer local ancestry of chromosomal segments, and conduct admixture mapping of complex diseases and traits. Successes and failures of admixture mapping and implications of admixture for GWAS, including replication, fine-mapping, and imputation will be evaluated. Examples such as kidney diseases, diabetes, dyslipidemia, asthma, and prostate cancer will be considered. Opportunities presented by admixed populations in the mapping of complex diseases will be discussed. As the field enters the post-GWAS era, it is essential to identify situations where the uniqueness of admixed populations can be exploited to identify susceptibility genes.
A general model for admixture histories in hybrid populations. P. Verdu, N. A. Rosenberg Human Genetics, University of Michigan, Ann Arbor, MI.
Admixed human populations have previously been used for inferring human migrations, detecting natural selection, and finding disease genes. However, these applications often use a simple statistical model of admixture rather than a modeling perspective that incorporates the history of the admixture process. We have developed a general new model of admixture that mechanistically accounts for complex historical admixture processes. We consider M source populations contributing to the ancestry of a hybrid population, potentially with variable contributions across generations. For a random individual in the hybrid population at generation g, we study the fraction of genetic ancestry originating from one of the source populations by computing its moments as functions of time and of introgression parameters. We show how very different admixture processes can produce identical mean admixture proportions but different variances. In a case with two source populations, we also show that when introgression parameters for each source population are constant over time, the long-term limit of the expectation of admixture proportions depends only on the ratio of the introgression parameters. Further, in this constant admixture process, we show that the variance of the admixture proportion can reach a maximum before decreasing to its long-term limit. Our approach will facilitate the inference of admixture mechanisms, illustrating how higher moments of the distribution of admixture proportions can be informative about the admixture processes contributing to the genetic diversity of hybrid populations.
A small number of candidate gene SNPs reveal geographical ancestry. N. KODAMAN1, J. R. SMITH2, L. B. SIGNORELLO2,3, K. BRADLEY2, J. BREYER2, S. COHEN2, J. LONG2, Q. CAI2, W. J. BLOT2,3, C. MATTHEWS2, S. M. WILLIAMS1 1) CENTER FOR HUMAN GENETICS RESEARCH, VANDERBILT UNIVERSITY, NASHVILLE, TN; 2) DEPARTMENT OF MEDICINE, VANDERBILT UNIVERSITY, NASHVILLE TN; 3) INTERNATIONAL EPIDEMIOLOGY INSTITUTE, ROCKVILLE, MD.
Ancestry Informative Markers (AIMs) are genetic variants that differ substantially in frequency among geographical populations. They are frequently used in conjunction with PCA or STRUCTURE to infer ancestry and subsequently to control for stratification in genetic epidemiological studies. The number of AIMs necessary to infer ancestry and provide adequate information to control for stratification is not clear. Original estimates suggested as many as 300 AIMs were necessary, but more recently, Allocco et al. (2007) found that as few as 50 SNPs chosen randomly from the HapMap database predicted ancestral continent of origin with an average accuracy of 95%. This observation raises the question of whether AIMs are necessary to estimate ancestry in candidate gene studies if the study has a sufficient number of markers. Here, using genomic data from an obesity-related candidate gene study on 2547 African-American and Caucasian participants, we assessed the proportion of African/European ancestry in individuals by running STRUCTURE (k=2) with 276 AIMs, and then compared the results to analyses using randomly chosen subsets of 100, 50, and 25 AIMS, as well as 100, 50, and 25 SNPs chosen randomly from 1144 SNPs in 44 obesity candidate genes. Each subset of AIMs and SNPs was chosen randomly 100 times, and each randomized sample was analyzed with STRUCTURE 25 times. We found that all of the subsets of AIMs and SNPs generated reliable estimates of ancestry. For example, the correlation between quantitative ancestry estimates using 276 AIMs and only 50 random SNPs chosen from among the candidate genes was approximately 0.95. Our results confirm Allocco et al’s conclusion on the informativeness of random SNPs, while further showing that even SNPs randomly chosen from a small number of hand-picked genes can accurately place individuals to their continent of origin and be used to estimate the degree of admixture in mixed-race individuals. Our findings suggest that future candidate gene studies on African-American and Caucasian populations could be conducted more cost-effectively by forgoing the use of AIMs to control for population stratification and just using SNPs from the candidate genes.
Defining population structure within Southern Africa to advance studies of human disease. V. M. Hayes1, D. C. Petersen2, A. J. Schork3, R.-A. Hardie2, R. Wilkinson4, P. Venter5, S. C. Schuster6, N. J. Schork3 1) Human Genomics, J. Craig Venter Institute, San Diego, CA; 2) Lowy Cancer Research Center and Children's Cancer Institute Australia, University of New South Wales, Randwick, NSW, Australia; 3) The Scripps Research Institute and The Scripps Translational Science Institute, San Diego, CA; 4) Blood Transfusion Services, Windhoek, Namibia; 5) Department of Health Sciences, University of Limpopo, South Africa; 6) Department of Biochemistry and Molecular Biology, Pennsylvania State University, State College, PA.
Although Africa is home to one-sixth of the world’s population and is the epicenter of many globally significant infectious diseases, medical research tailored to African populations has been limited. As a result, the benefits of genome-wide association studies for defining human disease susceptibility, resistance and drug response, have not materialized for the continent. One of the main reasons for this limitation is that African representation in current DNA databases has largely been restricted to the Yoruba people of Nigeria. In an attempt to define the extent of genetic diversity within Africa, in 2010 the first Southern African genomes were sequenced, representing the diverse Niger-Congo B (Bantu) and Khoisan (click-speaking) linguistic groups. Whole genome and/or exome sequencing of five individuals resulted in the identification of 1.3 million novel DNA variants. In this study we use both current content genotyping arrays (>1 million SNPs) and the novel Southern African variant content (927,000 informative SNPs) to define the genetic diversity and population structure that exists within Southern Africa. It is clear from our studies that the Khoisan people not only represent an ancient divergent and broadly defined grouping with unique sub-population genetic structure, but are genetically distinct from the Bantu and Yoruba people. We also demonstrate a highly diverse Bantu population structure that is genetically distinct from the West African Yoruba. Ultimately, the smaller blocks of linkage disequilibrium in these older and diverse populations will provide an opportunity to overcome some of the limitations of current Eurocentric studies by facilitating disease-relevant casual variant identification. This data will drive host-genetic research efforts at the NIAID sponsored Genome Sequencing Center for Infectious Diseases at the JCVI.