race/history/evolution notes: DNA analysis and small-scale admixture events

This paper ("Calculating expected DNA remnants from ancient founding events in human population genetics"; provisional PDF) concludes, based on a series of computer simulations:

while genetic data may be sensitive and powerful in large genetic studies, caution must be used when applying genetic information to small, recent admixture events. For some parameter sets, genetic data will not be adequate to detect historic admixture. In such cases, studies should consider anthropologic, archeological, and linguistic data where possible.

The authors point out that:

While genetic studies can provide considerable information, they are also accompanied by variation and stochasticity. Because of these limitations, even the most complete studies of human populations have been called “not unequivocal”[21] or “sobering”[22] by those conducting the research. Recent reports have also addressed the limited depth of current genetic studies[23], indicating that most studies make conclusions after sequencing less than 1% of subjects’ genomes, and sampling only small numbers of a population. Such methods can be especially problematic when dealing with historic admixture events that are very small. The difficulty is a function of the current architecture of genetic studies: researchers sample loci from a group of individuals and categorize individuals into groups based on which alleles they have at the loci tested[24, 25]. These categorizations are determined based on the most prevalent or probable genetic markers in an individual’s genome. The results of these studies, then, can overlook genetic markers that simply are not sampled, which is common in small admixture events. Additionally, stochastic events can lead to allele fixation and further complicate matters, particularly in small populations. It has been suggested that studies of even the largest migrations should couple genetic information with archeological, anthropological, and linguistic data[26].

The simulation results aren't too surprising:

The sizes of the migrant and native populations are fundamental for an understanding of expected allele frequency. With time since admixture as low as those we consider in our simulations, the most important factors are the sizes of the migrating and native populations. In our simulations, if the native population is large, changing the migrating population size results in a change of mean final allele frequency from .0243 to .0010. If the native population is small, those numbers change to .5016 and .0407. These are the most significant differences illustrated by our simulations and they attest to the important role of population sizes. Researchers should not expect to find many alleles from a small migratory group of 50 individuals in a large population today, even if sampling methods are exhaustive.

Sample size matters. Or, to kick a (hopefully) dead horse, why DNAprint is crap:

The average final allele frequency of the migrant allele in our population from the second simulation was 1.017%. We calculated the cumulative density function (CDF) for a genetic study that samples 50 loci for each individual and where the probability of detecting the migrant allele is equal to the probability found in our simulations. The CDF demonstrates that in 60% of individuals sequenced for 50 loci, we would not expect to find a single migrant allele (Figure 7a). Furthermore, we will only find more than one migrant allele in 9% of the subjects examined.

In the case of a large study with as many as 933 loci, based upon the expected migrant allele frequency of 1.017%, almost every subject would demonstrate at least one migrant allele (Figure 7b). In fact, most subjects would demonstrate more than 9 migrant alleles. However, while large studies would expect to succeed in finding more migrant alleles in today’s population, this alone cannot link the admixed population to the migrant population. The migrant alleles will still only represent, on average, 1% of every allele sequenced in the entire study. Therefore, although 9 migrant alleles may, on average, be found in each subject, it is hard to know if the migrant alleles will be redundant among loci and subjects or spread evenly throughout all the loci in the study. Additionally, these numbers could be considerably lower depending on the allele frequency in the migrating population.

And:

As time increases, genetic drift causes the spread of final allele frequencies to increase, particularly when the population sizes are small. Thus, as the time since the admixture event increases, sample size for both loci and subjects becomes increasingly important.

In our second simulation, most of the migrant alleles are present in less than 2% of the population. In a study of a population where few subjects from many human populations are studied, alleles from a small-scale admixture will usually not be recovered at all. And these rare alleles could easily be ignored in favor of haplotypes that better categorize the population into clusters.

The authors conclude:

DNA data have been touted as a panacea for recovering information about the past, but their use depends so extensively on factors that are beyond our control that its applicability is not always appropriate. It is imperative, therefore, that researchers understand the implications of the variables we have presented and not rely solely on DNA sequence data when researching small, recent human migrations.

We can only hope to understand basic details of population history when quantifying genetic data and even valid results derived from genetic data may still be misleading if viewed unilaterally, as demonstrated by Harpending et. al.[46, 47]

Our results, however, are not completely ominous. Carefully designed studies should be able to draw specific and valid conclusions from genetic data. One area for major improvement is the number of individuals and loci sampled. Our results indicate that a large sample size and large number of loci are needed to obtain robust results. Studies that are unable to sample sufficiently do not have the power to draw appropriate conclusions and should be interpreted with caution.

[. . .]

The random nature of admixed genetic data seen in these simulations demonstrates that the utility of genetic data is dependent on the context of each individual study. Increasing the number of loci and the number of individuals sampled will increase the probability of detecting small traces of signal, but other sources of evidence should always be considered where possible.

DNA analysis and small-scale admixture events

No comments: