Genome-wide inference of ancestral recombination graphs
As the title indicates, our paper is about the problem of inferring an “ancestral recombination graph,” or ARG, from sequence data. This is a topic that may strike many readers as impenetrably obscure and technical, so I will first try to explain, in plain language, what the ARG describes and why it has so much potential to be useful in many kinds of genetic analysis. Then, I will tell the story of how I and members of my research group have become increasingly fascinated by this problem over the years, how we have struggled with it, and how we finally achieved the conceptual breakthrough that is described in our paper. As will become evident, Matt Rasmussen, a former postdoc in the group and lead author of our paper, was central in this achievement.
What is the ARG?
The ARG is an elegantly simple yet superbly rich data structure that describes the complete evolutionary history of a collection of genetic sequences drawn from individuals in one or more populations. It was invented in the mid 1990s by the mathematicians Bob Griffiths and Paul Marjoram. The ARG captures essentially all evolutionary information relevant for genetic analysis of such sequences. Statisticians say that it fully defines the “correlation structure” of the sequences, meaning that it explains most similarities and differences among the sequences in terms of their patterns of shared ancestry.
The ARG is something like a family tree, only richer, because it not only defines the relationships among individuals, but it also traces the histories of specific segments of DNA sequences. For example, if you were to replace your family tree with an ARG, you could tell exactly which pieces of your genome came from your eccentric great grandmother and which pieces you share with your charming, intelligent, and handsome third cousin. [. . .]
The significance of recombinations and coalescences comes from the fact that these are the two ways in which lineages can join or split over time. The best way to understand them is to think about the behavior of lineages as one looks backward in time. The graph is typically laid out with time on the vertical axis, so that the bottom of the graph represents the present time and each node is assigned a height above this baseline indicating the time before the present at which the associated event occurred. Therefore, to look backward in time, we look upward in the graph. As we do so, we see that recombination events cause a single lineage to split into two ancestral lineages (representing the two sequence fragments that were joined together by the recombination in forward time), and coalescence events cause two lineages to join into one. Therefore, recombination nodes have one edge coming in and two going out, and coalescence nodes have two edges coming in and one going out. One way of thinking about it is that, given a fragment of modern DNA, recombinations have the effect of increasing its set of ancestors, while coalescences have the effect of decreasing its set of ancestors. [. . .]
It is worth emphasizing that this representation is very general. It can be used to describe the history of a particular gene of interest in individuals from a single well-defined population, or the history of whole genomes (with one ARG per chromosome) for individuals from many diverse populations. It can even be used to describe the histories of sequences from representatives of different species, such as humans, chimpanzees, and gorillas. As long as the sequences in question are orthologous and collinear—meaning, essentially, that they are derived from a common ancestral sequence in the absence of duplication and rearrangement events—then the coalescence and recombination events defined by the ARG are sufficient for describing precisely how the sequences derive from their common ancestor, and, hence, how they are correlated with each other. [. . .]
The problem with trees on population genetic time scales, however, is that they change along the sequence, due to recombination. As noted above, the ARG precisely describes these trees and the way they change. Therefore, it enables tree-thinking with population genetic data.
Viewing population genetics in terms of the ARG can clarify one’s thinking about many problems of interest. For instance, the ARG makes it clear that divergence times for genetically isolated populations can be estimated by looking across the ARG for the most recent coalescences that cross population boundaries. Similarly, given an estimated divergence time, the rate of gene flow or migration between populations can be estimated, in a fairly straightforward manner, in terms of the rates of inter-population coalescence events across the ARG. Ancestral effective population sizes can be estimated from the density of coalescence events in the ARG over time. Signatures of natural selection, including hitchhiking and background selection, can be detected by various kinds of local distortion of the ARG. In general, the ARG provides a unifying framework for the field, and many challenging statistical problems in population genetics can properly be seen as problems of revealing relevant features of the ARG.
What would a reconstructed ARG mean in practical terms? First, I should be clear that we have no intension of actually drawing an ARG for dozens of complete human genome sequences. Such a drawing would be far too large and complex to be useful. Rather, the value of a reconstructed ARG is as a rich data structure that could be interrogated for many features of interest, such as local trees, recombination events, mutation ages, or regions of identity by descent. Because these features would be derived from a unified description of the evolutionary history of the sample, they would be guaranteed to be internally consistent, unlike ones based on simpler estimators. In this way, the ARG would be useful in many problems of interest in statistical genetics, ranging from demography inference (e.g., estimation of population divergence times or rates of gene flow between populations), to the detection of regions influenced by natural selection, to the detection of genotype/phenotype associations.
Why is it so difficult to find a good ARG?
In practice, most population geneticists do not work with ARGs, but instead work with surrogates such as principle components, site frequency spectra, and spectra of identity by descent. The reason people work with these simpler, lower-dimensional summaries of genetic data, of course, is that explicit ARG reconstruction is forbiddingly difficult. From a statistical and computational perspective, there are two major issues in reconstructing the ARG. First, the problem of searching all possible ARGs for one that best fits the data is computationally intractable, even in a restricted, parsimony-based formulation of the problem (it belongs to the class of problems computer scientists call “NP-hard”). Second, and perhaps more importantly, in most cases of interest there is simply not enough information in the data to reconstruct a single ARG with high confidence. Rather, in general, a large family of ARGs will be more or less equally compatible with observed sequences.
For these reasons, it would be misleading to suggest that there is any hope of producing a magical computer program that will allow the user to input a collection of sequences and obtain the true ARG for those sequences as output. Instead, we must consider many possible ARGs, weighting them in some way by their plausibility. In other words, we must consider a statistical distribution of ARGs given the data.
Because of the awkwardness of the space of ARGs (each ARG is a complex, combinatorial object, difficult to summarize in terms of low-dimensional features), we and others have come to the conclusion that the best way to get at these distributions is by making use of statistical sampling methods. In our case, we use an approach, called Markov chain Monte Carlo (MCMC), that chooses samples that are guaranteed to be representative of the distribution of ARGs given the data and the model, provided the sampling program is run long enough. After collecting fairly large numbers of samples, we can make useful statements about general features of the ARG even if we have limited confidence in each individual sample. For example, the average of the times to most recent common ancestry (TMRCA) in the sampled ARGs at a particular position along the sequence can be used as an estimator of the true TMRCA at that position. We show that our methods can be used to summarize various useful features of this kind, including recombination and coalescence rates, and the ages of mutations that are polymorphic in the sample, as well as TMRCAs.
[. . .] The ability to perform explicit ARG inference on the scale of complete genomes opens up a wide range of possible applications, but the long running times required for these analyses and the unwieldy data structures they produce (large numbers of samples of ARGs) may be barriers to practical usefulness. One strategy for addressing this problem would be to precompute ARGs for data sets of particular interest and provide publicly available tools for data retrieval and visualization. For example, one could carefully analyze a particularly rich public data set, such as the highcoverage genome sequences currently being produced by the 1000 Genomes Project , and extract a modest number of samples (say, 1000) from a lengthy MCMC run. These samples could be stored in a database in a manner that allowed researchers to efficiently extract various features of interest, such as marginal genealogies, recombination events, regions of IBD, or times to most recent common ancestry. In this way, a single ARG sampling run could be used to enable a broad variety of downstream analyses. A related possibility would be to support on-the-fly “threading” of user-specified query sequences into precomputed ARGs. This operation would be analogous to local ancestry inference [47,70,71], but would reveal not only the population sources of query sequence segments, but also additional information about recombination events, coalescence times, approximate mutation ages, and other features. The same operation could be used to allow our sampling methods to scale to thousands of genomes: one could infer ARGs for, say, 100 genomes, then simply thread in hundreds more, without full MCMC sampling. Finally, precomputed ARG samples could also be used as the basis for various visualization tracks, perhaps including the tracks like the ones introduced in this paper, as well as complementary tracks describe features of the ARG such as population divergence times, migration rates, or mutation ages. In general, posterior samples of ARGs will be a rich resource for the interpretation of genetic data, but it will be critical to find efficient and effective ways to make these samples practically useful to the genomics community.