Great ape genetic diversity and population history
Most great ape genetic variation remains uncharacterized1,2; however, its study is critical for understanding population history3–6, recombination7, selection8, and susceptibility to disease9,10. Here, we sequence to high coverage a total of 79 wild- and captive-born individuals representing all six great ape species and seven subspecies and report ~88.8 million single nucleotide polymorphisms. Our analysis provides support for genetically distinct populations within each species, novel signals of gene flow, and the split of common chimpanzees into two distinct groups: Nigeria-Cameroon/Western and Central/Eastern populations. We find extensive inbreeding in almost all wild populations with Eastern gorillas being the most extreme. Inferred effective population sizes have varied radically over time in different lineages and this appears to have a profound effect on the genetic diversity at or close to genes in almost all species. We comprehensively discover and assign 1,982 loss-of-function variants throughout the human and great ape lineages, determining that the rate of gene loss has not been different in the human branch compared to other internal branches in the great ape phylogeny. This comprehensive catalog of great ape genome diversity provides a framework for understanding evolution and a resource for more effective management of wild and captive great ape populations.
We sequenced great ape genomes to a mean of 25-fold coverage per individual (Table 1, Supplementary Note, Table S1) sampling natural diversity by selecting captive individuals of known wild-born origin as well as individuals from protected areas in Africa (Figure 1a). We also included nine human genomes—three African and six non-African individuals11. Variants were called using the software package GATK12 (Methods), applying several quality filters, including conservative allele balance filters, and requiring that genomes showed <2% contamination between samples (Methods and Supplementary Note). In order to assess the quality of single nucleotide variant (SNV) calls, we performed three sets of independent validation experiments with concordance rates ranging from 86%–99% depending on allele frequency, the great ape population analyzed, and the species reference genome used (Supplementary Note, Table S2). In total, we discovered ~84.0 million fixed substitutions and ~88.8 million segregating sites of high quality (Table 1, Table S3) providing the most comprehensive catalog of great ape genetic diversity to date. From these variants we also constructed a list of potentially ancestry-informative markers (AIMs) for each of the surveyed populations, although a larger sampling of some subspecies is still required (Supplementary Note).
We initially explored the genetic relationships between individuals by constructing neighbor-joining phylogenetic trees from both autosomal and mitochondrial genomes (Supplementary Note). The autosomal tree identified separate monophyletic groupings for each species/subspecies designation (Suppl. Figure 8.5.1) and supports a split of extant chimpanzees into two groups. Nigeria-Cameroon and Western chimpanzees form a monophyletic clade (>97% of all autosomal trees) while Central and Eastern chimpanzees form a second group (72% of all autosomal trees).
Genome-wide patterns of heterozygosity (Figure 1b) reveal a threefold range in single nucleotide polymorphism (SNP) diversity. Non-African humans, Eastern lowland gorillas, bonobos, and Western chimpanzees show the lowest genetic diversity (~0.8 x 10−3 heterozygotes/bp). In contrast, Central chimpanzees, Western lowland gorillas, and both orangutan species show the greatest (1.6–2.4 x 10−3 heterozygotes/bp). These differences are also reflected by measures of inbreeding from runs of homozygosity13 (Figure 1c, Supplementary Note). Bonobos and Western lowland gorillas, for example, have similar distributions of tracts of homozygosity as human populations that have experienced strong genetic bottlenecks (Karitiana and Papuan). Eastern lowland gorillas appear to represent the most inbred population, with evidence that they have been subjected to both recent and ancient inbreeding.
To examine the level of genetic differentiation between individuals we performed a principal component analysis (PCA) of SNP genotypes (Supplementary Note). Chimpanzees were stratified between subspecies with PC1 separating Western and Nigeria-Cameroon chimpanzees from the Eastern and Central chimpanzees and PC2 separating Western and Nigeria-Cameroon chimpanzees. In gorillas, PC1 clearly separates Eastern and Western gorillas while the Western lowland gorillas are distributed along a gradient of PC2, with individuals from the Congo and Western Cameroon positioning in opposite directions along the axis. The isolated Cross River gorilla is genetically more similar to Cameroon Western lowland gorillas and can be clearly differentiated with PC3 (Suppl. Figure 8.2.9).
We explored the level of shared ancestry among individuals within each group14 using an admixture model (FRAPPE). In chimpanzees, the four known subspecies are clearly distinguished when fitting the model using four ancestry components (K=4) (Figure 1d). Additional substructure is identified among the Eastern chimpanzees Vincent and Andromeda (K=6), who hail from the most Eastern sample site (Gombe National Park, Tanzania). As in Gonder et al2 we have identified three Nigeria-Cameroon samples (Julie, Tobi and Banyo, K=3–5) with components of Central chimpanzee ancestry. However, taking Central chimpanzees and the remaining Nigeria-Cameroon chimpanzees as ancestral populations shows no evidence of gene flow by either the F3 statistic or HapMix. This suggests these three samples are not the result of a recent admixture and may represent a genetically distinct population (Supplementary Note).
In gorillas, following the separation of Eastern and Western lowland species (K=2), an increasing number of components further subdivide Western lowland populations distinguishing Congolese and Cameroonian gorillas—a pattern consistent with the structure observed in the PCA analysis (Suppl. Figure 8.2.9). One striking observation is the extent of admixed ancestry predicted for captive individuals when compared to wild-born. Our analysis suggests that most captive individuals included in this study are admixed from two or more genetically distinct wild-born populations leading to an erosion of phylogeographic signal. This finding is consistent with microsatellite analyses of captive gorillas15 and the fact that great ape breeding programs have not been managed at the subspecies level.
As great apes have been evolving on separate lineages since the middle Miocene, we attempted to reconstruct the history of these various species and subspecies by applying methods sensitive to branching processes, changes in effective population size (Ne), and gene flow occurring at different time scales. Using a combination of speciation times inferred from a haploid pairwise sequential Markovian coalescent (PSMC) analysis16, a coalescent hidden Markov model (CoalHMM)3, and incomplete lineage sorting approaches, we were able to estimate the most ancient split times and effective population sizes among the great ape species. By combining these estimates with an approximate Bayesian computation (ABC)17 analysis applied to the more complex chimpanzee phylogeny, we constructed a composite model of great ape population history over the last ~15 million years (Figure 2). This model presents a complete overview of great ape divergence and speciation events in the context of historical effective population sizes.
PSMC analyses of historical Ne (Figure 3) suggests that the ancestral Pan lineage had the largest effective population size of all lineages >3 million years ago (Mya), after which the ancestral bonobo-chimpanzee population experienced a dramatic decline. Both PSMC and ABC analyses support a model of subsequent increase in chimpanzee Ne starting ~1 Mya, prior to their divergence into separate subspecies. Following an Eastern chimpanzee increase in Ne (~500 thousand years ago, kya), the Central chimpanzees reached their zenith ~200–300 kya followed by the Western chimpanzee ~150 kya. Although the PSMC profiles of the two subspecies within each of the major chimpanzee clades (Eastern/Central and Nigeria-Cameroon/Western) closely shadow each other between 100 kya and 1 Mya, the Western chimpanzee PSMC profile is notable for its initial separation from that of the other chimpanzees, followed by its sudden rise and decline (Supplementary Note, Figure 3). The different gorilla species also show variable demographic histories over the past ~200 ky. Eastern lowland gorillas have the smallest historical Ne, consistent with smaller present-day populations and a history of inbreeding (Figure 1c). A comparison of effective population sizes with the ratio of non-synonymous to synonymous substitutions finds that selection has acted more efficiently in populations wit higher Ne, consistent with neutral theory (Supplementary Note).
Although the phylogeny of bonobos and Western, Central and Eastern common chimpanzees has been well established based on genetic data18, there is still uncertainty regarding their relationship to Nigeria-Cameroon chimpanzees2,19. Regional neighbor-joining trees and a maximum-likelihood tree estimated from allele frequencies both show that Nigeria-Cameroon and Western chimpanzees form a clade. A complex demographic history has been previously reported for chimpanzees with evidence of asymmetrical gene flow among different subspecies. For instance, Hey4 identified migration from Western into Eastern chimpanzees, two subspecies that are currently geographically isolated. We find support for this using the D-statistic, a model-free approach that tests whether unequal levels of allele sharing between an outgroup and two populations that have more recently diverged (D(H,W;E,C)>16SD). However, no previous genome-wide analysis that has examined gene flow included chimpanzees from the Nigeria-Cameroon subspecies and a comparison of them with Eastern chimpanzees results in a highly significant D-statistic (D(H,E;W,N)>25SD). Furthermore, TreeMix, a model-based approach that identifies gene flow events to explain allele frequency patterns not captured by a simple branching phylogeny, infers a signal of gene flow between Nigeria-Cameroon and Eastern chimpanzees (p=2x10300). A more detailed treatment of gene flow applying different models and methods may be found in the Supplementary Note.
Genetic diversity is depressed at or close to genes in almost all species (Suppl. Fig 11.1) with the effect less pronounced in subspecies with lower estimated Ne, consistent with population genetic theory. When we compare the relative level of X chromosome and autosomal (X/A) diversity across great apes as a function of genetic distance from genes, the Eastern lowland gorillas and Bornean orangutans are outliers, with substantially reduced X/A diversity compared to the neutral expectation of 0.75, regardless of the distance to genes. This pattern is consistent with a recent reduction in effective population size20, clearly visible in the PSMC analysis for both species (Figure 3). However, bonobos also demonstrate a relatively constant level of X/A diversity regardless of distance from genes, with values very much in line with neutral expectations. All other subspecies demonstrate a pattern consistent with previous studies in humans21 where X/A diversity is lower than 0.75 close to genes and higher farther away from genes.
It has been hypothesized that loss of gene function may represent a common evolutionary mechanism to facilitate adaptation to changes in an environment22. There has been speculation that the success of humans may have, in part, been catalyzed by an excess of beneficial loss-of-function mutations23. We, thus, characterized the distribution of fixed loss-of-function mutations among different species of great apes identifying nonsense and frameshift mutations resulting from SNVs (n=806) and indels (n=1080) in addition to gene deletion events (n=96) (Table S4). We assigned these events to the phylogeny and determined that the number of fixed loss-of-function mutations scales proportionally to the estimated branch lengths (R2=0.987 SNVs, R2=0.998 indels). In addition, we found no evidence of distortion on the terminal branches of the tree compared to point mutations based on a maximum likelihood analysis (Supplementary Note). Thus, the human branch in particular showed no excess of fixed loss-of-function mutations even after accounting for human-specific pseudogenes24 (Supplementary Note).
Our analysis provides one of the first genome-wide views of the major patterns of evolutionary diversification among great apes. We have generated the most comprehensive catalogue of SNPs for chimpanzees (27.2 million), bonobos (9.0 million), gorillas (19.2 million), and orangutans (24.3 million)(Table 1) to date and identified several thousand AIMs, which provides a useful resource for future analyses of ape populations. Humans, Western chimpanzees, and Eastern gorillas all show a remarkable dearth of genetic diversity when compared to other great apes. It is striking, for example, that sequencing of 79 great ape genomes identifies more than double the number of SNPs obtained from the recent sequencing of more than a thousand diverse humans25—a reflection of the unique out-of-Africa origin and nested phylogeny of our species.
We provide strong genetic support for distinct populations and subpopulations of great apes with evidence of additional substructure. The common chimpanzee shows the greatest population stratification when compared to all other lineages with multiple lines of evidence supporting two major groups: the Western and Nigeria-Cameroon and the Central and Eastern chimpanzees. The PSMC analysis indicates a temporal order to changes in ancestral effective population sizes over the last two million years, previous to which the Pan genus suffered a dramatic population collapse. Eastern chimpanzee populations reached their maximum size first, followed by the Central and Western chimpanzee. The Nigerian chimpanzee population size appears much more constant.
Despite their rich evolutionary history, great apes have experienced drastic declines in suitable habitat in recent years26, along with declines in local population sizes of up to 75%27. These observations highlight the urgency to sample from wild ape populations to more fully understand reservoirs of genetic diversity across the range of each species and to illuminate how basic demographic processes have affected it. The ~80 million SNPs we identified in this study may now be used to characterize patterns of genetic differentiation among great apes in sanctuaries and zoos and, thus, are of great importance for the conservation of these endangered species with regard to their original range. These efforts will greatly enhance conservation planning and management of apes by providing important information on how to maintain genetic diversity in wild populations for future generations.
We sequenced to a mean coverage of 25X (Illumina HiSeq 2000) a total of 79 great ape individuals, representing 10 subspecies and four genera of great apes from a variety of populations across the African continent and Southeast Asia. SNPs were called using GATK12 after BWA28 mapping to the human genome (NCBI Build 36) using relaxed mapping parameters. Samples combined by species were realigned around putative indels. SNP calling was then performed on the combined individuals for each species. For indels, we used the GATK Unified Genotyper to produce an initial set of indel candidates applying several quality filters and removing variants overlapping segmental duplications and tandem repeats. We also removed groups of indels clustering within 10 bp to eliminate possible artifacts in problematic regions. Conservative allelic imbalance filters were used to eliminate false heterozygotes that may affect demographic analyses, some of which are sensitive to low levels of contamination. We estimate that the application of this filter resulted in a 14% false negative rate for heterozygotes. Our multispecies study design facilitated this assessment of contamination, which may remain undetected in studies focused on assessing diversity within a single species. The amount of cross-species contamination was estimated from the amount of non-endogenous mitochondrial sequence present in an individual. Because we wished to compare patterns of variation between and within species, we report all variants with respect to coordinates of the human genome reference. For FRAPPE analyses, we used MAF0.06 (human, orangutan, and bonobo) and 0.05 (chimpanzee and gorilla) to remove singletons. For most of the analyses, we only used autosomal markers, except in the X/A analysis. To determine the amount of inbreeding, we calculated the heterozygosity genome-wide in windows of 1 Mbp with 200 kbp sliding windows. We then clustered together the neighboring regions to account for runs of homozygosity. For the PSMC analyses, we called the consensus bases using SAMtools29. Underlying raw sequence data is available through the SRA (PRJNA189439/SRP018689). Data generated in this work are available from http://biologiaevolutiva.org/greatape/. A complete description of the material and methods is provided in the Supplementary Note.
We thank the following funding agencies: ERC Starting Grant (260372) to T.M.-B.; NIH grants HG002385 to E.E.E., R01_HG005226 to K.R.V., A.E.W., M.F.H., L.S. and J.D.W., GM100233 and NSF HOMINID grant 1032255 to D.R. and He.Li.; MICINN (Spain) BFU2011-28549 to T.M.-B., BFU2010-19443 to Ja.Be., Spanish Government and FEDER for grants BFU2009-13409-C02-02 and BFU2012-38236 to A.N. and J.P.-M., Direcció General de Recerca, Generalitat de Catalunya (Grup de Recerca Consolidat 2009 SGR 1101) to Ja.Be., D.C., A.N. and T.M.-B.; ERC Advanced Grant (233297) and Max Planck Society to S. Paabo; Danish Council for Independent Research Natural Sciences to H.S.; Spanish Grant (CGL-2010-20170) and Zoo de Barcelona (Beca PRIC) to A.R.-H.; EUPRIM-Net to BPRC; DP1ES022577-04 NIH grant to S.A.T.; NSF Grant 0755823 to M.K.G.; P.G. is supported by the G. Harold and Leila Y. Mathers Foundation. A.N. and T.M.-B. are ICREA Research Investigators (Institut Catala d’Estudis i Recerca Avancats de la Generalitat de Catalunya). J.P.-M. is supported by the Zoo de Barcelona and l’Ajuntament de Barcelona. P.H.S. is supported by an HHMI International Student Fellowship. E.E.E. is an investigator with the Howard Hughes Medical Institute. We are especially grateful to all those who generously provided the samples for the project: O. Thalmann and H. Siedel from Limbe Sanctuary; R. Garriga from Tacugama Sanctuary; W. Schempp (University of Freiburg), Burgers' Zoo; Zoo of Antwerp; Wilhelma Zoo; Givskud Zoo; Ngamba Island Chimpanzee Sanctuary and Centre de Primatologie; Centre International de Recherches Médicales de Franceville; North Carolina Zoological Park; Zoo Atlanta; the Lincoln Park Zoo (Chicago); the Antwerp Zoo and the Limbe Wildlife Centre (Cameroon); D. Travis from University of Minnesota and M. Kinsel from University of Illinois Urbana-Champaign and S. Paabo and L. Vigilant, Max Planck Institute for Evolutionary Anthropology. We thank T. Brown for revising the manuscript, L. Capilla and E. Eyras for technical support, and M. Dierssen for comments on genes expressed in the brain.
We are especially grateful to all those who generously provided the samples for the project: O. Thalmann and H. Siedel from Limbe Sanctuary; Rosa Garriga from Tacugama Sanctuary; Professor Dr. Werner Schempp (University of Freiburg), Burgers’ Zoo; Zoo of Antwerp; Wilhelma Zoo; Givskud Zoo; Ngamba Island Chimpanzee Sanctuary and Centre de Primatologie; Centre International de Recherches Médicales de Franceville; North Carolina Zoological Park; Zoo Atlanta, the Lincoln Park Zoo (Chicago); the Antwerp Zoo and the Limbe Wildlife Centre (Cameroon); Dominic Travis from University of Minnesota and Michael Kinsel from University of Illinois Urbana-Champaign and Dr. Svante Paabo, Max Planck Institute for Evolutionary Anthropology. Finally, we thank Tonia Brown for revising the manuscript, Laia Capilla and Eduardo Eyras for technical support, and Mara Dierssen for comments on genes expressed in the brain.
The authors declare no competing financial interests.
AUTHOR CONTRIBUTIONSEEE and TM-B designed the study. JP-M, PHS, JMK, JLK, BL-G, MD, MF-C, JCM, CDB, EEE, and TM-B analyzed the raw data and performed the variant calling. JP-M, PHS, MM, JHG, IH, CB, LV, AR-H, and CC validated the different variants. JP-M, PHS, BL-G, CA, FH, EEE, and TM-B analyzed large variants. KV, AW, and MH analyzed the X/Autosome diversity. DT, GS, AC, CT, FC, HL, KP, MP, ML, NP, DC, JB, AN, and AMA performed selection analyses. JP-M, PHS, TDO, HL, DR, KM, AH, AEH, MHS, CH, JMA, TM, CDB, EEE, and TM-B analyzed different aspects of demography. MLW, LS, TA, IK, RWP, AP, FL, JK, EL, HS, MKG, SAT, RB, OAR, and BHH provided critical samples and participated in the discussion of phylogeny. LF, RKW, JB, EEE, MM, LA-C, MG, and IGG generated genome libraries and produced the genome sequence associated with this project. All authors contributed to data interpretation. JP-M, PHS, EEE and TM-B drafted the manuscript with input from all authors.