An integrated map of structural variation in 2,504 human genomes
SV discovery & genotyping: R.E.H., P.H.S., T.R., E.J.G., A.Ab., K.Y., F.H., K.C., G.D., K.W., M.H.-Y.F., S.K., C.A., S.A.M., R.E.M., M.B.G., S.E.D., E.E.E., J.O.K.; SV merging & haplotype-integration: T.R., R.E.H., M.H.-Y.F., E.G., A.Me., S.McC.; SV validation: R.E.H., A.Ab., G.J., M.H.-Y.F., A.M.S., M.K.K., A.Ma., S.K., M.M., M.J.P.C., S.M., P.C., S.E., J.M.K., B.R., J.A.W., F.Y., T.Z., M.A.B., R.E.M., A.B., C.L., E.E.E., J.O.K.; additional analyses: A.Au., C.M., E.C., E.D., E.-W.L., F.K., J.H., Y.Z., X.S., F.P.C., M.M., M.J.P.C., G.M., S.M., D.A., T.B., J.C., Z.C., L.D., X.F., M.G., J.M.K., H.Y.K.L., Y.K., X.J.M., B.J.N., A.N., R.A.G., M.P., M.R., R.S., D.M.M., M.W., N.F.P., A.Q., E.S., A.S., A.A.S., A.U., C.Z., J.Z., W.Z., J.S., O.S.; data management & archiving: L.C, X.Z-B, P.F.; display items: P.H.S., T.R., E.J.G., A.A., Y.Z., J.H., M.H.-Y.F., K.Y., M.B.G., A.B., O.S., R.E.M., S.E.D., E.E.E., J.O.K.; organization of Supplementary Material: G.D., J.O.K., P.H.S., R.E.M.; SV Analysis group co-chairs: C.L., E.E.E., J.O.K.; manuscript writing: P.H.S., T.R., E.J.G., J.H., R.E.M., M.B.G., O.S., S.E.D., E.E.E., J.O.K.
Structural variants (SVs) are implicated in numerous diseases and make up the majority of varying nucleotides among human genomes. Here we describe an integrated set of eight SV classes comprising both balanced and unbalanced variants, which we constructed using short-read DNA sequencing data and statistically phased onto haplotype-blocks in 26 human populations. Analyzing this set, we identify numerous gene-intersecting SVs exhibiting population stratification and describe naturally occurring homozygous gene knockouts suggesting the dispensability of a variety of human genes. We demonstrate that SVs are enriched on haplotypes identified by genome-wide association studies and exhibit enrichment for expression quantitative trait loci. Additionally, we uncover appreciable levels of SV complexity at different scales, including genic loci subject to clusters of repeated rearrangement and complex SVs with multiple breakpoints likely formed through individual mutational events. Our catalog will enhance future studies into SV demography, functional impact and disease association.
SVs, including deletions, insertions, duplications and inversions, account for most varying base pairs (bp) among individual human genomes1. Numerous studies have implicated SVs in human health with associated phenotypes ranging from cognitive disabilities to predispositions to obesity, cancer and other maladies1,2. Discovery and genotyping of these variants remains challenging, however, since SVs are prone to arise in repetitive regions and internal SV structures can be complex3. This has created challenges for genome-wide association studies (GWAS)4,5. Despite recent methodological and technological advances6-9, efforts to perform discovery, genotyping, and statistical haplotype-block integration of all major SV classes have so far been lacking. Earlier SV surveys depended on microarrays10 as well as genomic and clone-based approaches limited to a small number of samples11-15. More recently, short-read DNA sequencing data from the initial phases of the 1000 Genomes Project8,9 enabled us to construct sets of SVs, genotyped across populations, with enhanced size and breakpoint resolution6,7. Previous 1000 Genomes Project SV set releases, however, encompassed fewer individuals and were largely6 or entirely8 limited to deletions, in spite of the relevance of other SV classes to human genetics1,2,4.
The objective of the Structural Variation Analysis Group has been to discover and genotype major classes of SVs (defined as DNA variants ≥50 bp) in diverse populations and to generate a statistically phased reference panel with these SVs. Here we report an integrated map of 68,818 SVs in unrelated individuals with ancestry from 26 populations (Supplementary Table 1). We constructed this resource by analyzing 1000 Genomes Project phase 3 whole-genome sequencing (WGS) data16 along with data from orthogonal techniques, including long-read single-molecule sequencing (Supplementary Table 2), to characterize hitherto unresolved SV classes. Our study emphasizes the population diversity of SVs, quantifies their functional impact, and highlights previously understudied SV classes including inversions exhibiting marked sequence complexity.
Construction of our phase 3 SV release
We mapped Illumina WGS data (~100 bp reads, mean 7.4-fold coverage) from 2,504 individuals onto an amended version8 of the GRCh37 reference assembly using two independent mapping algorithms—BWA17 and mrsFAST18—and performed SV discovery and genotyping using an ensemble of nine different algorithms (ED Figure 1 and Supplementary Notes). We applied several orthogonal experimental platforms for SV set assessment, refinement and characterization (Supplementary Table 2) and to calculate the False Discovery Rate (FDR) for each SV class (Table 1). Callset refinements facilitated through long-read sequencing enabled us to incorporate a number of additional SVs into our callset, including an additional 698 inversions and 9,132 small (<1 kbp) deletions, compared to the SV set released with the 1000 Genomes Project marker paper16. As a result, our callset differs slightly relative to the marker paper’s SV set16 (see Supplementary Table 2). We merged individual callsets to construct our unified release (Table 1), comprising 42,279 biallelic deletions, 6,025 biallelic duplications, 2,929 mCNVs (multi allelic copy-number variants), 786 inversions, 168 nuclear mitochondrial insertions (NUMTs), and 16,631 mobile element insertions (MEIs— including 12,748, 3,048 and 835 insertions of Alu, L1 and SVA (SINE-R, VNTR and Alu composite) elements, respectively).
SV non-reference genotype concordance estimates ranged from ~98% for biallelic deletions and MEI classes to ~94% for biallelic duplications. 60% of SVs were novel with respect to the Database of Genomic Variants (DGV)19 (50% reciprocal overlap criterion, Figure 1a), whereby 71% of SVs (50% reciprocal overlap) and 60% of collapsed copy-number variable regions (CNVRs, 1 bp overlap) were novel compared to previous 1000 Genomes Project releases6,8, reflecting methodological improvements and inclusion of additional populations. Novel SVs showed enrichment for rare sites, which we detected down to an autosomal allele count of ‘1’. And while variations in FDR estimates were evident with SV size and VAF (variant allele frequency), we consistently estimated the FDR at ≤5.4% when stratifying deletions and duplications by size and frequency, including for rare SVs with VAF<0.1% (ED Figures 1, ,2).2). A comparison with deep-coverage Complete Genomics (CG) sequencing data indicated an overall sensitivity of 88% for deletions and 65% for duplications, with the false negatives driven largely by the relatively lowered sensitivity for ascertaining small SVs in Illumina sequencing data (Figure 1b, ED Figure 3). The average per-individual sensitivity was similar for deletions (89%) and slightly lower for duplications (50%). For MEI classes, estimated sensitivities ranged from 83%–96% (Table 1) compared to the 1000 Genomes Project pilot phase where a different MEI detection tool was used20. For inversions, we estimated an overall sensitivity of 32% based on variants with a positive validation status recorded in the InvFEST database21, with an increased sensitivity of 67% for inversions <5 kbp in size.
We performed breakpoint assembly using pooled Illumina WGS and Pacific Biosciences (PacBio) sequencing data22, and additionally performed split-read analysis23 of short reads, to resolve the fine-resolution breakpoint structure of 37,250 SVs (29,954 deletions, 357 tandem duplications, 6,919 MEIs, and 20 inversions; Supplementary Table 3). Breakpoint assemblies showed a mean boundary precision of 0–15 bp for all SV types with the exception of inversions and duplications for which we achieved mean precision estimates of 32 bp and 683 bp, respectively (Table 1, Figure 1c).
Population genetic properties of SVs
We explored the population genetic properties of SVs among five continental groups—Africa (AFR), the Americas (AMR), East Asia (EAS), Europe (EUR) and South Asia (SAS). The bulk of SVs occur at low frequency (65% exhibit VAF<0.2%) consistent amongst individual SV classes (ED Figures 2, ,3).3). While rare SVs are typically specific to individual continental groups, at VAF≥2% nearly all SVs are shared across continents (Figure 1d, ED Figure 3). Notably, we identified 1,075 SVs with VAF>50% (889 biallelic deletions, 2 biallelic duplications, 90 mCNVs, 88 MEIs and 6 inversions) encompassing 5 Mbp, sites of interest for future updates to the human reference genome. We estimated the mutation rate for each SV class using Waterson’s estimator of θ, for example, ascertaining a mutation rate of 0.113 deletions per haploid genome generation, a threefold higher estimate compared with previous reports10,24, likely due to our increased power for detecting variants <5 kbp (Supplementary Note).
We found that 73% of SVs with >1% VAF and 68% of rarer SVs (VAF>0.1%) are in linkage disequilibrium (LD) with nearby single nucleotide polymorphisms (SNPs) (r2>0.6); however, the proportion of variants in LD highly depends on the SV class (Figure 1e, ED Figure 4). For example, only 44% of all biallelic duplications with VAF>0.1% were in LD with a nearby SNP (r^2>0.6), in agreement with previous findings10,25,26. Notably, we observed a striking depletion of biallelic duplications amongst common SVs (P<2×10−16, KS-test; ED Figure 5) with most common duplications classified as multi-allelic SVs (i.e., mCNVs). This behavior suggests extensive recurrence of SVs at duplication sites consistent with what was recently observed in a smaller cohort of 849 individuals27. These LD characteristics suggest duplications are currently under-ascertained for disease associations using tag-SNP-based approaches.
Based on our haplotype-resolved SV catalog, we observed that individuals of African ancestry exhibit, on average, 27% more heterozygous deletions than individuals from other populations (mean of 1,705 vs. 1,342) consistent with SNPs28 (ED Figure 5). The relative proportion of deletion- versus SNP-affected sequence, however, showed a 13% excess in non-African compared to African populations (ratio 1.64 vs. 1.45). Principal component analyses with different SV classes generally recapitulated continental population structure and admixture (ED Figure 6 and Supplementary Note). Our analysis further allowed us to identify a catalog of 6,495 ancestry-informative MEI markers of potential value to population genetics history and forensics research (ED Figure 5, Supplementary Table 4).
Since population stratification can be used as a signature to detect adaptive selection, we additionally identified SVs varying in VAF amongst different populations. For each SV site we calculated a Vst statistic, a measure highly correlated with Fst (the fixation index)29 that can be applied to assess population stratification of biallelic and multi-allelic SVs29. We observed 1,434 highly stratified SVs (>0.2 Vst, corresponding to 2.9 standard deviations (s.d.) from the mean; Supplementary Table 5) among which 578 intersected gene coding sequences (CDSs). Among these were several SVs associated with regions previously reported to be under positive selection, such as KANSL1 mCNVs (ED Figure 6) that tag a European-enriched inversion polymorphism associated with increased fecundity30. Most of the population-stratified sites, however, have not been previously described and are, thus, potential targets for future investigation of SVs undergoing adaptive selection or genetic drift. These include, for example, a 14.5 kbp intronic duplication of HERC2 enriched in East Asians (Vst=0.62 EAS-EUR).
Functional impact of SVs
We analyzed the intersection of deletions binned by VAF with various classes of genic and intergenic functional elements (Figure 2a, ED Figure 7). The CDSs, untranslated regions (UTRs) and introns of genes, in addition to ENCODE31 transcription factor binding sites and ultrasensitive noncoding regions, showed a significant depletion (P<0.001; permutation testing in each VAF bin) compared to a random background model. In general, these elements are more depleted (in terms of fold change) in common VAF bins compared to rarer deletion alleles in keeping with purifying10 (or in some cases background32) selection. Genes more intolerant to mutation (as measured from SNP diversity, residual variation intolerance score (RVIS)33 <20) exhibited the most pronounced depletion (P<0.001; permutation testing between pairs of RVIS-score categories). All other SV classes exhibited similar signatures of selection; when compared to deletions these depletions were, however, more attenuated (Figure 2b, ED Figure 7). Additional assessment of the site frequency spectrum showed that as deletion sizes increase these SVs become more rare (p<2.2e-16; linear model, F-test), evidence of purifying selection against events more likely intersecting functional elements. Duplications, by comparison, did not exhibit such trend, consistent with reduced selective constraints (Supplementary Note).
We additionally analyzed 5,819 homozygous deletions to search for gene knockouts naturally occurring in human populations. Among these we identified 240 genes (corresponding to 204 individual deletion sites), which based on the observation of homozygous losses in normal individuals appear “dispensable” (Supplementary Table 6). Most of the underlying deletions were found in more than one human population, and for only one (0.5%) we observed evidence for the putative involvement of uniparental disomy in the homozygosity (Supplementary Note). The majority (>80%) of these homozygous gene losses were novel compared to a previous analysis based on DGV variants19, or recent clinical genomics studies (Supplementary Note). As expected, genes affected by homozygous loss were not highly conserved and were relatively tolerant to other forms of genetic variation (RVIS=0.74 compared to OMIM disease genes showing RVIS=0.43; p=9.4e-25; Mann-Whitney test). Moreover, the set was functionally enriched for glycoproteins (Benjamini Hochberg corrected p-value=1.6×10−3, EASE [Expression Analysis Systematic Explorer] score) and genes harboring immunoglobulin domains (Benjamini Hochberg corrected p-value=1.0×10−5, EASE score).
We next quantified the functional impact of SVs using expression quantitative trait loci (eQTL) associations as a surrogate34,35. Based on transcriptome data from lymphoblastoid cell lines derived from 462 individuals36 (the gEUVADIS consortium), we tested 18,969 expressed protein-coding genes for cis-eQTL associations, considering 1 Mbp candidate regions upstream and downstream of CDSs. A joint eQTL analysis using SNPs, InDels and SVs with VAF>1% identified 54 eQTLs with a lead SV association (denoted SV-eQTL) and 9,537 eQTLs with a lead SNP/InDel association (10% FDR). For an additional 166 eQTLs with lead associations to SNPs or InDels, we observed SVs in LD (r2>0.5) seven times more than when using random variants matched for LD structure, distance to the transcription start site, and VAF, suggesting that a larger number of eQTLs are likely impacted by SVs (ED Figure 8, Supplementary Table 7). In proportion to the number of variants tested, SV classes were up to ~50-fold enriched for SV-eQTLs (p=2.84×10−39, one-sided Fisher’s exact test; Supplementary Table 8). Large SVs were associated with increased effect size, for example, a twofold increase in effect size for genic SVs >10 kbp versus variants <1 kbp (P=0.0004; t-test; ED Figure 8). Taken together, although SNPs contribute more eQTLs overall, our results suggest SVs have a disproportionate impact on gene expression relative to their number.
Among those 220 eQTLs having either an SV-eQTL or an SV in LD with the lead SNP/InDel, most were due to deletions (55% of associations) followed by mCNVs (19%) (Supplementary Table 8). Although SV-eQTLs with the largest effect sizes tended to overlap with CDSs, such as for the dual specificity phosphatase 22 (DUSP22) gene (Figure 2c), we also observed several expression-associated SVs strictly intersecting upstream noncoding sequences, including an mCNV upstream of ZNF43 (Figure 2d) possibly mediated through variation of a cis-regulatory element. We additionally considered the impact of accounting for SVs when constructing personalized reference genomes for transcriptome analysis. To illustrate this, we considered RNA read alignments for the sample NA12878, comparing the standard reference genome with GRCh37-derived personalized references constructed using NA12878 SNPs, or using NA12878 SNPs and SVs. Using such an approach, we observed marked changes in expression for 525 exons (±10 reads, ≥1-fold change relative to the standard reference), 24 of which could be attributed to the inclusion of SVs into the personalized reference (Supplementary Table 9).
The relevance of SVs to eQTLs suggests that a number of disease associations previously detected by GWAS may be attributable to SVs, which are difficult to assess directly in GWAS. To test this hypothesis we compared 12,892 previously reported SNP-based GWAS hits to SVs identified in our dataset, identifying 136 candidate SVs in strong LD (r2>0.8) with GWAS variants, which represents a 1.5-fold enrichment when compared to a VAF and haplotype size-matched background set and a 3-fold enrichment for deletions >20 kbp (P=0.004) (Figure 2e and Supplementary Note). Approximately a third of these candidate GWAS associations (39) were novel, impacting phenotypes such as colorectal cancer and bone mineral density (Supplementary Table 10). Interestingly, 64% of these novel associations were mediated by deletions <1 kbp, a size-range for which our study has improved power over previous surveys, which more than doubles (from 18 to 40) the number of SVs <1 kbp in strong LD with a GWAS lead SNP. Thus, our SV resource could facilitate discovery of numerous additional disease-linked SVs.
SV clustering and complexity
Advances in Illumina sequencing towards longer read lengths (~100 bp vs. 36 bp)6 in conjunction with the population-level data allowed us to perform an in-depth investigation of SV complexity and clustering. We identified 3,163 regions where SVs appeared to cluster (>2 SVs mapping within 500 bp; Supplementary Table 11). To reduce redundancy caused by multiple overlapping calls per sample, we calculated distinct CNVRs per cluster by merging calls per sample and haplotype and then counting the distinct CNVRs produced across samples (average 6.4 ± 7.2 CNVRs per cluster). We identified 30 genomic regions with an excess of CNVRs (>4 s.d. or >36 CNVRs per cluster). This clustering effect was not correlated with segmental duplications (r=0.02) and only partially explained by SNP diversity (r=0.15; ED Figure 9). CNVR clusters showed enrichment near late-replicating origins (p=0.013, permutation test) and at cytogenically defined ‘fragile’ sites (p=0.0017; permutation test). Although the proportion of gene content in regions exhibiting excessive SV clustering was significantly reduced when compared to a null distribution (p<0.000001, permutation test), 1,881 of 3,163 such regions (59%) intersected one or more genes (Supplementary Table 11). This includes a region comprised of 47 SVs (ranking 2nd out of the 30 genomic regions with >4 s.d.) encompassing the pregnancy-specific glycoprotein gene family (Figure 3a), a set of genes thought to be critically important for maintenance of pregnancy37. Other SV clusters associated with genes (e.g., IMMP2L, CHL1 and GRID2) have been implicated as potential risk factors for disease, including neurodevelopmental disorders38.
We additionally specifically assessed the complexity of the 29,954 deletions with resolved breakpoints and found that 6% (1,822) intersected another deletion with distinct breakpoints. A larger fraction (16% or 4,813 of assembled deletion sites) showed the presence of additional inserted sequence at deletion breakpoints. We grouped 1,651 deletions with mean size of 3.1 kbp and at least 10 bp of additional DNA sequence between the original SV site boundaries into five broad classes (Figure 3b, Supplementary Table 12). The most common class (N=501, 30.3%), termed Ins with Dup and Del, comprised deletions exhibiting a recognizable duplicated sequence interval within the respective inserted sequence. Notably, in many cases (N=191) the inserted sequences comprised two or more apparent sequence duplications at the deletion boundaries (a class denoted Ins with MultiDup and Del). Additional classes commonly observed include Inv and Del (inversion with adjacent deletion; N=9) and MultiDel—a class where two or more adjacent deletions are separated by at least one sequence “spacer” of up to ~204 bp in length (N=370). However, not all complex SVs fit into these classes, with 214 sites forming distinct patterns corresponding to multiple classes or exhibiting increased complexity. Template-switching mechanisms could explain the notable complexity of these SVs3. Indeed, microhomology patterns were typically present between the breakpoints of deletions and the respective boundaries of insertion templates at these sites (ED Figure 9) consistent with formation through single mutational events (Supplementary Note). Across the complex sites assessed, 871 (53%) showed evidence for a local template (≥10 bp match, within 10 kbp), whereas for 41 the insertion was presumably templated from a distal region (≥22 bp match, >10 kbp away), including 17 sites where the DNA stretch was likely derived from RNA templates (Supplementary Table 13).
To further characterize SV breakpoint complexity, we employed two alternative approaches that do not rely on low-coverage Illumina read assembly. We first examined 7,804 small deletions for breakpoint complexity using split-read analysis23 (Figure 3c) and identified 664 (median size: 67 bp) exhibiting complexity, 64 of which contained insertions ≥3 bp that may be derived from a nearby template (Supplementary Table 14, ED Figure 9). We additionally realigned long DNA reads from a single individual (NA12878)22 sequenced by high-coverage PacBio (median read length=3.0 kbp) and Moleculo (median=3.2 kbp) single-molecule WGS around deletions from our release set (Figure 3d). Out of 766 deletions in NA12878 investigated with this approach, 62 exhibited complexity showing three to six breakpoints (Supplementary Table 12). A deletion of exon 3 of the serine protease inhibitor SPINK14, for example, was accompanied by an inversion of an internal segment of the SV sequence (Figure 3d panel i). In contrast to the smaller proportion of deletions showing breakpoint complexity, the majority of inversions assessed in NA12878 (19/28) exhibited multiple breakpoints.
To further explore inversion sequence complexity, we performed a battery of targeted analyses, leveraging PacBio resequencing of fosmids (targeting 34 loci), sequencing by Oxford Nanopore Minion (60 loci) and PacBio (206 loci) of long-range PCR amplicons, and data for 13 loci from another sample (CHM1) sequenced by high-coverage PacBio WGS14. Altogether we verified and further characterized 229 inversion sites, 208 using long-read data and 21 by PCR (Supplementary Table 15) increasing the number of known validated inversions21 by >2.5-fold. Remarkably, only 20% of all sequenced inversions characterized in this manner were “simple” (termed Simple Inv), exhibiting two breakpoints (Figure 3e), including a 2 kbp inversion on chromosome 4 intersecting a regulatory exon of the Ras homolog family member RHOH (Figure 3d panel iv). The majority of inversions (54%) corresponded to inverted duplications (Inverted Dup; Figure 3d panel iii). In nearly all cases, these involved duplicated stretches <1 kbp inserted within 5 kbp of the alternate copy, suggesting a common mechanism of SV formation (ED Figure 10). The remaining inversions comprised Inv and Del events (14%), MultiDel events exhibiting inverted spacers (7%), and more highly complex sites (5%; Figure 3d panel ii). The appreciable inversion complexity uncovered here is most likely due to a mutational process forming complex SVs, potentially involving DNA replication errors3, rather than due to recurrent rearrangement, as our analyses failed to detect corresponding intermediate events in 1000 Genomes Project samples.
We present the most comprehensive set of human SVs to date as an integrated resource for future disease and population genetics studies. We estimate individuals harbor a median of 18.4 Mbp of SVs per diploid genome, an excess contributed to a large extent by mCNVs (11.3 Mbp) and biallelic deletions (5.6 Mbp; Table 1). When collapsing mCNV sites carrying multiple copies as well as homozygous SVs onto the haploid reference assembly, a median of 8.9 Mbp of sequence are affected by SVs, compared to 3.6 Mbp for SNPs. Furthermore, 37,250 SVs have mapped breakpoints amounting to >113 Mbp of SV sequence resolved at the nucleotide-level. By mining homozygous deletions we identified over two hundred nonessential human genes, a set enriched for immunoglobulin domains that hence may reflect variation in the immune repertoire underlying inter-individual differences in disease susceptibility.
We demonstrate that SV classes are disproportionally enriched (by up to ~50-fold) for SV-eQTLs, although only 220 SVs were either found as lead eQTL association or in high LD with the respective lead SNP. While this corresponds to proportionally fewer associations relative to SNPs compared to a prior estimate based on array technology34, this may be explained by the reliance of this prior estimate on bacterial artificial chromosome arrays, which ascertain large SVs (>50 kbp) that associate with strong effect size, as well as by the relative scarcity of SNPs tested in an earlier study34 (HapMap Phase I)39. We further expand the number of candidate SVs in strong LD with GWAS hits by ~30% (39/136 novel associations implicating SVs as candidates) and find that GWAS haplotypes are enriched up to threefold for common SVs, which emphasizes the relevance of ascertaining SVs in disease studies. The large number of novel SVs smaller than 1 kbp in length associated with previously reported GWAS hits highlights the importance of increasing sensitivity for SV detection and genotyping at this size range. Additionally, the large number of rare SVs captured by our resource may be of value for disease association studies investigating rare variants.
Our deep population survey has identified hotspots of SV mutation that cannot be accounted for by deep coalescence or segmental duplication content. We describe hitherto undescribed patterns of SV complexity, particularly for inversions. These patterns indicate that other more complex mutational processes outside of non-allelic homologous recombination, retrotransposition, and non-homologous end-joining played an important role in shaping our genome. In spite of this, it remains difficult to fully disentangle the contributions of SV mutation rates and selective forces to the observed variant clustering. The findings presented here leveraged substantial recent technological advances, including increases in Illumina read length and developments in long-read DNA technologies. SV discovery remains a challenge nonetheless, and the full complexity and spectrum of SV is not yet understood. Our analyses, for example, are largely based on 7.4-fold Illumina WGS and, thus, are underpowered to capture much of the complexity of variation, including SVs in repetitive regions, non-reference insertions, and short SVs at the boundaries of the detection limits of read-depth and paired-end-based SV discovery4. Furthermore, while many SVs in our callset are statistically phased, the diploid nature of the genome is non-optimally captured by current analysis approaches, which mostly rely on mapping to a haploid reference. We envision that in the future, the use of technology allowing substantial increases in read lengths over the current state-of-the-art will enable genomic analyses of truly diploid sequences to facilitate targeting these additional layers of genomic complexity. Until this is realized, our SV set represents an invaluable resource for the construction and analysis of personalized genomes.
We thank Matthew Hurles, Richard Durbin and David Reich for valuable comments during the preparation of this work, Stephen Scherer for providing PCR-based inversion genotyping data for the initial calibration of our inversion caller, Ben Nelson and Vladimir Benes for technical assistance, and Tonia Brown and Nina Habermann for critical review of the manuscript. The following people are acknowledged for contributing to PacBio sequencing or analysis: Ekta Patel, Sandra Lee, Harsha Doddapaneni, Lora Lewis, Robert Ruth, Qingchang Meng, Vanesa Vee, Yi Han, Joy Jayaseelan, Adam English, Jonas Korlach, Mike Hunkapiller, Bruno Hüttel and Richard Reinhardt. We acknowledge the Yale University Biomedical High-Performance Computing Center and high-performance compute infrastructure made available through the EMBL and EMBL-EBI IT facilities. We thank the people generously contributing samples to the 1000 Genomes Project. Funding for this research project came from the following grants: NIH U41HG007497 (to C.L., E.E.E., J.O.K., M.A.B., M.G., S.A.M., R.E.M. and J.S.), RO1GM59290 (M.A.B.), R01HG002898 (S.E.D.) and R01CA166661 (S.E.D.), P01HG007497 (to E.E.E.), R01HG007068 (to R.E.M.), RR19895 and RR029676-01 (to M.B.G.), Wellcome Trust WT085532/Z/08/Z and WT104947/Z/14/Z (to P.F.), an Emmy Noether Grant from the German Research Foundation (KO4037/1-1, to J.O.K.) and the European Molecular Biology Laboratory. C.L. is on the scientific advisory board (SAB) of BioNano Genomics. E.E.E. is on the scientific advisory board (SAB) of DNAnexus, Inc. and is a consultant for Kunming University of Science and Technology (KUST) as part of the 1000 China Talent Program. P.F. is on the SAB of Omicia, Inc. C.L. is an Ewha Womans University Distinguished Professor. E.E.E. is an investigator of the Howard Hughes Medical Institute. J.O.K. is a European Research Council investigator.
Data deposition statement: Sequencing data, archive accessions and supporting datasets including GRCh37 variant call files comprising the extended SV Analysis Group release set, a “readme” describing differences to the phase 3 marker paper variant release16, and a GRCh38 version of our callset, are available at www.1000genomes.org/phase-3-structural-variant-dataset. DGV archive accession: estd219.
Competing financial interests statement: E.E.E. is on the scientific advisory board (SAB) of DNAnexus, Inc. and is a consultant for Kunming University of Science and Technology (KUST) as part of the 1000 China Talent Program. P.F. is on the SAB of Omicia, Inc.