Description

Structural variants (SVs) are large changes in DNA — deletions, duplications, inversions, insertions of mobile elements, and translocations — that are at least 50 base pairs in size. They are a major source of genetic variation between individuals and can affect gene dosage, disrupt coding sequence, or rearrange regulatory elements. Because SVs are harder to detect than small variants, population-scale SV maps lag behind single-nucleotide variant resources.

This track displays site-frequency data for 738,624 SVs identified in 17,795 deeply sequenced human genomes (mean coverage > 20×) by Abel et al., Nature 2020. The samples were sequenced by the four sequencing centers of the NHGRI Centers for Common Disease Genomics (CCDG) program, supplemented with ancestrally diverse samples from the PAGE consortium and the Simons Genome Diversity Project. The composition includes roughly 24% African, 16% Latino, 11% Finnish, 39% non-Finnish European, and 9% other ancestries.

Two non-overlapping public callsets are displayed as a single track:

Important: the B38 and B37 callsets share 5,245 samples. When inspecting a variant present in both callsets, users should not simply sum the allele counts; the AC/AN reported for each callset reflects that callset's sample set. The callset filter can be used to restrict display to one source.

Display conventions

Items are colored by SV type:

Deletions, duplications, inversions, and mobile-element variants are drawn as intervals spanning from the variant start to its end. Breakend (BND) records are drawn as single-base items at the variant breakpoint; the mate chromosome and position are shown on the details page for each BND. Each BND pair from LUMPY is shown only once (the secondary mate record is suppressed).

Filters

The following filters are available from the track configuration page:

Per-population allele counts and numbers are shown on the details page for 8 ancestry groups: AFR (African), AMR (Latino/Admixed-American), NFE (non-Finnish European), FE (Finnish European), EAS (East-Asian), SAS (South-Asian), PI (Pacific Islander), and Other.

Methods

The authors used their open-source svtools pipeline to jointly call SVs across all samples. Per-sample calls were produced with LUMPY (v0.2.13), CNVnator (v0.3.3), and svtyper (v0.1.4); calls were merged across samples and refined with svtools. Low- and high-confidence variants were distinguished using a Mendelian-error cutoff on mean sample quality, calibrated against a set of 409 CEPH trios. Per-sample validation was performed against a PacBio long-read truth set derived from three HGSVC samples.

For this UCSC track, VCF INFO fields were parsed and converted to BED9+ format. Variants originally called on GRCh37 (B37 callset) were lifted to GRCh38 using the UCSC hg19ToHg38.over.chain.gz chain. See the track build documentation for full details.

Data Access

The data can be explored interactively in table format with the Table Browser or the Data Integrator and exported from there to spreadsheet or tab-sep tables. From scripts, the data can be accessed through our API, track=abelSv.

For automated download and analysis, the annotation is stored in a bigBed file that can be downloaded from our download server. The file for this track is called abelSv.bb. Individual regions or the whole genome annotation can be obtained using our tool bigBedToBed, which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain features within a given range, e.g. bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/abelSv/abelSv.bb -chrom=chr21 -start=0 -end=100000000 stdout

The original site-frequency VCF and BEDPE files are distributed by the authors from their supplementary-data GitHub repository.

Credits

Thanks to Haley J. Abel, David E. Larson, Ira M. Hall and colleagues at the McDonnell Genome Institute (Washington University in St. Louis), the Broad Institute, Baylor College of Medicine, the New York Genome Center, and the University of Washington for producing this resource and making the site-frequency callsets publicly available.

References

Abel HJ, Larson DE, Regier AA, Chiang C, Das I, Kanchi KL, Layer RM, Neale BM, Salerno WJ, Reeves C et al. Mapping and characterization of structural variation in 17,795 human genomes. Nature. 2020 Jul;583(7814):83-89. PMID: 32460305; PMC: PMC7547914