Description

This track shows structural variants (SVs) identified by PacBio HiFi long-read sequencing of probands and their families enrolled in the Genomic Answers for Kids (GA4K) program at Children's Mercy Research Institute. GA4K is a longitudinal pediatric genomics initiative that aims to enroll 30,000 children with suspected rare genetic disorders, together with their parents, to build a large-scale resource of clinical and genomic data.

The callset contains 115,554 SVs (52,564 deletions, 58,219 insertions, 4,408 duplications, 363 inversions) from 502 sequenced samples. Variants are site-level (no per-sample genotypes) and each SV has been replicated, meaning that it was either observed in two or more unrelated GA4K individuals, or matched an SV from an external long-read reference set (Decode or the Human Pangenome Reference Consortium).

Display Conventions and Configuration

Items are colored by SV type:

Insertions are placed at the insertion site with a width of 1 bp; deletions, duplications and inversions span the affected interval. Filters are available for SV type, SV length, carrier-sample count and allele frequency. The detail page also shows the total number of samples genotyped at each site.

Methods

Samples were sequenced on PacBio Revio and Sequel II instruments with HiFi chemistry. Single-sample SV callsets were produced with pbsv and then merged across the cohort with JASMINE v1.1.4 (jasmine --output-genotypes), which clusters equivalent SVs across samples and writes a site-level multi-sample VCF.

To reduce false positives, the merged VCF was filtered to retain only SVs that were replicated in at least two independent observations: either (1) matching a second SV from another unrelated Children's Mercy (CMH) individual within the same Jasmine cluster, or (2) matching an SV from the Decode Icelandic or Human Pangenome Reference Consortium (HPRC) callsets using svpack match with default settings.

Carrier counts (SVC), total sample counts (SVN) and allele frequencies (SVF = SVC/SVN) were recomputed on the replicated callset.

Data Access

The data can be explored interactively in table format with the Table Browser or the Data Integrator and exported from there to spreadsheet or tab-sep tables. From scripts, the data can be accessed through our API, track=ga4kSv.

For automated download and analysis, the annotation is stored in a bigBed file that can be downloaded from our download server. The file for this track is called ga4kSv.bb. Individual regions or the whole annotation can be obtained using the bigBedToBed utility, available as a precompiled binary or from source as described on our utilities page. Example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/lrSv/ga4kSv.bb -chrom=chr21 -start=0 -end=100000000 stdout.

The original VCF is available from the Children's Mercy Research Institute GA4K data release at github.com/ChildrensMercyResearchInstitute/GA4K.

Credits

Thanks to the Children's Mercy Research Institute and the Genomic Answers for Kids participants and their families for making this dataset publicly available.

References

Cohen ASA, Farrow EG, Abdelmoity AT, Alaimo JT, Amudhavalli SM, Anderson JT, Bansal L, Bartik L, Baybayan P, Belden B et al. Genomic answers for children: Dynamic analyses of >1000 pediatric rare disease genomes. Genet Med. 2022 Jun;24(6):1336-1348. PMID: 35305867