The Korean Variant Archive (KOVA) contains 1,896 whole genome sequencing and 3,409 whole exome sequencing data from healthy individuals of Korean ethnicity. Most of the samples originated from normal tissue of cancer patients (40.16%), healthy parents of rare disease patients (28.4%), or healthy volunteers (31.44%). Korean ancestry is not broken down further in the INFO field. Coverage 100x for WES, 30x for WGS. SVs called with Manta are also available.
Due to license restrictions, the data for this track cannot be downloaded from the UCSC Genome Browser. The Table Browser, Data Integrator, and download server are not available for this track.
TSV data can be requested on the KOVA Downloads website. Our GitHub repo contains a script that converts this format to VCF.
Raw reads were aligned to the GRCh38+decoy reference using BWA-MEM v0.7.17 with default parameters, followed by duplicate marking and coordinate sorting with MarkDuplicatesSpark, and base quality score recalibration using BQSRPipelineSpark in GATK v4.1.3.0; mapping quality control metrics were generated with Qualimap v2.2.1. Single-nucleotide variants and small insertions/deletions were called per sample using GATK HaplotypeCaller in GVCF mode (-ERC GVCF), and joint genotyping was performed by creating a GenomicsDB with GenomicsDBImport and following GATK Best Practices, including variant quality score recalibration (VQSR) retaining 99.7% of true SNVs and 99.0% of true indels based on training sets (workflow detailed in Supplementary Fig. 1). Downstream analyses followed a modified version of the gnomAD quality-control framework and were primarily conducted using Hail; after merging WES and WGS data in Hail, multiallelic variants and variants with genotype quality <20, read depth <10, allelic balance <0.2, or overlapping low-complexity regions were excluded.
At UCSC, V7 of the TSV.gz was obtained from the KOVA staff by email and converted to VCF. It is not available for download from our site but can be requested from the KOVA website. We provide documentation that indicates how all source files of the varFreqs track were converted in the makeDoc file of the track. For some tracks, python scripts were necessary and are also available from GitHub.
Thanks to Insu Jang and the KOVA director for providing variant frequencies in TSV format.
Lee J, Lee J, Jeon S, Lee J, Jang I, Yang JO, Park S, Lee B, Choi J, Choi BO et al. A database of 5305 healthy Korean individuals reveals genetic and clinical implications for an East Asian population. Exp Mol Med. 2022 Nov;54(11):1862-1871. PMID: 36323850; PMC: PMC9628380