This track shows allele frequencies for 78.6 million variants from 4,480 whole-genome-sequenced Chinese individuals released by the Westlake BioBank for Chinese (WBBC) pilot project. The WBBC is a population study of about 35,000 Chinese volunteers across 31 provinces; about 15,000 have been deeply phenotyped and a subset have been whole-genome sequenced. The frequencies are also broken down into four Han Chinese regional groups (North, Central, South, Lingnan) defined by recruitment province in the WBBC paper.
The pilot project has been folded into the larger China Precision BioBank (CPBB) initiative, which is collecting up to 100,000 samples nationwide. The variant frequencies on this track are from the original WBBC Phase I release (v20210103) and are unchanged by the rebranding.
The track uses the standard UCSC VCF display. Hovering a variant shows the cohort allele frequency, the four regional frequencies, sequencing depth, GATK VQSR log-odds score, and the per-genotype hom-ref / het / hom-alt sample counts as reported by WBBC.
The WBBC pilot whole-genome-sequenced 4,535 individuals at a mean depth of 13.9x on Illumina HiSeq X10 platforms, after dropping samples that failed standard QC. Reads were aligned to GRCh38 with BWA-MEM, variants were jointly called with GATK 4.0 HaplotypeCaller, and the callset was hard-filtered with VQSR. The 4,480 unrelated samples released for download were stratified into four Han Chinese regional groups (North, Central, South and Lingnan, which together cover 27 of the administrative divisions the pilot reached). Allele counts and frequencies are reported overall and per region. See Cong et al. 2022 (in References below) for full sample-selection and pipeline details.
The per-chromosome WGS sites VCFs (chr1-22) were downloaded from https://wbbc.westlake.edu.cn/ (URL pattern: WBBC.chr<N>.GRCh38.vcf.gz). We concatenated the 22 files with bcftools concat, re-headered the result to add the standard hg38 contig lines and proper INFO definitions, then dropped variants with cohort allele count zero (multi-allelic splits that no WBBC sample carries; ~1.9% of rows), and sorted, bgzipped and tabix-indexed the result. No coordinate liftover was needed: the upstream files are already on GRCh38 with chr-prefixed chromosomes. The pipeline is recorded in the makeDoc file of the track.
Only autosomes (chr1-22) are present; chrX/Y/M are not in the WBBC download. Variants reported as AC=0 in the WBBC release (about 1.9 % of rows, mostly multi-allelic split sites that no WBBC individual carries) have been removed from this track.
The variant frequencies can be explored interactively using the Table Browser or the Data Integrator, and exported to spreadsheet or tab-separated tables. From scripts, the data can be accessed via our REST API with track=wbbc.
The VCF file is also available from our download server as wbbc.vcf.gz. Individual regions can be extracted with tabix, for example tabix http://hgdownload.soe.ucsc.edu/gbdb/hg38/varFreqs/wbbc/wbbc.vcf.gz chr21:1-100000000. The original per-chromosome WBBC release is distributed at https://wbbc.westlake.edu.cn/.
Thanks to the WBBC participants and to the Westlake University team (Pei-Kuan Cong, Hou-Feng Zheng and colleagues) for making the pilot sites-only VCFs publicly available.
Cong PK, Bai WY, Li JC, Yang MY, Khederzadeh S, Gai SR, Li N, Liu YH, Yu SH, Zhao WW et al. Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project. Nat Commun. 2022 May 26;13(1):2939. PMID: 35618720; PMC: PMC9135724