The Genome of the Netherlands (GoNL) is a whole-genome sequencing project covering the Dutch population. The cohort was drawn from five Dutch biobanks and includes 250 parent-offspring families (231 trios and 19 quartets) from 11 of the 12 Dutch provinces. Samples were not selected by phenotype or disease status. This track shows allele counts and frequencies from the GRCh38 re-analysis of GoNL, restricted to the 498 unrelated parents (250 fathers and 248 mothers; two mothers failed QC in the original release).
The data can be explored interactively with the Table Browser or the Data Integrator. For programmatic access, our REST API can be used; the track name is gonl. For bulk download, the VCF is available from our download server. The original file is also available from the GoNL download directory at MolGenis.
The track shown here uses the GRCh38 re-analysis (version 1.0). All samples were re-aligned from raw reads to a GRCh38 analysis set (GRCh38_no_alt_plus_hs38d1 with PhiX as decoy). The processing pipeline is documented in the README accompanying the data and differs from the original Nature Genetics pipeline (reference below). Per-library reads were trimmed with cutadapt 1.13, aligned with bwa mem 0.7.15, sorted with Picard SortSam 2.9.0, and base-quality-recalibrated with GATK BaseRecalibrator 3.7. Per-sample files were merged and deduplicated with sambamba 0.6.6, and variants were called per sample with GATK HaplotypeCaller 3.7. Per-family GVCFs were merged with GATK CombineGVCFs, and all families were jointly genotyped with GATK GenotypeGVCFs 3.7. The GRCh38 callset has not been filtered with VQSR and missing genotypes have not been imputed, so it is rougher than the original GRCh37 release.
The file multisample.parents_only.info_only.vcf.gz was downloaded from https://download.molgeniscloud.org/downloads/gonl_public/variants/GoNL_GRCh38_1.0/. Of the 31,114,481 records in the source file, 30,904,161 were kept after dropping calls on the GRCh38 decoy contigs (chrUn_JTFH01* and similar) and the EBV contig, which are not part of the UCSC hg38 assembly. The original chromosome naming already uses the UCSC chr prefix, so no renaming was needed. The 2,629,361 multiallelic sites were then split with bcftools norm -m-any, with indels left-aligned against the hg38 reference, yielding 36,363,474 biallelic records (3,559,402 indels realigned). The maximum observed allele number (AN) is 996, which matches the 498 diploid parents in the cohort. Loading documentation is in the varFreqs makeDoc file; helper scripts for the broader varFreqs collection are in our GitHub scripts directory.
Data was generated by the Genome of the Netherlands Consortium and distributed via the MOLGENIS infrastructure at the University Medical Center Groningen. Thanks to the participants who donated samples and to the BBMRI-NL biobanks: LifeLines, Leiden Longevity Study, Netherlands Twin Registry, Rotterdam Study and Rucphen Study.
Genome of the Netherlands Consortium. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat Genet. 2014 Aug;46(8):818-25. PMID: 24974849