The GenomeIndia project is a national initiative that coordinates academic and medical institutions across India to characterize the genetic diversity of the Indian subcontinent. The release used by this track is whole-genome sequencing of 9,768 healthy adults sampled from 83 anthropologically defined endogamous populations across India's ethnolinguistic and biogeographic range (Indo-European, Dravidian, Austroasiatic, and Tibeto-Burman language families, plus a continentally admixed outgroup). After joint genotyping and quality filtering, 129,938,889 high-confidence biallelic variants (~121M SNVs and ~8M indels) were reported, of which roughly one third are absent from gnomAD, 1000 Genomes, and GenomeAsia. This track shows the alternate allele frequency in that 9,768-sample autosomal call set.
Indian populations are underrepresented in global variant databases, so many globally rare alleles are at much higher frequencies in specific endogamous groups. The release ships only the cohort-wide alternate allele frequency (no per-population breakdown), so this track shows the overall GenomeIndia AF; AC is derived from AF (see Methods).
Variants are shown as a VCF dense track. Each row reports the genomic position, ref/alt alleles, the GenomeIndia alternate allele frequency, and a synthesized allele count. The track only includes autosomal variants (chr1–chr22); chrX, chrY, and chrM are not in the current release.
The data can be explored interactively with the Table Browser or the Data Integrator. For programmatic access, our REST API can be used; the track name is genomeindia. For bulk download, the VCF file can be obtained from our download server.
The original per-chromosome TSV summary statistics can be downloaded directly from the GenomeIndia Data Centre at ibdc.dbtindia.gov.in (the 9768GI_SummaryStats.tar.gz bundle). Use of the data is subject to the GenomeIndia data-access policy listed on that page.
PCR-free whole-genome sequencing libraries were prepared from blood-derived DNA and sequenced on Illumina NovaSeq 6000 to a per-sample average depth of at least 23×. Reads were processed with the Illumina DRAGEN v4.0.3 germline pipeline against GRCh38. The resulting per-sample gVCFs were then joint-genotyped with the Illumina gVCF genotyper. Site-level filters retained only PASS variants with QUAL ≥ 30, posterior genotype probability ≥ 99.9%, GQ > 20 at every site (GQ > 40 for singletons and doubletons), heterozygous allele balance ≥ 0.2, call rate ≥ 98%, and Hardy–Weinberg equilibrium p > 1×10-11; sites with an inbreeding coefficient of 1 were also excluded as technical artefacts. Variants were annotated for protein impact with Ensembl VEP v113 plus LOFTEE; details are in the published methods (Bhattacharyya et al. 2025, see References).
The release was downloaded from ibdc.dbtindia.gov.in as 9768GI_SummaryStats.tar.gz, which contains 22 per-chromosome TSV files of CHROM, POS, ID, REF, ALT, AF (no header). The TSV files were converted to a single sorted, bgzipped, tabix-indexed VCF by the script genomeindiaToVcf.py. The release ships only AF; AC and AN are synthesized as AN = 2 × 9768 = 19536 and AC = round(AF × AN). Variants were kept only when called in ≥98% of samples, so AN slightly overstates the true called allele count for some sites (worst case ~2%); the AC field is a close approximation, not the exact observed count. The processing steps are documented in the makeDoc file.
We thank the GenomeIndia consortium for making the 9,768-sample summary statistics publicly available. The track was built at UCSC by Max Haeussler.
Bhattacharyya C, Subramanian K, Uppili B, Biswas NK, Ramdas S, Tallapaka KB, Arvind P, Rupanagudi KV, Maitra A, Nagabandi T et al. Mapping genetic diversity with the GenomeIndia project. Nat Genet. 2025 Apr;57(4):767-773. PMID: 40200122
Subramanian K, Bhattacharyya C, Machha P, Mukherjee A, Tripathi D, Chakraborty S, Majumdar SS, Sengupta S, Singh P, More V et al; GenomeIndia Consortium. An Atlas of Indian Genetic Diversity. medRxiv. 2026 Mar 20;2026.03.20.26348801 (preprint).