Description

This track shows which genome regions are more or less accessible to next generation sequencing methods that use short, paired-end reads. It summarizes whole genome sequencing data from Phase 3 of the 1000 Genomes Project and shows two levels of stringency: "pilot" stringency regions (see below) cover 94.5% of non-N bases in the genome (excluding alternate haplotype sequences and unplaced contigs) and "strict" regions cover 75.5%. Each site which meets "strict" criteria also passes the "pilot" criteria.

This track does not show a mask of regions in which variant calls can or cannot be made. Some 1000 Genomes Phase 3 variant calls are in regions that do not meet the "strict" criteria. Phase 3 variant calls are filtered using the Variant Quality Score Recalibrator (VQSR) method (implemented in the Genome Analysis Toolkit (GATK)) without regard to the thresholds applied here. VQSR assesses the evidence for variation at sites where there is evidence, but says nothing about the remaining sites.

These regions will be useful for (a) comparing accessibility using current technologies to accessibility in the 1000 Genomes Pilot Project, and (b) population genetic analyses (such as estimates of mutation rate) that must focus on genomic regions with very low false positive and false negative rates.

Methods

The total depth of mapped sequence reads, the average mapping quality score and the fraction of reads with mapping quality zero (meaning that this read maps equally well to more than one location in the genome) are tabulated from 1000 Genomes Project Phase 3 .bam files. This combines low coverage whole genome sequence information from 2,504 individuals, giving a genome wide average total depth of coverage of 17,920 reads. Both "pilot" and "strict" tracks are .bed file conversions of the "pass" regions from .fasta mask files. See the README file in that directory and Supplementary Information (section 9.2) of (1000 Genomes Project Consortium, 2015) for more details.

The "pilot" criteria require a depth of coverage between 8,960 and 35,840 inclusive (between one-half and twice the average depth) and that no more than 20% of covering reads have mapping quality zero. These are equivalent to the criteria used for analyses in the 1000 Genomes Pilot paper (2010). The "strict" criteria require a depth of coverage between 8,960 and 26,880 inclusive, no more than 0.1% of reads with mapping quality zero, and an average mapping quality of 56 or greater. This definition is quite stringent and focuses on the most unique regions of the genome. Since approximately one-half of 1000 Genomes Project individuals are males, the depth of coverage is generally lower on the X chromosome. Coverage thresholds on the X chromosome were adjusted by a factor of 3/4 and on the Y chromosome by a factor of 1/2.

Credits

Mary Kate Wing at the University of Michigan Center for Statistical Genetics provided the track data files. Tom Blackwell and Mary Kate Wing at UM edited the description and methods.

References

1000 Genomes Pilot Project:
1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010 Oct 28;467(7319):1061-73.

Phase 3 of the 1000 Genomes Project:
PMID: 26432245 http://www.nature.com/nature/journal/v526/n7571/full/nature15393.html