Description

This container track helps call out sections of the genome that often cause problems or confusion when working with the genome. The hg19 genome has a track with the same name, but with more subtracks, as the GeT-RM and Genome-in-a-Bottle artifact variants do not exist for hg38.

Problematic Regions

The Problematic Regions track contains the following subtracks:

Highly Reproducible Regions (HighRepro)

The Highly Reproducible Regions track highlights regions and variants from eight samples that can be used to assess variant detection pipelines. The "Highly Reproducible Regions" subtrack comprises the intersection of the reproducible regions across all eight samples, while the "Variants" subtracks contain the reproducible variants from each assayed sample. Both tracks contain data from the following samples:

Please refer to the Pan et al reference for more information on how these regions were defined.

GIAB Problematic Regions

The Genome in a Bottle (GIAB) Problematic Regions tracks provide stratifications of the genome to evaluate variant calls in complex regions. It is designed for use with Global Alliance for Genomic Health (GA4GH) benchmarking tools like hap.py and includes regions with low complexity, segmental duplications, functional regions, and difficult-to-sequence areas. Developed in collaboration with GA4GH, the Genome in a Bottle (GIAB) consortium, and the Telomere-to-Telomere Consortium (T2T), the dataset aims to standardize the analysis of genetic variation by offering pre-defined BED files for stratifying true and false positives in genomic studies, facilitating accurate assessments in complex areas of the genome.

The creation of the GIAB Problematic Regions tracks involves using a pipeline and configuration to generate stratification BED files that categorize genomic regions based on specific challenges, such as low complexity or difficult mapping, to facilitate accurate benchmarking of variant calls. For more information on the pipeline and configuration used, please visit the following webpage: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v3.5/README.md. If you have questions or comments, please write to Justin Zook (jzook@nist.gov).

Panmask Easy 151b Regions

The Panmask Easy 151b Regions subtrack contains a set of sample-agnostic easy regions where short-read variant calling reaches high accuracy. Easy regions are derived for variant filtration agnostic to individual samples. They are genomic intervals where general variant callers achieve high accuracy without sophisticated filtering.

A set of easy regions for ancient DNA variant filtering was generated by selecting 35-mers that could not be mapped elsewhere within one mismatch or gap. Read alignments from multiple samples were inspected to exclude regions with excessively high or low coverage or those enriched with low mapping quality alignments. The easy regions generated through this k-mer uniqueness procedure are referred to as pm151:lenient, where "pm" stands for panmask. In addition, low complexity regions identified by SDUST were removed.

The pm151 regions are used to filter spurious variant calls in centromeres, long repeats, and other genomic regions where short-read mapping is often problematic. They cover 88.2% of hg38, 92.2% of coding regions, and 96.3% of ClinVar pathogenic variants. The track can be used to filter variant calls for clinical or research human samples. Like the HighRepro track in this container (see above), it shows regions that are easy to sequence, not those that are problematic. The data was derived from the HPRC assemblies, and this track presents the 151b-easy panmask set.

Display Conventions and Configuration

Each track contains a set of regions of varying length with no special configuration options. The UCSC Unusual Regions track has a mouse-over description, all other tracks have at most a name field, which can be shown in pack mode. The tracks are usually kept in dense mode.

The Hide empty subtracks control hides subtracks with no data in the browser window. Changing the browser window by zooming or scrolling may result in the display of a different selection of tracks.

Data access

The raw data can be explored interactively with the Table Browser or the Data Integrator.

For automated download and analysis, the genome annotation is stored in bigBed files that can be downloaded from our download server. Individual regions or the whole genome annotation can be obtained using our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, e.g.
bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/problematic/comments.bb -chrom=chr21 -start=0 -end=100000000 stdout

Methods

Files were downloaded from the respective databases and converted to bigBed format. The procedure is documented in our hg38 makeDoc file.

Credits

Thanks to Anna Benet-Pagès, Max Haeussler, Angie Hinrichs, Daniel Schmelter, and Jairo Navarro at the UCSC Genome Browser for planning, building, and testing these tracks. The underlying data comes from the ENCODE Blacklist and some parts were copied manually from the HGNC and NCBI RefSeq tracks.

References

Amemiya HM, Kundaje A, Boyle AP. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep. 2019 Jun 27;9(1):9354. PMID: 31249361; PMC: PMC6597582

Dwarshuis N, Kalra D, McDaniel J, Sanio P, Alvarez Jerez P, Jadhav B, Huang WE, Mondal R, Busby B, Olson ND et al. The GIAB genomic stratifications resource for human reference genomes. Nat Commun. 2024 Oct 19;15(1):9029. PMID: 39424793; PMC: PMC11489684

Krusche P, Trigg L, Boutros PC, Mason CE, De La Vega FM, Moore BL, Gonzalez-Porta M, Eberle MA, Tezak Z, Lababidi S et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat Biotechnol. 2019 May;37(5):555-560. PMID: 30858580; PMC: PMC6699627

Li H. Finding easy regions for short-read variant calling from pangenome data. ArXiv. 2025 Aug 8;. PMID: 40799803; PMC: PMC12340882

Pan B, Ren L, Onuchic V, Guan M, Kusko R, Bruinsma S, Trigg L, Scherer A, Ning B, Zhang C et al. Assessing reproducibility of inherited variants detected with short-read whole genome sequencing. Genome Biol. 2022 Jan 3;23(1):2. PMID: 34980216; PMC: PMC8722114