Description

This track shows 44k upstream open reading frames (uORFs) in 5' UTRs of human genes, curated from ribosome profiling data by the UTRannotator project, annotated by UCSC with the Kozak strength and translational efficiency.

uORFs are small open reading frames located in the 5' UTR of mRNAs, upstream of the main protein-coding sequence. They play an important role in translational regulation: ribosomes scanning from the 5' cap may translate a uORF first, which can reduce translation of the downstream main ORF. Genetic variants that create or disrupt uORFs can therefore alter protein expression and contribute to disease.

UTRannotator is a plugin for the Ensembl Variant Effect Predictor (VEP) that annotates 5' UTR variants with respect to uORFs. It detects five types of uORF-perturbing events (AUG gained/lost, stop lost/gained, frameshift). This plugin needs a database of uORFs to annotate, so the authors compiled a curated reference set of translated small ORFs in human 5' UTRs, derived from ribosome profiling data in the sorfs.org database. This reference set is what is displayed in this track. Almost all of these ORFs are annotated as 5' uORFs, only a tiny fraction, 270 of them, are annotated as 5'UTR+3'UTR uORF, when transcripts overlap.

Display Conventions and Configuration

Items are displayed in bigGenePred format. Each item is labeled with the gene symbol of the host transcript. Color reflects the categorical Kozak consensus strength:

Strong – A/G at position −3 and G at position +4
Moderate – only one of those positions matches
Weak – neither position matches
non-ATG – near-cognate start codon; the Kozak rule does not apply
no context – chromosome edge or context unavailable

The UTRannotator source data has no exon/intron structure, so each uORF is projected onto a same-strand host transcript whose coordinates overlap the uORF range. The host's exons are clipped to the uORF range, so any host intron inside the overlap becomes an intron of the displayed feature; a uORF that extends past either end of the host gets a single bridging block for the orphan portion. The primary donor pool is the MANE Select / MANE Plus Clinical set; if every MANE candidate is rejected (e.g. the original UTRannotator transcript had a different UTR exon boundary), the full GENCODE comprehensive set is consulted as a fallback. The chosen donor transcript ID is stored in intronsSource (none if no host was found in either pool).

Mouseover shows the gene symbol, uORF type, start codon, Kozak strength and translational efficiency, and the host transcript whose exons supplied the intron structure.

The track offers the following filters: start codon, Kozak strength, Kozak TE (range), uORF type (5'UTR-only vs spans into 3'UTR).

Data Access

The raw data can be explored interactively with the Table Browser or the Data Integrator. The data can be accessed from scripts through our API; the track name is "utrAnnotUorfs".

For automated download and analysis, the genome annotation is stored in a bigBed file that can be downloaded from our download server. Individual regions or the whole genome annotation can be obtained using our tool bigBedToBed, which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, e.g.

bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/ncOrfs/utrAnnotUorfs.kozak.bb -chrom=chr21 -start=0 -end=100000000 stdout

Methods

The uORF reference data was downloaded from the UTRannotator GitHub repository (file uORF_5UTR_GRCh38_PUBLIC.txt) and converted to bigBed format at UCSC. Coordinates for reverse-strand uORFs were swapped to genomic orientation. Four entries with invalid coordinates were excluded. Host transcripts were annotated as described above.

Credits

Thanks to Xiaolei Zhang, Nicola Whiffin, and the UTRannotator team at the Imperial College London Cardiovascular Genetics group for making this data publicly available.

References

Whiffin N, Karczewski KJ, Zhang X, Chothani S, Smith MJ, Evans DG, Roberts AM, Quaife NM, Schafer S, Rackham O et al. Characterising the loss-of-function impact of 5' untranslated region variants in 15,708 individuals. Nat Commun. 2020 May 27;11(1):2523. PMID: 32461616; PMC: PMC7253449

Zhang X, Wakeling M, Ware J, Whiffin N. Annotating high-impact 5'untranslated region variants with the UTRannotator. Bioinformatics. 2021 May 23;37(8):1171-1173. PMID: 32926138; PMC: PMC8150139