Description

The non-canonical ORFs supertrack contains tracks that display open reading frames (ORFs) found outside of annotated protein-coding sequences. While the human genome has approximately 20,000 annotated protein-coding genes, recent advances in ribosome profiling (Ribo-seq) and proteomics have revealed widespread translation of ORFs that do not correspond to known protein-coding genes. These non-canonical ORFs are found in regions previously considered non-coding, including 5' and 3' UTRs, long non-coding RNAs, pseudogenes, and alternative reading frames of known genes.

Several subtypes of non-canonical ORFs are commonly distinguished. Upstream ORFs (uORFs) are located in 5' UTRs and can regulate translation of the downstream main coding sequence; ribosomes that translate a uORF may fail to reinitiate at the main start codon, reducing protein output. Small ORFs (sORFs), generally defined as encoding fewer than 100 amino acids, have been systematically overlooked by gene annotation pipelines due to their short length, but many produce functional micropeptides involved in signaling, metabolism, and development. Other types include downstream ORFs (dORFs) in 3' UTRs, out-of-frame ORFs that overlap known coding sequences in an alternative reading frame, and ORFs in transcripts annotated as non-coding RNAs or pseudogenes.

This track collection imports various databases and annotates all ORFs with their Kozak strength, and colors the features by Kozak strength.

Click any of the track names below to show their configuration/documentation page:

Track Description Items Genome
Coverage
Exon
Coverage
Start codon Kozak strength (ATG only)
ATG non-ATG Strong Moderate Weak
UTRannotator uORFs Upstream ORFs in 5' UTRs from UTRannotator 44,435 1.15% 1.15% 6,236 38,199 1,307 3,054 1,875
GENCODE ncORFs GENCODE non-canonical ORFs supported by Ribo-seq 7,264 1.02% 0.03% 7,263 1 1,571 3,705 1,987
GENCODE ncORFs primary GENCODE non-canonical ORFs – primary set 10,127 0.45% 0.02% 6,183 3,944 1,746 3,300 1,137
GENCODE ncORFs comprehensive GENCODE non-canonical ORFs – comprehensive set 28,359 2.24% 0.06% 13,776 14,583 3,133 7,168 3,475
nuORFdb Non-canonical ORFs from nuORFdb v1.2 229,251 22.14% 0.83% 51,080 178,171 10,905 25,539 14,636
MetamORF Meta-database of small ORFs (sORFs) 664,558 33.53% 1.19% 147,490 517,068 33,481 74,267 39,742
OpenProt Alternative and reference proteins from OpenProt v2.2 921,170 49.85% 3.36% 906,942 14,228 202,199 446,288 258,455
OpenProt (MS>=2) OpenProt proteins with mass spectrometry evidence (≥2 peptides) 377,916 40.29% 1.85% 367,257 10,659 106,148 181,817 79,292

GENCODE ncORFs (Phase I and Phase II)

The three GENCODE ncORF tracks display non-canonical translated open reading frames identified from ribosome profiling (Ribo-seq) data and mapped to the GENCODE annotation by the GENCODE / TransCODE consortium.

See the GENCODE ncORFs Phase I subtrack page or the Phase II primary / comprehensive pages for download URLs, methods, and references.

UTRannotator uORFs

Created by the Whiffin lab, UTRannotator is a VEP plugin for annotating 5' UTR variants with respect to upstream open reading frames (uORFs). As part of the project, the authors compiled a curated reference set of uORFs in human 5' UTRs from sorfs.org, which contains ORFs supported by Ribo-Seq. See the UTRannotator uORFs subtrack page for more details. Data from sorfs.org is also part of the Metamorf track (see below). This track is useful if you have a prediction from the VEP plugin and want to see the context.

The UTRannotator source data is distributed as single-span features with no exon/intron structure, so the uORFs would appear as continuous blocks even across introns of their host transcripts. To recover the splicing structure we look up, for each uORF, a same-strand MANE Select / MANE Plus Clinical transcript whose coordinates overlap the uORF range. The host transcript's exons are clipped to the uORF range so that any MANE intron inside the overlap is preserved as an intron of the displayed bed12 record. A uORF that extends past either end of MANE keeps the MANE introns inside the overlap and gets a single bridging block for the orphan portion. If a uORF endpoint falls inside a MANE intron (i.e. UTRannotator originally used a transcript whose UTR exon boundaries differ from MANE's), we fall back to the full GENCODE comprehensive set and apply the same projection. If no donor in either pool can host the uORF, it stays single-block. The chosen donor transcript ID is recorded in the intronsSource field (or none if no host was found).

nuORFdb

nuORFdb (novel unannotated ORF database) is a Broad Institute database of non-canonical open reading frames with evidence of translation from ribosome profiling (Ribo-seq). ORF types include uORFs, dORFs, out-of-frame ORFs, pseudogene ORFs, lincRNA ORFs, and others. See the nuORFdb subtrack page for more details. The nuORFdb database is a very consistent dataset, from a well-known paper.

MetamORF

MetamORF is a repository of small ORFs (sORFs) in the human genome, consolidated from several primary data sources and many individual ribosome profiling datasets. It integrates bioinformatic predictions, ribosome profiling experiments, and mass spectrometry studies into a unified format. See the MetamORF subtrack page for more details. Metamorf has many predictions, and not all may be relevant, but gives an example of a database with as many models as possible, and is the only complete archive of sorfs.org that we are aware of.

OpenProt

OpenProt is a comprehensive annotation of all possible protein-coding ORFs in eukaryotic genomes. It distinguishes RefProts (the known canonical proteins), Isoforms (alternative products of canonical genes), and AltProts (predicted from alternative reading frames in UTRs, frameshifted CDS overlaps, and non-coding RNAs). Each ORF is annotated with mass spectrometry and ribosome profiling evidence; a pre-filtered Mass-Spec-supported subset (≥2 unique peptides) is also available. See the OpenProt subtrack page for more details. OpenProt is widely known and has by far the most predictions, even more than Metamorf, which is why a subset exists with only the more reliable ORFs with Mass-Spec data.

Kozak Strength Annotation

Every ORF in every subtrack carries three additional annotation fields derived from the genomic sequence around its start codon:

Features in every subtrack are colored by the categorical kozakStrength field. The same legend applies to all subtracks:

Strong – A/G at position −3 and G at position +4
Moderate – only one of those two positions matches
Weak – neither position matches
non-ATG – near-cognate start codon; the Kozak rule does not apply
no context – chromosome edge or other case where the 11-base context could not be read

Per-subtrack counts of ATG vs. non-ATG starts and the Strong / Moderate / Weak breakdown are shown in the table at the top of this page. The high non-ATG fraction in UTRannotator, MetamORF, and nuORFdb is inherent to those catalogs — they explicitly include non-canonical CTG/GTG/TTG starts. GENCODE Phase I restricted itself to ATG-only starts.

The 11-base Kozak context is fetched directly from the genome at the position of the start codon. For multi-exon ORFs with an intron immediately upstream of the start codon, the upstream bases of the context are genomic rather than the host transcript's true 5' UTR; in that uncommon case the computed Kozak value may be inaccurate.

The Kozak strength annotation and color coding are added by the script colorByKozak.py in the kent source tree (src/hg/makeDb/scripts/ncOrfs/), along with the per-track autoSql files, the cached Noderer 2014 TE table, and a helper script (addIntrons.py) that recovers exon/intron structure for the UTRannotator uORFs. The Kozak strength logic is a Python port of the corresponding R routines in the VuTR pipeline (Whiffin lab / Computational Rare-Disease Genomics, WHG Oxford); credit and thanks to the VuTR authors for the original implementation. Full build steps are recorded in the makedoc.

Data Access

The raw data can be explored interactively with the Table Browser or the Data Integrator. The data can be accessed from scripts through our API. See the individual track pages for more details.

For automated download and analysis, each subtrack is stored as a bigBed file that can be downloaded from our download server. Individual regions or the whole genome annotation can be obtained using our tool bigBedToBed, which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain features within a given range, e.g. for the GENCODE Phase I ncORF subtrack:

bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/ncOrfs/gencNcOrf/Ribo-seq_ORFs.kozak.bb -chrom=chr21 -start=0 -end=100000000 stdout

File names for all eight subtracks (under /gbdb/hg38/ncOrfs/):

References

Please refer to each subtrack's description page for references.

References for the Kozak / TE methodology:

Kozak M. An analysis of 5'-noncoding sequences from 699 vertebrate messenger RNAs. Nucleic Acids Res. 1987 Oct 26;15(20):8125-48. DOI: 10.1093/nar/15.20.8125; PMID: 3313277; PMC: PMC306349

Noderer WL, Flockhart RJ, Bhaduri A, Diaz de Arce AJ, Zhang J, Khavari PA, Wang CL. Quantitative analysis of mammalian translation initiation sites by FACS-seq. Mol Syst Biol. 2014 Aug 28;10(8):748. DOI: 10.15252/msb.20145136; PMID: 25170020; PMC: PMC4299517