Description

This track displays 229,251 non-canonical open reading frames (ORFs) from nuORFdb v1.2 (novel unannotated ORF database), a database of ORFs with evidence of translation detected by ribosome profiling (Ribo-seq). nuORFdb was developed by the Bhatt lab at the Broad Institute of MIT and Harvard as a resource for identifying non-canonical peptides in immunopeptidomic mass spectrometry datasets.

The ORFs were predicted using a hierarchical pipeline that aggregates ribosome profiling signal across 29 primary healthy and cancer tissue samples and cell lines. The pipeline operates at multiple levels—individual samples, tissues, and combined across all samples—to predict lowly translated ORFs while maintaining sensitivity for tissue-specific variants. All ORFs have a minimum length of 8 amino acids.

Display Conventions and Configuration

Items are displayed in bigGenePred format, showing gene-structure-like visualization with thick and thin regions representing the predicted ORF and flanking transcript structure.

Mouseover on items shows the simplified ORF category (plotType). Items are labeled with the nuORFdb ORF identifier, which encodes the source Ensembl transcript and ORF number (e.g. ENST00000488147.1_1_1).

The track includes the following ORF categories (by mergeType):

Each item also includes the predicted protein sequence and additional classification fields (predictorType, plotType, geneType) from the nuORFdb annotations.

Data Access

The raw data can be explored interactively with the Table Browser or the Data Integrator. The data can be accessed from scripts through our API; the track name is "nuorfdb".

For automated download and analysis, the genome annotation is stored in a bigBed file that can be downloaded from our download server. Individual regions or the whole genome annotation can be obtained using our tool bigBedToBed, which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, e.g.

bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/ncOrfs/nuorfdb/nuorfdb.bb -chrom=chr21 -start=0 -end=100000000 stdout

The original data files can be downloaded from the nuORFdb website at the Broad Institute.

Methods

The nuORFdb v1.2 data files (BED12 coordinates, Excel annotations, and protein FASTA sequences) were downloaded from the Broad Institute. The BED12 file was combined with the annotation spreadsheet (keyed on ORF_ID_hg38) and protein FASTA (keyed on sequence header ID) to produce a bigGenePred+ format file with 23 fields (12 standard BED fields, 8 bigGenePred fields, and 3 extended fields: predictorType, plotType, and proteinSequence).

A small number of entries (176 out of 229,251) used non-standard chromosome names (e.g. chrGL000008.2, chrMT) which were mapped to UCSC standard names (e.g. chr4_GL000008v2_random, chrM).

Credits

Thanks to Tatyana Ouspenskaia, Toph Law, Karl Clauser, and Steven Bhatt at the Broad Institute of MIT and Harvard for creating nuORFdb and making the data publicly available. Thanks to Eric Malekos, UCSC, for suggesting this database.

References

Ouspenskaia T, Law T, Clauser KR, Klaeger S, Sarkizova S, Aguet F, Li B, Christian E, Knisbacher BA, Le PM et al. Unannotated proteins expand the MHC-I-restricted immunopeptidome in cancer. Nat Biotechnol. 2022 Feb;40(2):209-217. DOI: 10.1038/s41587-021-01072-6. PMID: 34663921