Description

This track displays 921,170 protein-coding ORFs from OpenProt v2.2, a database that provides a comprehensive annotation of all possible protein-coding ORFs in the human genome. In addition to currently annotated coding sequences (CDSs) and their reference proteins (RefProts), OpenProt predicts alternative ORFs (AltORFs) and their corresponding alternative proteins (AltProts) that are hidden within transcripts previously considered to encode only a single protein.

A pre-filtered subtrack (OpenProt MS>=2) is also available, containing only the 377,916 ORFs with at least 2 unique mass spectrometry peptides detected across studies, matching the MS-evidence threshold used by OpenProt for their curated downloads.

OpenProt classifies proteins into three types:

AltORFs are further classified by their localization relative to the annotated CDS:

Display Conventions and Configuration

Items are displayed in bigGenePred format. Mouseover shows the ORF localization. Items are labeled with the protein accession number: IDs starting with IP_ are predicted AltProts, II_ are novel isoforms, and other IDs (e.g. NP_, ENSP) are RefProts from existing annotations.

The track includes several filter options:

Data Access

The raw data can be explored interactively with the Table Browser or the Data Integrator. The data can be accessed from scripts through our API; the track name is "openprot" (all ORFs) or "openprotMs" (MS-filtered).

For automated download and analysis, the genome annotations are stored in bigBed files that can be downloaded from our download server. Individual regions or the whole genome annotation can be obtained using our tool bigBedToBed, which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, e.g.

bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/ncOrfs/openprot/openprot.bb -chrom=chr21 -start=0 -end=100000000 stdout

The original data files can be downloaded from the OpenProt download page.

Methods

The OpenProt v2.2 BED12 and TSV annotation files were downloaded from the OpenProt API. The BED file (2,846,289 rows) contains genomic coordinates for all predicted ORFs; since the same protein can be mapped through multiple transcripts to identical genomic coordinates, deduplication reduced this to 921,170 unique genomic features (3 entries with overlapping BED blocks were excluded).

Each BED entry was annotated with metadata from the TSV file by joining on protein accession. For proteins with multiple transcript entries in the TSV, the annotation with the highest MS score was retained. Extended fields include protein type (AltProt/RefProt/Isoform), ORF localization, MS score, TE (Translation Event) score, Kozak motif status, InterPro domain count, and reading frame.

The annotation is based on GRCh38.p13, Ensembl release 106, and UniProt release 2022_06_01.

Credits

Thanks to Xavier Roucou and the OpenProt team at the Université de Sherbrooke for creating OpenProt and making the data publicly available.

References

Brunet MA, Leblanc S, Bhatt P, Bhatt R, Bhatt S, Brunelle M, Bhatt V, Bhatt M, Roucou X et al. OpenProt: a more comprehensive guide to explore eukaryotic coding potential and proteomes. Nucleic Acids Res. 2021 Jan 8;49(D1):D1175-D1180. DOI: 10.1093/nar/gkaa1036. PMID: 33196842

Leblanc S, Bhatt P, Bhatt R, Bhatt S, Bhatt V, Brunet MA, Bhatt M, Roucou X et al. OpenProt: a database for a comprehensive listing of all predicted ORFs and their associated proteins. Nucleic Acids Res. 2018 Jan 4;46(D1):D529-D533. DOI: 10.1093/nar/gkx1109. PMID: 29140464