Overview
^^^^^^^^

This directory contains the Feb. 2009 GRCh37 assembly of the human genome,
in various formats and some related files.  The UCSC release name is "hg19". 
This directory also includes versions of these files for a patch releases after
2009, "hg19.p13.plusMT". The subdirectory "genes/" contains selected gene
transcript sets in GFF format. 

Most users looking at this directory want to download the file latest/hg19.fa.gz
If you need a file for a genome aligner, like BWA, bowtie2 or hisat2 or similar,
please read the section "Analysis Set" below and look at the directory analysisSet/.

The main chromosome sequences of hg19.fa.gz are taken from and identical to the assembly 
as released by NCBI, and called GRCh37 Genome Reference Consortium Human
Reference 37 (GCA_000001405.1).

An expanded version of hg19 is also available that includes new sequences
from GRC patch release GRCh37.p13 (GCA_000001405.14) plus the revised
Cambridge Reference Sequence (rCRS) mitochondrial sequence. See the section
"Patches" below.

GRCh37 was produced and is updated by the Genome Reference Consortium:
	https://www.ncbi.nlm.nih.gov/grc

Differences from the NCBI files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

There are two main differences compared to the NCBI files:

- the mitochondrial genome: since the release of the UCSC hg19
assembly, the Homo sapiens mitochondrion sequence (represented as "chrM" in the
Genome Browser) has been replaced in GenBank with the record NC_012920, the
revised Cambridge Reference Sequence (rCRS).  We have not replaced the original
sequence, NC_001807, as chrM in the hg19 Genome Browser.  However, files in the
subdirectory p13.plusMT include NC_012920 as "chrMT", in addition to the original
"chrM".

- also, the FASTA files of NCBI's GCA_000001405.1 distributed at
ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/
have different sequence identifiers ("NC_000001.10" for NCBI instead of "chr1"
for UCSC) and the repeatmasking, expressed by lowercasing letters, was done
with different RepeatMasker settings.

Please also read the notes on our hg19 overview page at:
   http://genome.ucsc.edu/cgi-bin/hgGateway?db=hg19
The page explains the naming scheme of unplaced contigs and haplotypes, 
e.g. HSCHR6_MHC_APD_CTG1 = GL000250.1 => "chr6_apd_hap1"
and the placement of the pseudo-autosomal (PAR) regions on chrX and chrY.

Analysis set
^^^^^^^^^^^^

The GRCh37/hg19 patch13 assembly contains more than just the chromosome
sequences, but also a mitochondrial genome, unplaced sequences, alternate
haplotypes and fixes, some of these sequences can confuse modern aligners.

The subdirectory analysisSet/ contains files with optimized versions of the
genome for these aligners or similar high-throuput analysis programs. The 
README.txt file in that directory provides more details.

Patches to hg19
^^^^^^^^^^^^^^^

The Genome Reference Consortium has been adding additional (short) sequences
since the initial release.  We have added these patches in 2020 but keep the
updated releases in separate directories:

- The initial/ subdirectory contains files for the initial release of GRCh37,
without any patch release sequences.

- The p13.plusMT/ subdirectory contains files for GRCh37.p13 (patch release 13)
plus the rCRS mitochondrion sequence (NC_012920) as "chrMT".
GRC patch releases do not change any previously existing sequences; they
simply add new sequences for fix patches or alternate haplotypes that
correspond to specific regions of the main chromosome sequences.
The Genome Browser displays this expanded set of assembly sequences.

- The latest/ subdirectory contains files that do not include version indicators
in their names, but are symbolic links to files in the most recent version
subdirectory, i.e. p13.plusMT.

- Data files in the current directory are the same as files in the initial/
subdirectory, i.e. they are from the initial GRCh37 release and do not
include the patch sequences that are included in the Genome Browser.

Sequence names
^^^^^^^^^^^^^^

During genome assembly, reads are assembled into "contigs" (a few kbp long),
which are then joined into longer "scaffolds" of a few hundred kbp. These are
finally placed, often manually e.g. with FISH assays, onto chromosomes.
The .agp file below describes how these were placed onto chromosomes.

The alternate haplotype (_hap) sequences were released with the initial assembly, 
subsequent patches introduced fix sequences (_fix) and novel sequences (_alt).
For more information on patches see: http://genome.ucsc.edu/blog/patches/
The following list represents all the types of sequences found in the hg19 genome:

Chromosomes:
- made from scaffolds placed onto chromosome locations, 95% of the genome file
- format: chr{chromosome number or name}
- e.g. chr1 or chrX, chrM for the (non-rCRS) mitochondrial genome.

Unlocalized scaffolds:
- a sequence found in an assembly that is associated with a specific
chromosome but cannot be ordered or oriented on that chromosome.
- format: chr{chromosome number or name}_{sequence_accession}v{sequence_version}_random
- e.g. chr17_gl000205_random

Unplaced scaffolds:
- a sequence found in an assembly that is not associated with any chromosome.
- format: chrUn_{sequence_accession}v{sequence_version}
- e.g. chrUn_gl000223

Alternative haplotypes in initial GRCh37 release:
- a sequence that provides an alternate representation of a locus found
  in the primary assembly. These sequences were present in the initial hg19
  assembly release. They do not represent complete chromosome sequences. 
  There are 9 present in the initial hg19 assembly.
  For more information on the 7 chr6 alternate haplotypes see the MHC Haplotype
  Project website: http://www.ucl.ac.uk/cancer/medical-genomics/mhc
- format: chr{chromosome number or name}_{haplotype_name}_hap{haplotype_number_in_chromosome}
- e.g. chr6_cox_hap2

Alternate loci scaffolds from patch releases:
- a scaffold that provides an alternate representation of a locus found
  in the primary assembly. These sequences do not represent a complete
  chromosome sequence although there is no hard limit on the size of the
  alternate locus; currently most are less than 1 Mb. In the context of 
  hg19, all these sequences have been added through patch releases.
- these sequences are not part of the files in the initial/ directory
- format: chr{chromosome number or name}_{sequence_accession}_alt
- e.g. chr12_gl877876_alt

Fix loci scaffolds:
- a patch that corrects sequence or reduces an assembly gap in a given
  major release. FIX patch sequences are meant to be incorporated into
  the primary or existing alt-loci assembly units at the next major
  release.
- these sequences are not part of the files in the initial/ directory
- format: chr{chromosome number or name}_{sequence_accession}_fix
- e.g. chrX_kb021648_fix

Files
^^^^^

Files included in this directory are from the initial 2009 release of the genome, 
files for the most current patch version of the genome are in the "latest/" subdirectory:

hg19.fa.gz - "Soft-masked" assembly sequence in one file.
    Repeats from RepeatMasker and Tandem Repeats Finder (with period
    of 12 or less) are shown in lower case; non-repeating sequence is
    shown in upper case. Again, the most current version of this
    file is latest/hg19.fa.gz 
    For many types of analysis that include sequence comparisons,
    the files in the directory analysisSet are recommended, as these
    include fewer duplicates.

hg19.fa.masked.gz - based on hg19.fa.gz, "hard-masked" assembly sequence in 
    one file. Repeats are masked by capital Ns; non-repeating sequence is shown in
    upper case.

hg19.fa.out.gz - RepeatMasker .out file.  RepeatMasker was run with the
    -s (sensitive) setting.
    Jan 29 2009 (open-3-2-7) version of RepeatMasker
    RepBase library: RELEASE 20090120

hg19.fa.align.gz - RepeatMasker .align file.  RepeatMasker was run with the
    -s (sensitive) setting.
    Jan 29 2009 (open-3-2-7) version of RepeatMasker
    RepBase library: RELEASE 20090120

hg19.trf.bed.gz - Tandem Repeats Finder locations, filtered to keep repeats
    with period less than or equal to 12, and translated into UCSC's BED
    format.

hg19.2bit - contains the complete human/hg19/GRCh37 genome sequence
    in the 2bit file format.  Repeats from RepeatMasker and Tandem Repeats
    Finder (with period of 12 or less) are shown in lower case; non-repeating
    sequence is shown in upper case.  The utility program, twoBitToFa (available
    from the kent src tree), can be used to extract .fa file(s) from
    this file.  A pre-compiled version of the command line tool can be
    found at:
        http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/
    See also:
        http://genome.ucsc.edu/admin/git.html
        https://genome-source.gi.ucsc.edu/gitlist/kent.git/blob/master/src/userApps/README

hg19.agp.gz - Description of how the assembly was generated from
    fragments.

chromAgp.tar.gz - Description of how the assembly was generated from
    fragments, unpacking to one file per chromosome.

chromFa.tar.gz - The assembly sequence in one file per chromosome.
    Repeats from RepeatMasker and Tandem Repeats Finder (with period
    of 12 or less) are shown in lower case; non-repeating sequence is
    shown in upper case.

chromFaMasked.tar.gz - The assembly sequence in one file per chromosome.
    Repeats are masked by capital Ns; non-repeating sequence is shown in
    upper case.

chromOut.tar.gz - RepeatMasker .out files (one file per chromosome).
    RepeatMasker was run with the -s (sensitive) setting.
    Using: Jan 29 2009 (open-3-2-7) version of RepeatMasker and
    RELEASE 20090120 of library RepeatMaskerLib.embl

chromTrf.tar.gz - Tandem Repeats Finder locations, filtered to keep repeats
    with period less than or equal to 12, and translated into UCSC's BED 5+
    format (one file per chromosome).

est.fa.gz - Human ESTs in GenBank. This sequence data is updated 
    regularly via automatic GenBank updates.

md5sum.txt - checksums of files in this directory

mrna.fa.gz - Human mRNA from GenBank. This sequence data is updated
    regularly via automatic GenBank updates.

refMrna.fa.gz - RefSeq mRNA from the same species as the genome.
    This sequence data is updated regularly via automatic GenBank
    updates.

upstream1000.fa.gz - Sequences 1000 bases upstream of annotated
    transcription starts of RefSeq genes with annotated 5' UTRs.
    This file is updated weekly so it might be slightly out of sync with
    the RefSeq data which is updated daily for most assemblies.

upstream2000.fa.gz - Same as upstream1000, but 2000 bases.

upstream5000.fa.gz - Same as upstream1000, but 5000 bases.

xenoMrna.fa.gz - GenBank mRNAs from species other than that of 
    the genome. 

hg19.chrom.sizes - Two-column tab-separated text file containing assembly
    sequence names and sizes.

hg19.gc5Base.wigVarStep.gz - ascii data wiggle variable step values used
                           - to construct the GC Percent track
hg19.gc5Base.wig.gz - wiggle database table for the GC Percent track
                    - this is an older standard alternative to the current
                    - bigWig format of the track, sometimes usefull for analysis
hg19.gc5Base.wib    - binary data to correspond with the gc5Base.wig file
    see also:  http://genome.ucsc.edu/goldenPath/help/wiggle.html
    and  http://genomewiki.ucsc.edu/index.php/Using_hgWiggle_without_a_database
         for a discussion of how to use the wig.gz and .wib files for
         interaction with the GC percent data values

How to download
^^^^^^^^^^^^^^^

If you plan to download a large file or multiple files from this
directory, we recommend that you use ftp rather than downloading the
files via our website. To do so, ftp to hgdownload.soe.ucsc.edu
[username: anonymous, password: your email address], then cd to the
directory goldenPath/hg19/bigZips. To download multiple files, use
the "mget" command:

    mget <filename1> <filename2> ...
    - or -
    mget -a (to download all the files in the directory)

Alternate methods to ftp access.

Using an rsync command to download the entire directory:
    rsync -avzP rsync://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/ .
For a single file, e.g. chromFa.tar.gz
    rsync -avzP 
        rsync://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz .

Or with wget, all files:
    wget --timestamping 
        'ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/*'
With wget, a single file:
    wget --timestamping 
        'ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz' 
        -O chromFa.tar.gz

To unpack the *.tar.gz files:
    tar xvzf <file>.tar.gz
To uncompress the fa.gz files:
    gunzip <file>.fa.gz

All the files in this directory are freely available for public use.
      Name                       Last modified      Size  Description
Parent Directory - hg19.chrom.sizes 08-Mar-2009 14:56 1.9K hg19.trf.bed.gz 08-Mar-2009 15:00 7.6M hg19.2bit 08-Mar-2009 15:29 778M hg19.fa.out.gz 08-Mar-2009 21:55 163M hg19.fa.align.gz 08-Mar-2009 22:08 2.2G chromAgp.tar.gz 20-Mar-2009 09:02 538K chromOut.tar.gz 20-Mar-2009 09:03 163M chromFa.tar.gz 20-Mar-2009 09:21 905M chromFaMasked.tar.gz 20-Mar-2009 09:30 477M chromTrf.tar.gz 20-Mar-2009 09:30 7.6M hg19.agp.gz 06-May-2009 15:22 532K hg19.fa.gz 21-Aug-2018 12:56 905M hg19.fa.masked.gz 12-Sep-2018 10:33 477M hg19.gc5Base.wigVarStep.gz 28-Sep-2018 15:21 1.5G hg19.gc5Base.wib 17-Jan-2019 14:49 571M hg19.gc5Base.wig.gz 17-Jan-2019 14:49 11M md5sum.txt 17-Jan-2019 15:55 967 initial/ 28-Mar-2019 14:44 - mrna.fa.gz 14-Oct-2019 14:50 370M mrna.fa.gz.md5 14-Oct-2019 14:50 45 xenoMrna.fa.gz 14-Oct-2019 15:00 6.4G xenoMrna.fa.gz.md5 14-Oct-2019 15:00 49 est.fa.gz 14-Oct-2019 15:08 1.5G est.fa.gz.md5 14-Oct-2019 15:08 44 xenoRefMrna.fa.gz 14-Oct-2019 15:08 250M xenoRefMrna.fa.gz.md5 14-Oct-2019 15:08 52 refMrna.fa.gz 14-Oct-2019 15:08 80M refMrna.fa.gz.md5 14-Oct-2019 15:08 48 upstream1000.fa.gz 14-Oct-2019 15:09 9.7M upstream1000.fa.gz.md5 14-Oct-2019 15:09 53 upstream2000.fa.gz 14-Oct-2019 15:10 18M upstream2000.fa.gz.md5 14-Oct-2019 15:10 53 upstream5000.fa.gz 14-Oct-2019 15:10 47M upstream5000.fa.gz.md5 14-Oct-2019 15:10 53 p13.plusMT/ 17-Jan-2020 17:47 - genes/ 05-Feb-2020 13:47 - analysisSet/ 13-Mar-2020 17:39 - latest/ 25-Mar-2020 13:33 -