This file is from:

This directory contains compressed multiple alignments of
119 virus sequences.

The 'reference' sequence for this collection is the sequence:

  NC_045512v2 - 2019-12-30 - Wuhan-Hu-1

These 119 unique sequences were obtained from 3 sources:
   1. NCBI Entrez search term: "SARS-CoV-2" produces 106 sequences
      as of 2020-03-06.  40 of these sequences were exact duplicates
      to other sequences in this set, and 21 of these sequences were fragments
      of gene sequences.  The duplicates and fragments are not included
      in the list of 119 sequences
   2. "coronaviridae" sequences (55 sequences) obtained from:
      selected to show "RefSeq nucleotides"
   3. 12 additional unique sequences were obtained from:
      all the other sequences available here were copies of the NCBI/genbank

Description files in this directory:

  md5sum.txt - md5 sums to verify copied files
  wuhCor1.119way.nameList.txt - relating the accession name to
                              sequence name, and sample collection date

  wuhCor1.119way.nh - Phylogenetic tree used for multiz alignment.
           The phylogenetic tree was calculated on 31mer frequency similarity
           and neighbor joining that distance matrix with the phylip toolset:
           'neighbor' command:

  wuhCor1.multiz119way.maf.gz - alignments with gap annotation with
                                accession identifiers

  sequences/ - directory with files:

  sequences/dnaFasta119.tgz - gzipped tar file for the DNA fasta, 119 sequences

  sequences/proteinFasta119.tgz - gzipped tar file for the proteins as obtained
                     from the genbank records, for example in:
                     one .faa.gz for each sequence
         (not all protein sequences are available, only 107 sequences present)

  sequences/proteinTab119.tgz - the same proteins arranged in single lines
                     of the format:

  sequenceName.proteinName<tab>amino acids . . .

  One file for each of the sequences.  Not all of the 119 sequences have
  these protein records, there are 107 protein records here.

  This format file is convenient for extracting proteins from all the
  sequences that have a similar length.  For example, the longest protein
  (the 'spike' protein) is over 6000 AAs, after unpacking the tgz file
  into a directory:

  zcat * | awk -F$'\t' 'length($2) > 6000' \
     | awk -F$'\t' '{printf ">%s\n%s\n", $1, $2}' > spikeProtein.faa

  You can drop that spikeProtein.faa file into a multiple aligner such
  as 'COBALT'
  to obtain a multiple alignment of that protein for 99 of these sequences

For a description of multiple alignment format (MAF), see

PhastCons conservation scores for these alignments are available at:

PhyloP conservation scores for these alignments are available at:

To download a large file or multiple files from this directory, we recommend
that you use rsync or ftp rather than downloading the files via our website.

Via rsync:
rsync -avz --progress \
        rsync:// ./

Via FTP:
    user name: anonymous
    password: <your email address>
    go to the directory goldenPath/wuhCor1/multiz119way

To download multiple files from the UNIX command line, use the "mget" command.
    mget <filename1> <filename2> ...
    - or -
    mget -a (to download all the files in the directory)
Use the "prompt" command to toggle the interactive mode if you do not want
to be prompted for each file that you download.

All the files in this directory are freely usable for any
purpose. For data use restrictions regarding the individual
genome assemblies, see
      Name                              Last modified      Size  Description
Parent Directory - wuhCor1.strainName119way.maf.gz 2020-05-01 13:48 1.0M wuhCor1.119way.phyloDistance.txt 2020-03-13 12:06 5.5K wuhCor1.119way.nh 2020-03-13 10:03 6.4K wuhCor1.119way.nameList.txt 2020-03-11 14:55 5.5K wuhCor1.119way.descriptiveName.nh 2020-03-13 10:52 9.6K sequences/ 2020-05-01 14:55 - md5sums.txt 2020-03-13 12:16 497