This directory contains data from the December 2007 ENCODE Multi-Species 
Sequence Analysis (MSA) sequence freeze, along with multiple sequence 
alignments based on these sequences.  The freeze consists of sequence 
from regions orthologous to the human ENCODE regions in 36 vertebrate 
species, and are based on comparative sequence data generated at the 
NHGRI Intramural Sequencing Center (NISC) for the ENCODE project,
as well as whole-genome assemblies residing at UCSC, as listed:

New species in this freeze are: 
       	orangutan 
Species from previous freezes not present in this freeze are: 
        xenopus, fugu, zebrafish, tetraodon
NISC sequences are present in additional regions, and WGS genomes
have all been updated to the most current.

	* human (March 2006, hg18)
	* armadillo (NISC)
	* baboon (NISC)
	* cat (NISC)
	* chicken (galGal3)
	* chimp (Mar 2006, panTro2)
	* colobus_monkey (NISC)
	* cow (Aug 2006, bosTau3)
	* dog (May 2005, canFam2)
	* dusky_titi (NISC)
	* elephant (NISC)
	* flying_fox (NISC)
	* galago (NISC)
	* gibbon (NISC)
	* guinea_pig (NISC)
	* hedgehog (NISC)
	* horse (NISC)
	* macaque (Jan 2006, rheMac2)
	* marmoset (NISC)
	* monodelphis (Jan 2006, monDom4)
	* mouse (Jul 2007, mm9)
	* mouse_lemur (NISC)
	* orangutan   (Jul 2007, ponAbe2)
	* owl_monkey (NISC)
	* platypus (NISC)
	* rabbit (NISC)
	* rat (Nov 2004, rn4)
	* rfbat (NISC)
	* rock_hyrax (NISC)
	* sbbat (NISC)
	* shrew (NISC)
	* squirrel_monkey (NISC)
	* st_squirrel (NISC)
	* tenrec (NISC)
	* tree shrew (NISC)
	* vervet (NISC)

DIRECTORY STRUCTURE:

  sequences/${ENCODE_REGION}/${COMMON_NAME}.${ENCODE_REGION}.fa
  sequences/metadata.txt        description of all of the sequences; same as header lines
  DEC-2007.tar.gz               tarfile of the contents of the sequences directory
  encode_36way.gif              phylogenetic tree image
  species36.nh                  phylogeny in newick tree format
  tree_4d.tba.nh                phylogeny with branch lengths, based on 4-fold degenerate sites
  alignments/                   multiple sequence alignments


Each FASTA file will have all the sequence entries for a given
species/region.

Description of the FASTA Header lines and the metadata.txt file:

>${COMMON_NAME}|${ENCODE_REGION}|${FREEZE_DATE}|${NCBI_TAXON_ID}|${ASSEMBLY_PROVIDER}|${ASSEMBLY_DATE}|${ASSEMBLY_ID}|${CHROMOSOME}|${CHROMOSOME_START}|${CHROMOSOME_END}|${CHROMOSOME_LENGTH}|${STRAND}|${ACCESSION}.${VERSION}|${NUM_BASES}|${NUM_N}|${THIS_CONTIG_NUM}|${TOTAL_NUM_CONTIGS}|${COMMENT}

Where:

	${COMMON_NAME}		like 'baboon' or 'dusky_titi'
	${ENCODE_REGION}	like 'ENm001' or 'ENr223'
        ${FREEZE_DATE}		like 'AUG-2004'; latest date for inclusion in this freeze of the set of sequences encompassing the ENCODE regions
	${NCBI_TAXON_ID}	like '9555' or '9523'
	${ASSEMBLY_PROVIDER}	like 'NISC' or 'RGSC'
	${ASSEMBLY_DATE}	like 'NOV-2003' or '21-JUN-2003'; Date associated with the specific sequence assembly represented in this ENCODE freeze
	${ASSEMBLY_ID}		like 'rn4' or 'panTro2'
	${CHROMOSOME}		like 'chr1' or 'chr19_random'
	${CHROMOSOME_START}	[1 based]
	${CHROMOSOME_END}	[1 based]
	${CHROMOSOME_LENGTH}	length of entire ${CHROMOSOME}
	${STRAND}		as in '+' or '-' indicating whether the sequence came from the top or bottom DNA strand
   	${ACCESSION}.${VERSION}	like 'NT_107546.1' or internal identifier for assemblies that have not been accessioned yet.
    ${NUM_BASES}		Total number of called bases in the sequence entry, including N's
    ${NUM_N}		Total number of N's in the sequence entry
    ${THIS_CONTIG_NUM}	ID of sequence contig (see next variable).
    ${TOTAL_NUM_CONTIGS}	Total number of sequence contigs syntenic to a human region.
    ${COMMENT}		This is an example I hope we all agree on. (Currently '.' for all entries.)

>rat|ENm001|May-2005|10116|Baylor HGSC v. 3.1|01-Jun-2003|rn3|chr4|42742602|44711183|187371129|+|NT_107460.3|1968582|143786|1|1|.

Some fields are optional.  For example when ${ASSEMBLY_PROVIDER} ==
NISC, there will be no ${ASSEMBLY_ID} or chrom:start-stop coordinates.
Unused fields are filled with a period ('.') or zero ('0') for ease in
parsing.

The FASTA sequence have been repeat masked with default RepeatMasker
options and with the Tandem Repeat Finder.  Repeat sequences are
indicated in lowercase, while non-repeat sequences are in uppercase.
These are the RepeatMasker library options that were used here:

	armadillo   =>  mammal
	baboon	=>  mammal
	cat =>  cat
	chicken =>  chicken
	chimp	=>  mammal
	colobus_monkey =>  mammal
	cow =>  cow
	dog =>  dog
	dusky_titi  =>  cow
	elephant    =>  mammal
	flying_fox  =>  mammal
	galago	=>  mammal
	gibbon	=>  mammal
	guinea_pig  =>  mammal
	hedgehog    =>  mammal
	horse   =>  mammal
	human   =>  human
	macaque	=>  mammal
	marmoset	=>  mammal
	monodelphis =>  mammal
	mouse   =>  mus
	mouse_lemur	=>  mammal
	orangutan	=>  mammal
	owl_monkey	=>  mammal
	platypus    =>  mammal
	rabbit  =>  rodentia
	rat =>  rat
	rfbat   =>  mammal
	rock_hyrax  =>  mammal
	sbbat   =>  mammal
	shrew   =>  mammal
	squirrel_monkey
	st_squirrel =>  mammal
	tenrec  =>  mammal
	tree_shrew  =>  mammal
	vervet	=>  mammal


There are also a set of RECON libraries that have been prepared by
Damian Keefe at EBI.

      ftp://ftp.ebi.ac.uk/pub/databases/ensembl/encode/repeat_libraries/

Data Release Terms
------------------
All data in this directory and any subdirectories are subject to the terms 
of the ENCODE Project Data Release Policy of the National Human Genome Research
Institute.  This policy is posted at:

http://www.genome.gov/12513440
http://genome.ucsc.edu/encode/terms.html


      Name                         Last modified      Size  Description
Parent Directory - alignments/ 2008-08-21 12:14 - sequences/ 2008-06-25 10:05 - species36.nh 2008-10-22 16:23 376 tree_4d.tba.nh 2008-08-21 10:53 1.0K encode_36way.gif 2008-10-22 16:33 6.4K DEC-2007.tar.gz 2008-02-20 09:51 296M