DIRECTORY STRUCTURE:

  README.txt -- this file
  alignments/${ALIGNER}/*.maf
  alignments/${ALIGNER}/${CONSERVATION}/
  metadata.txt -- description of all of the sequences; same as header lines
  sequences/${ENCODE_REGION}/${COMMON_NAME}.${ENCODE_REGION}.fa

Each FASTA file will have all the sequence entries for a given
species/region.

Description of the FASTA Header lines and the metadata.txt file:

>${COMMON_NAME}|${ENCODE_REGION}|${FREEZE_DATE}|${NCBI_TAXON_ID}|${ASSEMBLY_PROVIDER}|${ASSEMBLY_DATE}|${ASSEMBLY_ID}|${CHROMOSOME}|${CHROMOSOME_START}|${CHROMOSOME_END}|${CHROMOSOME_LENGTH}|${STRAND}|${ACCESSION}.${VERSION}|${NUM_BASES}|${NUM_N}|${THIS_CONTIG_NUM}|${TOTAL_NUM_CONTIGS}|${COMMENT}

Where:

	${COMMON_NAME}		like 'baboon' or 'dusky_titi'
	${ENCODE_REGION}	like 'ENm001' or 'ENr223'
        ${FREEZE_DATE}		like 'AUG-2004'; latest date for inclusion in this freeze of the set of sequences encompassing the ENCODE regions
	${NCBI_TAXON_ID}	like '9555' or '9523'
	${ASSEMBLY_PROVIDER}	like 'NISC' or 'RGSC'
	${ASSEMBLY_DATE}	like 'NOV-2003' or '21-JUN-2003'; Date associated with the specific sequence assembly represented in this ENCODE freeze
	${ASSEMBLY_ID}		like 'rn3' or 'panTro1'
	${CHROMOSOME}		like 'chr1' or 'chr19_random'
	${CHROMOSOME_START}	[1 based]
	${CHROMOSOME_END}	[1 based]
	${CHROMOSOME_LENGTH}	length of entire ${CHROMOSOME}
	${STRAND}		as in '+' or '-' indicating whether the sequence came from the top or bottom DNA strand
        ${ACCESSION}.${VERSION}	like 'NT_107546.1' or internal identifier for assemblies that have not been accessioned yet.
        ${NUM_BASES}		Total number of called bases in the sequence entry, including N's
        ${NUM_N}		Total number of N's in the sequence entry
        ${THIS_CONTIG_NUM}	ID of sequence contig (see next variable).
        ${TOTAL_NUM_CONTIGS}	Total number of sequence contigs syntenic to a human region.
        ${COMMENT}		This is an example I hope we all agree on. (Currently '.' for all entries.)

>rat|ENm001|May-2005|10116|Baylor HGSC v. 3.1|01-Jun-2003|rn3|chr4|42742602|44711183|187371129|+|NT_107460.3|1968582|143786|1|1|.

Some fields are optional.  For example when ${ASSEMBLY_PROVIDER} ==
NISC, there will be no ${ASSEMBLY_ID} or chrom:start-stop coordinates.
Unused fields are filled with a period ('.') or zero ('0') for ease in
parsing.

The FASTA sequence have been repeat masked with default RepeatMasker
options and with the Tandem Repeat Finder.  Repeat sequences are
indicated in lowercase, while non-repeat sequences are in uppercase.
These are the RepeatMasker library options that were used here:

%libOptions =
    (
     "armadillo"   => "-mam    ",
     "baboon"      => "-mam    ",
     "elephant"    => "-mam    ",
     "galago"      => "-mam    ",
     "marmoset"    => "-mam    ",
     "monodelphis" => "-mam    ",
     "platypus"    => "-mam    ",
     "rfbat"       => "-mam    ",
     "chicken"     => "-chicken",
     "chimp"       => "        ",
     "cow"         => "-cow    ",
     "dog"         => "-dog    ",
     "fugu"        => "-fugu   ",
     "hedgehog"    => "-mam    ",
     "human"       => "        ",
     "macaque"     => "        ",
     "mouse"       => "-mus    ",
     "rabbit"      => "-mam    ",
     "rat"         => "-rat    ",
     "tenrec"      => "-mam    ",
     "tetraodon"   => "-danio  ",
     "xenopus"     => "        ",
     "zebrafish"   => "-danio  "
     );

There are also a set of RECON libraries that have been prepared by
Damian Keefe at EBI.  We have not tested these yet, but plan to do
so before the July meeting.  They are available here:

      ftp://ftp.ebi.ac.uk/pub/databases/ensembl/encode/repeat_libraries/

Data Release Terms
------------------
All data in this directory and any subdirectories is subject to the terms 
of the ENCODE Project Data Release Policy of the National Human Genome Research
Institute.  This policy is posted at:

http://www.genome.gov/12513440
http://genome.ucsc.edu/encode/terms.html


      Name                                Last modified      Size  Description
Parent Directory - sequences/ 2005-05-26 22:07 - metadata.txt 2005-05-26 22:02 1.0M alignments/ 2005-05-22 14:04 - ENr334/ 2005-05-24 20:51 - ENr333/ 2005-05-24 20:51 - ENr332/ 2005-05-24 20:51 - ENr331/ 2005-05-24 20:51 - ENr324/ 2005-05-24 20:51 - ENr323/ 2005-05-24 20:51 - ENr322/ 2005-05-24 20:51 - ENr321/ 2005-05-24 20:51 - ENr313/ 2005-05-24 20:51 - ENr312/ 2005-05-24 20:50 - ENr311/ 2005-05-24 20:51 - ENr233/ 2005-05-24 20:51 - ENr232/ 2005-05-24 20:51 - ENr231/ 2005-05-24 20:51 - ENr223/ 2005-05-24 20:51 - ENr222/ 2005-05-24 20:51 - ENr221/ 2005-05-24 20:51 - ENr213/ 2005-05-24 20:51 - ENr212/ 2005-05-24 20:51 - ENr211/ 2005-05-24 20:51 - ENr133/ 2005-05-24 20:51 - ENr132/ 2005-05-24 20:51 - ENr131/ 2005-05-24 20:51 - ENr123/ 2005-05-24 20:51 - ENr122/ 2005-05-24 20:51 - ENr121/ 2005-05-24 20:51 - ENr114/ 2005-05-24 20:51 - ENr113/ 2005-05-24 20:51 - ENr112/ 2005-05-24 20:51 - ENr111/ 2005-05-24 20:51 - ENm014/ 2005-05-24 20:51 - ENm013/ 2005-05-24 20:51 - ENm012/ 2005-05-24 20:51 - ENm011/ 2005-05-24 20:51 - ENm010/ 2005-05-24 20:50 - ENm009/ 2005-05-24 20:50 - ENm008/ 2005-05-24 20:50 - ENm007/ 2005-05-24 20:50 - ENm006/ 2005-05-24 20:50 - ENm005/ 2005-05-24 20:50 - ENm004/ 2005-05-24 20:50 - ENm003/ 2005-05-24 20:50 - ENm002/ 2005-05-24 20:50 - ENm001/ 2005-05-24 20:50 -