Human Genome Project Working Draft Caveats
|
|
- This is a work in progress. Data content will change and data format may change in future versions. Expect
that any code written to interpret this data may have to be changed.
- When data changes, all sequence positions change. The only static sequence positions are references in the
.agp files to ranges in the original GenBank accessions used to build this view of the working draft.
- An attempt has been made to place all the data from all the GenBank accessions in the given GenBank freeze,
after reasonable cleaning to remove contamination. Contamination
removal was performed by Greg Schuler at NCBI.
Some clone contamination still remains, including contamination by nonhuman sequence, and contamination of one
clone with the sequence from another human clone. We do not currently know the extent of the contamination, but
some of the smaller sequence fragments placed on the working draft represent these forms of contamination,
especially those that end up being placed far from most other fragments in their clone.
- Some fragments in GenBank or EMBL submissions represent misassemblies, so that the sequence in these fragments is not
actually all contiguous within the given clone.
No attempt has been made to detect or break up these misassembled fragments.
- Some fragments have no sequence overlap, cDNA connection, or BAC end connections to help order or orient them
relative to the other fragments in a clone layout. In such cases the order and orientation of the fragment is derived
from the original GenBank or EMBL submissions. In some cases BAC end links or cDNA links between fragments may be wrong.
This may cause the fragment to be oriented or ordered incorrectly. For freezes before Sept. 5,
our best guess is that about 75% of the fragments
are in the correct orientation. The
estimation of the accuracy of the ordering is more involved.
Starting with the Sept. 5 freeze, plasmid end read pairs were also used for
additional order and orientation. More were added for the Oct. 7 freeze.
This appears to have improved the o+o, and our guess is that starting
with the Oct 7 freeze, the o+o order and orientation are about 85% accurate.
Estimates of order and orientation accuracy are highly variable from region to region and are still uncertain at this time.
- The construction of the working draft in regions where the sequence has been finished is
not 100% reliable on earlier assemblies
due some occassional inconsistences between information about finished sequence contigs, taken
from the NT finished contigs at NCBI, via the Wash. U. accession map, and
other constraints on the assembly from additional draft clones and auxiliary data.
The treatment of finished regions improved in the Oct. 7 freeze final assembly. Also,
starting with the Sept. 5 freeze, we use special .fa and .agp files for the finished
chromosomes 21 and 22 that have been constructed by Greg Schuler at NCBI using data
generated by the centers that sequenced these chromosomes. The original data on
chromosome 21 is available from the
Max Plank Institute and
RIKEN, and on chromosome 22 from the
Sanger Centre.
- True overlaps between fragments that are short or that involve a lot of repetitive DNA are not detected. These
fragments will be included separately as not overlapping.
- Some clone overlaps and larger fragment overlaps may be missed because
there is inconsistent information that GigAssembler cannot reconcile, causing
the overlap not to be made. In addition, some clones may not be placed
correctly in the map, causing missed overlaps. Both these processes
can cause artifactual duplication of regions of size usually less than 50Kb.
We estimate that about 3 percent of the sequence is represented by such
artifactual duplication.
- Some of the joins made between fragments in building the working draft may be misassemblies. It is believed
the rate of such misassemblies is quite low, but there are no independent tests of this yet.
- Other corrections to the data are given in news archives.
| |
|
|