Human Genome Working Draft: Human Genome Browser Gateway

Human Genome Project Working Draft Caveats

This is a work in progress. Data content will change and data format may change in future versions. Expect that any code written to interpret this data may have to be changed.
When data changes, all sequence positions change. The only static sequence positions are references in the .agp files to ranges in the original GenBank accessions used to build this view of the working draft.
An attempt has been made to place all the data from all the GenBank accessions in the given GenBank freeze, after reasonable cleaning to remove contamination. Contamination removal was performed by Greg Schuler at NCBI. Some clone contamination still remains, including contamination by nonhuman sequence, and contamination of one clone with the sequence from another human clone. We do not currently know the extent of the contamination, but some of the smaller sequence fragments placed on the working draft represent these forms of contamination, especially those that end up being placed far from most other fragments in their clone.
Some fragments in GenBank or EMBL submissions represent misassemblies, so that the sequence in these fragments is not actually all contiguous within the given clone. No attempt has been made to detect or break up these misassembled fragments.
Some fragments have no sequence overlap, cDNA connection, or BAC end connections to help order or orient them relative to the other fragments in a clone layout. In such cases the order and orientation of the fragment is derived from the original GenBank or EMBL submissions. In some cases BAC end links or cDNA links between fragments may be wrong. This may cause the fragment to be oriented or ordered incorrectly. For freezes before Sept. 5, our best guess is that about 75% of the fragments are in the correct orientation. The estimation of the accuracy of the ordering is more involved. Starting with the Sept. 5 freeze, plasmid end read pairs were also used for additional order and orientation. More were added for the Oct. 7 freeze. This appears to have improved the o+o, and our guess is that starting with the Oct 7 freeze, the o+o order and orientation are about 85% accurate. Estimates of order and orientation accuracy are highly variable from region to region and are still uncertain at this time.
The construction of the working draft in regions where the sequence has been finished is not 100% reliable on earlier assemblies due some occassional inconsistences between information about finished sequence contigs, taken from the NT finished contigs at NCBI, via the Wash. U. accession map, and other constraints on the assembly from additional draft clones and auxiliary data. The treatment of finished regions improved in the Oct. 7 freeze final assembly. Also, starting with the Sept. 5 freeze, we use special .fa and .agp files for the finished chromosomes 21 and 22 that have been constructed by Greg Schuler at NCBI using data generated by the centers that sequenced these chromosomes. The original data on chromosome 21 is available from the Max Plank Institute and RIKEN, and on chromosome 22 from the Sanger Centre.
True overlaps between fragments that are short or that involve a lot of repetitive DNA are not detected. These fragments will be included separately as not overlapping.
Some clone overlaps and larger fragment overlaps may be missed because there is inconsistent information that GigAssembler cannot reconcile, causing the overlap not to be made. In addition, some clones may not be placed correctly in the map, causing missed overlaps. Both these processes can cause artifactual duplication of regions of size usually less than 50Kb. We estimate that about 3 percent of the sequence is represented by such artifactual duplication.
Some of the joins made between fragments in building the working draft may be misassemblies. It is believed the rate of such misassemblies is quite low, but there are no independent tests of this yet.
Other corrections to the data are given in news archives.