This directory contains the Blat application for stand-alone use.

Please note that the Blat source and executables are freely available for
academic, nonprofit and personal use. Commercial licensing information is
available on the Kent Informatics website (http://www.kentinformatics.com/).

For help installing and running Blat please see:

- Documentation: http://genome.ucsc.edu/goldenPath/help/blatSpec.html

- FAQs: http://genome.ucsc.edu/FAQ/FAQblat.html
      Name                     Last modified      Size  Description
Parent Directory - FOOTER.txt 2024-06-12 14:03 29K blat 2024-12-18 11:45 5.3M blatHuge 2024-12-18 11:45 5.3M gfClient 2024-12-18 11:45 5.3M gfPcr 2024-12-18 11:45 5.2M gfServer 2024-12-18 11:45 5.1M gfServerHuge 2024-12-18 11:45 5.1M isPcr 2024-12-18 11:45 5.2M
================================================================
========   blat   ====================================
================================================================

blat - Standalone BLAT v. 39x1 fast sequence search command line tool
usage:
   blat database query [-ooc=11.ooc] output.psl
where:
   database and query are each either a .fa, .nib or .2bit file,
      or a list of these files with one file name per line.
   -ooc=11.ooc tells the program to load over-occurring 11-mers from
      an external file.  This will increase the speed
      by a factor of 40 in many cases, but is not required.
   output.psl is the name of the output file.
   Subranges of .nib and .2bit files may be specified using the syntax:
      /path/file.nib:seqid:start-end
   or
      /path/file.2bit:seqid:start-end
   or
      /path/file.nib:start-end
   With the second form, a sequence id of file:start-end will be used.
options:
   -t=type        Database type.  Type is one of:
                    dna - DNA sequence
                    prot - protein sequence
                    dnax - DNA sequence translated in six frames to protein
                  The default is dna.
   -q=type        Query type.  Type is one of:
                    dna - DNA sequence
                    rna - RNA sequence
                    prot - protein sequence
                    dnax - DNA sequence translated in six frames to protein
                    rnax - DNA sequence translated in three frames to protein
                  The default is dna.
   -prot          Synonymous with -t=prot -q=prot.
   -ooc=N.ooc     Use overused tile file N.ooc.  N should correspond to 
                  the tileSize.
   -tileSize=N    Sets the size of match that triggers an alignment.  
                  Usually between 8 and 12.
                  Default is 11 for DNA and 5 for protein.
   -stepSize=N    Spacing between tiles. Default is tileSize.
   -oneOff=N      If set to 1, this allows one mismatch in tile and still
                  triggers an alignment.  Default is 0.
   -minMatch=N    Sets the number of tile matches.  Usually set from 2 to 4.
                  Default is 2 for nucleotide, 1 for protein.
   -minScore=N    Sets minimum score.  This is the matches minus the 
                  mismatches minus some sort of gap penalty.  Default is 30.
   -minIdentity=N Sets minimum sequence identity (in percent).  Default is
                  90 for nucleotide searches, 25 for protein or translated
                  protein searches.
   -maxGap=N      Sets the size of maximum gap between tiles in a clump.  Usually
                  set from 0 to 3.  Default is 2. Only relevant for minMatch > 1.
   -noHead        Suppresses .psl header (so it's just a tab-separated file).
   -makeOoc=N.ooc Make overused tile file. Target needs to be complete genome.
   -repMatch=N    Sets the number of repetitions of a tile allowed before
                  it is marked as overused.  Typically this is 256 for tileSize
                  12, 1024 for tile size 11, 4096 for tile size 10.
                  Default is 1024.  Typically comes into play only with makeOoc.
                  Also affected by stepSize: when stepSize is halved, repMatch is
                  doubled to compensate.
   -noSimpRepMask Suppresses simple repeat masking.
   -mask=type     Mask out repeats.  Alignments won't be started in masked region
                  but may extend through it in nucleotide searches.  Masked areas
                  are ignored entirely in protein or translated searches. Types are:
                    lower - mask out lower-cased sequence
                    upper - mask out upper-cased sequence
                    out   - mask according to database.out RepeatMasker .out file
                    file.out - mask database according to RepeatMasker file.out
   -qMask=type    Mask out repeats in query sequence.  Similar to -mask above, but
                  for query rather than target sequence.
   -repeats=type  Type is same as mask types above.  Repeat bases will not be
                  masked in any way, but matches in repeat areas will be reported
                  separately from matches in other areas in the psl output.
   -minRepDivergence=NN   Minimum percent divergence of repeats to allow 
                  them to be unmasked.  Default is 15.  Only relevant for 
                  masking using RepeatMasker .out files.
   -dots=N        Output dot every N sequences to show program's progress.
   -trimT         Trim leading poly-T.
   -noTrimA       Don't trim trailing poly-A.
   -trimHardA     Remove poly-A tail from qSize as well as alignments in 
                  psl output.
   -fastMap       Run for fast DNA/DNA remapping - not allowing introns, 
                  requiring high %ID. Query sizes must not exceed 5000.
   -out=type      Controls output file format.  Type is one of:
                    psl - Default.  Tab-separated format, no sequence
                    pslx - Tab-separated format with sequence
                    axt - blastz-associated axt format
                    maf - multiz-associated maf format
                    sim4 - similar to sim4 format
                    wublast - similar to wublast format
                    blast - similar to NCBI blast format
                    blast8- NCBI blast tabular format
                    blast9 - NCBI blast tabular format with comments
   -fine          For high-quality mRNAs, look harder for small initial and
                  terminal exons.  Not recommended for ESTs.
   -maxIntron=N  Sets maximum intron size. Default is 750000.
   -extendThroughN  Allows extension of alignment through large blocks of Ns.
================================================================
========   blatHuge   ====================================
================================================================

blat - Standalone BLAT v. 39x1 fast sequence search command line tool
usage:
   blat database query [-ooc=11.ooc] output.psl
where:
   database and query are each either a .fa, .nib or .2bit file,
      or a list of these files with one file name per line.
   -ooc=11.ooc tells the program to load over-occurring 11-mers from
      an external file.  This will increase the speed
      by a factor of 40 in many cases, but is not required.
   output.psl is the name of the output file.
   Subranges of .nib and .2bit files may be specified using the syntax:
      /path/file.nib:seqid:start-end
   or
      /path/file.2bit:seqid:start-end
   or
      /path/file.nib:start-end
   With the second form, a sequence id of file:start-end will be used.
options:
   -t=type        Database type.  Type is one of:
                    dna - DNA sequence
                    prot - protein sequence
                    dnax - DNA sequence translated in six frames to protein
                  The default is dna.
   -q=type        Query type.  Type is one of:
                    dna - DNA sequence
                    rna - RNA sequence
                    prot - protein sequence
                    dnax - DNA sequence translated in six frames to protein
                    rnax - DNA sequence translated in three frames to protein
                  The default is dna.
   -prot          Synonymous with -t=prot -q=prot.
   -ooc=N.ooc     Use overused tile file N.ooc.  N should correspond to 
                  the tileSize.
   -tileSize=N    Sets the size of match that triggers an alignment.  
                  Usually between 8 and 12.
                  Default is 11 for DNA and 5 for protein.
   -stepSize=N    Spacing between tiles. Default is tileSize.
   -oneOff=N      If set to 1, this allows one mismatch in tile and still
                  triggers an alignment.  Default is 0.
   -minMatch=N    Sets the number of tile matches.  Usually set from 2 to 4.
                  Default is 2 for nucleotide, 1 for protein.
   -minScore=N    Sets minimum score.  This is the matches minus the 
                  mismatches minus some sort of gap penalty.  Default is 30.
   -minIdentity=N Sets minimum sequence identity (in percent).  Default is
                  90 for nucleotide searches, 25 for protein or translated
                  protein searches.
   -maxGap=N      Sets the size of maximum gap between tiles in a clump.  Usually
                  set from 0 to 3.  Default is 2. Only relevant for minMatch > 1.
   -noHead        Suppresses .psl header (so it's just a tab-separated file).
   -makeOoc=N.ooc Make overused tile file. Target needs to be complete genome.
   -repMatch=N    Sets the number of repetitions of a tile allowed before
                  it is marked as overused.  Typically this is 256 for tileSize
                  12, 1024 for tile size 11, 4096 for tile size 10.
                  Default is 1024.  Typically comes into play only with makeOoc.
                  Also affected by stepSize: when stepSize is halved, repMatch is
                  doubled to compensate.
   -noSimpRepMask Suppresses simple repeat masking.
   -mask=type     Mask out repeats.  Alignments won't be started in masked region
                  but may extend through it in nucleotide searches.  Masked areas
                  are ignored entirely in protein or translated searches. Types are:
                    lower - mask out lower-cased sequence
                    upper - mask out upper-cased sequence
                    out   - mask according to database.out RepeatMasker .out file
                    file.out - mask database according to RepeatMasker file.out
   -qMask=type    Mask out repeats in query sequence.  Similar to -mask above, but
                  for query rather than target sequence.
   -repeats=type  Type is same as mask types above.  Repeat bases will not be
                  masked in any way, but matches in repeat areas will be reported
                  separately from matches in other areas in the psl output.
   -minRepDivergence=NN   Minimum percent divergence of repeats to allow 
                  them to be unmasked.  Default is 15.  Only relevant for 
                  masking using RepeatMasker .out files.
   -dots=N        Output dot every N sequences to show program's progress.
   -trimT         Trim leading poly-T.
   -noTrimA       Don't trim trailing poly-A.
   -trimHardA     Remove poly-A tail from qSize as well as alignments in 
                  psl output.
   -fastMap       Run for fast DNA/DNA remapping - not allowing introns, 
                  requiring high %ID. Query sizes must not exceed 5000.
   -out=type      Controls output file format.  Type is one of:
                    psl - Default.  Tab-separated format, no sequence
                    pslx - Tab-separated format with sequence
                    axt - blastz-associated axt format
                    maf - multiz-associated maf format
                    sim4 - similar to sim4 format
                    wublast - similar to wublast format
                    blast - similar to NCBI blast format
                    blast8- NCBI blast tabular format
                    blast9 - NCBI blast tabular format with comments
   -fine          For high-quality mRNAs, look harder for small initial and
                  terminal exons.  Not recommended for ESTs.
   -maxIntron=N  Sets maximum intron size. Default is 750000.
   -extendThroughN  Allows extension of alignment through large blocks of Ns.
================================================================
========   gfClient   ====================================
================================================================

gfClient v. 39x1 - A client for the genomic finding program that produces a .psl file
usage:
   gfClient host port seqDir in.fa out.psl
where
   host is the name of the machine running the gfServer
   port is the same port that you started the gfServer with
   seqDir is the path of the .2bit or .nib files relative to the current dir
       (note these are needed by the client as well as the server)
   in.fa is a fasta format file.  May contain multiple records
   out.psl is where to put the output
options:
   -t=type       Database type. Type is one of:
                   dna - DNA sequence
                   prot - protein sequence
                   dnax - DNA sequence translated in six frames to protein
                 The default is dna.
   -q=type       Query type. Type is one of:
                   dna - DNA sequence
                   rna - RNA sequence
                   prot - protein sequence
                   dnax - DNA sequence translated in six frames to protein
                   rnax - DNA sequence translated in three frames to protein
   -prot         Synonymous with -t=prot -q=prot.
   -dots=N       Output a dot every N query sequences.
   -nohead       Suppresses 5-line psl header.
   -minScore=N   Sets minimum score.  This is twice the matches minus the 
                 mismatches minus some sort of gap penalty.  Default is 30.
   -minIdentity=N   Sets minimum sequence identity (in percent).  Default is
                 90 for nucleotide searches, 25 for protein or translated
                 protein searches.
   -out=type     Controls output file format.  Type is one of:
                   psl - Default.  Tab-separated format without actual sequence
                   pslx - Tab-separated format with sequence
                   axt - blastz-associated axt format
                   maf - multiz-associated maf format
                   sim4 - similar to sim4 format
                   wublast - similar to wublast format
                   blast - similar to NCBI blast format
                   blast8- NCBI blast tabular format
                   blast9 - NCBI blast tabular format with comments
   -maxIntron=N   Sets maximum intron size. Default is 750000.
   -genome=name  When using a dynamic gfServer, The genome name is used to 
                 find the data files relative to the dynamic gfServer root, named 
                 in the form $genome.2bit, $genome.untrans.gfidx, and $genome.trans.gfidx
   -genomeDataDir=path
                 When using a dynamic gfServer, this is the dynamic gfServer root directory
                 that contained the genome data files.  Defaults to being the root directory.
                
================================================================
========   gfPcr   ====================================
================================================================

gfPcr - In silico PCR version 39x1 using gfServer index.
usage:
   gfPcr host port seqDir fPrimer rPrimer output
or
   gfPcr host port seqDir batch output
Where:
   host is the name of the machine running the gfServer
   port is the gfServer port (usually 17779)
   seqDir is where the nib or 2bit files for the genome database are
   fPrimer is the forward strand primer
   rPrimer is the reverse strand primer
   output is the output file.  Use 'stdout' for output to standard output
   batch is a space or tab delimited file with the following fields on each line
       name/fPrimer/rPrimer/maxProductSize
options:
   -maxSize=N - Maximum size of PCR product (default 4000)
   -minPerfect=N - Minimum size of perfect match at 3' end of primer (default 15)
   -minGood=N - Minimum size where there must be 2 matches for each mismatch (default 18)
   -out=XXX - Output format.  Either
      fa - fasta with position, primers in header (default)
      bed - tab delimited format. Fields: chrom/start/end/name/score/strand
      psl - blat format.
   -name=XXX - Name to use in bed output.
   -genome=name  When using a dynamic gfServer, The genome name is used to 
                 find the data files relative to the dynamic gfServer root, named 
                 in the form $genome.2bit, and $genome.untrans.gfidx.
   -genomeDataDir=path
                 When using a dynamic gfServer, this is the dynamic gfServer root directory
                 that contained the genome data files.  Defaults to being the root directory.
                

================================================================
========   gfServer   ====================================
================================================================

gfServer v 39x1 - Make a server to quickly find where DNA occurs in genome (32-bit index)
   To set up a server:
      gfServer start host port file(s)
      where the files are .2bit or .nib format files specified relative to the current directory
   To remove a server:
      gfServer stop host port
   To query a server with DNA sequence:
      gfServer query host port probe.fa
   To query a server with protein sequence:
      gfServer protQuery host port probe.fa
   To query a server with translated DNA sequence:
      gfServer transQuery host port probe.fa
   To query server with PCR primers:
      gfServer pcr host port fPrimer rPrimer maxDistance
   To process one probe fa file against a .2bit format genome (not starting server):
      gfServer direct probe.fa file(s).2bit
   To test PCR without starting server:
      gfServer pcrDirect fPrimer rPrimer file(s).2bit
   To figure out if server is alive, on static instances get usage statics as well:
      gfServer status host port
     For dynamic gfServer instances, specify -genome and optionally the -genomeDataDir
     to get information on an untranslated genome index. Include -trans to get about information
     about a translated genome index
   To get input file list:
      gfServer files host port
   To generate a precomputed index:
      gfServer index gfidx file(s)
     where the files are .2bit or .nib format files.  Separate indexes are
     be created for untranslated and translated queries.  These can be used
     with a persistent server as with 'start -indexFile or a dynamic server.
     They must follow the naming convention for for dynamic servers.
   To run a dynamic server (usually called by xinetd):
      gfServer dynserver rootdir
     Data files for genomes are found relative to the root directory.
     Queries are made using the prefix of the file path relative to the root
     directory.  The files $genome.2bit, $genome.untrans.gfidx, and
     $genome.trans.gfidx are required. Typically the structure will be in
     the form:
         $rootdir/$genomeDataDir/$genome.2bit
         $rootdir/$genomeDataDir/$genome.untrans.gfidx
         $rootdir/$genomeDataDir/$genome.trans.gfidx
     in this case, one would call gfClient with 
         -genome=$genome -genomeDataDir=$genomeDataDir
     Often $genomeDataDir will be the same name as $genome, however it
     can be a multi-level path. For instance:
          GCA/902/686/455/GCA_902686455.1_mSciVul1.1/
     The translated or untranslated index maybe omitted if there is no
     need to handle that type of request.
     The -perSeqMax functionality can be implemented by creating a file
         $rootdir/$genomeDataDir/$genome.perseqmax

options:
   -tileSize=N     Size of n-mers to index.  Default is 11 for nucleotides, 4 for
                   proteins (or translated nucleotides).
   -stepSize=N     Spacing between tiles. Default is tileSize.
   -minMatch=N     Number of n-mer matches that trigger detailed alignment.
                   Default is 2 for nucleotides, 3 for proteins.
   -maxGap=N       Number of insertions or deletions allowed between n-mers.
                   Default is 2 for nucleotides, 0 for proteins.
   -trans          Translate database to protein in 6 frames.  Note: it is best
                   to run this on RepeatMasked data in this case.
   -log=logFile    Keep a log file that records server requests.
   -seqLog         Include sequences in log file (not logged with -syslog).
   -ipLog          Include user's IP in log file (not logged with -syslog).
   -debugLog       Include debugging info in log file.
   -syslog         Log to syslog.
   -logFacility=facility  Log to the specified syslog facility - default local0.
   -mask           Use masking from .2bit file.
   -repMatch=N     Number of occurrences of a tile (n-mer) that triggers repeat masking the
                   tile. Default is 1024.
   -noSimpRepMask  Suppresses simple repeat masking.
   -maxDnaHits=N   Maximum number of hits for a DNA query that are sent from the server.
                   Default is 100.
   -maxTransHits=N Maximum number of hits for a translated query that are sent from the server.
                   Default is 200.
   -maxNtSize=N    Maximum size of untranslated DNA query sequence.
                   Default is 40000.
   -maxAaSize=N    Maximum size of protein or translated DNA queries.
                   Default is 8000.
   -perSeqMax=file File contains one seq filename (possibly with ':seq' suffix) per line.
                   -maxDnaHits will be applied to each filename[:seq] separately: each may
                   have at most maxDnaHits/2 hits.  The filename MUST not include the directory.
                   Useful for assemblies with many alternate/patch sequences.
   -canStop        If set, a quit message will actually take down the server.
   -indexFile      Index file created by `gfServer index'. Saving index can speed up
                   gfServer startup by two orders of magnitude.  The parameters must
                   exactly match the parameters when the file is written or bad things
                   will happen.
   -timeout=N      Timeout in seconds.
                   Default is 90.

================================================================
========   gfServerHuge   ====================================
================================================================

gfServer v 39x1 - Make a server to quickly find where DNA occurs in genome (64-bit index)
   To set up a server:
      gfServer start host port file(s)
      where the files are .2bit or .nib format files specified relative to the current directory
   To remove a server:
      gfServer stop host port
   To query a server with DNA sequence:
      gfServer query host port probe.fa
   To query a server with protein sequence:
      gfServer protQuery host port probe.fa
   To query a server with translated DNA sequence:
      gfServer transQuery host port probe.fa
   To query server with PCR primers:
      gfServer pcr host port fPrimer rPrimer maxDistance
   To process one probe fa file against a .2bit format genome (not starting server):
      gfServer direct probe.fa file(s).2bit
   To test PCR without starting server:
      gfServer pcrDirect fPrimer rPrimer file(s).2bit
   To figure out if server is alive, on static instances get usage statics as well:
      gfServer status host port
     For dynamic gfServer instances, specify -genome and optionally the -genomeDataDir
     to get information on an untranslated genome index. Include -trans to get about information
     about a translated genome index
   To get input file list:
      gfServer files host port
   To generate a precomputed index:
      gfServer index gfidx file(s)
     where the files are .2bit or .nib format files.  Separate indexes are
     be created for untranslated and translated queries.  These can be used
     with a persistent server as with 'start -indexFile or a dynamic server.
     They must follow the naming convention for for dynamic servers.
   To run a dynamic server (usually called by xinetd):
      gfServer dynserver rootdir
     Data files for genomes are found relative to the root directory.
     Queries are made using the prefix of the file path relative to the root
     directory.  The files $genome.2bit, $genome.untrans.gfidx, and
     $genome.trans.gfidx are required. Typically the structure will be in
     the form:
         $rootdir/$genomeDataDir/$genome.2bit
         $rootdir/$genomeDataDir/$genome.untrans.gfidx
         $rootdir/$genomeDataDir/$genome.trans.gfidx
     in this case, one would call gfClient with 
         -genome=$genome -genomeDataDir=$genomeDataDir
     Often $genomeDataDir will be the same name as $genome, however it
     can be a multi-level path. For instance:
          GCA/902/686/455/GCA_902686455.1_mSciVul1.1/
     The translated or untranslated index maybe omitted if there is no
     need to handle that type of request.
     The -perSeqMax functionality can be implemented by creating a file
         $rootdir/$genomeDataDir/$genome.perseqmax

options:
   -tileSize=N     Size of n-mers to index.  Default is 11 for nucleotides, 4 for
                   proteins (or translated nucleotides).
   -stepSize=N     Spacing between tiles. Default is tileSize.
   -minMatch=N     Number of n-mer matches that trigger detailed alignment.
                   Default is 2 for nucleotides, 3 for proteins.
   -maxGap=N       Number of insertions or deletions allowed between n-mers.
                   Default is 2 for nucleotides, 0 for proteins.
   -trans          Translate database to protein in 6 frames.  Note: it is best
                   to run this on RepeatMasked data in this case.
   -log=logFile    Keep a log file that records server requests.
   -seqLog         Include sequences in log file (not logged with -syslog).
   -ipLog          Include user's IP in log file (not logged with -syslog).
   -debugLog       Include debugging info in log file.
   -syslog         Log to syslog.
   -logFacility=facility  Log to the specified syslog facility - default local0.
   -mask           Use masking from .2bit file.
   -repMatch=N     Number of occurrences of a tile (n-mer) that triggers repeat masking the
                   tile. Default is 1024.
   -noSimpRepMask  Suppresses simple repeat masking.
   -maxDnaHits=N   Maximum number of hits for a DNA query that are sent from the server.
                   Default is 100.
   -maxTransHits=N Maximum number of hits for a translated query that are sent from the server.
                   Default is 200.
   -maxNtSize=N    Maximum size of untranslated DNA query sequence.
                   Default is 40000.
   -maxAaSize=N    Maximum size of protein or translated DNA queries.
                   Default is 8000.
   -perSeqMax=file File contains one seq filename (possibly with ':seq' suffix) per line.
                   -maxDnaHits will be applied to each filename[:seq] separately: each may
                   have at most maxDnaHits/2 hits.  The filename MUST not include the directory.
                   Useful for assemblies with many alternate/patch sequences.
   -canStop        If set, a quit message will actually take down the server.
   -indexFile      Index file created by `gfServer index'. Saving index can speed up
                   gfServer startup by two orders of magnitude.  The parameters must
                   exactly match the parameters when the file is written or bad things
                   will happen.
   -timeout=N      Timeout in seconds.
                   Default is 90.

================================================================
========   isPcr   ====================================
================================================================

isPcr - Standalone v 39x1 In-Situ PCR Program
usage:
   isPcr database query output
where database is a fasta, nib, or twoBit file or a text file containing
a list of these files,  query is a text file file containing three columns: name,
forward primer, and reverse primer,  and output is where the results go.
The names 'stdin' and 'stdout' can be used as file names to make using the
program in pipes easier.
options:
   -ooc=N.ooc  Use overused tile file N.ooc.  N should correspond to 
               the tileSize
   -tileSize=N the size of match that triggers an alignment.  
               Default is 11 .
   -stepSize=N spacing between tiles. Default is 5.
   -maxSize=N - Maximum size of PCR product (default 4000)
   -minSize=N - Minimum size of PCR product (default 0)
   -minPerfect=N - Minimum size of perfect match at 3' end of primer (default 15)
   -minGood=N - Minimum size where there must be 2 matches for each mismatch (default 15)
   -mask=type  Mask out repeats.  Alignments won't be started in masked region
               but may extend through it in nucleotide searches.  Masked areas
               are ignored entirely in protein or translated searches. Types are
                 lower - mask out lower cased sequence
                 upper - mask out upper cased sequence
                 out   - mask according to database.out RepeatMasker .out file
                 file.out - mask database according to RepeatMasker file.out
   -makeOoc=N.ooc Make overused tile file. Database needs to be complete genome.
   -repMatch=N sets the number of repetitions of a tile allowed before
               it is marked as overused.  Typically this is 256 for tileSize
               12, 1024 for tile size 11, 4096 for tile size 10.
               Default is 1024.  Only comes into play with makeOoc
   -noSimpRepMask Suppresses simple repeat masking.
   -flipReverse Reverse complement reverse (second) primer before using
   -out=XXX - Output format.  Either
      fa - fasta with position, primers in header (default)
      bed - tab delimited format. Fields: chrom/start/end/name/score/strand
      psl - blat format.