================================================================ to download all of the files from one of these admin/exe/ directories, for example: admin/exe/linux.x86_64/ using the rsync command to your current directory: rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/ ./ ================================================================ ======== addCols ==================================== ================================================================ ### kent source version 491 ### addCols - Sum columns in a text file. usage: addCols adds all columns in the given file, outputs the sum of each column. can be the name: stdin to accept input from stdin. Options: -maxCols=N - maximum number of colums (defaults to 16) ================================================================ ======== ameme ==================================== ================================================================ ameme - find common patterns in DNA usage ameme good=goodIn.fa [bad=badIn.fa] [numMotifs=2] [background=m1] [maxOcc=2] [motifOutput=fileName] [html=output.html] [gif=output.gif] [rcToo=on] [controlRun=on] [startScanLimit=20] [outputLogo] [constrainer=1] where goodIn.fa is a multi-sequence fa file containing instances of the motif you want to find, badIn.fa is a file containing similar sequences but lacking the motif, numMotifs is the number of motifs to scan for, background is m0,m1, or m2 for various levels of Markov models, maxOcc is the maximum occurrences of the motif you expect to find in a single sequence and motifOutput is the name of a file to store just the motifs in. rcToo=on searches both strands. If you include controlRun=on in the command line, a random set of sequences will be generated that match your foreground data set in size, and your background data set in nucleotide probabilities. The program will then look for motifs in this random set. If the scores you get in a real run are about the same as those you get in a control run, then the motifs Improbizer has found are probably not significant. ================================================================ ======== autoDtd ==================================== ================================================================ ### kent source version 491 ### autoDtd - Give this a XML document to look at and it will come up with a DTD to describe it. usage: autoDtd in.xml out.dtd out.stats options: -tree=out.tree - Output tag tree. -atree=out.atree - Output attributed tag tree. ================================================================ ======== autoSql ==================================== ================================================================ ### kent source version 491 ### autoSql - create SQL and C code for permanently storing a structure in database and loading it back into memory based on a specification file usage: autoSql specFile outRoot {optional: -dbLink -withNull -json} This will create outRoot.sql outRoot.c and outRoot.h based on the contents of specFile. options: -dbLink - optionally generates code to execute queries and updates of the table. -addBin - Add an initial bin field and index it as (chrom,bin) -withNull - optionally generates code and .sql to enable applications to accept and load data into objects with potential 'missing data' (NULL in SQL) situations. -defaultZeros - will put zero and or empty string as default value -django - generate method to output object as django model Python code -json - generate method to output the object in JSON (JavaScript) format. ================================================================ ======== autoXml ==================================== ================================================================ autoXml - Generate structures code and parser for XML file from DTD-like spec usage: autoXml file.dtdx root This will generate root.c, root.h options: -textField=xxx what to name text between start/end tags. Default 'text' -comment=xxx Comment to appear at top of generated code files -picky Generate parser that rejects stuff it doesn't understand -main Put in a main routine that's a test harness -prefix=xxx Prefix to add to structure names. By default same as root -positive Don't write out optional attributes with negative values ================================================================ ======== ave ==================================== ================================================================ ave - Compute average and basic stats usage: ave file options: -col=N Which column to use. Default 1 -tableOut - output by columns (default output in rows) -noQuartiles - only calculate min,max,mean,standard deviation - for large data sets that will not fit in memory. ================================================================ ======== aveCols ==================================== ================================================================ aveCols - average together columns usage: aveCols file adds all columns (up to 16 columns) in the given file, outputs the average (sum/#ofRows) of each column. can be the name: stdin to accept input from stdin. ================================================================ ======== bamToPsl ==================================== ================================================================ ### kent source version 491 ### bamToPsl - Convert a bam file to a psl and optionally also a fasta file that contains the reads. usage: bamToPsl [options] in.bam out.psl options: -fasta=output.fa - output query sequences to specified file -chromAlias=file - specify a two-column file: 1: alias, 2: other name for target name translation from column 1 name to column 2 name names not found are passed through intact -nohead - do not output the PSL header, default has header output -dots=N - output progress dot(.) every N alignments processed Note: a chromAlias file can be obtained from a UCSC database, e.g.: hgsql -N -e 'select alias,chrom from chromAlias;' hg38 > hg38.chromAlias.tab Or from the downloads server: wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/chromAlias.txt.gz See also our tool chromToUcsc ================================================================ ======== barChartMaxLimit ==================================== ================================================================ Can't open file '-verbose=2' for reading ================================================================ ======== bedClip ==================================== ================================================================ ### kent source version 491 ### bedClip - Remove lines from bed file that refer to off-chromosome locations. usage: bedClip [options] input.bed chrom.sizes output.bed chrom.sizes is a two-column file/URL: If the assembly is hosted by UCSC, chrom.sizes can be a URL like http://hgdownload.soe.ucsc.edu/goldenPath//bigZips/.chrom.sizes or you may use the script fetchChromSizes to download the chrom.sizes file. If not hosted by UCSC, a chrom.sizes file can be generated by running twoBitInfo on the assembly .2bit file. options: -truncate - truncate items that span ends of chrom instead of the default of dropping the items -verbose=2 - set to get list of lines clipped and why ================================================================ ======== bedCommonRegions ==================================== ================================================================ ### kent source version 491 ### bedCommonRegions - Create a bed file (just bed3) that contains the regions common to all inputs. Regions are common only if exactly the same chromosome, starts, and end. Overlap is not enough. Each region must be in each input at most once. Output is stdout. usage: bedCommonRegions file1 file2 file3 ... fileN ================================================================ ======== bedExtendRanges ==================================== ================================================================ ./bedExtendRanges: /lib64/libcurl.so.4: no version information available (required by ./bedExtendRanges) ### kent source version 491 ### bedExtendRanges - extend length of entries in bed 6+ data to be at least the given length, taking strand directionality into account. usage: bedExtendRanges database length files(s) options: -host mysql host -user mysql user -password mysql password -tab Separate by tabs rather than space -verbose=N - verbose level for extra information to STDERR example: bedExtendRanges hg18 250 stdin bedExtendRanges -user=genome -host=genome-mysql.soe.ucsc.edu hg18 250 stdin will transform: chr1 500 525 . 100 + chr1 1000 1025 . 100 - to: chr1 500 750 . 100 + chr1 775 1025 . 100 - ================================================================ ======== bedGeneParts ==================================== ================================================================ ### kent source version 491 ### bedGeneParts - Given a bed, spit out promoter, first exon, or all introns. usage: bedGeneParts part in.bed out.bed Where part is either 'exons' or 'firstExon' or 'introns' or 'promoter' or 'firstCodingSplice' or 'secondCodingSplice' options: -proStart=NN - start of promoter relative to txStart, default -100 -proEnd=NN - end of promoter relative to txStart, default 50 ================================================================ ======== bedGraphPack ==================================== ================================================================ ### kent source version 491 ### bedGraphPack v1 - Pack together adjacent records representing same value. usage: bedGraphPack in.bedGraph out.bedGraph The input needs to be sorted by chrom and this is checked. To put in a pipe use stdin and stdout in the command line in place of file names. ================================================================ ======== bedGraphToBigWig ==================================== ================================================================ ### kent source version 491 ### bedGraphToBigWig v 2.10 - Convert a bedGraph file to bigWig format (bbi version: 4). usage: bedGraphToBigWig in.bedGraph chrom.sizes out.bw where in.bedGraph is a four column file in the format: and chrom.sizes is a two-column file/URL: and out.bw is the output indexed big wig file. If the assembly is hosted by UCSC, chrom.sizes can be a URL like http://hgdownload.soe.ucsc.edu/goldenPath//bigZips/.chrom.sizes or you may use the script fetchChromSizes to download the chrom.sizes file. If not hosted by UCSC, a chrom.sizes file can be generated by running twoBitInfo on the assembly .2bit file. The input bedGraph file must be sorted, use the unix sort command: LC_ALL=C sort -k1,1 -k2,2n unsorted.bedGraph > sorted.bedGraph The LC_ALL=C variable activates case-sensitive sorting. options: -blockSize=N - Number of items to bundle in r-tree. Default 256 -itemsPerSlot=N - Number of data points bundled at lowest level. Default 1024 -sizesIsBb -- If set, the chrom.sizes file is assumed to be a bigBed file. -unc - If set, do not use compression. ================================================================ ======== bedJoinTabOffset ==================================== ================================================================ bedJoinTabOffset - Add file offset and length of line in a text file with the same name as the BED name to each row of BED. usage: bedJoinTabOffset inTabFile inBedFile outBedFile Given a bed file and tab file where each have a column with matching values: 1. first get the value of column0, the offset and line length from inTabFile. 2. Then go over the bed file, use the -bedKey (defaults to the name field) field and append its offset and length to the bed file as two separate fields. Write the new bed file to outBed. options: -bedKey=integer 0-based index key of the bed file to use to match up with the tab file. Default is 3 for the name field. ================================================================ ======== bedJoinTabOffset.py ==================================== ================================================================ /usr/bin/env: 'python2.7': No such file or directory ================================================================ ======== bedMergeAdjacent ==================================== ================================================================ ### kent source version 491 ### bedMergeAdjacent - merge adjacent blocks in a BED 12 usage: bedMergeAdjacent inBed outBed options: ================================================================ ======== bedPartition ==================================== ================================================================ ### kent source version 491 ### bedPartition - split BED ranges into non-overlapping ranges usage: bedPartition [options] bedFile rangesBed Split ranges in a BED into non-overlapping sets for use in cluster jobs. Output is a BED 4 of the ranges and a generated name. The bedFile maybe compressed and no ordering is assumed. options: -verbose=1 - print statistics if >= 1 -minPartitionItems=0 - minimum number of input items in a partition. Partitions with fewer items will merged with into subsequent partitions -partMergeDist=0 - will combine adjacent non-overlapping partitions that are separated by no more that this number of bases. -partMergeSize is an obsolete name for this option. -parallel=n - use this many cores for parallel sorting notes: - Generate name is useful for identifying partition ================================================================ ======== bedPileUps ==================================== ================================================================ ### kent source version 491 ### bedPileUps - Find (exact) overlaps if any in bed input usage: bedPileUps in.bed Where in.bed is in one of the ascii bed formats. The in.bed file must be sorted by chromosome,start, to sort a bed file, use the unix sort command: sort -k1,1 -k2,2n unsorted.bed > sorted.bed Options: -name - include BED name field 4 when evaluating uniqueness -tab - use tabs to parse fields -verbose=2 - show the location and size of each pileUp ================================================================ ======== bedRemoveOverlap ==================================== ================================================================ ### kent source version 491 ### bedRemoveOverlap - Remove overlapping records from a (sorted) bed file. Gets rid of `the smaller of overlapping records. usage: bedRemoveOverlap in.bed out.bed options: -xxx=XXX ================================================================ ======== bedRestrictToPositions ==================================== ================================================================ ### kent source version 491 ### bedRestrictToPositions - Filter bed file, restricting to only ones that match chrom/start/ends specified in restrict.bed file. usage: bedRestrictToPositions in.bed restrict.bed out.bed options: -xxx=XXX ================================================================ ======== bedSort ==================================== ================================================================ bedSort - Sort a .bed file by chrom,chromStart usage: bedSort in.bed out.bed in.bed and out.bed may be the same. ================================================================ ======== bedToBigBed ==================================== ================================================================ ### kent source version 491 ### bedToBigBed v. 2.10 - Convert bed file to bigBed. (bbi version: 4) usage: bedToBigBed in.bed chrom.sizes out.bb Where in.bed is in one of the ascii bed formats, but not including track lines and chrom.sizes is a two-column file/URL: and out.bb is the output indexed big bed file. If the assembly is hosted by UCSC, chrom.sizes can be a URL like http://hgdownload.soe.ucsc.edu/goldenPath//bigZips/.chrom.sizes or you may use the script fetchChromSizes to download the chrom.sizes file. If you have bed annotations on patch sequences from NCBI, a more inclusive chrom.sizes file can be found using a URL like http://hgdownload.soe.ucsc.edu/goldenPath//database/chromInfo.txt.gz If not hosted by UCSC, a chrom.sizes file can be generated by running twoBitInfo on the assembly .2bit file or the 2bit file or used directly if the -sizesIs2Bit option is specified. The chrom.sizes file may also be a chromAlias bigBed file, or a URL to such a file, by specifying the -sizesIsChromAliasBb option. When using a chromAlias bigBed file, the input BED file may have chromosome names matching any of the sequence name aliases in the chromAlias file. For UCSC provided genomes, the chromAlias files can be found under: https://hgdownload.soe.ucsc.edu/goldenPath//bigZips/.chromAlias.bb For UCSC GenArk assembly hubs, the chrom aliases are namedd in the form: https://hgdownload.soe.ucsc.edu/hubs/GCF/006/542/625/GCF_006542625.1/GCF_006542625.1.chromAlias.bb For a description of generating chromAlias files for your own assembly hub, see: http://genomewiki.ucsc.edu/index.php/Chrom_Alias Without the -sort option, the in.bed file must be sorted by the chromosome and start fields. To sort a BED file, you can use bedSort or the following Unix command: sort -k1,1 -k2,2n unsorted.bed > sorted.bed Sequences must be sorted by name so all sequences with the same name are collected together, but they don't need to be in any particular order. options: -type=bedN[+[P]] : N is between 3 and 15, optional (+) if extra "bedPlus" fields, optional P specifies the number of extra fields. Not required, but preferred. Examples: -type=bed6 or -type=bed6+ or -type=bed6+3 (see http://genome.ucsc.edu/FAQ/FAQformat.html#format1) -as=fields.as - If you have non-standard "bedPlus" fields, it's great to put a definition of each field in a row in AutoSql format here. -blockSize=N - Number of items to bundle in r-tree. Default 256 -itemsPerSlot=N - Number of data points bundled at lowest level. Default 512 -unc - If set, do not use compression. -tab - If set, expect fields to be tab separated, normally expects white space separator. -extraIndex=fieldList - If set, make an index on each field in a comma separated list extraIndex=name and extraIndex=name,id are commonly used. -sizesIs2Bit -- If set, the chrom.sizes file is assumed to be a 2bit file. -sizesIsChromAliasBb -- If set, then chrom.sizes file is assumed to be a chromAlias bigBed file or a URL to a such a file (see above). -sizesIsBb -- Obsolete name for -sizesIsChromAliasBb. -udcDir=/path/to/udcCacheDir -- sets the UDC cache dir for caching of remote files. -allow1bpOverlap -- allow exons to overlap by at most one base pair -fixScores -- change non-integer scores to 0 and force integer scores into the range 0..1000 -maxAlloc=N -- Set the maximum memory allocation size to N bytes -sort -- sort the input file ================================================================ ======== bedToPsl ==================================== ================================================================ ### kent source version 491 ### bedToPsl - convert bed format files to psl format usage: bedToPsl [options] chromSizes bedFile pslFile Convert a BED file to a PSL file. This the result is an alignment. It is intended to allow processing by tools that operate on PSL. If the BED has at least 12 columns, then a PSL with blocks is created. Otherwise single-exon PSLs are created. Options: -tabs - use tab as a separator -keepQuery - instead of creating a fake query, create PSL with identical query and target specs. Useful if bed features are to be lifted with pslMap and one wants to keep the source location in the lift result. ================================================================ ======== bedWeedOverlapping ==================================== ================================================================ ### kent source version 491 ### bedWeedOverlapping - Filter out beds that overlap a 'weed.bed' file. usage: bedWeedOverlapping weeds.bed input.bed output.bed options: -maxOverlap=0.N - maximum overlapping ratio, default 0 (any overlap) -invert - keep the overlapping and get rid of everything else ================================================================ ======== bigBedInfo ==================================== ================================================================ ### kent source version 491 ### bigBedInfo - Show information about a bigBed file. usage: bigBedInfo file.bb options: -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs -chroms - list all chromosomes and their sizes -zooms - list all zoom levels and their sizes -as - get autoSql spec -asOut - output only autoSql spec -extraIndex - list all the extra indexes ================================================================ ======== bigBedNamedItems ==================================== ================================================================ ### kent source version 491 ### bigBedNamedItems - Extract item of given name from bigBed usage: bigBedNamedItems file.bb name output.bed options: -nameFile - if set, treat name parameter as file full of space delimited names -field=fieldName - use index on field name, default is "name" -header - output a autoSql-style header (starts with '#'). ================================================================ ======== bigBedSummary ==================================== ================================================================ ### kent source version 491 ### bigBedSummary - Extract summary information from a bigBed file. usage: bigBedSummary file.bb chrom start end dataPoints Get summary data from bigBed for indicated region, broken into dataPoints equal parts. (Use dataPoints=1 for simple summary.) options: -type=X where X is one of: coverage - % of region that is covered (default) mean - average depth of covered regions min - minimum depth of covered regions max - maximum depth of covered regions -fields - print out information on fields in file. If fields option is used, the chrom, start, end, dataPoints parameters may be omitted -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs ================================================================ ======== bigBedToBed ==================================== ================================================================ ### kent source version 491 ### bigBedToBed v1 - Convert from bigBed to ascii bed format. usage: bigBedToBed input.bb output.bed options: -chrom=chr1 - if set restrict output to given chromosome -start=N - if set, restrict output to only that over start -end=N - if set, restrict output to only that under end -bed=in.bed - restrict output to all regions in a BED file -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs -header - output a autoSql-style header (starts with '#'). -tsv - output a TSV header (without '#'). ================================================================ ======== bigChainBreaks ==================================== ================================================================ ### kent source version 491 ### bigChainBreaks - output a set of rearrangement breakpoints usage: bigChainBreaks bigChain.bb label breaks.txt options: -xxx=XXX ================================================================ ======== bigChainToChain ==================================== ================================================================ ./bigChainToChain: /lib64/libcurl.so.4: no version information available (required by ./bigChainToChain) ### kent source version 491 ### bigChainToChain - convert bigChain files back into a chain file usage: bigChainToChain bigChain.bb bigLinks.bb output.chain options: -xxx=XXX ================================================================ ======== bigGenePredToGenePred ==================================== ================================================================ ./bigGenePredToGenePred: /lib64/libcurl.so.4: no version information available (required by ./bigGenePredToGenePred) ### kent source version 491 ### bigGenePredToGenePred - convert bigGenePred file to genePred. usage: bigGenePredToGenePred bigGenePred.bb genePred.gp ================================================================ ======== bigGuessDb ==================================== ================================================================ Usage: bigGuessDb [options] inFile - given a bigBed or "bigWig file or URL, guess the assembly based on the chrom names and sizes. Must have bigBedInfo and bigWigInfo in PATH. Also requires a bigGuessDb.txt.gz, an alpha version of which can be downloaded at https://hgwdev.gi.ucsc.edu/~max/bigGuessDb/bigGuessDb.txt.gz Example run: $ wget https://hgwdev.gi.ucsc.edu/~max/bigGuessDb/bigGuessDb.txt.gz $ bigGuessDb --best https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1014nnn/GSM1014177/suppl/GSM1014177_mm9_wgEncodeUwDnaseNih3t3NihsMImmortalSigRep2.bigWig mm9 bigGuessDb: error: no such option: -v ================================================================ ======== bigHeat ==================================== ================================================================ Usage: bigHeat [options] locationBed locationMatrixFnames chromSizes outDir - create one feature Duplicate BED features and color by them by the values in locationMatrix. Creates new bigBed files in outDir and creates a basic trackDb.ra file there. BED file looks like this: chr1 1 1000 myGene 0 + 1 1000 0,0,0 chr2 1 1000 myGene2 0 + 1 1000 0,0,0 locationMatrix looks like this: gene sample1 sample2 sample3 myGene 1 2 3 myGene2 0.1 3 10 myGene2_probe2 0.1 3 10 This will create a composite with three subtracks (sample1, sample2, sample). Each subtrack will have myGene, and colored in intensity with sample3 more intense than sample1 and sample2. Same for myGene2. Also can add a bigWig with a summary of all these values, one per nucleotide bigHeat: error: no such option: -v ================================================================ ======== bigMafToMaf ==================================== ================================================================ ### kent source version 491 ### bigMafToMaf - convert bigMaf to maf file usage: bigMafToMaf bigMaf.bb file.maf options: ================================================================ ======== bigPslToPsl ==================================== ================================================================ ### kent source version 491 ### bigPslToPsl - convert bigPsl file to psl usage: bigPslToPsl bigPsl.bb output.psl options: -collapseStrand if target strand is '+', don't output it ================================================================ ======== bigWigAverageOverBed ==================================== ================================================================ ### kent source version 491 ### bigWigAverageOverBed v2 - Compute average score of big wig over each bed, which may have introns. usage: bigWigAverageOverBed in.bw in.bed out.tab The output columns are: name - name field from bed, which should be unique size - size of bed (sum of exon sizes covered - # bases within exons covered by bigWig sum - sum of values over all bases covered mean0 - average over bases with non-covered bases counting as zeroes mean - average over just covered bases Options: -stats=stats.ra - Output a collection of overall statistics to stat.ra file -bedOut=out.bed - Make output bed that is echo of input bed but with mean column appended -sampleAroundCenter=N - Take sample at region N bases wide centered around bed item, rather than the usual sample in the bed item. -minMax - include two additional columns containing the min and max observed in the area. -tsv - include a TSV header for input to other tools. ================================================================ ======== bigWigCat ==================================== ================================================================ ### kent source version 491 ### bigWigCat v 4 - merge non-overlapping bigWig files directly into bigWig format usage: bigWigCat out.bw in1.bw in2.bw ... Where in*.bw is in big wig format and out.bw is the output indexed big wig file. options: -itemsPerSlot=N - Number of data points bundled at lowest level. Default 1024 Note: must use wigToBigWig -fixedSummaries -keepAllChromosomes (perhaps in parallel cluster jobs) to create the input files. Note: By non-overlapping we mean the entire span of each file, from first data point to last data point, must not overlap with that of other files. ================================================================ ======== bigWigCluster ==================================== ================================================================ ### kent source version 491 ### bigWigCluster - Cluster bigWigs using a hacTree usage: bigWigCluster input.list chrom.sizes output.json output.tab where: input.list is a list of bigWig file names chrom.sizes is tab separated for assembly for bigWigs output.json is json formatted output suitable for graphing with D3 output.tab is tab-separated file of of items ordered by tree with the fields label - label from -labels option or from file name with no dir or extention pos - number from 0-1 representing position according to tree and distance red - number from 0-255 representing recommended red component of color green - number from 0-255 representing recommended green component of color blue - number from 0-255 representing recommended blue component of color path - file name from input.list including directory and extension options: -labels=fileName - label files from tabSeparated file with fields path - path to bigWig file label - a string with no tabs -precalc=precalc.tab - tab separated file with columns. -threads=N - number of threads to use, default 10 -tmpDir=/tmp/path - place to put temp files, default current dir ================================================================ ======== bigWigCorrelate ==================================== ================================================================ ### kent source version 491 ### bigWigCorrelate - Correlate bigWig files, optionally only on target regions. usage: bigWigCorrelate a.bigWig b.bigWig or bigWigCorrelate listOfFiles options: -restrict=restrict.bigBed - restrict correlation to parts covered by this file -threshold=N.N - clip values to this threshold -rootNames - if set just report the root (minus directory and suffix) of file names when using listOfFiles -ignoreMissing - if set do not correlate where either side is missing data Normally missing data is treated as zeros ================================================================ ======== bigWigInfo ==================================== ================================================================ ### kent source version 491 ### bigWigInfo - Print out information about bigWig file. usage: bigWigInfo file.bw options: -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs -chroms - list all chromosomes and their sizes -zooms - list all zoom levels and their sizes -minMax - list the min and max on a single line ================================================================ ======== bigWigMerge ==================================== ================================================================ ### kent source version 491 ### bigWigMerge v2 - Merge together multiple bigWigs into a single output bedGraph. You'll have to run bedGraphToBigWig to make the output bigWig. The signal values are just added together to merge them usage: bigWigMerge in1.bw in2.bw .. inN.bw out.bedGraph options: -threshold=0.N - don't output values at or below this threshold. Default is 0.0 -adjust=0.N - add adjustment to each value -clip=NNN.N - values higher than this are clipped to this value -inList - input file are lists of file names of bigWigs -max - merged value is maximum from input files rather than sum ================================================================ ======== bigWigSummary ==================================== ================================================================ ### kent source version 491 ### bigWigSummary - Extract summary information from a bigWig file. usage: bigWigSummary file.bigWig chrom start end dataPoints Get summary data from bigWig for indicated region, broken into dataPoints equal parts. (Use dataPoints=1 for simple summary.) NOTE: start and end coordinates are in BED format (0-based) options: -type=X where X is one of: mean - average value in region (default) min - minimum value in region max - maximum value in region std - standard deviation in region coverage - % of region that is covered -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs ================================================================ ======== bigWigToBedGraph ==================================== ================================================================ ### kent source version 491 ### bigWigToBedGraph - Convert from bigWig to bedGraph format. usage: bigWigToBedGraph in.bigWig out.bedGraph options: -chrom=chr1 - if set restrict output to given chromosome -start=N - if set, restrict output to only that over start -end=N - if set, restict output to only that under end -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs ================================================================ ======== bigWigToWig ==================================== ================================================================ ### kent source version 491 ### bigWigToWig - Convert bigWig to wig. This will keep more of the same structure of the original wig than bigWigToBedGraph does, but still will break up large stepped sections into smaller ones. usage: bigWigToWig in.bigWig out.wig options: -chrom=chr1 - if set restrict output to given chromosome -start=N - if set, restrict output to only that over start -end=N - if set, restict output to only that under end -bed=input.bed Extract values for all ranges specified by input.bed. If bed4, will also print the bed name. -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs ================================================================ ======== binFromRange ==================================== ================================================================ ./binFromRange: /lib64/libcurl.so.4: no version information available (required by ./binFromRange) ### kent source version 491 ### binFromRange - Translate a 0-based half open start and end into a bin range sql expression. usage: binFromRange start end ================================================================ ======== blat ==================================== ================================================================ ### kent source version 491 ### blat - Standalone BLAT v. 39x1 fast sequence search command line tool usage: blat database query [-ooc=11.ooc] output.psl where: database and query are each either a .fa, .nib or .2bit file, or a list of these files with one file name per line. -ooc=11.ooc tells the program to load over-occurring 11-mers from an external file. This will increase the speed by a factor of 40 in many cases, but is not required. output.psl is the name of the output file. Subranges of .nib and .2bit files may be specified using the syntax: /path/file.nib:seqid:start-end or /path/file.2bit:seqid:start-end or /path/file.nib:start-end With the second form, a sequence id of file:start-end will be used. options: -t=type Database type. Type is one of: dna - DNA sequence prot - protein sequence dnax - DNA sequence translated in six frames to protein The default is dna. -q=type Query type. Type is one of: dna - DNA sequence rna - RNA sequence prot - protein sequence dnax - DNA sequence translated in six frames to protein rnax - DNA sequence translated in three frames to protein The default is dna. -prot Synonymous with -t=prot -q=prot. -ooc=N.ooc Use overused tile file N.ooc. N should correspond to the tileSize. -tileSize=N Sets the size of match that triggers an alignment. Usually between 8 and 12. Default is 11 for DNA and 5 for protein. -stepSize=N Spacing between tiles. Default is tileSize. -oneOff=N If set to 1, this allows one mismatch in tile and still triggers an alignment. Default is 0. -minMatch=N Sets the number of tile matches. Usually set from 2 to 4. Default is 2 for nucleotide, 1 for protein. -minScore=N Sets minimum score. This is the matches minus the mismatches minus some sort of gap penalty. Default is 30. -minIdentity=N Sets minimum sequence identity (in percent). Default is 90 for nucleotide searches, 25 for protein or translated protein searches. -maxGap=N Sets the size of maximum gap between tiles in a clump. Usually set from 0 to 3. Default is 2. Only relevant for minMatch > 1. -noHead Suppresses .psl header (so it's just a tab-separated file). -makeOoc=N.ooc Make overused tile file. Target needs to be complete genome. -repMatch=N Sets the number of repetitions of a tile allowed before it is marked as overused. Typically this is 256 for tileSize 12, 1024 for tile size 11, 4096 for tile size 10. Default is 1024. Typically comes into play only with makeOoc. Also affected by stepSize: when stepSize is halved, repMatch is doubled to compensate. -noSimpRepMask Suppresses simple repeat masking. -mask=type Mask out repeats. Alignments won't be started in masked region but may extend through it in nucleotide searches. Masked areas are ignored entirely in protein or translated searches. Types are: lower - mask out lower-cased sequence upper - mask out upper-cased sequence out - mask according to database.out RepeatMasker .out file file.out - mask database according to RepeatMasker file.out -qMask=type Mask out repeats in query sequence. Similar to -mask above, but for query rather than target sequence. -repeats=type Type is same as mask types above. Repeat bases will not be masked in any way, but matches in repeat areas will be reported separately from matches in other areas in the psl output. -minRepDivergence=NN Minimum percent divergence of repeats to allow them to be unmasked. Default is 15. Only relevant for masking using RepeatMasker .out files. -dots=N Output dot every N sequences to show program's progress. -trimT Trim leading poly-T. -noTrimA Don't trim trailing poly-A. -trimHardA Remove poly-A tail from qSize as well as alignments in psl output. -fastMap Run for fast DNA/DNA remapping - not allowing introns, requiring high %ID. Query sizes must not exceed 5000. -out=type Controls output file format. Type is one of: psl - Default. Tab-separated format, no sequence pslx - Tab-separated format with sequence axt - blastz-associated axt format maf - multiz-associated maf format sim4 - similar to sim4 format wublast - similar to wublast format blast - similar to NCBI blast format blast8- NCBI blast tabular format blast9 - NCBI blast tabular format with comments -fine For high-quality mRNAs, look harder for small initial and terminal exons. Not recommended for ESTs. -maxIntron=N Sets maximum intron size. Default is 750000. -extendThroughN Allows extension of alignment through large blocks of Ns. To filter PSL files to the best hits (e.g. minimum ID > 90% or 'only best match'), you can use the commands pslReps, pslCDnaFilter or pslUniq. ================================================================ ======== blatHuge ==================================== ================================================================ ### kent source version 491 ### blat - Standalone BLAT v. 39x1 fast sequence search command line tool usage: blat database query [-ooc=11.ooc] output.psl where: database and query are each either a .fa, .nib or .2bit file, or a list of these files with one file name per line. -ooc=11.ooc tells the program to load over-occurring 11-mers from an external file. This will increase the speed by a factor of 40 in many cases, but is not required. output.psl is the name of the output file. Subranges of .nib and .2bit files may be specified using the syntax: /path/file.nib:seqid:start-end or /path/file.2bit:seqid:start-end or /path/file.nib:start-end With the second form, a sequence id of file:start-end will be used. options: -t=type Database type. Type is one of: dna - DNA sequence prot - protein sequence dnax - DNA sequence translated in six frames to protein The default is dna. -q=type Query type. Type is one of: dna - DNA sequence rna - RNA sequence prot - protein sequence dnax - DNA sequence translated in six frames to protein rnax - DNA sequence translated in three frames to protein The default is dna. -prot Synonymous with -t=prot -q=prot. -ooc=N.ooc Use overused tile file N.ooc. N should correspond to the tileSize. -tileSize=N Sets the size of match that triggers an alignment. Usually between 8 and 12. Default is 11 for DNA and 5 for protein. -stepSize=N Spacing between tiles. Default is tileSize. -oneOff=N If set to 1, this allows one mismatch in tile and still triggers an alignment. Default is 0. -minMatch=N Sets the number of tile matches. Usually set from 2 to 4. Default is 2 for nucleotide, 1 for protein. -minScore=N Sets minimum score. This is the matches minus the mismatches minus some sort of gap penalty. Default is 30. -minIdentity=N Sets minimum sequence identity (in percent). Default is 90 for nucleotide searches, 25 for protein or translated protein searches. -maxGap=N Sets the size of maximum gap between tiles in a clump. Usually set from 0 to 3. Default is 2. Only relevant for minMatch > 1. -noHead Suppresses .psl header (so it's just a tab-separated file). -makeOoc=N.ooc Make overused tile file. Target needs to be complete genome. -repMatch=N Sets the number of repetitions of a tile allowed before it is marked as overused. Typically this is 256 for tileSize 12, 1024 for tile size 11, 4096 for tile size 10. Default is 1024. Typically comes into play only with makeOoc. Also affected by stepSize: when stepSize is halved, repMatch is doubled to compensate. -noSimpRepMask Suppresses simple repeat masking. -mask=type Mask out repeats. Alignments won't be started in masked region but may extend through it in nucleotide searches. Masked areas are ignored entirely in protein or translated searches. Types are: lower - mask out lower-cased sequence upper - mask out upper-cased sequence out - mask according to database.out RepeatMasker .out file file.out - mask database according to RepeatMasker file.out -qMask=type Mask out repeats in query sequence. Similar to -mask above, but for query rather than target sequence. -repeats=type Type is same as mask types above. Repeat bases will not be masked in any way, but matches in repeat areas will be reported separately from matches in other areas in the psl output. -minRepDivergence=NN Minimum percent divergence of repeats to allow them to be unmasked. Default is 15. Only relevant for masking using RepeatMasker .out files. -dots=N Output dot every N sequences to show program's progress. -trimT Trim leading poly-T. -noTrimA Don't trim trailing poly-A. -trimHardA Remove poly-A tail from qSize as well as alignments in psl output. -fastMap Run for fast DNA/DNA remapping - not allowing introns, requiring high %ID. Query sizes must not exceed 5000. -out=type Controls output file format. Type is one of: psl - Default. Tab-separated format, no sequence pslx - Tab-separated format with sequence axt - blastz-associated axt format maf - multiz-associated maf format sim4 - similar to sim4 format wublast - similar to wublast format blast - similar to NCBI blast format blast8- NCBI blast tabular format blast9 - NCBI blast tabular format with comments -fine For high-quality mRNAs, look harder for small initial and terminal exons. Not recommended for ESTs. -maxIntron=N Sets maximum intron size. Default is 750000. -extendThroughN Allows extension of alignment through large blocks of Ns. To filter PSL files to the best hits (e.g. minimum ID > 90% or 'only best match'), you can use the commands pslReps, pslCDnaFilter or pslUniq. ================================================================ ======== calc ==================================== ================================================================ ### kent source version 491 ### calc - Little command line calculator usage: calc this + that * theOther / (a + b) Options: -h - output result as a human-readable integer numbers, with k/m/g/t suffix ================================================================ ======== catDir ==================================== ================================================================ catDir - concatenate files in directory to stdout. For those times when too many files for cat to handle. usage: catDir dir(s) options: -r Recurse into subdirectories -suffix=.suf This will restrict things to files ending in .suf '-wild=*.???' This will match wildcards. -nonz Prints file name of non-zero length files ================================================================ ======== catUncomment ==================================== ================================================================ catUncomment - Concatenate input removing lines that start with '#' Output goes to stdout usage: catUncomment file(s) ================================================================ ======== chainToBigChain ==================================== ================================================================ ### kent source version 491 ### chainToBigChain - converts chain to bigChain input (bed format with extra fields) usage: chainToBigChain chainIn bigChainOut bigLinkOut Output will be sorted To build bigBed files: bedToBigBed -type=bed6+6 -as=bigChain.as -tab data.bigChain hg38.chrom.sizes data.bb bedToBigBed -type=bed4+1 -as=bigLink.as -tab data.bigLink hg38.chrom.sizes data.link.bb ================================================================ ======== chopFaLines ==================================== ================================================================ chopFaLines - Read in FA file with long lines and rewrite it with shorter lines usage: chopFaLines in.fa out.fa ================================================================ ======== chromGraphFromBin ==================================== ================================================================ ./chromGraphFromBin: /lib64/libcurl.so.4: no version information available (required by ./chromGraphFromBin) ### kent source version 491 ### chromGraphFromBin - Convert chromGraph binary to ascii format. usage: chromGraphFromBin in.chromGraph out.tab options: -chrom=chrX - restrict output to single chromosome ================================================================ ======== chromGraphToBin ==================================== ================================================================ ./chromGraphToBin: /lib64/libcurl.so.4: no version information available (required by ./chromGraphToBin) ### kent source version 491 ### chromGraphToBin - Make binary version of chromGraph. usage: chromGraphToBin in.tab out.chromGraph options: -xxx=XXX ================================================================ ======== chromToUcsc ==================================== ================================================================ Usage: chromToUcsc [options] filename - change NCBI or Ensembl chromosome names to UCSC names in tabular or wiggle files, using a chromAlias table. Supports these UCSC file formats: BED, genePred, PSL, wiggle (all formats), bedGraph, VCF, SAM, GTF, Chain ... or any other csv or tsv format where the sequence (chromosome) name is a separate field. Requires a .chromAlias.tsv file which can be downloaded like this: chromToUcsc --get hg19 # download the file hg19.chromAlias.tsv into current directory Which also works for GenArk assemblies and can take an output directory: chromToUcsc --get GCF_000001735.3 -o /tmp/ # for GenArk assemblies, will translate to NCBI sequence names (accessions) If you do not want to use the --get option to retrieve the mapping tables, you can also download the alias mapping files yourself, e.g. for mm10 with 'wget https://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/chromAlias.txt.gz' Then the script can be run like this: chromToUcsc -i in.bed -o out.bed -a hg19.chromAlias.tsv chromToUcsc -i in.bed -o out.bed -a https://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/chromAlias.txt.gz Or in pipes, like this: cat test.bed | chromToUcsc -a mm10.chromAlias.tsv > test.ucsc.bed For BAM files use this program in a pipe with samtools: samtools view -h in.bam | ./chromToUcsc -a mm10.chromAlias.tsv | samtools -bS > out.bam By default, this script expects the chromosome name in the first field. The default works for BED, bedGraph, GTF, wiggle, VCF. For the following file formats, you will need to set the -k option to these values manually: genePred: 2 -- PSL: 10 (query) or 14 (target) -- chain: 2 (target) or 7 (query) -- SAM: 2 (If a line starts with @ (SAM format), -k is automatically set to 2.) Options: -h, --help show this help message and exit --get=DOWNLOADDB download a chrom alias table from UCSC for the genomeDb into the current directory or directory provided by -o and exit -a ALIASFNAME, --chromAlias=ALIASFNAME a UCSC chromAlias file in tab-sep format or the http/https URL to one -i INFNAME, --in=INFNAME input filename, default: /dev/stdin -o OUTFNAME, --out=OUTFNAME output filename, default: /dev/stdout -d, --debug show debug messages -s, --skipUnknown skip unknown sequence rather than generate an error. -k FIELDNO, --field=FIELDNO Index of field (1-based) that contains the chromosome name. No other field is touched by this program, unless the SAM format is detected. Default is 1 (first field). ================================================================ ======== clusterMatrixToBarChartBed ==================================== ================================================================ ### kent source version 491 ### clusterMatrixToBarChartBed - Compute a barchart bed file from a gene matrix and a gene bed file and a way to cluster samples. NOTE: consider using matrixClusterColumns and matrixToBarChartBed instead usage: clusterMatrixToBarChartBed sampleClusters.tsv geneMatrix.tsv geneset.bed output.bed where: sampleClusters.tsv is a two column tab separated file with sampleId and clusterId geneMatrix.tsv has a row for each gene. The first row uses the same sampleId as above geneset.bed has the maps the genes in the matrix (from it's first column) to the genome geneset.bed needs 6 standard bed fields. Unless name2 is set it also needs a name2 field as the last field output.bed is the resulting bar chart, with one column per cluster options: -simple - don't store the position of gene in geneMatrix.tsv file in output -median - use median (instead of mean) -name2=twoColFile.tsv - get name2 from file where first col is same ase geneset.bed's name ================================================================ ======== colTransform ==================================== ================================================================ colTransform - Add and/or multiply column by constant. usage: colTransform column input.tab addFactor mulFactor output.tab where: column is the column to transform, starting with 1 input.tab is the tab delimited input file addFactor is what to add. Use 0 here to not change anything mulFactor is what to multiply by. Use 1 here not to change anything output.tab is the tab delimited output file ================================================================ ======== countChars ==================================== ================================================================ countChars - Count the number of occurrences of a particular char usage: countChars char file(s) Char can either be a two digit hexadecimal value or a single letter literal character ================================================================ ======== cpg_lh ==================================== ================================================================ cpg_lh - calculate CpG Island data for cpgIslandExt tracks usage: cpg_lh where is fasta sequence, must be more than 200 bases of legitimate sequence, not all N's To process the output into the UCSC bed file format: cpglh fastaInput.fa \ | awk '{$2 = $2 - 1; width = $3 - $2; printf("%s\t%d\t%s\t%s %s\t%s\t%s\t%0.0f\t%0.1f\t%s\t%s\n", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \ | sort -k1,1 -k2,2n > output.bed The original cpg.c was written by Gos Miklem from the Sanger Center. LaDeana Hillier added some modifications --> cpg_lh.c, and UCSC hass added some further modifications to cpg_lh.c, so that its expected number of CpGs in an island is calculated as described in Gardiner-Garden, M. and M. Frommer, 1987 CpG islands in vertebrate genomes. J. Mol. Biol. 196:261-282 Expected = (Number of C's * Number of G's) / Length Instead of a sliding-window search for CpG islands, this cpg program uses a running-sum score where a 'C' followed by a 'G' increases the score by 17 and anything else decreases the score by 1. When the score transitions from positive to 0 (and at the end of the sequence), the sequence in the current span is evaluated to see if it qualifies as a CpG island (>200 bp length, >50% GC, >0.6 ratio of observed CpG to expected). Then the search recurses on the span from the position with the max running score up to the current position. ================================================================ ======== crTreeIndexBed ==================================== ================================================================ ### kent source version 491 ### crTreeIndexBed - Create an index for a bed file. usage: crTreeIndexBed in.bed out.cr options: -blockSize=N - number of children per node in index tree. Default 1024 -itemsPerSlot=N - number of items per index slot. Default is half block size -noCheckSort - Don't check sorting order of in.tab ================================================================ ======== crTreeSearchBed ==================================== ================================================================ ### kent source version 491 ### crTreeSearchBed - Search a crTree indexed bed file and print all items that overlap query. usage: crTreeSearchBed file.bed index.cr chrom start end ================================================================ ======== dbDbToHubTxt ==================================== ================================================================ ./dbDbToHubTxt: /lib64/libcurl.so.4: no version information available (required by ./dbDbToHubTxt) ### kent source version 491 ### dbDbToHubTxt - Reformat dbDb line to hub and genome stanzas for assembly hubs usage: dbDbToHubTxt database email groups hubAndGenome.txt options: -xxx=XXX ================================================================ ======== endsInLf ==================================== ================================================================ endsInLf - Check that last letter in files is end of line usage: endsInLf file(s) options: -zeroOk ================================================================ ======== expMatrixToBarchartBed ==================================== ================================================================ usage: expMatrixToBarchartBed [-h] [--autoSql AUTOSQL] [--groupOrderFile GROUPORDERFILE] [--useMean] [--verbose] sampleFile matrixFile bedFile outputFile Generate a barChart bed6+5 file from a matrix, meta data, and coordinates. positional arguments: sampleFile Two column no header, the first column is the samples which should match the matrix, the second is the grouping (cell type, tissue, etc) matrixFile The input matrix file. The samples in the first row should exactly match the ones in the sampleFile. The labels (ex ENST*****) in the first column should exactly match the ones in the bed file. bedFile Bed6+1 format. File that maps the column labels from the matrix to coordinates. Tab separated; chr, start coord, end coord, label, score, strand, gene name. The score column is ignored. outputFile The output file, bed 6+5 format. See the schema in kent/src/hg/lib/barChartBed.as. optional arguments: -h, --help show this help message and exit --autoSql AUTOSQL Optional autoSql description of extra fields in the input bed. --groupOrderFile GROUPORDERFILE Optional file to define the group order, list the groups in a single column in the order desired. The default ordering is alphabetical. --useMean Calculate the group values using mean rather than median. --verbose Show runtime messages. ================================================================ ======== faAlign ==================================== ================================================================ ### kent source version 491 ### faAlign - Align two fasta files usage: faAlign target.fa query.fa output.axt options: -dna - use DNA scoring scheme ================================================================ ======== faCmp ==================================== ================================================================ ### kent source version 491 ### faCmp - Compare two .fa files usage: faCmp [options] a.fa b.fa options: -softMask - use the soft masking information during the compare Differences will be noted if the masking is different. -sortName - sort input files by name before comparing -peptide - read as peptide sequences default: no masking information is used during compare. It is as if both sequences were not masked. Exit codes: - 0 if files are the same - 1 if files differ - 255 on an error ================================================================ ======== faCount ==================================== ================================================================ ### kent source version 491 ### faCount - count base statistics and CpGs in FA files. usage: faCount file(s).fa -summary show only summary statistics -dinuc include statistics on dinucletoide frequencies -strands count bases on both strands ================================================================ ======== faFilter ==================================== ================================================================ ### kent source version 491 ### faFilter - Filter fa records, selecting ones that match the specified conditions usage: faFilter [options] in.fa out.fa Options: -name=wildCard - Only pass records where name matches wildcard * matches any string or no character. ? matches any single character. anything else etc must match the character exactly (these will will need to be quoted for the shell) -namePatList=filename - A list of regular expressions, one per line, that will be applied to the fasta name the same as -name -v - invert match, select non-matching records. -minSize=N - Only pass sequences at least this big. -maxSize=N - Only pass sequences this size or smaller. -maxN=N Only pass sequences with fewer than this number of N's -uniq - Removes duplicate sequence ids, keeping the first. -i - make -uniq ignore case so sequence IDs ABC and abc count as dupes. All specified conditions must pass to pass a sequence. If no conditions are specified, all records will be passed. ================================================================ ======== faFilterN ==================================== ================================================================ faFilterN - Get rid of sequences with too many N's usage: faFilterN in.fa out.fa maxPercentN options: -out=in.fa.out -uniq=self.psl ================================================================ ======== faFrag ==================================== ================================================================ faFrag - Extract a piece of DNA from a .fa file. usage: faFrag in.fa start end out.fa options: -mixed - preserve mixed-case in FASTA file ================================================================ ======== faNoise ==================================== ================================================================ faNoise - Add noise to .fa file usage: faNoise inName outName transitionPpt transversionPpt insertPpt deletePpt chimeraPpt options: -upper - output in upper case ================================================================ ======== faOneRecord ==================================== ================================================================ faOneRecord - Extract a single record from a .FA file usage: faOneRecord in.fa recordName ================================================================ ======== faPolyASizes ==================================== ================================================================ ### kent source version 491 ### faPolyASizes - get poly A sizes usage: faPolyASizes in.fa out.tab output file has four columns: id seqSize tailPolyASize headPolyTSize options: ================================================================ ======== faRandomize ==================================== ================================================================ ### kent source version 491 ### faRandomize - Program to create random fasta records usage: faRandomize [-seed=N] in.fa randomized.fa Use optional -seed argument to specify seed (integer) for random number generator (rand). Generated sequence has the same base frequency as seen in original fasta records. ================================================================ ======== faRc ==================================== ================================================================ faRc - Reverse complement a FA file usage: faRc in.fa out.fa In.fa and out.fa may be the same file. options: -keepName - keep name identical (don't prepend RC) -keepCase - works well for ACGTUN in either case. bizarre for other letters. without it bases are turned to lower, all else to n's -justReverse - prepends R unless asked to keep name -justComplement - prepends C unless asked to keep name (cannot appear together with -justReverse) ================================================================ ======== faSize ==================================== ================================================================ ### kent source version 491 ### faSize - print total base count in fa files. usage: faSize file(s).fa Command flags -detailed outputs name and size of each record has the side effect of printing nothing else -tab output statistics in a tab separated format -veryDetailed outputs name, size, #Ns, #real, #upper, #lower of each record ================================================================ ======== faSomeRecords ==================================== ================================================================ ### kent source version 491 ### faSomeRecords - Extract multiple fa records usage: faSomeRecords in.fa listFile out.fa options: -exclude - output sequences not in the list file. ================================================================ ======== faSplit ==================================== ================================================================ ### kent source version 491 ### faSplit - Split an fa file into several files. usage: faSplit how input.fa count outRoot where how is either 'about' 'byname' 'base' 'gap' 'sequence' or 'size'. Files split by sequence will be broken at the nearest fa record boundary. Files split by base will be broken at any base. Files broken by size will be broken every count bases. Examples: faSplit sequence estAll.fa 100 est This will break up estAll.fa into 100 files (numbered est001.fa est002.fa, ... est100.fa Files will only be broken at fa record boundaries faSplit base chr1.fa 10 1_ This will break up chr1.fa into 10 files faSplit size input.fa 2000 outRoot This breaks up input.fa into 2000 base chunks faSplit about est.fa 20000 outRoot This will break up est.fa into files of about 20000 bytes each by record. faSplit byname scaffolds.fa outRoot/ This breaks up scaffolds.fa using sequence names as file names. Use the terminating / on the outRoot to get it to work correctly. faSplit gap chrN.fa 20000 outRoot This breaks up chrN.fa into files of at most 20000 bases each, at gap boundaries if possible. If the sequence ends in N's, the last piece, if larger than 20000, will be all one piece. Options: -verbose=2 - Write names of each file created (=3 more details) -maxN=N - Suppress pieces with more than maxN n's. Only used with size. default is size-1 (only suppresses pieces that are all N). -oneFile - Put output in one file. Only used with size -extra=N - Add N extra bytes at the end to form overlapping pieces. Only used with size. -out=outFile Get masking from outfile. Only used with size. -lift=file.lft Put info on how to reconstruct sequence from pieces in file.lft. Only used with size and gap. -minGapSize=X Consider a block of Ns to be a gap if block size >= X. Default value 1000. Only used with gap. -noGapDrops - include all N's when splitting by gap. -outDirDepth=N Create N levels of output directory under current dir. This helps prevent NFS problems with a large number of file in a directory. Using -outDirDepth=3 would produce ./1/2/3/outRoot123.fa. -prefixLength=N - used with byname option. create a separate output file for each group of sequences names with same prefix of length N. ================================================================ ======== faToFastq ==================================== ================================================================ ### kent source version 491 ### faToFastq - Convert fa to fastq format, just faking quality values. usage: faToFastq in.fa out.fastq options: -qual=X quality letter to use. Default is '<' which is good I think.... ================================================================ ======== faToTab ==================================== ================================================================ faToTab - convert fa file to tab separated file usage: faToTab infileName outFileName options: -type=seqType sequence type, dna or protein, default is dna -keepAccSuffix - don't strip dot version off of sequence id, keep as is ================================================================ ======== faToTwoBit ==================================== ================================================================ ### kent source version 491 ### faToTwoBit - Convert DNA from fasta to 2bit format usage: faToTwoBit in.fa [in2.fa in3.fa ...] out.2bit options: -long use 64-bit offsets for index. Allow for twoBit to contain more than 4Gb of sequence. NOT COMPATIBLE WITH OLDER CODE. -noMask Ignore lower-case masking in fa file. -stripVersion Strip off version number after '.' for GenBank accessions. -ignoreDups Convert first sequence only if there are duplicate sequence names. Use 'twoBitDup' to find duplicate sequences. -namePrefix=XX. add XX. to start of sequence name in 2bit. ================================================================ ======== faToVcf ==================================== ================================================================ ### kent source version 491 ### faToVcf - Convert a FASTA alignment file to Variant Call Format (VCF) single-nucleotide diffs usage: faToVcf in.fa out.vcf options: -ambiguousToN Treat all IUPAC ambiguous bases (N, R, V etc) as N (no call). -excludeFile=file Exclude sequences named in file which has one sequence name per line -includeNoAltN Include base positions with no alternate alleles observed, but at least one N (missing base / no-call) -includeRef Include the reference in the genotype columns (default: omitted as redundant) -maskSites=file Exclude variants in positions recommended for masking in file (typically https://github.com/W-L/ProblematicSites_SARS-CoV2/raw/master/problematic_sites_sarsCov2.vcf) -maxDiff=N Exclude sequences with more than N mismatches with the reference (if -windowSize is used, sequences are masked accordingly first) -minAc=N Ignore alternate alleles observed fewer than N times -minAf=F Ignore alternate alleles observed in less than F of non-N bases -minAmbigInWindow=N When -windowSize is provided, mask any base for which there are at least this many N, ambiguous or gap characters within the window. (default: 2) -noGenotypes Output 8-column VCF, without the sample genotype columns -ref=seqName Use seqName as the reference sequence; must be present in faFile (default: first sequence in faFile) -resolveAmbiguous For IUPAC ambiguous characters like R (A or G), if the character represents two bases and one is the reference base, convert it to the non-reference base. Otherwise convert it to N. -startOffset=N Add N bases to each position (for trimmed alignments) -vcfChrom=seqName Use seqName for the CHROM column in VCF (default: ref sequence) -windowSize=N Mask any base for which there are at least -minAmbigWindow bases in a window of +-N bases around the base. Masking approach adapted from https://github.com/roblanf/sarscov2phylo/ file scripts/mask_seq.py Use -windowSize=7 for same results. in.fa must contain a series of sequences with different names and the same length. Both N and - are treated as missing information. ================================================================ ======== faTrans ==================================== ================================================================ ### kent source version 491 ### faTrans - Translate DNA .fa file to peptide usage: faTrans in.fa out.fa options: -stop stop at first stop codon (otherwise puts in Z for stop codons) -offset=N start at a particular offset. -cdsUpper - cds is in upper case ================================================================ ======== fastqStatsAndSubsample ==================================== ================================================================ ### kent source version 491 ### fastqStatsAndSubsample v2 - Go through a fastq file doing sanity checks and collecting stats and also producing a smaller fastq out of a sample of the data. The fastq input may be compressed with gzip or bzip2. Paired-end samples: run on both files, the seed is fixed so it will chose the paired reads usage: fastqStatsAndSubsample in.fastq out.stats out.fastq options: -sampleSize=N - default 100000 -seed=N - Use given seed for random number generator. Default 0. -smallOk - Not an error if less than sampleSize reads. out.fastq will be entire in.fastq -json - out.stats will be in json rather than text format Use /dev/null for out.fastq and/or out.stats if not interested in these outputs ================================================================ ======== fastqToFa ==================================== ================================================================ ### kent source version 491 ### # no name checks will be made on lines beginning with @ # ignore quality scores # using default Phread quality score algorithm # all errors will cause exit fastqToFa - Convert from fastq to fasta format. usage: fastqToFa [options] in.fastq out.fa options: -nameVerify='string' - for multi-line fastq files, 'string' must match somewhere in the sequence names in order to correctly identify the next sequence block (e.g.: -nameVerify='Supercontig_') -qual=file.qual.fa - output quality scores to specifed file (default: quality scores are ignored) -qualSizes=qual.sizes - write sizes file for the quality scores -noErrors - warn only on problems, do not error out (specify -verbose=3 to see warnings -solexa - use Solexa/Illumina quality score algorithm (instead of Phread quality) -verbose=2 - set warning level to get some stats output during processing ================================================================ ======== fetchChromSizes ==================================== ================================================================ fetchChromSizes - script to grab chrom.sizes from UCSC via either of: mysql, wget or ftp usage: fetchChromSizes > .chrom.sizes used to fetch chrom.sizes information from UCSC for the given - name of UCSC database, e.g.: hg38, hg18, mm9, etc ... This script expects to find one of the following commands: wget, mysql, or ftp in order to fetch information from UCSC. Route the output to the file .chrom.sizes as indicated above. This data is available at the URL: http://hgdownload.soe.ucsc.edu/goldenPath//bigZips/.chrom.sizes Example: fetchChromSizes hg38 > hg38.chrom.sizes ================================================================ ======== findMotif ==================================== ================================================================ ### kent source version 491 ### findMotif - find specified motif in sequence usage: findMotif [options] -motif= sequence where: sequence is a .fa , .nib or .2bit file or a file which is a list of sequence files. options: -motif= - search for this specified motif (case ignored, [acgt] only) NOTE: motif must be at least 4 characters, less than 32 -chr= - process only this one chrN from the sequence -strand=<+|-> - limit to only one strand. Default is both. -bedOutput - output bed format (this is the default) -wigOutput - output wiggle data format instead of bed file -misMatch=N - allow N mismatches (0 default == perfect match) -verbose=N - set information level [1-4] -verbose=4 - will display gaps as bed file data lines to stderr * libpopcnt.h - C/C++ library for counting the number of 1 bits (bit * population count) in an array as quickly as possible using * specialized CPU instructions i.e. POPCNT, AVX2, AVX512, NEON. * * Copyright (c) 2016 - 2020, Kim Walisch * Copyright (c) 2016 - 2018, Wojciech Muła * * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: * * 1. Redistributions of source code must retain the above copyright notice, * this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright notice, * this list of conditions and the following disclaimer in the documentation * and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 'AS IS' * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS) * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE * POSSIBILITY OF SUCH DAMAGE. ================================================================ ======== fixStepToBedGraph.pl ==================================== ================================================================ fixStepToBedGraph.pl - read fixedStep wiggle input data, output four column bedGraph format data usage: fixStepToBedGraph.pl run in a pipeline like this: usage: zcat fixedStepData.gz | fixStepToBedGraph.pl | gzip > bedGraph.gz reading input data from stdin ... Can't open -verbose=2: No such file or directory at ./fixStepToBedGraph.pl line 28. ================================================================ ======== gapToLift ==================================== ================================================================ ./gapToLift: /lib64/libcurl.so.4: no version information available (required by ./gapToLift) ### kent source version 491 ### gapToLift - create lift file from gap table(s) usage: gapToLift [options] db liftFile.lft uses gap table(s) from specified db. Writes to liftFile.lft generates lift file segements separated by non-bridged gaps. options: -chr=chrN - work only on given chrom -minGap=M - examine gaps only >= than M -insane - do *not* perform coordinate sanity checks on gaps -bedFile=fileName.bed - output segments to fileName.bed -allowBridged - consider any type of gap not just the non-bridged gaps -verbose=N - N > 1 see more information about procedure ================================================================ ======== gencodeVersionForGenes ==================================== ================================================================ ### kent source version 491 ### gencodeVersionForGenes - Figure out which version of a gencode gene set a set of gene identifiers best fits usage: gencodeVersionForGenes genes.txt geneSymVer.tsv where: genes.txt is a list of gene symbols or identifiers, one per line geneSymVer.tsv is output of gencodeGeneSymVer, usually /hive/data/inside/geneSymVerTx.tsv options: -bed=output.bed - Create bed file for mapping genes to genome via best gencode fit -upperCase - Force genes to be upper case -allBed=outputDir - Output beds for all versions in geneSymVer.tsv -geneToId=geneToId.tsv - Output two column file with symbol from gene.txt and gencode gene names as second. Symbols with no gene found are omitted -miss=output.txt - unassigned genes are put here, one per line -target=ucscDb - something like hg38 or hg19. If set this will use most recent version of each gene that exists for the assembly in symbol mode ================================================================ ======== genePredFilter ==================================== ================================================================ ./genePredFilter: /lib64/libcurl.so.4: no version information available (required by ./genePredFilter) ### kent source version 491 ### genePredFilter - filter a genePred file usage: genePredFilter [options] genePredIn genePredOut Filter a genePredFile, dropping invalid entries options: -db=db - If specified, then this database is used to get chromosome sizes. -chromSizes=file.chrom.sizes - use chrom sizes from tab separated file (name,size) instead of from chromInfo table in specified db. -verbose=2 - level >= 2 prints out errors for each problem found. ================================================================ ======== genePredToBigGenePred ==================================== ================================================================ ./genePredToBigGenePred: /lib64/libcurl.so.4: no version information available (required by ./genePredToBigGenePred) ### kent source version 491 ### genePredToBigGenePred - converts genePred or genePredExt to bigGenePred input (bed format with extra fields) usage: genePredToBigGenePred [-known] [-score=scores] [-geneNames=geneNames] [-colors=colors] file.gp stdout | sort -k1,1 -k2,2n > file.bgpInput NOTE: In order to visualize on Genome Browser, the bigGenePred file needs to be converted to a bigBed such as the following: wget https://genome.ucsc.edu/goldenpath/help/examples/bigGenePred.as bedToBigBed -type=bed12+8 -tab -as=bigGenePred.as file.bgpInput chrom.sizes output.bb options: -known input file is a genePred in knownGene format -score=scores scores is two column file with id's mapping to scores -geneNames=geneNames geneNames is a three column file with id's mapping to two gene names -colors=colors colors is a four column file with id's mapping to r,g,b -cds=cds cds is a five column file with id's mapping to cds status codes and exonFrames (see knownCds.as) -geneType=geneType geneType is a two column file with id's mapping to geneType ================================================================ ======== genePredToProt ==================================== ================================================================ ./genePredToProt: /lib64/libcurl.so.4: no version information available (required by ./genePredToProt) ### kent source version 491 ### genePredToProt - create protein sequences by translating gene annotations usage: genePredToProt genePredFile genomeSeqs protFa This honors frame if genePred has frames, dropping partial codons. genomeSeqs is a 2bit or directory of nib files. options: -cdsFa=fasta - output FASTA with CDS that was used to generate protein. This will not include dropped partial codons. -protIdSuffix=str - add this string to the end of the name for protein FASTA -cdsIdSuffix=str - add this string to the end of the name for CDS FASTA -translateSeleno - assume internal TGA code for selenocysteine and translate to `U'. -includeStop - If the CDS ends with a stop codon, represent it as a `*' -starForInframeStops - use `*' instead of `X' for in-frame stop codons. This will result in selenocysteine's being `*', with only codons containing `N' being translated to `X'. This doesn't include terminal stop ================================================================ ======== gensub2 ==================================== ================================================================ gensub2 - version 12.20 Generate condor submission file from template and two file lists. Usage: gensub2