BBTools changelog and todo list. V38. 38.00 Moved ByteBuilder to Structures. Added some formatting and comments to SuperLongList. JsonObject printing now has an inArray state that prevents newlines from arrays of JsonObjects. Improved JsonParser handling of booleans. Added a JsonParser validate command. Wrote TaxClient for internally doing tax lookups from the TaxServer. Added post mode to TaxClient and TaxServer, for URLs over 2000 characters. Moved StringNum to Structures. Accession loader now sorts files in ascending order of size and can load some before others. Fixed a flaw in the hash function for accession numbers that may have allowed collisions. TaxTree.parseNodeFromHeader will now try harder for headers with certain formatting. Fixed potential overflows by changing Integer.MAX_VALUE to Shared.MAX_ARRAY_LEN. SketchTool now has a custom, low-garbage loader instead of relying on ByteFile. RQCFilter2 now uses half as many threads for pigz as logical cores. Wrote BloomFilter and BloomFilterWrapper. Added BloomFilter support into BBMap and RQCFilter. Wrote a better available memory estimation function for BloomFilter. Accelerated BloomFilter lookup when minConsecutiveMatches>1. Fixed logging of BBSplit vs BBMap in RQCFilter2. Bloom filter creation from BBMap index now uses multiple threads per chunk. Fixed a null pointer in TextStringWriter. Fixed a static variable (ef) persisting in RQCFilter, which slowed human removal. 38.01 Added support for lowercase letters in accessions. gi2ncbi now supports streaming and some other options like shrinknames in server mode. Sketch can now return json format from a curl call. Sketch server no longer crashes from invalid symbols in sequence in local mode. SketchMaker now has a local cache of SketchHeaps per thread in per-taxa mode, allowing a 6x speedup by reducing synchronization and rework. RefSeq now uses a 250-species blacklist limit with sizemult=2 instead of 300. Wrote MergeSorted and mergesorted.sh to resume SortByName runs that crashed or were killed during merging. Removed DumpCount from SortByName and CrisContainer. It was too confusing. To shuffle large datasets, they can be merged round-robin. Fixed an error message when autodetecting quality encoding. Refseq sketch server is now double the normal resolution (sizemult=2). SendSketch defaults to sizemult=2 for RefSeq. Sketch server startup script now sets sizemult=2 for refseq. Added logscale peak calling. Added peaks file GC annotation. Fixed an array out of bounds in EntropyTracker. CallVariants now ignores duplicates by default (0x400 bit). StatsWrapper will now append to the gc output if there are multiple assemblies. Wrote AnalyzeAccession and analyzeaccession.sh to reduce the memory footprint of accessions in the tax server. Added entropy filter flag to RQCFilter2. BloomFilter can now act as a highpass filter. 38.02 BloomFilter can now do error correction, using the Tadpole algorithm. Added merge and unmerge to Tadpole and BloomFilter for dramatic error correction improvements. Improved BloomFilter error correction defaults and added smoothing. Improved BloomFilter's memory management and added a memfraction flag. Fixed tuc not working. Tadpole.BloomFilter ECC_ROLLBACK will now roll back merges also (but not ecco currently). Wrote Rollback object to simplify rollbacks during error correction. Spun BloomFilterCorrectorWrapper of from BloomFilterWrapper. Spun bbcms.sh off of bloomfilter.sh. Fixed a bug in msa.sh handling of reverse-complements. Improved msa.sh to fully expand undefined bases, accept fasta files, and name the output such that it is clear whether an alignment was forward or reverse. msa.sh now allows a cutoff for min identity. Improved bbcms smoothing. bbcms now allows a minimum fraction of kmers above a certain count to be specified. bbcms now prints more statistics about the loaded bloom filter. 38.03 Fixed broken interleaving in bbcms output. Added seed flag to bbcms and bloomfilter. Added BBMerge vstrict and ustrict flags to bbcms. Added mergeOK and testmerge flags to BBMerge. Added BloomFilter support to BBMerge. BBMerge now automatically writes both mergable and unmergable pairs to out if ecco=t and mix is unset. testmerge flag now works with ecco. Fixed indentation for Tadpole/bbcms results. 38.04 bbcms and bloom filter now allow random seeds. Changed version printing to not repeat arguments. Eliminated redundant copies of mergeOK functions. Fixed bbcms testmerge flag. Fixed trim/qtrim flag in BBSplit help. Added relative error threshold for mergeOK. TODO: Does not seem to help in my test; try on single cell data. Added variable smooth width to bbcms. Changed bbcms default bits to 4 after testing. Fixed bbcms extra flag. 38.05 Fixed interleaving detection in SortByName. Changed interleaving detection in FileFormat to audodetect more aggressively. Fixed a bug with RQCFilter2 interleaving settings carrying over from BBMerge to FilterByTaxa. 38.06 Changed KmerArray to collide all possible kmer extensions into the same cell. Wrote FillFast to grab all 4 possible kmer extensions with a single modulo operation. Simplified some of BBDuk pair-tracking and discarding logic. Added trimfailures bbduk flag. Fixed a division by zero bug in SortByName.mergeRecursive. Fixed an array-out-of-bounds in CallPeaks. Made dual-kmer ANI estimation from Sketch more accurate. Added loglog support to BBMerge and Seal. Added loglogout support to BBMerge, BBDuk, and Seal. RQCFilter2 status.log now tracks kmers. Removed RQCFilter and pointed rqcfilter.sh to rqcfilter2.sh. 38.07 Changed KmerTable increment functions to require an incr value. Added sortbuffer flag to Tadpole, but speed was barely improved on high-depth Clumpified data. Migrated coremask and fillfast to tadpole2, but they make it slower for some reason. Migrated shave and rinse improvements to Tadpole2; these can make those steps dramatically faster in metagenomes. Added BloomFilter serialization. Increased default k and minhits of Bloom filter in RQCFilter2 and added serialized filters. Reduced RandomReads default quality. Made gaussian insert size distribution default for RandomReads. Wrote FastaShredInputStream for faster Bloom filter loading with lower memory consumption. Fixed number of threads allocated to Bloom filter loading from index. 38.08 FilterByTaxa and RQCFilter no longer crash if a header cannot be parsed and the accession tables are not loaded. 38.09 bbcms default bits changed from 1 to 2. Improved bbcms tossjunk function. Added documentation to bbcms and Tadpole. Added fixextensions flag, and enabled it for CallVariants, BBDuk, Reformat, RQCFilter, BBNorm, BBMerge, BBMap, Tadpole, and bbcms. RQCFilter now extends reads prior to merging if there is enough memory. This means the insert size histogram will take longer, but allow non-overlapping inserts. BBMap now tracks statistics correctly when Bloom filter is enabled. Fixed Children flag in TaxServer. Shave and rinse no longer checks owner for initial high kmers. Shave and rinse now ignores initial high kmers above the isJunction trigger for extension in some cases, for a large speedup in isolates (uses shaveFast flag). Changed RandomReads default insert size distribution to more closely match JGI fragment library targets. Multithreaded KmerCountArray/KmerCountArrayU ownership array allocation via OwnershipThread for a large speed increase in assembly. Added 2passresize flag to Tadpole but it didn't seem to speed things up. Added Constellation-like output option for CompareSketch. Major changes to Kmer table sizing - a premade resize schedule is now used. Only for Kmer so far not UKmer. 38.10 Merged dev python changes. 38.11 Ported schedule to UKmer. Fixed a bytesPerKmer bug in KmerCountExact for k>31. Accelerated kmer lookups for k>31. Condensed code for shave/rinse, but no speed increase. Changed default exploredist from 100 to 300. 38.12 Stats now omits the first size bracket if it is less than minscaf. Fixed problems with extended stats in format 4-6. Fixed a bug in reporting amount of spikin removed in RQCFilter. Multithreaded kmer frequency histogram generation using kmer and ukmer packages. mutate.sh now outputs vcf files. Fixed processing of sam files with M, =, and X in cigar string. Fixed a bloom filter BBMap bug in counting reads. Updated some pipelines shell scripts. Started writing a new KCountArray class, but abandoned it as the current one looks as efficient as possible. 38.13 Fixed a casting exception in Shared.sort. Fixed missing column from mutate.sh vcf output. Addslash for RandomReads now works with the illuminanames flag. Fixed mutate.sh VCF files. Wrote Contig and Edge classes. Wrote ContigLengthComparator. Transitioned Tadpole from building Reads to building Contigs. Wrote ProcessContigThread. Tadpole now writes additional information about contig ends to headers. Tadpole now strictly uses F_BRANCH and B_BRANCH instead of just BRANCH (TODO: D_BRANCH). Tadpole output should now have canonical orientation, order, and names (apart from circular contigs). Tadpole1 now has a preliminary contig graph processing phase (in progress). Tadpole now supports preliminary dot output (not yet correct). Added appendln to some ByteBuilder methods. Added print(Contig) to bsw. 38.14-38.15 Integrated dev Python changes; merging Git branches. 38.16 Ported Tadpole1 ProcessContigThread to Tadpole2. Added perfile flag to CompareSketch, which allows multithreaded loading. Added prealloc flag to CompareSketch. Revised TaxServer to use Sketch index, and typically run 1 thread per sketch. Added outsketch flag to CompareSketch. Modified RandomGenome to be faster and more flexible, and added a shell script. 38.17 Added Sketch minLevelExtended flag. Fixed bbcms loglog using quality scores from the wrong read. Wrote MergeSketch and mergesketch.sh. Fixed a major bug in TaxTree.getNodeAtLevel and restarted all servers. Wrote KmerLimit and kmerlimit.sh. Wrote Shuffle2 and shuffle2.sh. Changed blacklist_nt_species_1000.sketch to blacklist_nt_species_500.sketch 38.18 Modified RQCFilter and BBMap to correctly track and report unmapped reads and bases when using the Bloom filter. Wrote RQCFilterStats for tracking relevant RQCFilter stats. This is printed to filterStats2.txt. Added some columns to BBMap scafstats/refstats where a read is assigned to at most a single reference. All classes that used ThreadLocalRandom now call Shared.threadLocalRandom() to comply with Java 6. Wrote KmerLimit and kmerlimit.sh to restrict a randomly-ordered file to a specific number of unique kmers. Wrote KmerLimit2 and kmerlimit2.sh to restrict an arbitrarily-ordered file to a specific number of unique kmers via subsampling. Updated /pipelines/ scripts for fetching and sketching. 38.19 Updated RQCFilterData tar. Updated wrapper shellscripts to handle Cori error messages. Fixed a bug in tracking duplicate reads in RQCFilter. 38.20 Added logsum and powsum to stats.sh gc output format 5. Fixed a bug in tracking reads in RQCFilter. Fixed a basic to extended taxonomy translation routine in TaxTree. Added JSON (format 8) to stats.sh. Fixed(?) BBMap tracking of trimm/untrimmed bases for mapped and unmapped reads. Fixed bugs in RQCFilter tracking of trim/untrimmed mapped bases. 38.21 Wrote JsonLiteral and modified Stats to not put quotes around formatted floats. Added support for accession, gi, and header lookups to RenameGiToNcbi. --help or --version now exit with status 0 rather than 1. Updated some documentation. Added BBDuk trimpolyg flag. FlowCell MicroTiles now track more data and have more methods. Wrote PlotFlowCell and plotflowcell.sh, to look at the distribution of polyG in NovaSeq runs. Fixed a broken if-else in AccessionToTaxId that was causing TaxServer to start with prealloc false. Fixed a bug in verifying other mapped stats in RQCFilter2. 38.22 Added getters for sketch.Comparison and sketch.CompareBuffer, and made fields private. Fixed bug causing Sketch unique count to display incorrectly - bitsetbits had been changed from 2 to 1. It should be 2; made static final. Fixed an array size bug in Tadpole caused by increasing the range of termination codes. Fixed a problem of Kmers being appended to ByteBuilders reverse-complemented. This impacted Shaver2. Fixed a static variable (MASK_CORE) hangover from Tadpole1 into Tadpole2 with TadWrapper. Added more BBDuk polyG options. Added polyG options and tracking to RQCFilter. Fixed an incident where a new KmerComparator was created unnecessarily. Clumpify now correctly counts the number of reads when a temp file is streamed without being clumped. 38.23 Wrote hiseq.CycleTracker. Fixed a parse error in AnalyzeFlowCell. Added preliminary G-bubble-detection and elimination to AnalyzeFlowCell, but it is not clear if it is working correctly. Wrote hiseq.IlluminaHeaderParser. Revised A_Sample, A_SampleMT, and A_SampleByteFile with additional submethods to reduce the length of long methods. Removed JNI path flag from BBMerge, BBMap, and RQCFilter shell scripts. Fixed a bug in reading adaptersOut.fa from RQCFilter2. Changed the way path is appended to output files in RQCFilter2. Added poly-C flags to BBDuk. Wrote PolymerTracker. Added polymer count tracking to BBDuk and RQCFilter. Added clipfilter to Reformat. 38.24 Skipped this version. 38.25 Added maxcov flag to Tadpole. Seal now supports filenames without the ref= flag to allow wildcard expansion. Removed calcmem.sh perl dependency on Genepool, since Genepool is gone. Fixed a logging bug in RQCFilter. Added optical alias to RQCFilter. Modified mergesorted.sh. SortByName and MergeSorted buffer-resizing logic made safer. Fixed leftRatio calculation in Tadpole for printing in contig headers. Fixed an unwanted print statement in Tadpole dot generation. Fixed a crash in Clumpify when handling Ns. BBMap bloomserial now defaults to true. Deleted normandcorrectwrapper.sh. Updated removehuman, removehuman2, etc. to use Bloom filters and clarified that the scripts are for NERSC. Wrote PercentEncoding for translating URLs, and made it more efficient by removing String functions. 38.26 Improved Blacklist name translation. Data internmap is now faster and takes less memory. Made prok package for prok gene-calling. Moved LOGICAL_PROCESSORS to Shared to avoid an initialization order problem. Fixed a bug in FastaReadInputStream with buffer resizing logic. Disabled some assertions in BBIndex that do not appear to be valid with a long maxindel and many short contigs. Added nl() and tab() to ByteBuilder. Reduced memory prealloc request for kmer tables on high memory (>120G) nodes. Fixed CallVariants reporting of deletion count. Clarified CallVariants SamStreamer flag, and capped it at Shared.threads(). Clarified callvariants2.sh purpose and function. Wrote AnalyzeGenes, CallGenes, and CompareGff. Added amino acid output to CallGenes. 38.27 Bugfixes and improvements to gene calling. Began adding RNA models to gene calling. Refactored gene-caller to allow more flexibility with models; pgm format changed. Adjusted default gene model. 38.28 Multithreaded AnalyzeGenes. Wrote FloatList. Fixed a bug in Tools.reverseInPlace for partial arrays. Added trimcircular flag to Tadpole to trim ends of loop-loop contigs, which are presumably circular. Finished tRNA and rRNA models and calling functions. 38.28 Fixed a bug in 3-column Sketch colors. 38.29 Calibration of gene models. Fixed a bug with chloroOutFile/fbtOutFile name in RQCFilter2. Sketch now allows integrated gene-calling for nucleotide to protein translation. Added minsize and maxsize to RepresentativeSet. 38.30 More calibration of gene models. Fixed some misassumptions in percent encoding. Modified GatherKapaStats to output raw data. Generated a minimal representation of RefSeq Microbial... achieved 80% size reduction. Changed the way pileup calculates coverage from soft-clipped bases; they are now ignored. Changed the way samtools/sambamba exclusion flags are processed to be more flexible and faster. Pileup now uses samtools to parse the header and sambamba to parse the reads, since sambamba is slow at reading headers. Added key=value pair output to pileup. Wrote ScoreTracker to track scores of accepted and rejected ORFs when calling genes. 38.31 Added long kmer support to RNA calling in CallGenes. Added BBMerge flags maxmismatches and forcemerge. Added Tadpole flag filtermem. 38.32 Tadpole now refuses to run with no input files. BBMerge now supports filtermemory flag. Wrote KmerFilterSetMaker and kmerfilterset.sh to generate small covering sets of kmers for use with BBDuk. Added silent flags to suppress screen messages from BBDuk, Reformat, and KmerTableSet-related classes. Added reformat padding flags. 38.33 Shred now validates input files. Reformat now has options for padding sequences. ****KmerFilterSet now accepts an initial kmer set. Wrote IntList3. Wrote HashArrayHybridFast. Changed HashArray bulk add contract. Back-ported HashArrayHybridFast changes to KmerNode2D. Seal now uses HashArrayHybrid; indexing Silva became >100x faster. Sketch now uses HashArrayHybrid; indexing speed increased somewhat. Added amino support to BBDuk. Added amino support to KmerCountExact. Added amino support to EntropyTracker. Modified entropy defaults for amino acid mode with Sketch(?) and BBDuk(?) Fixed tracking of PercentOfPairs for insert size statistics. CompareSketch now automatically sets the protein, fungi, or mito path on NERSC. Mutate.sh now works on amino acid sequences. Validated CompareSketch on raw reads in protein space; it works amazingly well. 38.34 Wrote MetagenomeDataWriter to produce some stats for Brian Foster. Modified PreParser and Shared to deal with determining the original command line. TODO: (Brian Foster) report base and read counts exactly, not rounded to the nearest million. Refactored and commented IntList classes. Added merge and ecco to CallGenes. Wrote MetadataWriter to allow unified reads in and out nomenclature for certain programs. Removed a constructor from PreParser. Fixed MetadataWriter for AssemblyStats. Added support for protein Sketch server. Fixed some printing errors in CallGenes. Added recode and retranslate to CallGenes. Increased SendSketch default sizemult for RefSeq and proteins to 2.2. 38.35 Added sketchonly flag to CompareSketch, allowing it to just sketch and write files but not actually run comparisons. Protein sketch server is now active. Added TaxTree.descendsFrom(child, parent). TaxTree now classifies species-attached no-rank archaeal nodes as strains in addition to bacteria. pigz --version is now recorded to determine whether -11 and -I flags are supported. Added sketch sixframes flag, for dealing with indels. This works suprisingly well but bloats the genome size. Probably the size should be divided by 6. Added prokprot sketch to RQCFilter. Sketch now ignores AA kmers spanning stop codons in sixframes mode. Fixed a flaw in rkmer generation following Ns, in many classes. Added Sketch toValue2 function to process dual kmers in an unbiased manner. This yields more accurate ANI. Added comparison logic for tracking k1 and k2 matches independently. toValue2 now handles aminos as well. Changed default kmer lengths from 31,0 to 32,23, and 10,7 to 11,7. Simplified some parts of Sketch, like removing aniFromWkid flag. Changed an assertion in TaxTree to a warning, because the latest version of NCBI taxdump contains errors. Validation of K and hash version between sketches is now more robust. Fixed all instances of kmer bitmasks to work correctly with k=32; prior limit was k=31. Added 1-bit antialiasing to Sketch hashcodes. Bumped hash version to 2. Increased amino default kmer length to 12,8 to increase specificity. Fixed an assertion failure in comparesketch perfile mode. Increased size of prokprot blacklist. Added Sketch refhits flag, to indicate the number of references sharing kmers with keys hitting a reference. Remade prokprot blacklists at a higher taxonomic level to deal with high conservation. Fixed an assertion with regards to sketchonly mode in comparesketch. avgrefhits is now weakly factored into score. Modified some rqcfilter2 sketch flags such as minprob. 38.36 Increased Sketch minprob to 0.0008. Q7 (80% accurate) areas will be used but Q6 (75%) will be ignored; before it was 0.0001 (Q6.1). This slightly increases accuracy with raw reads. Trimrname now works on sam headers. Trimrname is now automatically set to the same as trd unless explicitly overriden with the trimrname flag. Added small RNA adapters to adapters.fa (thanks to Daniel N.) Sketch now reports the number of unique kmers indexed. BBTools can now read embl and gbk formats. Added support for subcohort taxonomic level. 38.37 Fixed a bug in BBDuk JSON readsOut reporting. BBSketch format 3 now prints taxID. Fixed broken qin flag (was being overriden by autodetection). Improved quality autodetection for out-of-range quality scores. FastqReadInputStream now correctly inherits interleaving from FileFormat rather than running internal tests. Added JsonParser.parseJsonObjectStatic. Added Blacklist.toBlacklist. Added SendSketch.toAddress, .setFromAddress, and .sendSketch (static). Simplified SendSketch parsing. TestFormat now automatically tries to detect organism with SendSketch. ReadStats bhist is now faster by formatting with ByteBuilder. Added TestFormat bhistlen flag to disable gigantic bhists. 38.38 Fixed a parsing error in SendSketch. Wrote docs/RestartingServers.txt Fixed CallGenes load failure whith under 9 threads. Added a 100k limit to SendSketch queries per instance, and added reference tars to the website. Increased buffer sizes of SendSketch. Reduced number of threads per session for Sketch servers. Added trackers for number of Sketches processed, bytes received, and bytes sent to Sketch server. 38.39 Fixed a bug in phist (required polysymbol to be set). Fixed a bug in BBDuk amino mode (failure to support k=12). Fixed a bug in bhist (no newlines!). Sketch and Tax servers now tracks single versus bulk queries. Converted several ReadStats histograms from TextStreamWriter to ByteStreamWriter. 38.40 Replaced some obsolete StringBuilder methods (mainly for read printing) with ByteBuilder. Deleted obsolete classes ReadStreamStringWriter and SortByMapping. Replaced many instances of StringBuilder with ByteBuilder. Moved some fields from Gene to Shared. Made Header class. Fixed a float-to-int rounding-down problem making BBMerge not strictly obey the maxmismatches flag. Redid RandomReads naming format to be pair-capable in sam format. Converted all known header-parsing functions to use the new format. Wrote SuperLongList.toString Added Reformat prioritizelength flag for subsampling variable-length reads. Fixed trailing whitespace in bhist. 38.41 Fixed a compile error. 38.42 Wrote SubSketch and subsketch.sh to pull partial sketches out of larger sketches (e.g. to shrink RefSeq). Added stats handler to TaxServer, with version and quantity tracking. Added bbversion field to sendsketch header. Fixed SendSketch address parsing. Added p and q suffixes to parseKMG. Added PacBio read length modelling to RandomReads. Fixed a CallVariants assertion with SamLine.RNAME_AS_BYTES. Fixed major bug in vcf line reading, misinterpreting variant types, preventing BBDuk from parsing vcf properly. Wrote SamStreamerMF, a multifile SamStreamer. Integrated SamStreamerMF into CallVariants. Now, with 8 sam.gz files, CallVariants is about 5x as fast on a 32-core node. Fixed CallVariants vcf output MCOV reporting -1 when out= is set instead of vcf=. Fixed ihist not working in BBDuk. 38.43 Wrote var2.VarKey for hashing. May not use it. Added indel processing to fixVars, and Read.containsVars(). Fixed bugs in reading insertions from VCF files. TaxServer usage no longer displays stats (stats are on the /stats page). Added ref flag to CompareVCF. Added shist to FilterVCF (for vars passing filter). FilterVCF no longer requires a reference (in most cases) if the VCF has a correct header. CallVariants modified to reduce negative impact of strand bias and read bias on score, in cases that otherwise appear fine. Demuxbyname can now do 1 file per sequence header, but it does not close the streams as soon as a sequence is written. This would be better as a custom program. Removed a mysterious automatic newline from Read.toSam(bb). Wrote CoverageArray3A, Atomic version. Added atomic flag to CallVariants, which increases speed by up to 300 percent. Increased speed of multithreaded coverage calculation even without atomic flag. Fixed stranded coverage default to false. Added CoverageArray.incrementRangeSynchronized. CallVariants trackstrand now correctly defaults to false, which disables the DP4 field. CalcTrueQuality should now ignore indels declared in a VCF. 38.44 Fixed a bug in Tools.parseKMG. Added qualhist to CallVariants. Added code in CallVariants to deal with recalibrated base quality. CallVariants no longer needs ref= prefix before fasta reference. FilterVCF can now split alleles. Modified mutate.sh to allow variable-length indels, and not put them too close together (to allow better grading). Major: Fixed BBDuk/Seal/Clumpify issue in failure to correctly reverse-complement some kmers. 38.45 Last restarted timestamp fixed for TaxServer stats page. Clarified randomreads.sh description of generating twin files versus interleaved. Added Read.countVars, CallVariants.findUniqueVars. Added support for indels and border to FilterSam. CallVariants can now force calls of specific alleles with an input vcf. VarMap is now iterable over values. Modified ShrinkAccession to optionally retain GI numbers. Fixed VCF genotype call of 1 for haploids failing filters. Updated GiToNcbi to read gi numbers from accession files since gi files will disappear soon. Clarified bbduk.sh comment on maxlength. Added unzip.sh script. Split Sketch displayfname into rfname and qfname. Fixed file column being enabled by default for sendsketch. Changed VarMap WAYS to 8, allowing 16 billion variants. Short match strings no longer generate consecutive symbols like mm because it is hard to parse. MSA.score() now accepts short or long match strings. CallVariants no longer generates long match strings prior to trimming, for perfect matches; 5-10% faster. FilterVCF can now split long substitutions into SNPs with the splitsubs flag. Fixed CalcTrueQuality ploidy unset warning. Add ls to testfilesystem. May be inaccurate due to cache effects. Added amino acid codes B and Z, mapped to ANY (same as X). CallVariants now integrated into FilterSam. BBCMS now supports sam files, if error-correction is disabled (depth filtering is allowed). Added some columns to CallVariants screen output for average allele depth. Added taxonomic levels series and section. Added RenameGiToTaxid badheaders flag for logging. Added RenameGiToTaxid maxbadheaders flag for early termination when exceeded, and included it in the download scripts (at 5000 since recent nt contains 2440 headers with no TaxID). Removed sharedVarMap from CallVariants2; replace with forcedVars1 or forcedVars2 for the two passes. FungalRelease agp generation now uses ByteStreamWriter over tsw and Read.breakAtGaps uses ByteBuilder over sb to save memory. Fully commented MSA11ts fullUnlimited. 38.46 Added Unzip.java and fixed unzip.sh. It is pretty resource-intensive, though, for a program that does nothing. This is possible to improve. Added KID and WKID to Sketch format 3, and flags to disable them. CompareVCF now prints results to screen correctly when there is no output file. TaxServer now defaults to 200k max reads in local mode. In local mode, TaxServer no longer reads files with pigz. FilterVCF now correctly observes del and ins flags. Added Var.COMPOUND type for multiallelic variations. Added VCFLine.trimPrefix() and trimSuffix(). Fixed bugs in trimToCanonical handling of compound variations. VCFLines split by allele now split INFO fields as well. Wrote demuxbyname2, to support massively multiplexed Novaseq runs. Splitting alleles now also splits the info field of VCFLines. 38.47 Added demuxbyname2 hamming distance support. Renamed Var.COMPOUND to Var.MULTI and added Var.COMPLEX. Modified demuxbyname2.sh to use pigz. Increased compiler error level (@Override, shadowing) and fixed resulting errors. Wrote MultiCros3, which supports concurrent streams; this makes DemuxByName2 faster. Made BufferedMultiCross an abstract superclass of MultiCros2 and MultiCros3. 38.48 Added samline field to Read. obj field is no longer used for SamLines. Caused substantial refactoring; may have introduced bugs when processing sam files (they will not be subtle if present). BufferedMultiCross now offers a threaded mode, but this has not improved performance. BufferedMultiCross now supports minReadsToDump and puts residual reads into unknown. Fixed DemuxByName2 hamming distance code, and improved it to only remove colliding keys. 38.49 Fully commented DemuxByName2, BufferedMultiCros, MultiCros2, and MultiCros3. Fixed a bug in MultiCros3 that created some duplicate reads. Speed is now >950MB/s for twin files. 38.50 Added bgzip control flags and version parsing. .vcf.gz files now default to being written and read by bgzip. All gzip files now default to being read with bgzip over pigz. Non-vcf files will only be written with bgzip if the bgzip flag is added (for now). Added alternate Sketch addresses via vm flag. minProb and minQual moved from SketchObject to DisplayParams, requiring the modification of many methods. Simplified some Sketch method signatures by allowing DisplayParams to substitute for multiple parameters. Added Locale to all String formatting without it. Refactored DemuxByName2. Improved commenting of DemuxByName2 and related classes. Added PacBio subread support to PartionReads (partition.sh). Disabled ByteFile1 being forced outside of JGI. ByteFile2 caused some problems, but those should be resolved now, I think... Added loglog and barcode flags to DemuxByName2. Fixed order of SendSketch setting server address to allow alternate (VM) server use. Fixed DemuxByName2 order of parsing parser args, allowing the barcode flag to trigger. Unified DemuxByName2 modes under a single mode field. Fixed maxrecords not being observed in Sketch JSON format. TaxServer sketch handler now does full parsing of URL arguments. Added D3 support to Sketch results. 38.51 Changed handling of same-name JSON keys; by default they are now replaced. Improved Sketch D3 output - added more keys, fixed depth handling. Subprocess testing now returns false for exit codes 126 and higher (missing libraries yield 127). Turned bgzip and pigz on by default for all programs. Made bgzip the default for RQCFilter. Modified TaxServer sketch portion to prevent carryover of parameters from subsequent queries. Fixed Sketch header reporting observed depth as actual depth. Wrote IceCreamFinder, IceCreamAligner, and icecreamfinder.sh. Wrote A_Sample_Generator, IceCreamMaker, and icecreammaker.sh. Moved A_Sample classes to new templates package. Changed some new Random() calls to Shared.threadLocalRandom(). Added jsonarrays flag to Sketch. Wrote IceCreamGrader and icecreamgrader.sh. Renamed demuxbyname2.sh to demuxbyname.sh. 38.52 Made IceCreamFinder ~50% faster by debranching loops and optimizing cache footprint. Added IceCreamFinder junction output. Simplified shell scripts by centralizing path-setting commands. Moved JNI library loading to Shared. Wrote IceCreamAligner JNI version. 38.53 Made IceCreamAligner JNI faster by adding functions for all alignments and adding 16-bit versions. Fixed bugs in calcmem.sh path setting and module loading on Cori. Automated jni library path setting (-Djava.library.path flag is no longer required). Disabled BBMerge attempt to load JNI libraries. Added magic number detection for .gz files. Disabled bgzip reading of non-bgzip .gz files, awaiting new bgzip release, because current bgzip breaks on concatenated gzip files (supposedly addressed after v1.9). 38.54 Changed a method to avoid a Java 11 dependency. Added ZMW stats to IceCreamFinder. Added preliminary adapter detection to IceCreamFinder. 38.55 Fixed a JNI bug in RQCFilter with BBMerge. 38.56 Improved and accelerated IceCreamFinder adapter detection. Reduced discarding of reads with adapters only at the tips. 38.57 Greatly improved IceCreamFinder adapter detection sensitivity by aligning to more reads. Increased speed of adapter aligner. Added less-specific adapter-screening phases to reduce calls to the adapter aligner. Added ambig output stream and changed the logic for determining ambiguous inverted repeats. Adapter-containing inverted repeats no longer go to junctions output. Improved timeless adapter aligner and made it default. Added start location to low bits of timeless aligner score, but it does not seem to work. 38.58 Fixed PreParser failure when encountering a standalone equals sign. Fixed a bug in automatically setting Sketch blacklists for known databases. Updated server-starting shellscripts to point to the new URLs. Renamed Missing Adapter as Absent Adapter. Changed ambiguity logic to better classify reads when there are 2 passes. Adapter alignment is slightly more lenient when an inverted repeat is detected. Slightly accelerated adapter detection by changing conditionals to array lookups in the inner loop. SendSketch can now load TaxTree. Increased Sketch number of comparisons returned, to compensate for potential losses during TaxFilter. 38.59 Added json output and stats redirection to IceCreamFinder. Added preliminary SamStreamer support to IceCreamFinder. SamStreamer now supports a limited number of reads. Added libbbtools.dylib (Mac version) to jni folder. Thanks Jie Wang for compiling it! Updated makefile.osx and jni readme. CoveragePileup now detects and aborts when a scaffold is specified multiple times with different lengths. Added ByteBuilder.print(float x, int decimals). Added asrhist and irsrhist to IceCreamFinder. Fixed an unnecessary array copy in adapter detection; X is now properly added to reads with adapters detected. Added trim support to IceCreamFinder. 38.60 maxReads is now a required parameter for SamStreamer; this allows acceleration of some other tools when reads are limited. Redid Sketch taxfilter. Now there are two different taxfilters, white (include) and black (exclude). The flags have changed. Organism names are now acceptable for TaxFilter. JNI mode for IceCreamFinder and BBMap is now automatic on NERSC or Mac/Linux AMD64 systems. Moved OS/CPU environment detection from Data to Shared. Restarted Sketch servers; they will no longer handle the old taxonomy filtering flags. Added reformat complement flag. Fixed spelling of complement in some cases. Added Sketch taxID to Sketch lookup table. Added Sketch server reference mode. Sketch taxonomy and metadata filtering are now handled by DisplayParams, and done prior to comparison, exactly once, and in threads. 38.61 Added Sketch KID, WKID, and hits comparators. Revised TaxonomyGuide.txt. Wrote ThreadWaiter and simplified A_SampleMT. Fixed an accidental use of bgzip for decompression. Fixed an erroneous error message (header with no bases) from splitting reads of target length in FastaReadInputStream. Added fixsra and addpairnum flags to RenameReads. Modified and moved ncbi and sketch scripts to pipelines/fetch and pipelines/server. 38.62 Added ACGT count tracking and printgc to Sketch. Sketch JSON format now caps decimals places of some numbers. MergeSorted can now use subprocess for decompression. Added linear sketch sizing via the density flag. Added polyploid support for MutateGenome (ploidy and hetrate flags). Added nohomopolymers flag to MutateGenome. Updated calcmem.sh with path for pigz. Revised fetch pipeline scripts again. Added plasmids to prokprot; removed viroids which no longer exist. Sambamba should no longer print the banner. Wrote FetchProks and fetchproks.sh, for downloading one genome assembly and gff per prokaryotic genus. Updated model.pgm with all archaea and one bacteria per genus. Deleted spurious copy of GffLine. Split VcfToGff off of GffLine. Moved Gff-related classes to gff package. Wrote GbffFile, GbffLocus, and GbffFeature. Fixed equals method in StringNum. Rewrote CompareGff to take sequence name and type into account. Generated pgms for plastid and plasmid, but they made bacterial calling worse. Enabled 5S long kmer support (9-mers) for CallGenes. Might be worthwhile ignoring the 1-count kmers. 38.63 Gene-calling long kmers are now uncompressed. tRNA and 5S now use 10-mers instead of 9-mers; plastid, plasmid, and viral sources are included. Fixed some remaining crash bugs from adding GC content to Sketches. Updated RefSeq protein sketching pipeline. 38.64 Added TaxTree methods for determining if a node descends from unclassified or environmental samples. Added banUnclassified and banVirus Sketch flags. Reduced TaxServer startup time by around 60% by multithreading AccessionToTaxid per-file reading. Wrote GlocalAligner to perform flat alignments for SSU identity calculation, and integrated it into sketch.Comparison. Wrote AddSSU and addssu.sh. Wrote AlignmentJob and AlignmentThreadPool to maintain a growable, limited pool of shared threads for aligning SSUs. Added GeneModel length restrictions for RNAs based on empirical data. IceCreamFinder trim flag now adjusts read coordinates in headers. Fixed a bug where IceCreamFinder trimming SamLines corrupted their quality. CallVariants now automatically converts IUPAC symbols to N; before it crashed when encountering them. iupacton flag now works on all programs and happens during read validation. 38.65 Updated FindPrimers to use a different, faster aligner and perform reverse alignments. 38.66 Removed a debug print statement. Fixed a crash bug in RepresentativeSet. 38.67 Fixed some ssu-related issues in Sketch. CompareSketch and SendSketch no longer try to grab SSUs if printssu=f. 38.68 Added subkmer support for BloomFilter via kbig. Added subkmer support for BloomFilterCorrector via ksmall. BBCMS subkmer support improves accuracy at very high load. Improved FetchProks to prefer reference assemblies when available. FetchProks now retries when connections time out. Added maxidfilter to Reformat. Wrote A_SampleSamStreamer. Added comma() and under() to ByteBuilder. Multithreaded AnalyzeAccession; now roughly 4x faster (190 -> 42 seconds). Speed limited by largest file. Wrote consensus package (BaseGraphPart, BaseNode, BaseGraph, ConsensusMaker) and consensus.sh. Added SamLine.calcIdentity(). Added SamFilter minid and maxid flags. Added SamLine leftmost() method. Wrote FixScaffoldGaps and fixgaps.sh for resizing scaffold gaps based on mapped pair insert sizes. IceCreamFinder now discards short reads after trimming. Added ConsensusMaker mindepth, mafsub, mafdel, mafins, mafn, and usemapq flags. Wrote Lilypad and lilypad.sh (scaffolder). Tested ConsensusMaker with Quast and changed default parameters. Added ConsensusMaker nonly and noindels flags. Fixed deletions sometimes being counted as bases in pileup.sh when delcov=f. Fixed a depth-0 assertion error in Tadpole. Added popbubbles to Tadpole. 38.69 Improved Tadpole bubble-popping and fixed some assertions. Multithreaded FindPrimers (msa.sh). Improved speed of SingleStateAligner by removing mode tracking (maybe 15% impact). But it does not produce correct tracebacks. Fixed a bug in FindPrimers with not swapping bases in the output. FindPrimers in swap mode now produces sam headers. Added multiple debubbling passes; may incur small misassemblies (default off). Added dead-end debranching; may incur small misassemblies (default off). PopBubbles now resets LOOP end conditions if new loops are created. BBSketch now automatically disables minprob if 0-quality reads are detected (due to PacBio). Fixed some scenarios in which Sketch would not spawn an adequate number of threads, particularly for long reads. TestFormat (testformat2.sh) now counts and makes a histogram of ZMWs, and provides better taxonomic identification for PacBio. 38.70 Fixed redundant comparisons in FindPrimers. Fixed incorrect reverse-complement in FindPrimers. Added Reformat flipsam flag, to disable flipping sam records into the correct orientation upon loading. Fixed a consensus bug in BaseNode. Added trimdepthfraction and trimns to ConsensusMaker. Added alternate aligner (msa2 flag) to FindPrimers. Fixed a bug in FindPrimers regarding max columns needed in swap mode. Wrote scripts for automatically determining consensus sequences of ribosomal components. SingleStateAlignerFlat now correctly produces N/m/S or M/=/X symbols in match/cigar strings. SingleStateAlignerFlat2 now works correctly. Added flags controlling RNA consensus alignment for CallGenes. Wrote 1D version of FlatAligner. Fixed SketchMaker running out of memory with SSU alignments: Prevented rRNA candidates from being created at over 8x the expected length (retries with an increased bias). Prevented rRNA alignments from being performed at over 15x expected length. Seems to mainly affect 16S search in plants. Adjusted RNA score cutoffs and adjusted algorithm for RNA gene-calling; more rRNAs and tRNAs will be called now. MergePGM now supports a per-file multiplication factor for asymmetric merges (e.g. in=bact.pgm@1,arch.pgm@5). Began adding 18S support to CallGenes. AnalyzeGenes now supports the alignribo flag, enabling it to discard Refactored Prok package, adding ProkObject with statics. LSU and SSU start and stop slop are all independently configurable, and optimized. Rewrote CutGff: CutGff is now multithreaded per file. CutGff now supports alignment. CutGff can rename by taxID. FetchProks now attempts to find the best asembly of each species (longest scaffold). Connection retries for FetchProks now wait for an increasing amount of time after each failure. Many other changes; to be documented. 38.71 Fixed an issue with containments of paired reads in Clumpify. Started AWS servers for Taxonomy and Sketch, and added code to enable them (in SendSketch, etc) via the aws flag. Some default resource paths are now set automatically based on the presence of the environment variable EC2_HOME. Added seed flag to AnalyzeFlowCell (filterbytile.sh). 38.72 SortByName now deletes recursive temp files after a full pass rather than incrementally, for easier resuming. Added BBDuk entropytrim flag. Increased buffer limits for SortByName; now it should rely on the data limit rather than sequence limit. Improved available memory estimation. Modified SortByName memory management to hopefully avoid merging too many files simultaneously. Added Unite support for taxonomy header parsing. Reduced BlacklistMaker memory consumption to prevent crashes with long sequences, by reducing buffers and threads. This may reduce speed slightly. Reduced default Sketch keyfraction from 0.2 to 0.16. This should only impact tiny partial ribo sequences which are not very useful, but make blacklists more efficient. Moved BBTool_ST to templates and changed access modifiers of some methods. Fixed EntropyTracker failing at k=1. 38.73 Halved MSA2PBA aligner weights to allow longer sequences (like ITS). Added Sketch requiressu flag. Moved IntListCompressor to its own file in structures. Wrote BlacklistMaker2, which makes blacklists from sketches rather than sequences. Modified SubSketch to do bulk operations (on # symbol in filenames), apply a blacklist, and use autosize. More Sketch output columns use KMG for big numbers. Modified Sketch scripts to use sketchblacklist2.sh and subsketch.sh, plus merged genus/family blacklists. Blacklists may be a bit too strict now. 38.74 Trimmed ~300ms off Sketch startup by eliminating redundant hash mask generation and antialiasing. Added Sketch printcommonancestor and printcommonancestorlevel flags. Fixed display of decimal numbers in json. Fixed json array output form for multiple query sketches. Fixed SendSketch perfile mode losing input file order. Added DoubleList and related static functions. Wrote AnalyzeSketchResults.java. Added Sketch recordsperlevel flag. Added multiple Sketch output file option in perfile mode. Refactored AnalyzeSketchResults by spinning off ResultLineParser, Record, and RecordSet into files. Added AnalyzeSketchResults shrinkonly mode to remove unnecessary records. MergeRibo now makes a consensus per taxID. It and selects the best sequence on the basis of alignment to that consensus. Added align() to BaseGraph. Wrote CompareSSU for all-to-all SSU alignment. Multithreaded AnalyzeSketchResults alignment phase. Threads now accepts floats like 0.5 to use half of the logical processors. FetchProks now puts the taxID in the filename, which AnalyzeSketchResults now expects (but does not require). ukmer package now observes rcomp, thus KmerCountExact now works for k>31 with rcomp=f. TaxTree will now hash names on demand, allowing flags like exclude=Thermus for CompareSketch. RenameGiToTaxid should now produce proper statistics for gff files. gi2taxid.sh now reports the cause when it aborts due to bad headers. 38.75 DemuxByName should be fixed now, via disabling the closefast option. Added CompareSketch refseqbig and prokprotbig flags for internal use. Adjusted size and sensitivity of nt, RefSeq, and prokprot blacklists. Rewrote Sketch Comparison score calculation to give more weight to hit count, particularly for low hit counts. Added printssulen flag. SubSketch no longer sets the filename field to an input sketch file. Implemented comparison SSU printing in json format (previously it was just for the query). 38.76 Wrote prok.SplitRibo and splitribo.sh for splitting mixed sequence types (e.g. Silva) into individual files. Added an exception to reporting an error when terminating a subprocess with exit code 141 (sigpipe). Split Sketch SSU tracking into 16S and 18S; 18S is chosen when available, and displayed with a * symbol. Added MergeRibo 16S and 18S flags, and made them mutually exclusive. stdin and .sketch should no longer appear in the sketch filename field. SubSketch now retains counts. SketchMaker now calls CompareSketch for ONE_SKETCH mode in addition to PER_FILE. sketch.sh now supports multiple files in persequence mode. Removed/disabled perheader mode as it was not clear how it differed from persequence, and it was not documented anywhere. Fixed missing tabs in stats.sh format=4. Added ConsensusMaker auto-loading of rRNA subunit consensuses. Moved non-BBMap alignment classes to aligner package, and IceCream-related classes to icecream package; redirected/recompiled related C code. Fixed IceCreamFinder modification of synthetic headers with whitespace-delimited extra content. Added SplitRibo support for mitochondrial and plastid rRNA output. Updated Silva script to remove mito and chloroplast sequences by name (chloroplast 16S cannot be distinguished from cyanobacteria). Added iupacton flag to Silva formatting script. Modified AddSSU to allow SSU deletions and more precise control over choices between new and prexisting sequences. Wrote FilterSilva and filtersilva.sh to eliminate some troublesome sequences with ambiguous names that were getting misclassified. SketchMaker now allows 16S calling to be skipped for Eukaryotes or when not desired. Changed MergeRibo formula to value sequences closer to the consensus length rather than the longest. SketchHeap now also favors SSUs closer to the consensus length. Possibly accelerated mergeRibo by sorting lists by length, descending. Requiressu flag now properly supports both 16S and 18S. 38.77 AddSSU now accepts legacy files with #SSU instead of #16S/#18S. Updated variantPipeline.sh. Wrote BaseGraph.score for grading alignments using a model. Enabled BaseGraph serialization via ConsensusMaker outmodel flag. Fixed a bug in Aligner classes that replaced I with C at match position 0 or 1. ConsensusMaker now accepts unaligned fasta/fastq input if there is only a single reference sequence. 38.78 Made a floating point aligner version allowing positional weights. Fixed json null pointer exception on addLiteral. Added json output for CallGenes. stderr can now be specified multiple times for file output. Added length histogram support to CallGenes. Removed unnecessary assertions from BaseGraph.score, which assumed a consensus was being used as the reference (therefore, indels would be under 50% probability). Fixed a bug with underscores and no prefix in rename.sh. Loading consensus sequence will now try fq first then fa. Added and removed indel weights from the float aligner, as they didn't seem to help. Split ITS by clade and made consensus for fungi, plant, animal, and other. Implemented restrictleft/restrictright in Seal. Added BBMap readgroup autonaming based on input filename. Added BBWrap fofn support. Modified Dedupe to allow concatenation of headers of absorbed sequences. Made some changes to consensus model scoring to improve separation. 38.79 Fixed a weird compiler static issue. 38.80 Extensive changes; unfortunately, log is currently incomplete as a result of COVID-related workplace changes. This will be updated. Improvements to cardinality estimation (LogLog-related structures and programs). Fixes and additions to variant-calling-related programs for the purpose of COVID genotyping. Fixed trimclip not working (BBDuk). 38.81 Add pileup tip ignore and qtrimming to match coverage from CallVariants. Wrote summarizecoverage.sh to summarize coverage basecov files for multiple samples. Fixed a testExecute bug causing bzip2 to hang. 38.82 CallVariants and Pileup no apply border to reads mapped to the ends of a chromosome. Filtersam no longer discards sam headers. Wrote A_SampleBasic to act as a template for programs that do not do stream processing. Tweaked some variant-calling scripts. 38.83 Fixed fastq interleaving detection for a rare failure with PacBio reads. Added entropy histogram support to BBDuk. Extended EntropyTracker to always track monomer counts in a window. Added EntropyTracker longestLowEntropyBlock function. Added entropy/monomer fraction filter to IceCreamFinder. 38.84 Removed some IceCreamFinder debug code causing a crash. 38.85 Fixed fastq interleaving detection for a rare failure with PacBio reads. Added entropy histogram support to BBDuk. Pileup modifications and a bugfix. Wrote ReformatPacBio, reformatpb.sh, and various support classes (ZMW, ZMWStreamer, PBHeader). Modified BaseGraph to do alignment. Modified BaseGraph to do piecewise alignment and handle overlaps correctly. Added CCS support to ReformatPacBio, but it currently only works well on synthetic data. Added artic3 primers to resources. Wrote FlatAligner2. Added hmm package and classes for parsing hmmsearch results. Added jasper package for intern. Added KmerPosition, KmerPosition3, and kmerposition.sh for positional kmer counts. These were written by Jasper Toscani Field. Added FlatAligner2 with flatter weights than FlatAligner. Added ApplyVariants support for renaming, excluding certain indels, and better handling of variations in low-coverage regions. Updated Covid scripts. Added A_SampleSummary template. Added BBDuk entropy histogram. Fixed IceCreamMaker reference loading. Fixed Tadpole1 ownership reinitialization bug. Added total sub/var count to FilterSam. Added CallVariants/VCF support for NearbyVarCount and Flag fields. Added seed to RandomGenome. Added TrimRead handling of aligned reads without attached SamLines. 38.86 Bump due to git glitch. TODO: Allow banning indels near poly-N in ApplyVariants. TODO: NVC, failnearby, and flagnearby seem to work for CallVariants but not CallVariants2. TODO: Add all reformat filters to FilterSam. TODO: Sketch redlist/whitelist, applied upon kmer eviction. TODO: BBDuk anomalies when left-trimming with mink. ***TODO: BBMerge should trim terminal poly-G or any homopolymer from adapter sequence. TODO: CallVariants2 is giving one extra var call and 2 failing vars that pass every sample... for the command: /global/projectb/sandbox/gaag/bbtools/callvarstest/synth2> callvariants2.sh in=deduped_trimclip.sam.gz,mapped2.sam.gz ref=ref.fa out=vars_fail_multi.vcf -Xmx1g ow strandedcov flagnearby TODO: Fix Pileup to deal with unpaired reads, or filter out reads with mate removed due to junk filter. TODO: Fix DedupeByPosition to deal with unpaired reads that say they are paired. TODO: Add minor allele support. TODO: ApplyVariants should optionally examine AF to give IUPAC codes for mixed alleles. TODO: Dedupebymapping has trouble with reads marked paired whose mate is missing. TODO: Examine interplay of minedist and chromosome ends. TODO: Add CallNs to CallVariants. There is a flag for it but the purpose is currently to call N alleles from Ns in reads. TODO: Add CallVariants coverage output. TODO: Add trimclip to CallVariants implicitly. TODO: Fix rqcfilterdata tax data. TODO: Fix size-zero SendSketch ribosomal bug. TODO: Add amplicon flag to CallVariants. ***TODO: SSAs appear to never perform clipping, and the score function is slightly different than traceback. TODO: Add CallVariants option to only produce the top allele at multiallelic sites. TODO: sketch.sh does not support multiple input files in onesketch or pertaxa modes. TODO: comparesketch.sh does not support refid flag, which is pretty useful... TODO: Update sketch-creation and download scripts... TODO: Different sizes of output files for 1rpl and 9999. TODO: CompareSketch outputs results in an arbitrary order, for per-file mode. TODO: Rename rdp data by taxa, and figure out how to pull out just 16S, etc. TODO: Consider adding gene caller support for ITS. TODO: Rename Sketch SSU column as Ribo and use ITS/16S depending on clade (or best for organisms that seem to have both... maybe 16S and ITS Sketch fields, but only one column.) TODO: Document AnalyzeGenes new flags. TODO: Filter and map reads to universal 16S in realtime to make a consensus. TODO: Why does SketchMaker run at ~10 cores when t=40? Check typical CPU utilization. TODO: Tool to isolate all 16S sequences from RefSeq based on gene-calling. Then this can be merged with RefSeq, etc and used as a resource when sketching RefSeq. TODO: Make KeepBestCopy multithreaded and align to 16S sequences. TODO: Option to rename CallGenes output by taxID (or add to description field). Try minimizing the number of active long sequences to reduce RefSeq memory consumption. Probably a problem with euks. Or, maybe unsorted would be better. Seems to run out of memory when it gets to mitos: #SZ:420 CD:ADC K:32,24 H:2 GS:16646 GK:16615 GE:14747 GQ:1 BC:4780,4432,3022,4412 ID:86930 NM:Richardsonius balteatus NM0:tid|86930|NC_033945.1 Richardsonius balteatus mitochondrial DNA, complete genome TODO: 5000s for prefilter phase - bgzip is at 150% but java is only 100%. 78.64 vs 78.63 (it varies) for command: /global/projectb/sandbox/gaag/bbtools/prok/auto> time nice msa.sh ref=16S_consensus_sequence.fa in=16S_flipped.fa t=32 -Xmx8g Document refid for sendsketch and add it to comparesketch. Fix NCBI annotations of 16S. Fix NCBI orientation of 16S within the tool that pulls them from gffs, via alignment (make it multithreaded). *** TODO: Some loop-creating bubble poppings are banned. *****TODO: Fix destMap pointers when nodes are merged and so forth. TODO: All bubble-popping with midnodes shorter than 2k-1 via truncation. TODO: Inflate collapsed repeats in BubblePopper by duplicating a single fbranch/fbranch midnode. This requires linkage information for the endnode pairs. Does not need to be fully resolved as long as any pair of endnodes can be uniquely associated. Should be done before popping bubbles. TODO: Try a second pass of bubble popping. TODO: Ultrafast aligner. Could be BBMap with alignment disabled, or could use a new data structure... TODO: HashArrayHybridLong, associating a kmer with multiple longs. These can be used to encode scafnum in high bits and pos in low bits. TODO: LilyPad should not require mapped reads; alternatively, make a map of contig end kmers (eg. 31-mers) then scan reads and pairs for those kmers to make edges. TODO: Consensus should pad Ns on ref start and stop, then optionally truncate at the 50th percentile depth (for 16S). TODO: Potentially make 3 kinds of edges in Tadpole: A) built edges, B) captured (in a read) edges, C) and captured (in a pair, unknown length) edges. A and B could perhaps be lumped together. TODO: Lilypad could track overhanging reads to fill in gap Ns. TODO: Tadpole simple tandem resolution. TODO: ByteBuilder toText and append interface. TODO: CallVariants should track strand ratio and ignore strandedness if it is highly biased. TODO: On Amazon, Rob had to specify path= for BBMap as otherwise it tries to write to /ref/ which is not allowed. TODO: Flag to Swap stats N/L50. *TODO: Provide a column for ssu comparison results. TODO: SendSketch on a 16S sequence makes a sketch that is too small. TODO: Enable auto tax server access for all programs. TODO: Enable server access for taxonomy.sh. TODO: Change taxonomy.sh to use server-style header/name parsing. TODO: Add more checks when parsing for invalid parameters, e.g. k>31 for bbcms. TODO: Rename archaeal and bacterial per-genus downloads with taxID. TODO: Write test harness for Sketch. TODO: analyzegenes is singlethreaded with 1 input file. TODO: Cori memory autodetection detects too little memory. TODO: Default to less memory on Cori head nodes. TODO: fixvcf for left or right justify of indels, etc. TODO: Shred input? TODO: Java gzip decompressor does not seem to work with multipart files streamed from stdin, for example, RefSeq. TODO: Convert other multithreaded programs to the ThreadWaiter pattern. TODO: Integrate tax server into everything as a replacement for downloading accessions (?). TODO: Allow partial tax tree queries from the tax server. Or even complete tax tree downloading? That would be convenient... TODO: Put taxonomy update pipeline scripts in version control. TODO: Update taxonomy sub-scripts. TODO: Per-organism frame ACTG composition; may allow 2-pass better gene-calling. TODO: tRNA folding TODO: Way to bulk download and organize bacteria. TODO: Indel bit for alignments. If this bit gets set, there was an indel. Perfect for accelerating alignments using dual arrays with no traceback. TODO: Enable silent (and json) flag for more shellscripts. TODO: If JNI init fails, give a useful error message. TODO: Compile JNI for Windows. TODO: SamStreamer does not support ordered output. TODO: Look for inverted repeat around suspected PacBio adapters, if there is only 1 adapter. TODO: Generate histograms for IceCreamFinder - ratios, lengths, and subread count. TODO: Re-test increasing entropy cutoff for euk sketches. TODO: Download all bacterial assemblies and automatically select a representative set for gene-caller training. TODO: SendSketch depth estimation is too low (e.g. 0.5x data yields 0.435x estimate). Compensate for kmer depth based on number of bases and number of sequences. **TODO: SendSketch crashes and hangs with in=nonexistent file. TODO: Clarify comparsketch single vs perfile flag. TODO: Reduce impact of quality scores on BBMerge. TODO: BBSketch proportional-size mode (linear-size). TODO: Sketch callribo flag (instead of whitelist - only for proks? And only assemblies, not reads.) TODO: Trim terminal adapter sequence on PacBio reads prior to looking for inverted repeats and central adapters. Adjust read names as needed. TODO: Test improved RC aligner with reduced gap extension penalties? TODO: Redo ambig classification of 2-pass based on proximity of junction to the other read (inner terminus). TODO: Combine scores of adapter and inverted repeat. e.g., use a lower adapter threshold when inverted repeat is detected. TODO: Debug method of storing start loc in low bits if adapter aligner which does not seem to work properly. TODO: 2-array, JNI version of adapter aligner. TODO: Consider effect of only filling last querylength cells in BBMap with DEL penalty. TODO: Write a local aligner to refine the junction location of IceCream reads. TODO: Vary queue size TODO: Break alignment into a few columns TODO: More precisely detect junction by recording highest value in addition to last value TODO: Read-based quantification for BBSketch (basically, assign each read to a reference). TODO: Make Depth a better indicator of abundance. TODO: Make big fastas faster to sketch. TODO: /global/projectb/sandbox/rqc/syao/anaconda2/envs/aligners/bin/bgzip TODO: Include taxlevel as a key in D3 mode TODO: Custom fast banded aligner for PacBio triangle read detection. TODO: Examine align2.Block.allowSubprocess (for writing index). Also find similar flag for reading index. TODO: Sketch json format - option to print stuff in arrays. TODO: Base limit (as opposed to read limit) would be nice for sketching PacBio data. TODO: BBTools Mark Duplicates. ***TODO: Put bgzip support in all tool shellscripts. **TODO: Link to latest nt and restart server. *TODO: Add option for full taxonomy in sketch output (JSON). *TODO: Added order option for TaxServer (JSON). TODO: Benchmark DemuxByName with various buffering settings on a shuffled file. TODO: test bgz/gz size on clumped/non-clumped files TODO: Flag for generating bgzf indexes TODO: bgz compress/decompress speed/mem test at various levels with various numbers of threads TODO: Command-line zipthreads setting TODO: Ensure bgz is always preferred for temp files; add a FileFormat temp flag. TODO: Add FileFormat support to ReadWrite. TODO: bgzip BBMap index; allow compression level specification for BBMap index. TODO: Allow streaming refseq to concatenate lines under Xbp long TODO: Sortbyname barcode mode/comparator (can add it to obj field) TODO: Clumpify.main is not called by RQCFilter2 so statics are not caught and restored. Easy solution - have a second main function that returns the constructed object? TODO: Consider adding tax info to demuxer. TODO: Test bgzip decompression speed. TODO: DemuxByName2 should also split sam files by contig. **TODO: Flag to skip synth contam removal in RQCFilter. TODO: Test to ensure refactoring did not break anything. TODO: DemuxByName2 is way faster without output; not clear why. TODO: Consider splitting extra fields of VCFLines. TODO: BBMap kfilter float. TODO: Consider making unzip more efficient. TODO: Modify testfilesystem to test ls on a random directory. TODO: gi2taxid should accept wildcards (shrunk.*.accession2taxid.gz). **TODO: CallVariants3 - multisample variant-caller without writing any intermediate VCF files. *TODO: Write CallVariants usage example that merges independent samples across nodes. TODO: CallVariants2 without merging VCFs. TODO: Update shells with new flags. *TODO: Add meta path flag to TestFilesystem. *TODO: MergeVCF shell wrapper. Also improve it to use hashmaps rather than requiring identical input files. TODO: Var - add an optimal 63-bit hashcode for Var, using max scaffold length and num scaffolds for bit widths. TODO: FilterSam - add capability to run without VCF file, both via Bloom filter and calling variants internally. TODO: FilterSam - consider soft-clipping and ignoring soft-clipped areas. TODO: Error-correct indel-free sam reads with kmers (and ref provided)... as long as a read has only mismatches, it does not need realignment. ***TODO: Sketch multi-JSON fix (Adam R). TODO: BBDuk var-based filtering does not support indels or border like filtersam. TODO: Consider scanning the index for only long kmers, at least optionally. Or only indexing long kmers. TODO: CallVariants stranded and 32bit should default to auto and be enabled based on memory. TODO: Investigate variant-calling after removing trash reads. TODO: CompareVCF should be able to process and produce var files, or else, CompareVar should be written. TODO: In strandBiasScore, correct for bias of all reads in addition to just mapped reads, when mcov is being tracked. TODO: CallVariants first base is capital, later bases may be lower case, for deletion ref calls. TODO: Bed output for pileup? TODO: Tool to split multiallelic variants. Or, ignore multiallelic in comparevcf. TODO: size= flag does not work sometimes: comparesketch.sh in=mruber.fa.gz ref=protein blacklist=null index=f translate size=10000 TODO: .fa files should cause a warning when processed in amino mode (rather than translate). TODO: Mhist seems to only get 0.5 for an indel instead of 1. TODO: Adapter-trimming grading. TODO: BBDuk auto adapters. TODO: Test CallVariants speed with ssmf and raw sam; possibly reduce concurrent files. TODO: Replace Integer.parseInt with Tools.parseIntKMG. TODO: BBMap ecco (for adapter trimming). TODO: RandomReads default insert sizes sould probably not have adapter sequences. TODO: RandomReads should add match string to reads. TO DELETE?: assemble toText methods, ErrorCorrect, countFastqSplit, TODO: Slow speed of singlethreaded sketching. Largely caused by entropy filtering. TODO: Consider 1-bit encoding for CallGenes, with 10-mers recording only AT vs GC or AC vs GT. TODO: Allow CallGenes to report gain over coding regions with a gff; or, write a new program to do that. TODO: Fix CompareGFF to be robust with multifastas. *TODO: SendSketch is not robust against size-0 sketches. On the server side, they just get reported as Error rather than 0 hits. *TODO: 2 or 3 connection threads for SendSketch. *TODO: SendSketch/CompareSketch only load sketches singlethreaded per file in persequence mode. TODO: Convert ReadStats formatting fully to ByteBuilder. TODO: Link Seal and BBSketch, and improve Seal per-file ID assignment or per-TaxID. *TODO: Collision-free, reversible hashcodes can be implemented. The reversibility requires masking the selection bits in each mask. May require single-kmer mode, particularly if k>31. NOTE: Entropy is disabled in Sketch amino acid mode; might be worth checking the entropy of common amino kmers. TODO: There may be no value in indexing protein sketches, due to high conservation. TODO: e-value - track range. TODO: more efficient blacklists and test blacklist efficiency. TODO: It is possible to count the average number of ref sketches sharing hit kmers. A lower average is more specific. TODO: Make sure sizemult flag works with servers. (checked, and it does) TODO: index=f did not work: comparesketch.sh in=c.fa.gz silva whitelist tree=auto index=f TODO: KmerCountExact can no longer ouput sketches correctly for the default dual kmer lengths. ***TODO: BBNorm does not like R#.fq notation in 2-pass mode. TODO: Call adjacent tRNAs. TODO: Add sixframes flag to Sketch for instances of frameshifts, e.g. in raw PacBio data. ***TODO: Make sure commonAncestor works correctly for no-rank nodes (if it is important). ***TODO: Tag Sketch hash codes with the lowest bit to indicate whether they are from long or short kmers. Then calculate genome fraction as in notebook. ***TODO: Figure out better dual-kmer ANI estimate. For example, if 24-mers and 31-mers have similar KID, this implies that the differences are not randomly distributed and therefore the ANI is an overestimate. *TODO: Sizemult does not work with sendsketch local flag. *TODO: Default query sketch size of ribo sketch server seems to be too low. ***TODO: Figure out how to filter mito and chloro from RefSeq... TODO: tRNAs are often densely packed (30bp apart), but CallGenes will only call one of them. TODO: Amino Acid cardinality, and a handful of other BBDuk trivia... TODO: KmerfilterSet for amino? TODO: Check defaults for amino mode size, ANI, entropy, etc calculations in Sketch. TODO: HashArrayHybridFast could be ported to ukmer, but that would be a pain since HashArrayUHybrid is not currently used. ****TODO: Initial kmer set does not appear to work correctly after the first pass... or something. Be sure to clear it and retain the whole thing. It should get copied. TODO: Set operations of GFF files. TODO: Cut sequences from gff file (check if already exists). TODO: MakeKmerSet with a blacklist set (do kmer tables support set subtraction?) TODO: Rename MakeKmerSet. TODO: Better documentation for rqcfilter.sh for remote users. TODO: Modify removehuman and removecatdogmousehuman to use the rqcfilterdata flag. TODO: Require multi-hit in BBDuk (remove current set from table before picking new kmers?) TODO: Easy kmer set operations. TODO: Make 5S and tRNA datasets. TODO: Test long kmer sensitivity with various lengths; consider increasing to 16-mers. *****TODO: Test RNA-calling long kmer support. TODO: How to use rqcfilter.sh and removehuman.sh externally TODO: rRNA score need to be higher, possibly doubled, to compete with CDS. TODO: sketch.sh perfile, or comparesketch.sh ignore ata flag (just outsketch should be fine). TODO: maxcount is now a supported flag for kmercountexact, but it does not actually work. TODO: MDWalker breaks when there are reads with both N and D in cigar. TODO: retain longest isoform and all high-scoring isoforms, but not low-scoring ones. TODO: Try multiplicative model for start probs, not additive. TODO: Score operons via window, and add operon scores to orf scores. TODO: Once optimal path is chosen, refine it by adding and removing orfs. TODO: Apply minscore filter before and after choosing optimal path. TODO: Homopolymer density test. This may affect ease of sequencing and assembly. TODO: Correlation between Tadpole and Spades stats. TODO: Vasanth: /global/projectb/sandbox/rnaseq/projects/Golovinomyces_orontii_MGH1_Metatranscriptome_1196471/multimap shifter --image=bryce911/bbtools bbmap.sh nodisk=t nzo=f ambig=all deterministic=t maxindel=100000 ref=/global/dna/projectdirs/RD/rnaseq_store/genomes/Golovinomyces_orontii_MGH1/Golor3_AssemblyScaffolds.fasta rpkm=CTWOX_counts.txt in=read1.fq.gz in2=read2.fq.gz machineout=t out=stdout.sam statsfile=CTWOX_counts.txt.summary | shifter --image=rmonti/samtools samtools view -Sb - | shifter --image=rmonti/samtools samtools sort - -o CTWOX_hits.bam TODO: Entropyout for BBDuk (Alex Copeland email). TODO: Percolate A_SampleMT changes to other A_Samples. TODO: Note that FilterByTile can be run at a very small tile size with perhaps 50 reads each to detect bubbles at high resolution. This is more effective than trying to detect a high G rate in a large tile that extends out of the bubble. TODO: Verify that gton is working in FilterByTile. It does seem to reduce homopolymers in some cases, but in other cases it increases the rate by selectively discarding non-G reads while not fixing the remaining reads... perhaps? The overall rate of homopolymers and Gs does not change much. It may also be prudent to calc stdev from cycles rather than tile averages; it's currently way too small to be useful. TODO: Optical deduplication is pretty slow if there are a massive number of duplicates, though it seems to be linear with the size of the file. Not a priority. TODO: minLevelExtended flag would be more useful if it could be used to identify hits where there is a higher identity to a different clade than to the same clade. TODO: Add an option to trim depth-1 contig ends in Tadpole (probably just for dead ends). TODO: Add fofn support to FileFormat. *TODO: Seems like last element of a sketch has count lower than it should by 1 or more. TODO: Look at Silva names of removed things. Figure out how to deal with them. TODO: see how big a merged Silva sketch is. TODO: Report (from Donovan) of a site not reported when a pair maps perfectly to two locations with different insert sizes. TODO: Note: srd=2 seems to improve scores of metagenome assemblies despite removing fewer kmers. TODO: Currently impossible to set ScheduleMaker memRatio except through prefilterFraction flag. There could be a default static memratio flag, in ScheduleMaker, for example. TODO: Custom memory settings for Oracle verses Open JDK. TODO: Consider not shaving on a case-by-case basis after looking at extensions (use the passed branchMult2 and minCountExtend fields). TODO: BBMerge extend (rsem) mode with 2-bit bloom filter, or maybe even 1-bit. TODO: bloom filter with k>31. TODO: RQCFilter2 produce a file similar to status.log but column-delimited. TODO: Modify RQCFilter2 to do entropy filtering in a discrete step. TODO: Update RQCFilter2 reproduce.sh to reflect what actually happens. TODO: Retain read names in bbfakereads. *TODO: Sanity check for paired reads being under 1mbp. TODO: Sketch kmer frequency histograms. TODO: Test consect. TODO: Refactor BBMerge? TODO: Tadpole reassemble is creating new ByteBuilders instead of re-using them. TODO: Write custom Fastq parser. TODO: Test reading kmers, then locking, then writing, to prepopulate cache. TODO: Consider allowing HashBuffers to deposit kmers per way, and only hash them if the deposit is too big. Thus, each way would have 2 swappable LongLists and you'd need to sync on the way to swap them. TODO: Write a nonatomic Bloom filter. V37. 37.01 Fixed crash with Seal qhdist. 37.02 Added ReadComparatorRandom and shuffling support to SortByName. Compared trimming tile edges before removing duplicates. Added support for taxonomy headers in Silva or comma-delimited format. Added simple mode to PrintTaxonomy. Fixed a bug with stdout stream name detection. Added subsampling for CompareSketch/SendSketch. Improved distribution of sketch file sizes. Wrote MergeSam for concatenating sam files. Bam streaming from the bamscript is now done uncompressed. Added preliminary support for flex-size Sketches via LongList. Fixed an assertion error in Clumpify with consensus. Dedupe now uses pigz by default. Added output ordering to TexStreamWriter and CompareSketch. Changed the way names of uncultured organisms are parsed. Fixed a regex bug when setting tmpdir. Fixed a bug in RandomReads with a print statement. Added trimnonoverlapping flag to BBMerge to produce consensus sequence only. Clumpify should now automatically create extra groups when expected reads exceed 2 billion. Sketch now supports blacklists. 37.03 Bump. 37.04 Wrote SamFilter. Added sam positional filtering capabilities to SamStreamerWrapper and CallVariants. SamStreamer now optionally retains sam headers. Wrote Sketch guide. Added VCFLine filtering to SamFilter. Wrote FilterVCF and filtervcf.sh. Added max filters to variants (maxscore, etc). Added sam line mapq filters to SamFilter (SamStreamer and CallVariants). Removed some shellscript module loads for specific versions of samtools. Added quality-trimming to variant-calling. Reference alleles now always use uppercase letters. 37.05 TaxServer now prints initialization time and memory. Reduced Sketch memory usage for constuction in taxa mode with prefilter. Sketch shellscripts now load pigz if not loaded. ReadWrite.readObject now correctly uses the allowSubprocess flag; TaxTree now loads much faster. Changed a couple array allocations in Sketch to use safe allocation. Split SketchHeap genome size into genomeSizeKmers and genomeSizeBases, to be more clear. Fixed some issues with TaxFilter; it was not working properly with default taxlevel. Added taxa sorting to SortByName. Fixed a bug with PrintTaxonomy accession=auto flag. Taxa parsing now supports tid as well as ncbi in sequence headers. RenameGiToNcbi now allows custom prefixes for the taxid number; default is tid. RenameGiToNcbi now supports accessions. Taxa sorting changed a bit. Promoting everything to direct descendants of the common ancestor did not work, so they are now promoted to the same level. Changed BBDuk.RQC_MAP to use Long values instead of Strings; it is now additive. Seal now uses the BBDuk RQC_MAP (for RQCFilter). Added spikein removal and mtst to RQCFilter. 37.06 Wrote ServerTools to house some functions from TaxServer. Shortened TaxServer functions by breaking off blocks into functions. Added comments to TaxServer. Added kill-old-instance flag to TaxServer. Added ability to print all sequence headers of empty Fasta sequences. 37.07 Sketch blacklist now supports comma-delimited lists of files. Refactored sketch code to unify location and parsing of shared fields such as k. Added capacity() method to SketchHeap. Implemented graduated sketch size via size=auto. Added a lower cutoff for hashcode values, to reduce blacklist size and increase speed. Codes are now checked against the heap prior to the blacklist, which is faster. BBDuk now supports entropy masking like BBMask, but uses less memory. CompareSketch now supports whitelisting. SketchHeap can now automatically apply the blacklist and whitelist. Fixed bloom filter crashing on unicode symbols in sequence. 37.08 Fixed a bug in FastaReadInputStream with long headers containing multiple greater-than symbols. 37.09 Added SketchObject keyFraction flag and changed default to 0.2. Wrote SketchIndex to contain Sketch indexing methods. Changed Blacklist ways to 1. Changed Bloom filters to by default keep duplicate kmers within reads, and rewrote that method to be more efficient. Bloom filter can now apply Sketch hash function to exclude kmers. Made some IntList allocation methods safer. Deprecated SortByTaxa (functionality moved to SortByName). Wrote BlacklistMaker and sketchblacklist.sh sketch.sh now sets -Xms. PrintTaxonomy now ignores the cellular organisms node. Added auto passes to BlacklistMaker. BlacklistMaker no longer requires a gi table. RenameGiToNcbi now handles lines that have already been renamed. Wrote ShrinkAccession. Accession loading will now default to trying a shrunk prefix, then revert to the normal filename. Removed calcmem.sh perl dependency outside of Genepool. Fixed a bug in sketch autosize. Changed sketch comparison sort order to use WKID. Sketch autosize now defaults to true. Added default blacklists. Wrote DisplayParams, which handles parsing of Sketch display parameters. Curl calls can now pass parameters to a sketch server. Tax server now returns an error message if no sketches are loaded. 37.10 Added utot to Reformat help. Sketch taxlevel now supports strings like species in addition to numbers. Disabled a print statement in CrisContainer. Removed some redundant static fields from SketchObject and CompareSketch to avoid duplication with DisplayOptions. Improved CompareSketch multithreading for small numbers of input sketches. 37.11 Changed the word reads to pairs in BBMap output header for pairing report. Moved KillSwitch from stream to shared. 37.12 Added Tools.contains(String a, String b, int start) Fixed an error in Clumpify allduplicates mode; the last in an odd-sized set of duplicates was not detected. Added Clumpify renamebycount mode. SendSketch no longer requires in= before filename. ShrinkAccession now discards lines with no TaxID. 37.14 Added chloro and mito removal to rqcfilter. Updated tax data and renamed Refseq Microbial records. Single-sketch mode sketches are now named after filename rather than sequence name. TaxServer now does gc() before killing the old server. Sketch Comparison genome size is now estimated from the sketch. 37.15 Sketch size now accepts kmg symbols. Added aliases for autosizefactor. printname0 has alias pn0 and defaults to false. Shared read buffer settings now use getters and setters. SendSketch and CompareSketch use a larger default read buffer length. Fixed an RQCFilter slowdown due to SendSketch changing read buffer length. Set RQCFilter final ziplevel to 9 from 8. Temporarily (?) set ScafMap.defaultScafMap from CallVariants, for ref-allele Var testing. 37.16 Fixed Seal clearing outu flag. Fixed RQCFilter misreporting number of input reads. Modified shellscripts to load samtools 1.4. 37.17 Added rqcfilter.sh to public distribution for Dockerization. Fixed clipping/trimming bug in CallVariants leading to incorrect variant calls. 37.18 Wrote standalone realigner using Realigner class. Wrote bbrealign.sh. Fixed a bug in sam output when loading rname as a String. Reads that would be fully quality-trimmed are no longer used for calling variations. Fixed a realigner bug in which length-1 reads had no cigar string. 37.19 Added zygosity histogram output to CallVariants. Wrote ProbShared for calculating the chance of two sequences sharing a kmer. 37.20 TaxServer URL parsing now correctly handles all reserved symbols encoded as percent codes, and many common non-reserved symbols. Added addunderscore flag to RenameReads. Added shrinknames flag to RenameGiToNcbi. RenameGiToNcbi now tests input files before loading taxonomy data. Changed sketch Comparison function to incorporate genomic kmer size. Changed removesmartbell to split reads by default rather than masking adapters. Sketch blacklist maker now correctly sets rcomp=f in amino mode. 37.21 Sketch blacklist maker yields oddly few keys in amino mode. Modified KmerCount7MTA to correctly observe the rcomp flag, in most cases. Modified KmerCount7MTA to support amino acids. Clumpify Parser.parse moved to end of block. Fixed a false warning for reads that were a multiple of fastareadlen. Added amino8, a reduced representation coding of amino acids. Improved sketch amino acid constants in postParse. Added amino and kmer length tags to sketch headers. This will require server restarts. Fixed a bug causing sketch.sh to write the file twice in single-sketch mode. Increased small sketch sizing for amino acids. Sketch size is now based on genome size estimate rather than genomic kmers. Suppressed writing of length-0 sketches. Fixed overflow when running over 2 billion comparisons. Added bbversion.sh. Sketch servers will now return an error message if incompatible settings were used for SendSketch. Added BBMap deterministic mode. Added SendSketch local mode. SendSketch local mode no longer loads a blacklist. Restarted servers with support for new flags. 37.22 DisplayParams now supports a reads flag. samplerate and maxreads removed from SketchObject statics list. Servers now report and continue in the case then there is an error while trying to kill the old server. UseSizeEstimate flag now fully enables or disables size estimate for both scoring and sketch-sizing. Slightly increased sketch size for large genomes over 200Mbp. 37.23 Fixed module load samtools line in bs.sh to reflect the current version of samtools. Added minlen flag to SortByName (for use with nt/sketch). Added cohort to TaxTree. Improved organization of TaxTree extended level names and synonyms. Wrote a script for fetching and sketching nt. BBDuk now prints an error message when invalid settings of ktrim are used. BBTools now crash rather than hange when quality-score autodetection fails. Clumpify now additionally sorts by lane and tile, making optical duplicate detection much faster for huge clumps. Added soft-clip trimming to BBDuk. Added pipelines directory with scripts for fetching and processing NCBI files. 37.24 Added support for alapy compression. Added getters and setters for private static errorState fields. Sketch kmer length field made optional, only if k!=31. Added optional sketch HASH_VERSION field. Partially addressed overlapping paramter names in SketchObject/DisplayParams. Added clump.KmerComparator2, X, and Y for axial sorting. Added axial sort to clump.Clump. Added Clumpify sortx and sorty flags to facilitate testing of axial sorting. Added additional XY sorting flags. Improved XY sorting and made it default to true for all optical duplicate removal. Clumpify now additionally sorts by sequence by default, yielding a slight compression improvement. Updated RQCFilter with new Clumpify flags (spany adjacent). Split Clumpify spantiles into spanx and spany. Added adjacent flag to Clumpify for removal of only optical duplicates on adjacent tiles (tile-edge duplicates). SendSketch now allows KMG for number of reads. CalcTrueQuality now supports the CallVariants prefilter flag. Fixed a bug in BBDuk r2 entropy masking. Added BBDuk poly-A trimming. 37.25 Fixed a bug in tax server processing headers with spaces. Added IMG support to tax server. Finished revised img sketch support on a per-IMG-id basis. Adapted SketchBlacklist for IMG. Tested memory consumption of nt and Silva servers and reduced -Xmx flag. Added kapatags.L40.fasta and blacklist_img_species_300.sketch to resources. Fixed a bug in which BBDuk was not applying the minlength cutoff when no trimming was performed. Removed ecc.sh. 37.26 Fixed an inequality when checking read length in BBMap. Fixed error reporting number of sketches loaded in all-to-all mode. Replaced int[] with CompareBuffer object. Added Sketch completeness calculation. Added Sketch contamination calculation. Revised Sketch KID calculation to compensate for variable sketch sizes. 37.27 Added ANI calculation and ANI flag. Added new flags to SendSketch and doubleheaders. Contamination can now be calculated without an index. Changed the way contamination was calculated. Fixed an indexing bug related to autosize mode. 37.28 Added printscore flag. Wrote AtomicBitSet. Sketch now uses AtomicBitSet for contam tracking, fixing a cache-coherency bug. Wrote RawBitSet. Sketch now uses one RawBitSet per thread to avoid atomic communication. Rewrote Sketch threading; multithreading is now possible with an index. Moved Parser to shared. Moved a few data structures to structures. BBDuk now allows user-specified poly-A length. Added Parser parallelsort flag. Reduced default sketch records to 20. Updated BBSketch guide. 37.29 Fixed a bug in mutate.sh when giving indel and sub rates using the 0-1 scale. Wrote IntHashSet from LongHashSet and added increment. Wrote IntHashMap. Integrated IntHashMap into SketchIndex and made it the default path (can be disabled with intmap=f flag). Wrote IntHashMapBinary to avoid modulo operation when hashing. Added processIMG.sh to pipelines directory. Added some new assertions and messages to Clumpify's FetchThread loop to diagnose an odd crash. TaxServer query count now ignores usage queries. 37.30 Increased the default number of Clumpify groups by 50% (with 2 fetch threads), and made it scale with the number of fetch threads. Sketch comparison raw fields can now be printed. Wrote SketchResults object, for managing comparison printing methods. Fixed a bug in displaying Sketch hits to ref sketches with no TaxID. Fixed a name0 display bug. Added flowcell and sequence modes to SortByName. 37.31 Wrote SplitSam6Way. Removed obsolete tryAllExtensions option from TextFile/ByteFile. Added histbefore flag to BBDuk, and option to generate histograms after processing. Added fname metadata to Sketch header. Changed Sketch results query formatting to include more metadata. 37.32 Added parsing for comment. Clumpify with groups>1 now works with paired fasta files, though interleaved fasta files need interleaved to be explicitly set. Wrote MultiBitSet and improved AbstractBitSet. Refactored comparison formating into DisplayParams. TaxServer no longer dies when receiving an unexpected parameter. TaxServer no longer terminates when failing to kill an old instance. SendSketch now passes printRefDivisor and so forth. Added Unique, uContam, and noHit Sketch results columns. Added taxonomy-based Sketch results coloring. Added TaxTree.extendedToLevel for reverse lookup. 37.33 Added counters for tracking TaxServer statistics. Improved server help messages; added Sketch usage info. Added Tadpole extra flag and clarified the documentation. 37.34 Added some scripts to Pipelines. Wrote SummarizeSketchStats and summarizesketch.sh for making tables of multiple Sketch results files. Added optional hard-coded path flags for CompareSketch to use silva, img, nt, and refseq sketches. Verified that external queries are tracked properly, even though none have been recorded. SendSketch and CompareSketch default to colors=f if outputting to a file rather than stdout. Fixed issue of FungalRelease not writing an AGP file if contig output file was not specified. Fixed a bug when specifying a SendSketch address with a terminal slash. Updated DisplayParams to only send hk and hamino. Improved SendSketch handling of automatic blacklists. Added support for dual kmer lengths in Sketch. gSize calculation now supports dual kmers. ANI now supports dual kmers, but uses linear interpolation of an exponential function. Seems to work, though. Fixed CompareSketch all-to-all including self for contamination detection. Accelerated dual-kmer sketching by choosing a random hashcode rather than the larger one. 37.35 Dual kmers are now supported in TaxServer's error message for incompatible sketches. Fixed a display bug in LoadReads. Added quality-score binning detection to fastq file memory use estimation. Added lowcomplexity flag to fastq file memory use estimation. 37.36 Clarified TaxServer error message for incompatible settings. Added deleteinput flag for Reformat and Clumpify. Updated BBSketchGuide. 37.37 Fixed a read orientation bug in CalcTrueQuality when using a VCF file. Simplified some calls to short and long match string conversion. Added a variant-calling script to /pipelines. Fixed a null pointer exception in Sketch when using sam files. Investigated recalibration of R2. Turns out the graph just looks odd because of low-quality unmapped reads. BBDuk can now accept ref=phix or adapters or artifacts, and automatically locates the file in /resources. Read identity calculation was crashing with fixvariants (from a VCF file). Removed bbduk2.sh as deprecated; only BBDuk is maintained. 37.38 Adjusted Sketch hash function; cycleMask is now a constant. Made Sketch hashing variables private. Made Sketch hashCycles variable; speeds up shorter kmer lengths and makes k2 codes compatible with k codes of same length. SortByName now uses compression level 2 for temp files. RenameImg now also reports the number of files, sequences, bases, and TaxIDs used. IMG naming is now in the old NCBI style, e.g. tid|1234|img|56789 IMG header parsing methods and lookup table moved from TaxServer and ImgRecord2 to TaxTree. IMG header parsing is now automatic. Updated some descriptions in commonMicrobes filter directory. RQCFilter now by default queries nt, RefSeq, and Silva when Sketching. Wrote TestFilesystem and testfilesystem.sh to monitor filesystem performance. SendSketch now automatically sets k and k2 for nt, silva, and refseq. Changed RefSeq and nt sketch servers and scripts to k=31,24 (needs restart). Modified KmerCount7MTA increment routine slightly; it can now store hashed kmers. gi2taxid now runs in silva mode without requiring a gi or accession file. Altered BlacklistMaker to fix an issue of redundant hash codes. Fixed DisplayParams clone method. Fixed order of parsing imghq and setting the default img file. Fixed a bug in taxa coloring using parent instead of current node. Added dark purple to Colors. Taxa coloring now underlines records with the same color but different taxa compared to above. Updated SketchGuide to explain underlining. Added a second genome repeat content estimation method. Genome repeat content now considers one copy of a repeat to be non-repeat. For example, a genome with 1% duplicated would be considered 1% repeat instead of 2%. Added pipelines/assemblyPipeline.sh. Increased maximum samtools compression threads to 64. Clarified descriptions of outm and BBDuk kmer-matching modes. Revised Reformat trimrname handling to include all whitespace, and clarified description to include bam files. Restarted RefSeq and nt servers with k=31,24. CallVariants and FilterVCF can now enable/disable SNPs, insertions, deletions. ReadStats histogram lengths can now be adjusted with the maxhistlen flag. 37.39 CompareSketch now allows first parameter to be a file name without in=. Wrote LongHashMap and LongHeapMap. Refactored SketchHeap to support LongHashMap when minkeycount>1. SketchHeap can now be temporarily longer than the desired sketch length when minkeycount>1. 37.40 Added usage query tracking to TaxServer. Added correct sketch blacklists to public distribution. Fixed incorrect insert size with renamebyinsert flag in RandomReads when reads are longer than insert size. RQCFilter now resets Sketch statics prior to subsequent SendSketch runs. SketchObject minkKeyCount moved to DisplayParams. SketchObject minCount field replaced. DisplayParams.minCount renamed minHits. BlacklistMaker.minCount renamed to MinTaxCount. RQCFilter now uses minkeycount=2 for Silva. Changed SketchObject.size to targetSketchSize. TaxServer now makes a new SketchTool as needed when minKeyCount is different in local mode. Made some improvements to assemblyPipeline.sh. 37.41 Fixed a tiny bug in parsing Sketch single kmer lengths of under 31. 37.42 Updated BBSketch guide. 37.43 Changed default IMG path to the k=31,24 version. Renamed minID to minWKID. MutateGenome can now output a smaller genome fraction of the original genome. Fixed a missing newline in Sketch server help info. BBSketch now supports non-multiples-of-4 for k2. Revised assemblyPipeline.sh. Added assembleMito to /pipelines. Increased hashing speed by 4-8% by switching from 2D to 1D matrix. Increased Sketch max kmer length to 32. Enabled pn0 (printseqname) flag for query. Fixed CompareSketch ignoring read limit when loading input files; this was caused by parse order. 37.44 Reworded code description of maq to indicate it happens after trimming. Added mbq to BBDuk. Added Sketch ANI bisection, enabled by exactani flag. But it made the results less accurate at low ANI than linear interpolation. Fixed a bug in which old 2D matrix was sometimes used instead of 1D matrix. Discovered current K=31,24 server sketches were generated with a bug; regenerating. Updated alapy compression support; speed flags are now enabled. Updated TaxonomyGuide.txt. Added testPlatformQuality.sh. Updated callInsertions.sh. Updated assemblyPipeline.sh. Made a MapPacBio assertion error more explicit, for debugging. TaxServer no longer logs usage queries. 37.45 Clumpify spanx was controlling both spanx and spany due to a parse error; fixed. Added full range of delimiters to demuxbyname and clarified shellscript help. Added demuxbyname column mode (e.g. column=2 to demux by the 2nd column). Demuxbyname default compression level changed to 1 to cope with slow compression speed. Improved CompareSketch parsing of flags shared by Parser and DisplayParams. Added 3-column Sketch results. Restarted servers with new format support. 37.46 Fixed a null pointer exception in Sketch format 3. 37.47 Sketch now supports minANI flag. Added Sketch spid field and allowed spid and imgID to be set from SketchMaker command line. 37.48 taxid, imgid, spid, name, name0, and fname can now all be set or overriden from the command line of Sketch, CompareSketch, and SendSketch. Fixed a bug in assigning spid to query sketches in the TaxServer. Restarted servers again. 37.49 For clarity, taxname is now an alias of name for Sketch. Changed flags like useimgname to useImgAsName following feedback. Added invert flag to Reformat. Rewrote ReadStats addToIndelHistogram to increase speed and fix bugs. 37.50 Documented autosizefactor in sketch shellscripts. Updated BBSketchGuide.txt with information about sketch sizing. Wrote fetchSilva.sh. Improved commenting of many /pipelines/ scripts. Modified RenameIMG to handle dual IMG files. Fixed img name parsing when no taxID is present. Fixed a failure to increment in TaxTree.parseDelimitedNumber. Sketch Amino mode now autosets k2, and a message is suppressed. Fixed a bug with Sketch amino mode parsing (it is parsed in 3 places). Really deleted ecc.sh from public BBTools distribution. Started MakeContaminatedGenomes. 37.51 Sketch ref= flag can now accept # wildcard. BBDuk Poly-A trimming now occurs before entropy-masking. Documented BBDuk internal order of operations in BBDukGuide. Wrote MakeContaminatedGenomes and makecontaminatedgenomes.sh. Removed a couple references to nonexistent changelogs in shellscripts. Fixed a bug in ConcurrentReadInputStream.getReads (failure to call start). ImgID sketch results header now is padded by spaces. MDWalker can now handle cigar N symbol. 37.52 Fixed CompareSketch replacing original header filenames with sketch filenames. Fixed a bug in FilterByTile by forcing IntList initial size to at least 1. Added eoom (ExitOnOutOfMemoryException) shellscript support. Added shellscript parsing for degenerate terms like xmx= and ea. Added DisplayParams taxFilter field, and SketchResults autoremoval of nonpassing results. Added TaxTree.parseLevel and called it in many parsing routines. Made shellscript formating slightly more standardized. Added some error checking to SendSketch; it now uses a nonzero exit code when the connection fails. Updates shell scripts with references to guides. Sketch now supports taxonomic filtering. Delete an obsolete redundant guide. 37.53 Sketches now scale heap size with sizemult, and default heap size is doubled. Fixed Refseq server; it was using a whitelist. KillSwitch kill and print methods are now synchronized. Moved parse location of Sketch db names to Searcher. Searcher.refFiles is now a LinkedHashSet. Added Tools.isDigitOrSign and toString(Throwable). TaxServer now returns error messages from doubleheader parsing. TaxFilter now always adds specified nodes regardless of tax level, and stops promoting as soon as the target level is reached. Fixed some taxonomy server issues with tax filtering. Added IntHashSetList for creating concise sets. Added blacklists to Silva and RefSeq server invocations; they were missing. All shellscripts now load oracle-jdk/1.8_144_64bit on Genepool. Sketches now have a count array. Sketch reading and writing now supports the count array. Sketch spid parsing fixed. No more spurious warnings about missing blacklists when they are not being used. Accelerated Sketch writing by around 20% by debranching. 37.54 Changed TaxServer back to prior Java version because 8u144 is not installed on gpwebs. Changed default startSilvaServer.sh to old style since the silva keyword is conflated. Added Sketch minQual and minProb processing. RefSeq is now the default sketch server, since bigger references are more accurate. Added support for NCBI merged.dmp file in TaxTree (now mandatory). This necessitates a coordinated push since the format changed. TaxServer no longer crashes when there are missing TaxNodes. taxpath now works better with printtaxonomy.sh. Sketch unique and nohit counts are now calculated correctly when printcontam is disabled. BBDuk now correctly removes reads that fail maxlen even when no trimming is performed. TaxServer now correctly tracks external query counts through a proxy (at NERSC). 37.55 TaxServer now reports average and most recent query time. Making a Sketch from a Heap moved from SketchTool to SketchHeap. Sketch construction now adds counts when available. SketchMaker now parses display params. Fixed an array out of bounds in LongHeapMap. PrintDepth is now working! Swapped minProb and minQual in SketchObject; parsing was bugged. Added #-symbol support for dual fastq files in Sketch. Added contains(key) to LongHeapSet and LongHeapMap. SendSketch now loads fastq files multithreaded. This is up to 6x as fast though slightly less efficient. Reformat now can upsample via samplereadstarget/samplebasestarget. 37.56 SendSketch now does read validation in-thread and achieves up to 9x the speed of the singlethreaded version and better efficiency. CompareSketch had bufferlen cap removed when processing fastq. SketchMaker has a new fast path for onesketch of a single fastq file, and default bufferlen changed from 1 to 2 to better deal with short sequences. For fastq, speed was quadrupled. SketchMaker no longer prints an error message if there were no output sketches; instead, there is a warning. Sketch now allows internal merging of paired reads. RQCFilter defaults to merging reads and using minprob=0.75. Added Sketch arbitrary metadata tags. Added Sketch depth2 (repeat-compensated depth). Regenerated nt and RefSeq reference sketches with coverage information and restarted the servers. Added Sketch volume column. Added IntHashSetList.toPackedArray. SketchIndex now returns SketchResults with taxHits instead of a raw ArrayList. contam2 now appears to work. SketchMaker now obeys read limit. Sketch results are now sortable by depth and volume. RQCFilter now uses some additional sketch flags like volume. 37.57 Changed default printOptions content. Wrote MakePolymers. Added period flag to MutateGenome. Tested: Homopolymer blacklisting up to k=9 does not obviously improve sketch depth accuracy. calcmem.sh now supports RQCMEM override flag (in megabytes). BBSketch now supports intersection and printing sketch intersections. Wrote Sketch.invertKey. Note that this requires the reference. Fixed an issue of not including ref= with # flag in SketchSearcher loading. Fixed a bug stemming from a null return in SketchIndex when there are no matches. Fixed an infinite loop in Sketch comparebydepth and volume. Sketch score moved to a field to make sorting faster. Deleted BBMask_noSam.java 37.58 Added exception handlers for AssertionErrors in cris. Added nucleotide support to sketch files. Added Var.noPassDotGenotype. Wrote EntropyTracker. Modified BBDuk to use EntropyTracker. Modified BBMask to use EntropyTracker. Note that entropy calculation was slightly off prior to EntropyTracker. BBSketch now supports entropy filtering. BBMap now supports sambamba for bam input. 37.59 Increased memory for RefSeq sketches by 1g. Set default Sketch entropy filter to 0.66. Set default Sketch minprob to 0.0001, which is sufficient for 31bp at 74% (Q5.9). Added EntropyTracker fast, slow, and superslow modes. Added command-line flags for EntropyTracker speed, verify, and Sketch entropyK. out=/dev/null no longer prompts you to delete it in most cases. RQCFilter sketchminprob flag added, and default changed to 0.2 (95%; Q12.9). Fixed a bug in EntropyTracker ns calculation and added it to verify(). RQCFilter now suppresses error messages when SendSketch fails due to connectivity issues. Fully commented EntropyTracker. Suppressed a race-condition-induced error message from closing the input stream early in Reformat and BBDuk. 37.60 Brought back UnicodeToAscii and changed it slightly. Still does not work as intended, but may work in most cases. 37.61 Made some slight changes to EntropyTracker. Added ribomap flag to RQCFilter. Added default adapter sequences to RandomReads. Suppressed printing some unnecessary verbose stuff from CoveragePileup. Added kmersIn tracking to kmer counters. KmerCountExact now prints average depth. Added Tools.observedToActualCoverage(). This allows conversion of observed kmer counts to average kmer depth. BBMap now has printstats and printsettings flags to suppress verbose output. Revised observedToActualCoverage with a more precise estimate with a reverse curve-fit. Added observedToActualCoverage to BBNorm. Updated BBSketchGuide.txt with entropy, depth, and merging. 37.62 Average kmer quality is now tracked in Sketch and stored in the header. Fixed a place in SketchTool where genomeSequences was not being reset to 0 (should have no effect). Added synonyms for onesketch and so forth so that the prefix mode= is no longer required. CompareSketch can now use # notation for paired reads. Added unique2 and unique3 flags. Comparesketch now automatically generates an index if required by some columns. Wrote TaxFilter.reviseByBestEffort(file) to allow closest available ancestors as output. Added FilterByTaxa besteffort flag. Improved FilterByTaxa output formatting. TaxTree constructor became private. TaxTree gained a static sharedTree which is used by default. RQCFilter ribomap, chloromap, and mitomap automatically widen filter thresholds when nothing is found. RQCFilter disables chloromap when the organism is not a plant (Viridiplantae), unless no taxa is given. 37.63 Fixed a bug when using Sketch constructor to pass average kmer quality and restarted servers. Added anifromwkid flag to alternate between calculating ani from kid. Added minbases to filter results, ignoring small reference sequences. Added minsizeratio to filter results. Intended mainly for all-to-all comparisons. Added Strain and Substrain to TaxTree. Added RepresentativeSet and representative.sh for condensing sets of genomes by all-to-all ANI. 37.64 Fixed a bug in determining which levels to print in PrintTaxonomy. 37.65 TaxServer now caps sketch load threads at 2 for local files. Added numChildren, minParentLevelExtended, and maxChildLevelExtended fields to TaxNode. Added printChildren and printRange to taxonomy server URL parsing. 37.66 Changed tax server error response codes from 200 to 400. Rewrote tax server URL parsing to be more flexible; /tax/ is no longer needed (though /sketch/ is). Broke down server timing reports by local, remote, and usage. Added TaxTree.getChildren() using a hashtable. Depth and merge flags now work in sketch server local mode. TaxServer has now enabled multithreaded local fastq sketch loading and capped the threads at 4 instead of 2 by default. TaxServer handlers are now multithreaded, fixing poor response time when loading data in local mode. RQCFilter now adds the original filename and organism name (if known) to sketch query headers. RQCFilter now reports which microbes were used in filtering. 37.67 Fixed a extended/normal level bug when widening TaxFilter. Updated CoveragePilup (pileup.sh) to give a more detailed summary, and import scaffold names from the reference sequences (default true) or reads (default false). Fixed crash when SamTools version string contains letters. RQCFilter now gathers chloro, ribo, mito references for mapping at the species level by default, rather than order. This dramatically speeds up mapping, by 20x in some cases. Pileup now calculates kmer coverage. BBMap can now output coverage statistics with the cov flag even if there are no coverage files specified. Reformat can now calculate kmer statistics via the k flag. Reformat now ties loglog k to counting k. Setting loglogk now automatically enables loglog. Fixed order of the conditional last column (name0) in Sketch output. Sketch format 3 now prints qsize and rsize instead of size ratio. RepresentativeSet now expects potentially 5 columns, with qsize and rsize. Clarified an assertion error in Seal. Added taxonomic filtering to RepresentativeSet. RepresentativeSet now prints the size of genomes retained and discarded. Strain can now be assigned to children of subspecies. TaxServer now prints children for life node. JsonObject now ignores attempts to add null values, preventing TaxServer from crashing. Comparison.taxID() and imgID() now return -1 rather than 0 if the number is undefined. Tweaked RepresentativeSet sorting to favor larger genomes; yields a slightly smaller output. Added pJET and lambda to BBMap resources. remote_files now additionally lists cat, dog, mouse, and microbial references. Sketch format 3 now prints out query size in bases, to avoid including massive sets of E.coli all listed under the same taxID. Added DedupeProtein, via the amino flag in dedupe.sh. Fixed a bug in Dedupe in which sequences could subsume each other if both contained the other. This mainly happened when they were the same length but differed by substitutions. 37.68 Added Clumpify allowNs flag. Clumpify can now process containments and affixes. clumpify.sh no longer prints out the java version. Clumpify now supports a dupesubrate flag. Clarified some steps in variantPipeline.sh. TaxFilter can now parse organism names if a tree is loaded. 37.69 Renamed kapatags.L40.fasta to kapatags.L40.fa and pJET1.2.fasta to pJET1.2.fa. Added kapa support to RQCFilter. Added pjet, lambda, mtst, and kapa keywords to BBDuk. Added pjet, lambda, mtst, kapa, adapters, artifacts, and phix keywords to Seal to mirror BBDuk. Moved breakReads from Reformat to Tools. Wrote PreParser to allow output stream redirection. Converted most classes to using PreParser. Removed MakeCoverageHistogram. Deprecated NormAndCorrectWrapper. Generally got rid of printOptions(); help is in shellscripts, not code. This is handled in PreParser now. 37.70 Tightened project error and warning levels for compilation; modified a large amount of code to comply. Deleted a redundant copy of KillSwitch. Deleted redudant copies of safe array allocators. 37.71 Eliminated hyphen-stripping, java flag parsing, and null flag replacement from PreParser classes. outstreams are now always closed in main, except in rare cases like TaxServer. Added outstream to a few classes like BBMerge. Moved some TaxServer parsing to ServerTools. 37.72 TaxServer no longer allows external file access by default. TaxServer logs ip addresses of malformed queries. Rewrote ServerTools.sendAndRecieve to be more robust. Changed URLConnection to HttpURLConnection to allow error stream access. Fixed a bug not displaying help in RemoveHuman. calcmem.sh now supports SLURM_MEM_PER_NODE. However this is only set when the --mem= flag is specified for job submission. Sketch metadata is now set in SketchMaker for per-taxa and per-sequence modes. Sketch results can now be filtered by optional metadata fields. 37.73 Re-added libbbtoolsjni.so, which had somehow been removed. Wrote DiskBench.java and diskbench.sh for comparing multithreaded I/O on local and networked disks. Added RQCFilter flags for Clumpify groups and tmpdir. 37.74 Sketch servers now log the first 3 lines of the body of malformed queries to help diagnose the problem. mouseCatDogHumanPath added to RQCFilter. Changed parse order of silva flag in TaxServer. Added RQCFilter dryrun flag. Split RQCFilter aggressive flag into aggressivehuman and aggressivemicrobe. Sketch servers no longer return error messages when query sketches are size 0. Fixed a parse bug allowing minkeycount to be 0 for sketch processing. Sketch k2 can now only be set via k. Sketch k2 can no longer be set to k. Enabled verbose output from SketchTool (for debugging). 37.75 Fixed AssemblyStats default outstream and printing Executing... message. 37.76 Added Shared.threadLocalRandom() to produce a ThreadLocalRandom when supported, and otherwise a Random. Converted some programs to use Shared.threadLocalRandom(), but not BBNorm since it uses .nextLong(long). DiskBench is now much faster in generating random text. TestFilesystem now supports multiple sequential files and is probably generating correct data. ReadWrite can now getRawOutputStream for /dev/null/* and will remove the * portion. This is much faster than writing to /dev/shm/* Removed an invalid assertion from RepresentativeSet. 37.77 Wrote ExplodeTree and explodetree.sh, to create a directory structure mirroring the tax tree. Rearranged parse order in A_SampleByteFile and A_SampleD. Added some convenience methods to TaxTree and TaxNode. Wrote LongLongHashMap. Added sequence path lookup to TaxServer. LongHashMap and LongLongHashMap no longer include invalid entries in toArray(). Wrote IntLongHashMap. Wrote TaxSize and taxsize.sh to generate the size of tax nodes. Added Silva header parsing to TaxServer. Added size lookup to TaxServer and created a RefSeq size file. ByteStreamWriter.print methods now return this, to allow chaining. Rewrote Read.validate() to be faster, simpler, and more modular. Read MIN_ and MAX_CALLED_QUALITY are now private, and generally replaced by a remapping array. Read validation no longer turns . - X to N by default. Fixed toSemicolon method in TaxTree. Increased TaxServer default memory to 52G in response to frequent GC during high query volume. ByteFile1 mode is no longer forced on Denovo or Cori. Added Parser.validateStdio() to ensure interleaving and file formats are specified when piping. Currently only enabled for BBDuk, BBMap, and Reformat. Added header and more columns to RepresentativeSet. 37.78 Updated citation guidlines. Added validatebranchless flag and code path. Improved validatebranchless to use binary instead of boolean or. Removed invalid sequence cre_lox_lib_yadapt1 from reference collections. Changed JsonObject handling of null values to be compliant. Added JsonObject handling for floating-point types. Added Json output for Sketch results. 37.79 RenameGiToNcbi now accepts multiple input files. TaxServer now handles favicon.ico requests. Modified SortByName to better handle large numbers of temp files with long sequences, by reducing buffers and adding a mem mult. Redid JsonObject to remove name field. Wrote JsonParser. Added stopcov option to Pileup. Fixed a bug with reporting invalid bases in Read. Regenerated RefSeq and nt sketches from the latest versions. 37.80 Fixed hidden compile errors. 37.81 Fixed a Json display error for duplicate names. Added Json parsing and printing support for escape characters and exponent numbers. Added Json parsing and printing support for arrays. Fixed a bug in ReadWrite failing to strip path correctly. 37.82 Fixed BBMap producing X8 (insert size) tag for improper pairs (on different contigs). Added an early test for BBMap invalid input files. Added a triple switch to shellscripts for genepool/cori/denovo. 37.83 Merged a branch. 37.84 Bump. 37.85 Removed an obsolete module name from shellscripts. Fixed BBMap bug in which files with uppercase letters were erroneously not found. Modified TetramerFrequencies to comply with stricter compilation rules. 37.86 Modified TetramerFrequencies to make k a variable. Changed TetramerFrequencies printing to use ByteBuilder. Wrote TestFormat and testformat2.sh. Undefined amino acids are now assigned X instead of . Fixed a race condition in ByteFile2 via a defensive copy. 37.87 Fixed a ByteBuilder overflow bug in append(long). Changed TetramerFrequencies to use ByteBuilder. Fixed missing else in CalcTrueQuality parser. Added a new switch case to shellscripts to handle Shifter environment variables on Cori/Denovo. Wrote multithreaded version of TestFormat. Added merge and trim to TestFormat. Better error message for ByteStreamWriter to read-only file. TestFormat no longer crashes when trying to write to a read-only directory. 37.88 SummarizeSketch now supports colors. Wrote CallVariants.findUniqueSubs to help locate bad NovaSeq reads. Added variant-based read filtering to BBDuk. Read.countSubs now supports shortmatch. Fixed Read.countMatchSymbols(). Fixed clearfilters flag not clearing SamFilter, only VarFilter. Var now parses depth, minusdepth, r1p, r2p, r1m, and r2m from VCF. Added AD field to primary fields of VCF output for ease of parsing. Wrote VcfLoader, a multithreaded VCF or var-format loader. 37.89 Wrote ByteBuilder.appendFast(double, int). Changed Var to perform calculations with doubles instead of floats. Fixed nondeterminisim in RevisedAlleleFraction calculation. This was not due to the use of floats vs doubles so the doubles can be changed back. VCF/Var files are now written much faster, at around 55 MB/s up from 10 MB/s. ByteStreamWriter now supports multithreaded input. FileFormat now detects VCF and Var files. Added some information to Var header. Wrote VcfWriter class to write VCF/Var files multithreaded, at up to 630 MB/s. Wrote Tools.isDigit, isLetter, toUpperCase, etc. Character.isDigit is slow. ByteBuilder now implements CharSequence, allowing it to be used with TextStreamWriter. Changed several instances of StringBuilder and String.Format to ByteBuilder. 37.90 Multithreaded TetramerFrequencies. Fixed some printing errors. Multithreaded var2.MergeSamples. 37.91 Multithreaded FilterVCF. Poor speedup with vcfline.toVar, for reasons that are hard to diagnose. Fixed a bug in ScafMap.loadVcfHeader. Wrote Tools.parseDelimited. Var.fromVCF now optionally imports extended information. Added maxReads, minCov, and maxCov to VarFilter. Reordered VCF info fields for faster parsing. Added code to convince compiler some possible null pointer acceses were safe. Added ConcurrentReadInputStream.returnList(ListNum) with internal null check. Added an assertion to most paring statements expecting a non-null b. Fixed several other potential null accesses. Made AccessionToTaxid/RenameGiToNcbi somewhat faster; running multiple concurrent unpigz processes makes it slow. Fixed taxpath setting failure in RenameGiToNcbi and other programs. Added G.species name format support for TaxServer and taxonomy in general. PreParser now supports printexecuting flag for command-line suppression of repeating the parameters. Wrote SuperLongList. Column needed for percent of library in sketch output, something like depth * genome size. TestFormat2 now works better with negative numbers for quality and broken quality scores. TestFormat2 supports additional fields like length mode and stddev. 37.92 Bump for Jenkins. 37.93 RQCFilter now defaults to auto for taxTreeFile. Fixed BBSplit crashing when parsing flags without an = symbol. Fixed some missing accession numbers in TaxServer. TaxServer now timestamps queries and displays the number of NotFound queries. 37.94 Found and replaced some instances of z2=Xmx with z2=Xms in shells. Reimplemented ByteFile.pushBack(line) to sidestep a NERSC slowdown in multithreaded java reading. Fixed VcfLine.type(). Wrote GffLine and vcf2gff.sh. Added CallVariants gff output. Fixed pairLength() and pairCount() swap. Fixed the way sambamba was being called. Re-tested bcftools 1.7 and BBMap 37.94. CallVariants is 14x more efficient and 180x faster. *It is now difficult to replicate the memory/timing bug in 37.94 with CompareSketch bf1, but partially replicates with bf2. TaxTree now checks for the auto keyword just before tree load. Moved TaxNode size tracking from TaxServer to TaxTree. Wrote SummarizeContamReport and summarizecontam.sh. Fixed an off-by-one error in Var to GFF translation. Added match generation from cigar, bases, and reference with no MDTags. Fixed bug in MDWalker for substitutions immediately after deletions. 37.95 Reformat is now able to generate match strings from a reference instead of an MDTag. Default SamStreamer threads increased to 6, to deal with match string generation from sam 1.3. ref added as a flag for various programs to enable MD-free sam line processing. Fixed an assertion preventing # replacement for BBMap input. Fixed handling of assertion errors during fastq quality encoding autodetection during initialization, for paired files in which file 2 has corrupted quality scores. Program now prints a warning instead of terminating when quality format is specified but it seems wrong, in at least one case. Failed an attempt to accelerate FASTQ.quadToRead. 37.96 Wrote FindJiCJunctions and processhi-c.sh for identifying and trimming junctions. Added formatting functions in Tools to handle printing reads and bases processed. Fixed a crash bug in CallVariants realign mode. Fixed missing sample names in CallVariants multisample mode. Fastawrap now supports kmg extensions. Fixed assemblers trying to get stats from stdout.fa. Fuse now allows length limits of fused output. Wrote preliminary junction detection for CallVariants. Made new RiboKmers files from Silva 132, and made a script for replicating the creation process (in /pipelines/). Wrote var2.SoftClipper. 37.97 Added FilterByTile to RQCFilter. Fixed a Clumpify crash-hang with low memory. Made a Clumpify KmerSort superclass to reduce code redundancy between KmerSort versions. Changed an exception handler in FastaReadInputStream to handle null-pointer exceptions as well. Wrote RQCFilter2, with dependencies in a single path set by the rqcfilterdata flag. 37.98 Fixed a bug in RQCFilter2 mousecatdoghuman mode with read-only files. Added ksplit to BBDuk. 37.99 Improved error message when processing sam files with no MD tag in Reformat. Possibly fixed a crash-hang during OutOfMemory exception handling in ConcurrentGenericReadInputStream. Merged DNA and RNA artifact files for RQCFilter2; modified the primary artifact files, and removed redundancies. Adapters are no longer present in Illumina.artifacts, only in adapters.fa. Nextera linkers are no longer present in Illumina.artifacts. PolyA is now a flag. Created a second RQCFilterData - RQCFilterData_Local, identical but with unmasked sequence names. Added ploidy flag to CallPeaks documentation. Added polyA.fa.gz to resources. Modified resources/sequencing_artifacts.fa.gz to remove adapter sequences and Nextera linkers. Changed Read constructors to ensure amino acid flag is passed correctly. Fixed an array length overflow in ByteBuilder. TODO: Counting Cuckoo filter as Bloom filter replacement. TODO: SamToRoc is broken. TODO: Rob example of NM tag not counting deletions adjacent to soft-clipping. TODO: Put new RiboKmers on gdrive or something. TODO: Fasta input, when auto-split, misses sequences of exactly the split length. Remove fasta auto-split. TODO: CallVariants line 861 and 883 assertion. TODO: Anomaly with soft-clipped reads at junctions for BBMap in local mode - analyze left-clipped reads. TODO: N rate reported by BBMap seems odd in local mode. TODO: Since move to floating-point trimming, Reformat does not speed up with bf2 when trimming. TODO: Move all packages with RQCFilter dep TODO: Change RQCFilter to use a config file for paths. TODO: BBMap: Scaffold ends have suspiciously low coverage, with few reads mapping that extend past the ends. TODO: Local mode does not increase mapping rate. TODO: Clumpify pick pivot from both reads in a pair. TODO: Move all hard-coded references to a single umbrella director that can be tarred. *TODO: TaxServer Children does not work if there are no children. TODO: Column needed for percent of library in sketch output, something like depth * genome size. TODO: For some reason, java after 1.7 release 51 is slow in multithreaded reading gzipped files with pigz. Submit a bug report. TODO: TestFormat overrepresented kmers? Adapter sequence? Probable organisms? TODO: ExplodeTree does not, strictly speaking, NEED to have a directory-making phase... TODO: Add variable for BBNorm RAM to cell ratio modifier. TODO: Better document server update and restart. ******TODO: Stats N90/L90 is broken in some cases with huge genomes. TODO: Modify RepresentativeSet to allow Strings as keys, or somehow add numbers to sequence names. TODO: For mapping to ribo (and for filtering in general) try reducing BBMap's sites2 and test resulting speed. TODO: ab indicates failed connections when concurrently accessing help. TODO: Add taxa-level barriers to edges in RepresentativeSet. TODO: Document changes due to tax tree adjustments TODO: DBDate TaxServer field. TODO: Sketching nt is slow and has poor threading. Reason is unclear but may relate to huge numbers of tiny sequences sharing taxIDs. TODO: File sketching for duplicate detection on filesystem. ***TODO: Race condition causes spurious error message when reads= is set in some cases. E.g. reformat.sh in=P1.fastq.gz out=stdout.fa reads=5. But this cannot be replicated on Genepool. TODO: Sketch nucleotide encoding support? Would require sorting after hashing... TODO: Allow sketch number field in output. For this purpose sketch number would need to be deterministic. TODO: Try euk ribo blacklisting for depth accuracy. TODO: Some RefSeq TaxIDs such as M.Ruber are duplicated which messes up depth2. TODO: KmerCountExact and KCompress guides. TODO: Autosize becomes unnaturally small in conjunction with whitelist. TODO: Sketch should autodetect k from input sketches. TODO: Write a new program for per-file sketches. ***TODO: Incorrect dual kmer ANI is probably due to short kmer not being rcomp'ed. TODO: Figure out how to parse amino in only 1 place, and add amino8 DisplayParams support. TODO: For some reason DemuxByName is slow with compressed output files. They seem to be written serially. TODO: Circular realigner. TODO: Move more scripts to pipelines. TODO: Explanation on BBMap stats output. TODO: Multiple input file support for Reformat. TODO: CompareSketch should try to parse the name of input fasta sequences, and include it by default in the Query header. *Actually it does - maybe this is only for IMG? TODO: Sketch minID setting instead of/in addition to current minWKID setting. Needs server reboot and new distro, and DisplayParams changes. TODO: Add BBMap gff output for spliced genes? Or maybe from a sam file makes more sense. TODO: Move targetSketchSize to displayParams. TODO: Dedupe ignores minoverlap for merging. TODO: Shuffle is being slow and using a lot of memory. TODO: Clumpify consensus should warn/quit when used with paired reads or deduplication. TODO: Investigate "rescued" counts. R1 seems abnormally high with NovaSeq data. TODO: Relaunch sketch servers with support for hk and hamino in doubleheader. TODO: RQCFilter filterbytile. TODO: BBMap trd default for sam output TODO: Chongle request TODO: IMG/NR servers TODO: BBMap samtools in container TODO: More informative TaxServer message when file is not present in URL. TODO: Classify tax nodes under species as subspecies when possible. TODO: Modify consect to print list of errors fixed/not fixed. *TODO: BBMask should remove fully-masked sequences, and trim sequences after masking (on the right side, at least) to eliminate masked ends. TODO: Modify Clump.removeDuplicates_inner to make a consensus of duplicates. TODO: Add CallVariants SNP output format (?) V36. 36.00 BBDuk now prints the exact number of reads removed. Refactored BBDuk and Seal to consolidate some shared code. Removed more colorspace-related code and comments. Added renamebyinsert flag to RandomReads. Fixed a custom parsing bug in BBMerge; it was being skipped. Added outputonlyincorrect (ooi) flag to BBMerge. Added TAG_CUSTOM field to BBMerge, to allow annotating reads with a feature vector. Wrote ProcessBBMergeHeaders for converting feature vectors to tsv format. Added minsecondratio flag to BBMerge. Slightly increased default pfilter value of BBMerge. Added quality-trimming to Dedupe to deal with contigs that have leading or trailing Ns. Dedupe now defaults to ignoring kmers contain Ns to prevent extreme slowdown. Noted by Shijie. Added tossjunk flag to Tadpole. Requested by Matt N. Added tossdepth flag to Tadpole. Requested by Torbin N. Disabled BBMerge JNI mode until the new code is ported. 36.01 Added bf1 flag (and parseCommonStatic) support to Pileup. Temporarily defaulted bf1 flag to false until race condition is resolved. This does not manifest on Genepool, only external computers. Added rem/rsem flags to BBMerge, for extra stringency. Large improvement in false-positives. Added outd flag to Tadpole for discarded reads. Modified tadpole.sh to remove references to ine/oute since they are confusing and no longer matched the documentation. Modified order of table allocation and memory messages in AbstractKmerTable to better understand out-of-memory states. Reduced available memory estimation for kmer tables from 0.75 to 0.72 when Xms is unset to avoid overallocation. RQCFilter now specifies -Xms flag to prevent khist running out of memory. Noted by Seung-jin. 36.02 Disabled a debugging message and closed output stream for Tadpole tossjunk/outd flags. 36.03 Added support for RQCFilter microbial detection without removal. Stats are identical to removal. Changed Tadpole tossdepth flag; it now discards reads with depth at or below the setting, rather than just below. Fixed a bug with Tadpole2 tossdepth flag; it was not working properly with k>31. Sped up BBMerge in rsem mode. 36.04 Added adapter search to BBMerge. Changed efilter to allow a value of zero (perfect overlaps); -1 now disables it. Modified BBMerge ecct such that when used in conjunction with rem/rsem, it only activates when extension fails. Updated BBMerge guide. 36.05 Removed a condition that forced Tadpole2 to be used over Tadpole1. Increased BBMerge adapter detection sensitivity. 36.06 BBMask now defaults to using all available memory instead of half. Fixed BBMerge read2 adapter output. Added poly-A removal to BBMerge adapter output, since that implies no signal. Added renamebyinsert flag to BBMap. Re-enabled bf2 mode on Genepool, where it does not appear to cause problems. Changed BBSplit default output format to fastq. Fixed parsecustom toggle deactivating in BBMerge. Changed naming format of renamebyinsert to omit the slash. 36.07 Wrote MutateGenome and mutate.sh to create genome clones with a specified percent identity. Added -Xms flag to bbmask.sh. 36.08 Added opfn flag to CrossBlock. 36.09 Wrote ParseCrossblockResults. Wrote SummarizeCrossblock and summarizecrossblock.sh. Analyzed Crossblock specificity - it appears that very little useful sequence is discarded, even with 20 same-species libraries. Added minprob flag to CalcUniqueness. Requested by Alex S. 36.10 Added support for unlimited kmer lengths to CrossBlock. SummarizeCrossblock can now complete despite missing files. FilterByName now supports whitespace trimming. Tax package can now handle missing IDs without crashing, if assertions are disabled. Requested by Jeff F. DemuxByName now looks for longer affixes first. This allows for a longest-affex match, so it will prioritize e.g. sample10 over sample1. Added keephumanreads flag to RQCFilter. Wrote removecatdogmousehuman.sh. 36.11 Added alignment-quality filtering for Reformat. This includes idfilter, subfilter, etc. Moved some functions for indel counting from BBMap to Read. Tested Crossblock specificity with a longer kmer of 48. Specificity improves substantially. 36.12 Validated CrossBlock's sensitivity up to K=62 with benchmark data. At K=75 some contamination slips through (0.04%). Wrote PartitionReads and partition.sh. Only supports round-robin currently. Added Tadpole lowdepthfraction, requirebothbad, and tossuncorrectable flags for filtering. Added Tadpole deadzone flag for controlling the amount of read tip left uncorrected. Split Tadpole aggressive error correction flag into aggressive (use aggressive parameters) and eccfull (use tail correction on the entire read). Rewrote SummarizeSealStats to use objects and calculate the total contamination rate. Wrote RenameAndMerge and muxbyname.sh, a standalone implementation of the mux phase of Crossblock. Multithreaded RenameAndMerge. It now scales with the number of input files. Changed RandomReads illuminanames to use space-num-colon rather than slash-num for pair number indicator. Changed RenameAndMerge to use space-num-colon also. Fixed LogLog not clearing Kmer objects between re-use for K>31. Modified BBMap to track last know numbers of mapped and unmapped reads in a static field. RQCFilter now prints additional messages to stderr and log indicating the number of reads removed or remaining after each stage. RQCFilter final log line changed to indicate what completed. Fixed array-out-of-bounds parsing single-digit gi numbers and taxids in TaxTree. Added correctfirst option to normAndCorrectWrapper. Changed BBDuk to print 0.00% instead of NaN% when there are no input reads. Changed RQCFilter stderr message format. Added insert size histogram to Reformat from sam/bam files. Changed RQCFilter pre-human qtrim from r to rl to match microbes. Note that the base count will be slightly off. 36.13 Added Tadpole conservative error correction flag. Added minhash sketch creation to KmerCountExact. Wrote MinHashSketch and comparesketch.sh to manipulate and compare sketches. Fixed CrossBlock making Tadpole default to assembly rather than correction. Multithreaded sketch creation. Added sketch delta compression. Added FileFormat support for int1d, long1d, and bitset extensions. Improved hexadecimal parsing. Added substring mode to demuxbyname; still needs improvement. Added filterbyname prefix mode; also needs improvement. *Paused work on Clumpify after modifying methods to allow return of multiple consensus reads from a single clump, in order to commit. BBNorm prints more informative error messages when crashing due to invalid characters. ConcurrentGenericReadOutputStream now exits successfully with an error state rather than crashing when closed prematurely (in some cases). *Temporarily disabled JNI mode for BandedAligner due to differences in results versus Java version. Increased max length of BBMap key buffers from 128 to 256 to prevent crashes when using very short kmers with vslow and long reads. 36.14 LogLog changed to allow first argument to be an input filename without in= flag. Fixed a reversed condition causing some streams to indicate they finished unsuccesfully. The output was in fact valid. Fixed an issue in which BBMap, with very short K and short reads, could throw an assertion error. 36.15 Made sketch package. Renamed MinHashSketch to SketchTool. Wrote Sketch. Wrote SketchMaker and sketch.sh. 36.18 BBMerge now checks input adapter sequences to ensure they are valid nucleotides. Capped BBMerge input adapter length at 21bp. 36.19 Fixed a couple of bugs in Sketch implementation. Added better Sketch name support in single-sketch mode. Sketches made by KmerCountExact now avoid duplicate hashcodes via LongHeapSet. BBMerge no longer loads terminal Ns in adapter sequences. BBMerge outadapter now trims trailing Ns. Added an early exit condition to BBMergeOverlapper. 36.20 Temporarily disabled MPI to allow easier compilation. 36.21 Fixed a condition in config file parsing. Added PrintTaxonomy support for sequence files, and generally improved robustness. Updated sketch package to handle IMG headers. Sketches can now be toggled between hex and ASCII-48 with the a48 flag. Sketch delta-compression is controlled with the delta flag. Sketches are now written in ASCII-48 by default. Sketch loading now uses a ConcurrentLinkedQueue to decouple the number of files and threads. 36.22 Verified that Phix and E.coli K-12 do not share any 27-mers. Renamed BBMerge normalmode to flatmode. Fixed a few issues in BBMerge related to combinations of flags for error-correction, extension, and quality processing. Added BBMerge minapproxoverlap flag. Added RQCFilter removelambda flag. Moved a TaxTree construction wrapper into TaxTree class. Wrote CompareSketch, a more formal tool for comparing sketches. Wrote Comparison class for tracking comparisons. Added rollover capability to Heap. Added sketch genome size field. Made LongHeapSet subclass, SketchHeap with additional fields, including genome size. Added raw mode to Sketch output (instead of hex or ASCII-48). SketchMaker now names sketches by NCBI name if a TaxTree is loaded. CalcUniqueness now defaults to 25-mers instead of 20-mers. Sketch headers now contain an encoding field (CD). Sketch headers support 2-character abbreviations. Added multithreading and taxonomy support to CompareSketch. 36.23 FileFormat now supports sketch detection. 36.24 Removed a debugging assertion in BBMerge. Fixed ConcurrentHashMap use in Sketch package. Finshed IMG taxonomy support for Sketch package. 36.25 Added Bisulfite adapters to adapters.fa. Added name0 (original name) field to sketches. Name is now for official taxonomic name. Fixed some locations where ByteFile2 was being used off of Genepool. Removed a debugging assertion in BBMerge that somehow did not get removed in 36.24. 36.26 Split RQCFilter kmer filtering into a short (k=19) and long (k=31) section, to increase accuracy. 36.27 Added N/L90 to stats output. Wrote TadpolePipe and tadpipe.sh, for optimal Tadpole assemblies. Revised TadpoleWrapper. Added many new options including early exit, search-space bisection, and search-space expansion. 36.28 Fixed Reformat ihist output; it was using reads classified as improper pairs. Fixed BBMerge bug, it was outputting extended reads with extend+ecco. Removed an unnecessary array copy from BBMerge. Added a new error-correction method to Tadpole (reassemble). BBMerge now passes read limit to Tadpole if a read limit is used. Added countErrors() to Tadpole to skip error-correction on error-free reads. Made ErrorTracker class for Tadpole to use instead of an array. Fixed a bug in error-tracking statistics. BBMerge now supports tail, pincer, and reassemble flags. BBMerge and Tadpole both default to reassemble for error correction, instead of pincer and tail. Changed Tadpole default deadzone to 0. Added Tadpole flags controlling reassemble window limits. Removed mpi package from all versions. 36.29 Added -Xms support to bbmerge.sh. Fixed a bug with extra flag in bbmerge. Disabled read validation and increased buffers for LogLog speedup. Added LogLog non-atomic mode support, but speed difference is minor. Added Tadpole rollback capability for reads that cause problems during error correction. Increased Tadpole ability to avoid correcting indels. 36.30 KmerTable regenerate now removes kmers less than or equal to a specified limit. Tadpole now removes all low-depth kmers missed by the prefilter. Added quality-related window flags for Tadpole. Fixed some incorrect assertions regarding degenerate bases in BBNorm. Revised Tadpole error-correction defaults to be more aggressive. Included quality value in Tadpole error detection. Accelerated Tadpole reassemble count regeneration. Accelerated Tadpole initial error correction count filling, leading to a 50% speedup. Added a missing consensus flag to Clumpify. readlength.sh now defaults to nzo=t. Added a default adapter list to BBMerge, usable with the adapters=default flag. Added probabilityErrorFree() to Read class. Added CalcUniqueness columns for average quality and probablity of error free reads. Added fixspikes flag to CalcUniqueness to enforce monotonicity. Wrote AnalyzeFlowCell and filterbytile.sh to remove low-quilty reads using positional information. Added minprob flag to LogLog. Fixed a bug in FilterByName in which the parse order prevented minlen from being applied. Added random mode to Shred. Requested by Shijie. Fixed pileup ignoring nzo flag. 36.31 Clumpify now uses low compression for creation of temp files. KmerSort now uses all threads for compression instead of half. Clumpify now supports ecco flag. Added qfout support to BBDuk, but only for primary output stream. Added oneline format to FileFormat. Added oneline output support to most tools using the .oneline extension. Added bisulfite flag to RQCFilter. Modified TadPipe to use adapters=default with BBMerge. Modified TadPipe to use Clumpify for speed. 36.32 Tools other than BBMerge using the ecco flag now default to using a static adapter list. Added contamination references to public distribution. Fixed issue with Ns in error-correction mode in BBNorm. Added trimming option to filterbytile. Sketch now uses ByteStreamWriter instead of TextStreamWriter. Fixed a bug in reporting the number of sketches creater by SketchMaker. Changed hash function of Clumpify and Sketch. Clumpify now reports the number of clumps. Interestingly, unhashed kmers result in fewer clumps and better compression. 36.33 Finished Clumpify pivot-split function. Added Clump.toStringStaggered(). Improved Clumpify consensus generation. Fixed a crash for Tadpole error correcting fasta reads. Changed Clumpify comparator to reverse the order of sorted clumps. Implemented preliminary error correction in Clumpify. It works well but requires multiple passes. Added Clumpify bloom filter and fixed mincount functionality. Added Clumpify minprob for pivots. Does not help much. Added Clumpify border for restricting pivots to near the middle of reads. Very slight increase in clumps, decent increase in error correction. Tested and noted that 0 hashes increases correction efficiency in the first pass. Accelerated in-memory multipass error-correction. Rewrote paired read name test to be faster. Added unpair and repair options to Clumpify. Repair currently only works with 1 group. Fixed a bug in KCountArray hashing when threads=1. Added Clumpify multipass depth-filter regeneration. Clumpify can now retain pairing and clumping for free in single-group mode, with unpair and repair flags. Wrote SortByName and sortbyname.sh, an out-of-memory sort program. Wrote MergeSorted as a wrapper to merge multiple name-sorted files into a single name-sorted file. Wrote CrisContainer. Clumpify now has all functions enabled (such as restoring read pairing) when groups>1, but single-group is faster. Clumpify will now autodetect the number of groups needed based on the input file size. 36.34 Added parallel sort option to Clumpify. This uses Reflection to verify whether Arrays has a parallel sort method. Fixed a bug with Clumpify assigning 0 quality scores to corrected bases. Parallelized clump formation. clumpify.sh now loads Java 1.8 for parallel sorting. Moved Clumpify read verification to worker threads for faster loading. Wrote Splitter for Clump-splitting methods. Updated Genepool build scripts to use Java 8. Updated Genepool shellscripts to load Java 8. 36.35 Added some new parallel sort wrappers to Shared. 36.36 Fixed granularity in gc plot downsampling. Changed pileup coverage variable column order. Requested by Jasmine. Fixed some issues in TadPipe, and added error correction to the Clumpify phase. Added several flags to Clumpify such as minci and minqi. 36.37 Fixed a sort bug. 36.38 Fixed a clumpify pivot bug. 36.39-36.44 Refactored and accelerated Clumpify. 36.45 Added sort package for holding sort programs and comparators. Added verbose2 flag to SortyByName for better verbosity control. Wrote KmerSort3, which does fetching asynchronously in multiple threads. Made SortByName slightly more efficient. Fixed a bug in which SortByName was resetting interleaved status when merging. Fixed a bug in interleaved testing. Fixed bug in Pileup: Bases on zero-depth contigs were being excluded from histogram. Made BBMerge mix flag nonstatic. 36.48 Fixed a bug in randomreads; it was not parsing the out flag. Improved FilterByTile read name parsing to support more Illumina software versions. 36.49 RQCFilter now writes intermediate files in ASCII-33. Read.failsBarcode now should works properly for old Illumina reads with no barcodes. Added Clumpify minratioqmult flag and quality-sensitive error ratio formula. Clumpify error correction uses a more conservative minidentity on the first pass. Fixed an RQCFilter bug in which a log file (KmerStats2) was not being generated. 36.50 Clumpify now stores per-read pivot information in new ReadKey class instead of a long[]. Clumpify sorting now takes strand into account, which slightly increases compression (~1% for paired reads). BBTools no longer use multithreaded sorting when threads are set to 1. Clumps are now hashable and comparable. Clumpify can now temporarily merge reads for pivot calculation with the flag mergefirst. This increases compression slightly but gives odd results for correction. Wrote SummarizeQuast for combining Quast reports into a box plot. Added BBMerge option to not change quality scores. Clumpify now uses more groups by default, to prevent running out of memory with highly unbalanced groups. 36.51 Added ordered flag to RQCFilter. BBMerge now tracks and displays the number of errors corrected in ecco mode. Tweaked clump.Splitter to better handle polymorphism, reducing chimeric corrections. KmerComparator now resolves ties using read name. Wrote Consect, a tool for making a consensus of multiple error-correction tools. Wrote consect.sh. Added Clumpify allele correlation calculator to Splitter. Finished Clumpify biallelic split function. Clumpify can now sort clumps for increased compression with the resort flag. Clumpify can also sort clumps of paired reads for even more compression with the resortpaired flag. Added support for pigz level 11 compression. Clumpify now accepts the flag changequality (cq) and defaults to false. RQCFilter now uses zl6 for intermediate and chaff files and zl8 at the end. Removed obsolete code for tracking of validReadsWritten and validBasesWritten. Wrote LoadReads and loadreads.sh to test the predicted and actual memory usage of compressed files. Revised some constants to improve memory usage prediction. Improved KillSwitch memory-kill functionality, and added more protected memory allocation points in read streams. Clumpify now takes into account per-read and per-Clump overhead when predicting memory usage. All calls to Arrays.copyOfRange now go through KillSwitch. Many calls to Arrays.copyOf now go through KillSwitch. 36.52 Added Reformat input file tests. Reformat now tests qual files as well. Wrote StreamToOutput; when a Clumpify group is too big, it can stream the group to the output file(s) with no processing. KmerSort3 now tests group size and streams overly-large groups to output. KmerSplit and KmerSort3 now track memory bytes read to help estimate memory requirements. Calls to Sort are now wrapped by KillSwitch because they use additional memory. Fixed a bug in which SortByName did not work for external sorts of empty files. Fixed a bug in quality-format detection in which N with quality 2 was not correctly being flagged as ASCII-64. Improved some quality-format warning messages. Clumpify now parses quality flags directly rather than passing them down. KmerCountExact and Tadpole now support a GC column in the kmer frequency histogram. 36.53 Clumpify error-correction is now done in conservative mode for the first half of passes. Fixed a bug with indels in CalcTrueQuality. 36.54 Fixed MDTag parsing, which was completely broken with indels. 36.55 Wrote new half-open Var class for masking heterozygous variants when recalibrating quality scores. Wrote CallVariants for use with the new Var class. Added ByteBuilder.append(ByteBuilder), which was missing. Wrote SamStreamer, which doubles the speed of reading sam files. Wrote SamLineStreamer for streaming SamLines rather than Reads. CallVariants now calculates coverage. Added SamLineStreamer support to Pileup, which doubles the speed. Added SamStreamer support to CalcTrueQuality, which increases the speed moderately. Added SamLine.PARSE_OPTIONAL_MD_ONLY flag to accelerate parsing of sam lines. SamLineStreamer and SamReadStreamer are now subclasses of SamStreamer. 36.56 Fixed failure to remove bam processes from the process table. Added ByteFile.pushBack() to make it easier to process sam headers in a seperate function. Disabled some assertions in FastaReadInputStream that fired in race conditions. 36.57 Cleared some additional static fields after BBMap/BBSplit termination. These interfered with RQCFilter. Fixed some invalid assertions and masked an exception in FastaReadInputStream; these are harmless and due to a known race condition. 36.58 Slightly improved Clumpify allele-pair selection by factoring in distance. Fixed a crash due to unexpected whitespace in RQCFilter. Added coverage target and metagenome mode to RandomReads. 36.59 Adjusted RandomReads coverage flag for paired reads. 36.60 Fixed BBMap issue in which secondary alignments of read 2 would not have their names changed to match read 1. Added addcolon flag to Reformat and RandomReads, to add 1: and 2: to read names. SplitPairsAndSingles (repair.sh) now identifies pair numbers from sam files. Moved Shuffle to sort package. Added ReadQualityComparator. Added setAscending to some read comparators. SortByName can now sort ascending or descending, by name, length, or quality. Rewrote the calctruequality.sh usage information. Added qap matrix to CalcTrueQuality. Works pretty well for 1-pass. Moved quality score quantization to Quantizer. Changed Quantizer to never assign 0 to non-zero scores. Added quantization option to Clumpify. Changed the default quant matrix to match current NexSeq bins. Added slash option for Quantizer; e.g. quantize=/2 will quantize to even numbers only. Fixed a bug in RandomReads - errors added from quality scores were biased toward certain bases and had a 33% chance of remaining as the original base. Added sticky quality score quantization. Slightly increases compression. Wrote Var functions for quality scores based on various metrics. Added CallVariants variant filtering flags. Increased default number of compression threads to all available threads. Q0 and Q1 error probabilties are now assigned fixed values of 0.75 and 0.7. Quality score position histogram now better reflects no-calls. Variant-calling is now integrated into CalcTrueQuality (for substitutions). BBDuk can now accept a varfile and ignore those variants when making quality-related histograms from sam files. CalcTrueQuality can now use a varfile also. Increased default filter thresholds for CallVariants. 36.61 Added VCF input support. Added VCF output support. Wrote ScafMap for storing scaffold information. Wrote VarMap for concurrent variation processing. Wrote VarFilter for variant filtering. Accelerated variant calling and filtering. Made coverage a mandatory column for variants. Ploidy and pairingRate are now required header lines for variant files. Modified Clumpify to support sam headers. However, it is unusable because Clumpify needs the obj field. Added positional sorting and sam support to SortByName. May not correspond to sam recommendations for pairs. Needs to alter sam header to indicate sorted. ScafMap and CallVariants can now load a fasta reference. VarMap variant filtering is now multithreaded. Added identity tracking to Var. Improved functionality and correctness of Read.identitySkewed. Fixed a parse error in TaxFilter. 36.62 Deleted CalcTrueQuality_single. Eliminated a difference between sam parsing via CallVariants and CalcTrueQuality. Increased CalcTrueQuality default memory allocation to handle variant calling. Implemented trimming for mapped reads with match strings, for CallVariants. CallVariants now has a border flag, default 5. Reformat can now trim sam files, though the optional fields may become incorrect. 36.63 Moved sam streamers to stream package. Wrote SamStreamerWrapper for fast sam -> fastq conversion. Fixed compression level getting reset in RQCFilter. Changed max compression threads for ziplevel 6 from 16 to 24. Fixed RQCFilter failure to parse null. Added clumpify option to RQCFilter. Changed default length sort to descending (except for SortByName which defaults to ascending). Reformat can now change cigar strings to 1.4 format. Improved Reformat sam=1.3 converter; it was allowing adjacent M operations. Added pairedonly and unpairedonly flags to Reformat. Fixed pairing count in CallVariants. Added support for pigz blocksize and iterations parameters. 36.64 Added qdhist and qfhist as aliases for qchist. Added an optional bloom filter to CallVariants. 36.65 Added a CallVariants scoring function for substitutions in homopolymers. Variant quality score is now further penalized for being below average base quality. Added some columns to variant output files. 36.66 Fixed a bug in Clumpify noted by WDC: when called (non-N) bases had quality scores of 0, Clumpify with the reorder option failed an assertion. Added bzip2 and pbzip2 toggles for enabling/disabling those subprocesses. 36.67 Verified that bzip2 works correctly for either bzip2 or pbzip2. bzip2 now always uses compression level 9, since lower is not faster. Added allowziplevelchange flag. Increased default number of pigz threads allowed per compression level. Bzip compression now defaults to threads, not threads-1. Fixed a bug in SamStreamer; it did not work for long headers. Clumpify reorder was renaming reads improperly if reads were single-ended. 36.68 Fixed an issue in which FilterByTaxa was capping RQCFilter buffers at 4. Fixed var file header bug. 36.69 Added ordered flag to Dedupe and Dedupe2. Made Colors class for colored text. Made changes to VCF output to increase compatibility. RQCFilter now pulls output read and bases numbers from BBMerge, the final step. Improved Dedupe sorting. 36.70 Modified BBMap bs option to work with both samtools 1.x and 0.x. Fixed a crash in Dedupe sorting. Possibly a Java 1.8 bug since it is not clear what the problem is. Wrote Realigner for realigning reads during variant calling. Added fqz support to BBTools. Added delimiter support to DemuxByName. DemuxByName now disables pigz if the number of output streams grows beyond 8. Added soft-clipping to realigner when it goes out of bounds. Added uptional unclipping as well. Fixed Read.calcMatchLength, which was incorrect for reads with Y match symbols. Paired read percent is now tracked independently of proper pair percent. 36.72 Added TaxServer class, with help from Shijie. Added JsonObject for Json formatting. 36.73 Made improvements to TaxServer. Upgraded tax server to support accessions. Fixed a bug in BBMap/MapPacBio related to Y symbols during realignment. Started CallVariants2, capable of multisample pileup. 36.74 Finished CallVariants2. 36.75 Added input file list support to CallVariants and CallVariants2. Removed a print statement from TaxServer. Added trimrname flag to Reformat. Added covpenalty and rarity flags to CallVariants. 36.76 Added CallVariants score and pass/fail per sample. 36.77 Wrote CallVariants guide. Increased speed of samtools processing mostly unmapped bam files with CallVariants. Fixed a bug in ScafMap.getScaffold handling of whitespace. Assorted changes to CallVariants handling of multiple sample names. 36.78 Sketch now works with nt if the prefilter flag is used. 36.79 Added Clump.removeDuplicates. Wrote class hiseq.FlowcellCoordinate. Added KmerComparator compareSequence and compareQuality. Added Clumpify deduplication modes. Inter-cluster distance calculation may now span tiles. Added keepsingletons flag to DedupeByMapping. Added multipass support to Clumpify deduplication. Wrote class MultiLogLog. Wrote class MultiKmerCounter. Wrote multiloglog.sh. Fixed some text in loglog.sh. Changed Read.expectedErrors to allow reads with incorrect quality values. Made a sketch package superclass, SketchObject, for static fields. Wrote SketchMakerMini for lightweight sketch generation. CompareSketch can now accept fasta files. Removed duplicate version strings from BBWrap output. Changed RenameReads pair identifier from /1 to 1:. Fixed logic causing DedupeByMapping to not detect reverse-complementary duplicates in ipo mode. Added flowcell coordinate filtering to BBDuk. Created package shared and moved some utility classes there. 36.80 CallVariants rarity flag will now automatically reduce minAlleleFraction if rarity is lower. Fixed a bug in CallVariants multisample mode; samples missing a variant would get assigned the sum of all samples rather than 0. Added minallelefraction to VCF header. 36.81 Fixed some VCF header issues. Changed VCF QUAL column to 2 decimal places. 36.82 Fixed a parse error for allduplicates in Clumpify. 36.83 Fixed a bug in inter-tile distance calculation in Clumpify. 36.84 Fixed Reformat failure to correctly generate insert size histogram from sam files. 36.85 Fixed N contribution to error rate in qhist; it was underestimated by 50%. Clumpify now supports twin files. Added Histogram to the names of some histogram-statistics methods. Fixed a bug in weighted-average calculation by KmerCountMulti. Reduced contribution of paired score to overall variant score. Reduced effect of paired score near contig ends. Multisample VCF info field is now the sum of samples rather than the best sample. Added bgzip support. Added variant score histogram output to CallVariants. Fixed variant sorting issue for insertions after subs. Fixed a bug in clipTipIndels() related to variant realignment. Added homopolymer score calculation for indels. Wrote VCFFile. Added VCFLine compare and hashing. Added VCFLine String caching. Wrote CompareVCF and comparevcf.sh. 36.86 Fixed a bug in Var initialization from vcf; alleles were not being canonicalized. Added kill flag to TaxServer. Fixed CalcTrueQuality crash with multiple sam files. FlowCell now retains total read count when written to disk. Added average read length tracking to CallVariants. Added insertion-length allele-frequency adjustment. Substantially improves insertion calling. Fixed some issues with RandomReads adding /1 to read headers that interfere with SamToRoc parsing. Added contig end dist field in variant files. Fixed a bug in ScafMap initialization from CompareVCF. 36.87 Fixed a Sketch bug adding a leading colon to sequence names. Removed an assertion preventing sketch loading without delta compression. Long insertions now reduce nearby SNPs that would be implied by misalignment. Fixed major bug in Sketch with very small genomes. 36.88 TaxTree now automatically parses headers for name and accession, but only if they contain pipe symbols (rather than replacing with underscore), since underscores may be present in accessions. Adjusted insertion-induced substitution revised frequency reduction down by 50%, as roughly half occur on either side. Added SketchMaker taxlevel, tossjunk, and accession flags. Accession processing now outputs a list of accession symbols. Added ftl, ftr, ftr2 flags to RQCFilter. Added support of - symbols in accession strings. Tested maskmiddle for Sketch; does not change sensitivity. Fixed CompareSketch output formatting. Added support for uppercase taxa level names. Wrote Query class for pulling data from TaxServer. Fixed vcf score at a constant 2 decimal places. Fixed windows-style slashes in taxpath argument. Added accession file input support to SketchMaker and various taxonomy classes. 36.89 Fixed negative coverage in VCF output when coverage overflows. Added suggested resolution to warning when coverage overflows. Changed Clumpify spantiles default to false. 36.90 Added Tadpole mincontig=auto, which sets mincontig at max(124, 2*k). Added Tadpole trimends=auto flag. Fixed VCF sorting, again. Fixed some VCF header bugs. Reduced native var format precision to 4 decimals maximum. VCF lines are now right-trimmed to a canonical representation from text. Changed TaxTree to use the full taxonomic tree. 36.92 Added Sketch support to TaxServer. 36.93 Add Locale.ROOT to String formatting. Fixed module being printed in bamscript output. BBMap no longer prints information about ambig mode when just indexing. Parallel sorting is no longer used when threads are set to 1. BBDuk now automatically turns on findBestMatch in rename mode. HashBuffer can now support HashArray2D and HashArrayHybrid. Improved taxonomy name parsing. SendSketch now sends multiple sketches in a single transaction rather than opening a new connection each time. 36.94 Revised CompareSketch to use indexed sketches. Regenerated nt sketches and restarted TaxServer with better parsing of names. 36.95 Fixed typo tar -xvzf typo in documentation. Wrote ProcessGC for printing gc content by interval. Added BBMap nfilter. 36.96 Version bump. 36.97 Minor changes to AccessionToTaxid. Changed default location of taxonomy files to a symlinked directory. Updated taxonomy files. 36.98 Fixed a crash in read header parsing for flowcell coordinates. Fixed some VCF header lines (FORMAT should have been FILTER). Added RAF (revised allele frequency) and SB (strand bias) columns to VCF. 36.99 Clumpify now assigns short kmers to short reads or reads with too many Ns to have a legit kmer. Should reduce the issue of large clumps. Reversed clump order to keep low-quality reads at the end rather than beginning of the file. Added 1-based offset option for PlotGC. Removed binary mode from Sketches. Documented some BBMap flags. Added amino acid support to Sketch. TODO: Examine impact of duplicates for both de novo genome assembly and for single cell applications TODO: More verbose output from FilterByTaxa in RQCFilter pipeline. TODO: Call peaks from raw count, not unique kmers. TODO: Test garbage collectors for throughput. TODO: Bisect seems to not work when highest K is best... TODO: Add taxpath to everything in tax. TODO: minscore2 minallelefrequency2, etc. For "fail" but retain. TODO: S-curves for variant scores rather than linear curves. TODO: "Warning: Zero reads processed.". Possibly, tiles don't get widened with stored flowcells? *TODO: ins homopolymer score always 1... TODO: Tadpole extender TODO: *Some variants in groups are still overall fail. hiseq clump miseq clump test assembly w/dedupe server for sketch compare how sketches work vs BLAST Write monthly plan *TODO: Add CallVariants assertion for ref allele = alt allele. TODO: CompareSketch quit early in fast mode. TODO: CompareSketch index by first key. *TODO: Tadpole needs an "extra" flag TODO: TRD flag for Reformat/BBMap - allow independent versions for fasta headers and sam rname/qname fields. TODO: TranslateSixFrames silent mode (?). TODO: add the date/version of the tax dump file you're using to the doc page TODO: Generate match strings from cigar strings with no MD tag when ref is present. TODO: Parse multi-allelic VCF lines. TODO: Consider tracking insert size when calling variants. TODO: Do quick allele-splitting on clumps where one allele is less than 90% identity to consensus, for compression. TODO: BBMap output chimerically-mapped reads to a different file. Or, at least, annotate them in some way. TODO: Make a SamHeader class. TODO: Add a SamLine field to Read (?). REGRESSION: CallVariants is faster under Java 7 than Java 8. TODO: Shijiefilter with taxa and environmental TODO: Streamer should support maxreads like cris does. TODO: Investigate TadPipe misassembly rate factors. TODO: Annotate peaks.txt with average gc content of kmers in each peak. TODO: Have BBMap quit after X mapped reads. TODO: Fix java 8 metaspace problem with something like "-XX:CompressedClassSpaceSize=64m -XX:MaxMetaspaceSize=128m". TODO: BBTools homebrew TODO: Add ability to depth-filter during multipass Clumpify ecc. TODO: Microbe filter is printing out "chr14" and etc for scafstats. TODO: Test Clumpify parameters. An empirical metric for false-positives would be handy. TODO: Write KmerSplitSort which operates on multiple input and output files in multiple passes. TODO: Consider binning reads by modulo while reading input for quicker sort and easier threading. TODO: Use quality scores to determine whether to correct. TODO: A high raw quality score should yield a lower corrected quality score, and vice-versa. Nice to get some data on this. TODO: MPI version in which data is split across nodes instead of files. TODO: Clumpify pipeline multithreading. ****TODO: Move all instances of Parser to the end of the parsing block. *TODO: Tadpole rollback if mincountcorrect is violated. TODO: Check other places where extra flag occurs ***TODO: Add ability to add ftr2=1 to RQCFilter (possibly for only reads that have been trimmed...). (Vasanth) TODO: BBMap 500 Mbp max chrom length causes trouble with Wheat 850 Mbp chromosome. ***TODO: Growable LongHeapSet (?). Useful for nt. TODO: Kmer tables quality vector to determine average quality of stored kmers. TODO: CompareSketch taxonomy trace, starting at hit with highest identity. TODO: CompareSketch raw and adjusted identity. TODO: SketchMaker min and max taxlevel. TODO: Add adapter detection to RQCFilter. TODO: Allow quality recalibration Q-change limits. TODO: Test Tadpole assembly with recalibrated Q-scores using gentler minprob. TODO: Filter out reads causing uniqueness spikes. TODO: 20-mer uniqueness -> 25-mer. TODO: sketch loading does not *really* need 1 object per line. 1 object per sketch is better. TODO: Kmer -> OldKmer. New Kmer should be packed 64-bit. TODO: New headers: LN: NM: ID: CD:DH, D33-6, etc TODO: KmerCountExact - ensure it supports prefilter. TODO: BBMerge adapter trimming *TODO: Genomax question. ***TODO: config= overrides tree file for Taxonomy.sh. TODO: BBMerge should not do adapter-detection on extended reads. TODO: KmerCountExact sketchname flag. TODO: Ensure fastq and fasta input generate identical sketches. TODO: Multithread sketch compare. TODO: BBMerge adapter-matching based on stringency. TODO: BBDuk adapter detection. TODO: Outa for BBDuk. TODO: LogLog minprob. ***TODO: IDMatrix gives wrong answers in some cases (?) probably involving different starting/ending locations. TODO: It gives even worse answers with JNI enabled. TODO: Reformat outu stream, particularly for gc splitting. TODO: BBNorm does not use extra files in second pass. *TODO: LongHashSet TODO: Move IntList and so forth into their own package. *TODO: Progressively square major/minor allele ratio for low-depth error-correction up to the max ratio. TODO: Filterbyname prefix, suffix, or even fixed-length substring hashing for super speed. TODO: Kmer sketch comparison. TODO: Tadpole diploid mode. TODO: Crossblock remove reads under a certain length during multiplexing or normalization. TODO: Increase multiplexing threads? TODO: Reduce demultiplexing pigz threads? TODO: BBDuk allows input and output files to have same name (?) TODO: Tadpole full-pass correction TODO: Tadpole count reduction check TODO: De-Partition and target-size serial mode TODO: Seal K>31 true support. TODO: Option to redirect all BBMap all stderr output to a file. TODO: Rewrite BBMask so that when sam files are not used, memory needs are reduced. Or, support sorted sam. Requested by Shijie. TODO: Make Tadpole outu accept discarded contigs as well. TODO: Find the best 16s copy. Then use only the ~2 copies with highest identity to that for consensus. TODO: Use Java library unique filename generator. TODO: More accurate LogLog with multiple hash functions (basically, xor masks). TODO: Tadpole valid extension byte array. Return as a long to include count. TODO: Write BBSplit guide. TODO: BBMap's scafstats are different from Seal's stats format. TODO: Dedupe should not hang when it runs out of memory. (Shijie). TODO: crc (checkrcomp) flag for BBDuk adapter trimming. TODO: Clumpify error correction. TODO: Standard way of determining whether a program crashed, or finished, or logging. TODO: Tadpole read exclusion for reads that won't assemble or have bad kmers. V35. 35.00 Changed Gene.toChromosome to return an int rather than a byte. Changed gitable.int2d name to gitable.int1d since it is a 1D array. Added taxa support for ArrayListSet. Added % support for output in Reformat. Requested by Alex Copeland. Added gitable.sh script for generating gitable.int1d.gz tax translator. 35.01 Fixed BBDuk crash when K>31 and stats output was enabled. Noted by Alex Spunde. Fixed repair.sh failure on fint flag. Fixed SplitPairsAndSingles not working on interleaved input anymore. Split Tadpole's mincount flag into mincountseed and mincountextend (mcs and mce). Added rcomp flag to BBMap. Requested by Bryce Foster. Added merge flag to KmerCountExact. 35.02 maq flag now accepts 2 arguments: maq=Q,B. If second argument is specified, only the initial B bases will be used to calculate the quality. Added minprob and maq flags to Tadpole and KmerCountExact. Fixed memory detection in calcmem.sh not working when ulimit=unlimited. Thanks for the debugging help from Jason S! Added some getters to KmerForest and KmerNode. Enabled Tadpole kmer harvesting from victim buffer. Greatly accelerated Tadpole by allowing threads to compete for tables, rather than using fixed allocation. Accelerated Tadpole by increasing default number of tables per thread. 35.03 Wrote Shred and shred.sh. BBMap can now output mapping stats to a file with the statsfile= flag. Requested by Vasanth. 35.04 Integrated extension into BBMerge (extend= flag). BBNorm now does ecc after deciding whether to discard a read, not before. 35.05 Fixed FilterByCoverage ignoring minCoverage if pre-normalization covstats not given. 35.06 Added BBMap lengthtag. Requested by Esther Singer. 35.07 Fixed rbb flag in BBNorm not working (conflated with parser flag). Integrated a shellscript modification that allows shellscripts to be symlinked and still find the correct classpath. Thanks Elmar Pruess! 35.08 Fixed rcompmate flag; it was triggering an assertion error. Added BBMap showprogress2 flag. Got rid of ReadInputStream.preferBlocks and associated methods. Simplified how Reformat works with in1 vs ffin1, and sam files. Fixed bug in which Reformat was dropping header lines. Reported by Gloria F. Fixed bug in BBMergeOverlapper pfilter for reads of different length. Fixed bug in BBMergeOverlapper for reads of different length. 35.09 Removed BBMerge hist2 and hist3 which were redundant; added showhiststats flag. Added BBMerge prealloc and prefilter flags. Removed some old BBMerge functionality (outinsert, perfectonly, etc). Changed extend to extend1 and extend2. Completely rewrote BBMerge's code path to break it into small modular functions. Memory allocation exceptions in HashArray are now handled gracefully. BBMerge now uses tadpole for kfilter. BBMerge can now extend before or after merge attempts. Tadpole can now do error correction via pincer. Tadpole can now do error correction via tail also. Added genome size esimation to KmerCountExact (via CallPeaks). This will be printed in the peaks output header. Fixed BBMap slowdown caused by rescue in LMP libraries. Thanks Marc Strous and Xiaoli Dong (Metawatt team) for helping me track it down! 35.10 Removed a debugging line from Tadpole that made it creash when extending reads. 35.11 Removed a debug assertion from SamReadInputStream. Found by Kurt LaButti. Improved descriptions in kmercountexact.sh. Centralized memory statistics printing in Shared. Separated Tadpoles load phase (KmerTableSet) from build phase (Tadpole). Added catch blocks for memory exceptions when reading objects from disk. Added catch blocks for memory exceptions when indexing. Added catch blocks for memory exceptions in ChromosomeArrays and CoverageArrays. RandomReads now correctly outputs names in fasta format. RandomReads now has simple names without custom BBMap coordinates. KmerCountExact now uses KmerTableSet. Parsing is more robust for Tadpole, KmerCountExact, and KmerTableSet. Coverage estimate (based on first peak) now in KmerCountExact. Requested by Vasanth. Added ihist PercentOfPairs header line. Added trim support to KmerTableSet. Added triangle filter for smoothing histograms before peak calling. Vastly improves result. Requested by Alex Copeland. 35.12 Updated shellscripts to have consistent formatting, and fixed various typos. Reimplemented outinsert for BBMerge. Requested by Matt Nolan. 35.13 Wrote Tadpole.explore. Removed debugging parameters (rid, pos) from Tadpole/KmerTableSet ownership functions. Fixed massive performance problem in KmerArray - victim buffer was being searched for nonexistent kmers. Wrote function to clear and regenerate tables after shaving. Shaving now seems to work correctly. Reduced mcs (minclustersize) in Dedupe from 2 to 1, to match the documentation. Added Shaver bubble-removal and improved statistics tracking. Added Tadpole markBadBases (mbb) flag for turning bases covered by low-count kmers into N. Added Tadpole mode=correct/ecc for correction without extension. Tadpole now uses in/out when in extend/ecc mode and ine/oute are not specified. 35.14 Added iterative seeding with decreasing depth to Tadpole, via contigpasses and contigpassmult flags. Added Tadpole mdo (markdeltaonly) flag; default true. Tadpole can now do error marking (mbb) without error correction (ecc). Tadpole ownership is now automatic. Added driver.FilterLines and filterlines.sh for filtering text lines. Verified that an issue with transcriptome mapping is due to an incorrect transcriptome rather than a bug. Fixed bug in which BBMap subfilter passed sites with no cigar string. Noted by lankage (SeqAnswers). mapq of reads with primary site filtered out is now very low (under 4). BBDuk can now stop after X outm or outu bases. Requested by R. Westerman. 35.15 Fixed minor bug in Seal in which unmatched reads were not being incremented, causing unmatched read rate to be displayed as 0. Fixed a bug in parseKMG for decimal values. BBMerge now supports error-correction with Tadpole. BBMerge now supports iterative extension. BBMerge will now always output the original read sequence for reads that don't get merged, rather than the extended or error-corrected sequence. BBMerge minor output formatting bugs fixed. BBMerge now calls Tadpole rather than Tadpole_old. Wrote a shellscript for TaxTree. Wrote Postfilter and postfilter.sh, a wrapper for BBMap and FilterByCoverage to postfilter SPAdes assemblies. 35.16 Fixed a bug in FilterByCoverage that was filtering everything if cov0 was not specified. Found and fixed some small bugs in Tadpole, such as not add the very last base of a contig. Fixed non-determinism in Tadpole by looking for hidden back branches and using extension return codes. Created ukmer package and Tadpole2, which supports unlimited kmer lengths. Tadpole2 will be automatically called by Tadpole if K>31. 35.17 Made Tadpole an abstract superclass of Tadpole1 and Tadpole2, with massive reduction in duplicate code. BBMerge now supports unlimited kmer lengths. Made AbstractKmerTableSet superclass of KmerTableSet and KmerTableSetU. KmerCountExact now supports unlimited kmer lengths. 35.18 Made Shaver and abstract superclass for Shaver1 and Shaver2. KmerCountExact now supports shave and rinse operations. Prefilter now works optimally with K>31, thanks to new hash routine. Fixed KmerCountExact not writing peaks without khist set. 35.19 Fixed a crash in read extension with K>31. 35.20 Re-added lines for unmerged read to BBMerge outinsert stream. Requested by Matt Nolan. Added some new header lines to KmerCountExact peaks output - ploidy, het rate, repeat content, etc. 35.21 Fixed Tadpole contig coverage estimation. Added Tadpole mincoverage flag. Fixed crash in ReadWrite when attempting to create filenames containing the pipe symbol. Fixed an invalid assertion in HashArrayHybrid resize(). Added IntList.contains(). Fixed a tricky bug with Seal qhdist not looking for matches with substitutions if it first found a match without substitutions. Noted by sdriscoll. Dedupe for some reason had interleaved name detection disabled. This is now enabled. Noted by Bede Constantinides. Fixed new BBMerge crash bug with outinsert. Noted by Matt Nolan. 35.22 Added truncateheadersymbol flag to filterbyname. Added some postfilter flags with defaults optimized to increase speed. Fixed Seal qhdist again; I had forgotten to sort a list. Noted by sdriscoll. Enabled pigz by default in BBNorm. Made some improvements to peak-calling. 35.23 Fixed a Tadpole2 assertion error when error-correcting with K>31 and variable-length reads. Added minconsecutivebases flag to Reformat/BBDuk. Added locking and lock testing to HashBuffer; unclear whether speed increased. 35.24 Added BBDuk maskfullycovered flag. Added SummarizeSealStats ignoresametaxa and ignoresamebarcode flags. Wrote ReduceSilva and reducesilva.sh for shrinking Silva database by removing entries with redundant taxonomy. 35.25 Added SummarizeSealStats ignoresamelocation flag. 35.26 Fixed ignoresamelocation pulling from incorrect field. Documented mlf flag. Requested by Alex Copeland. Added kmg support to minlength and maxlength. Requested by Bill A. Fixed a bug in BBMap when handling subfilter on multimapped reads. Noted by vout. Improved BBMap fixXY() function. Fixed major bug in BBMerge; outu read2 was reverse-complemented. Added a function to soft-clip reads with a long indel anchored by very few bases. Added ftm, ftl, ftr, ftr2 flags to BBMerge. Added qtrim2 flag to BBMerge (trim on overlap failure). Fixed implementation of shave and rinse to properly handle backward branches. Fixed shave mindepth at 1 instead of variable. 35.27 Fixed a crash bug in BBMap tip indel clipping during fast mode. 35.28 Checksites now verifies correct site ordering. ensureMatchStringOnPrimary now ensures correct ss ordering if it changes anything. ensureMatchStringsOnSiteScores now ensures correct ss ordering if it changes anything. These changes resolved an assertion crash bug. 35.29 Fixed an assertion bug in tip indel clipping. Noted by Bryce Foster. Fixed a couple places where clipped bases were not counted when calculating match position. Fixed another bug with fast match strings related to clipping tip indels. 35.30 Fixed another bug related to clipping tip indels and resorting. Noted by Bryce Foster and Matt Nolan. Clipped tip indels are now replaced with matches or mismatches. Fixed a missing else in SamLine. 35.31 Fixed a bug in toLocalAlignment() when a read has zero matches to reference. 35.32 Fixed an instance where alignments exceeding window size yielded ss with inconsistent lengths. Improved calculation of amount of padding needed when alignments exceed window. 35.33 Removed kmersamplerate and readsamplerate from bloom package to simplify code. Corrected handling of minprob in bloom package. Added Tadpole minprobprefilter and minprobmain flags. KmerCountExact now disables minProbMain when prefilter is enabled. Tadpole can now do multipass prefiltering. Tadpole can prefilter for an automatic number of passes. Tadpole now supports 1-bit final-pass prefiltering. Fixed a bug in fixLimitsXY() - only Y needs adjustment, not X. Fixed a bug in generateMatchString in which sorting was not redone when results changed. Added functionality to Bloom prefilter to allow arbitrary cutoffs, rather than just using the filter's max value. 35.34 Fixed a compile error due to Bloom filter changes. 35.35 Added MergeBigelow for combining custom Bigelow text files. Updated Shred to add equal flag, to shred reads into equal lengths rather than a fixed length. Tracked down a few bugs regarding ss score-setting order. Temporarily disabled CHECKSITES and SiteScore.setScore() assertions, which are mainly of interest for efficiency, not correctness. 35.36 Added mlf flag to BBDuk. Fixed qhdist crash with values over 1 (params were reversed). Stripped qfin/qfout support from rqcfilter since nobody will ever use it. Made files of the common kmers found in ribosomes (/global/projectb/sandbox/gaag/bbtools/ribo) using ReduceSilva, Dedupe, and KmerCountExact. Added ribo filtering to RQCFilter. 35.37 Fixed a ribo filtering flag for RQCFilter. Added RQCFilter ribodb, ribohdist, riboedist flags. Added RQCFilter extend flag (allows BBMerge read extension). Fixed path in file-list (directory was being prepended). 35.38 Improved Tadpole help info. Added Pileup coverage standard deviation calculation. Requested by Bill A. Fixed one last (?) assertion error in BBMap. Reported by Vasanth and Shijie. 35.39 Postfilter now unloads Data after mapping. Added trim flag to postfilter and filterbycoverage. 35.40 Fixed a null pointer in Read.validate(). Fixed read extension in Tadpole when K<=31; the wrong method was called, causing a crash. Noted by Westerman. Added Tadpole contig trimming flag. Fixed colossal BBMerge bug - read 2 was being merged as a reverse complement. Not sure when that started... 35.41 Added spaceslash flag to RandomReads to allow space to be omitted from read names prior to slash pairnum. Requested by Rob Egan. 35.42 Slightly altered Tadpole1 to allow condensed assembly of kmer sets; added flag ibo (ignore bad owner). Made KmerCompressor and kcompress.sh. Generates a concise fasta representation of the set of kmers occuring at least N times. 35.43 Fixed crash in BBDuk wheen using MinKmerFraction (mkf) flag on single-ended reads. Added fuse flag to KmerCompressor. Fixed a crash in BBMapPacBio versions, caused by not percolating over a change in normal BBMap. Noted by Teshome. 35.44 KmerCountExact khist was overflowing if there were more than 2 billion kmers of a given depth. Converted counts to long array. Wrote AbstractRemoveThread for removing kmers with counts outside of a certain range. Added mincr and maxcr (min count to retain and max count to retain) flags to Tadpole. Fixed incorrect haploid_fold_coverage in peaks.txt. Noted by Kurt. Fixed KmerCountExact not writing peaks file if no khist was specified. Fixed Tadpole differentiation between in/out and ine/oute. Now only in and out are needed. Added Reads and Bases columns to Dedupe output. Requested by Esther S. 35.45 Added driver.CountSharedLines and countsharedlines.sh. Requested by Esther S. 35.46 Added smoothing control flags for KmerCountExact. Caught invalid values of K in BBMap. Added some additional header lines in peaks output. 35.47 Fixed BBMap incorrect NM tags for reads with soft-clipping. Noted by Rob Egan. 35.48 Disabled second parameter being automatically interpreted as an output file when = is not specified, in most cases. This is ambiguous as the second parameter might be a file for input read 2. Fixed a new bug in newly fixed NM tag gen ^^;. Also noted by Rob Egan. Identity calculations no longer penalize regions skipped as introns if the intronlen flag is set. Suggested by Rob Egan. 35.49 Clarified error messages for reads failing barcode filter. Added cutprimers flag to include flanking primer sequence. BBMerge trimq can now be an array for multiple attempts. Multithreaded memory allocation for bloom filters; moderate speed increase. Added mingc/maxgc to BBDuk and BBDuk2. Added BBDuk mcf (min covered fraction) flag. 35.50 Added KmerCompressor max flag. Clarified crossblock help regarding input file lists. Enclosed all iterations of Dedupe overlapLists with a null check. KmerCompressor is not deterministic when multithreaded (kmers may be used more than once); reduced buildthreads to 1. Added LogLog class for cardinality estimation, and loglog.sh. Enabled loglog flag for Reformat and BBDuk. 35.51 Added X bit to bamscript generated by BBMap. Loglog can now accept multiple files. Changed settings of removehuman in rqcfilter to be faster (requires 2 hits now). Fixed a null pointer exception in BBMerge with quality. Fixed a bug with BBMerge ecco flag being ignored. Upgraded Seal to allow requiring full containments of ref sequences. Requested by Ernst O. 35.52 Fixed issue with "ignorebadquality" flag being ignored in some cases. Noted by Alicia C. 35.53 Added mouse to RQCFilter. Switched RQCFilter and RemoveHuman to k=14 for a 4x speedup. Modified rqcfilter.sh to allow mouse, cat, dog, and human removal concurrently on 40GB nodes. Disabled test for too-high quality scores because it was annoying when dealing with PacBio reads. 35.54 Added TadpoleWrapper and tadwrapper.sh, which runs Tadpole multiple times with different kmer lengths and recommends the best length. Requested by Alex Copeland. Added normandcorrectwrapper.sh, which runs BBNorm then Tadpole. Requested by Stephan Trong. Added clear() operation to KmerTableSets. 35.55 Changed Character.isAlphabetic() calls to Character.isLetter(). Modified CountBarcodes to add more flags (for dual barcodes). RQCFilter pigz and unpigz now default to true. Moved parsing of threads and recalpairnum from parseCommon to parseCommonStatic. Increased sensitivity of ribo removal (96.6% to 98.8%) by using a larger kmer set. Adjusted BBMap default per-thread memory usage estimate after a crash. Noted by Matt Nolan. 35.57 Fixed a change to removehuman.sh default memory allocation. 35.58 Fixed SynthMDA handling of minlen flag. 35.59 Added sam 1.4 -> 1.3 support to reformat, via sam=1.3 flag. Added RQCFilter filterqhdist flag. Requested by Adam Rivers. Slightly reduced default mininsert in BBDuk from 50 to 40. Added some additional comments to BBDuk. Added # support for BBMap output files. Requested by Adrian Pelin. Fixed rqcfilter.sh not grab enough memory on slot-scheduled Mendel nodes... hopefully. Added name flag to FuseSequence. 35.60 Data.clear() now also clears scaffold information in BBSplitter. Added removemicrobes flag to RQCFilter. Added removehuman2.sh for aggressive human contaminant removal versus an unmasked reference. Requested by Alicia Clum. Modified RQCFilter to support unmasked mouse, cat, dog, and human references. Requested by Alicia Clum. Allowed entropyfilter bbduk flag instead of just entropy. Unpigz is now used in certain cases where it was prevented before, like reading lists of names in filterbyname. 35.61 Corrected some names in RQCFilter file-list.txt. 35.62 Split writeReproduceHeader off from writeReproduceFile. Added BBTools version and RQCFilter command to RQCFilter reproduce.sh log. Added humanpath, catpath, dogpath, mousepath flags to RQCFilter and clarified them in the documentation. Improved documentation of bbduk2.sh. Fixed BBMergeOverlapper.c to match java version. 35.63 Fixed some BBMergeOverlapper.c syntax errors. 35.64 Fixed more BBMergeOverlapper.c syntax errors. 35.65 Fixed a BBMergeOverlapper.c runtime errors in quality-free mode. Finally working again! 35.66 Changed testQuality() to assign ASCII-64 to the specific case of N bases with quality B. Added preliminary support for dsrc compression. However, the program does not appear to work correctly in Windows. Added header output (.header extension). Requested by Matt Nolan. Added coverage calculation ignoring deletions. AssemblyStats (stats.sh) now has fastq support; requested by several people. Added file type and extension documentation as readme_filetypes.txt. Removed obsolete changelogs for BBDuk and Reformat. 35.67 Added "_part" suffix before the part number to names of automatic-split fasta reads. This fixes a problem with underscore-number-named sequences in BBEst. Noted by Kurt L. Fixed a corner-case in filterbycoverage's handling of trimmed reads that drop below the length cutoff. Noted by Stephan Trong. 35.68 Wrote KmerComparator and KmerComparator2 for comparing reads by pivot kmers. Wrote KmerSort for sorting reads by pivots. Wrote KmerSplit for binning reads by pivots. Wrote Clump class for storing ordered overlapping reads. Wrote Clumpify to wrap KmerSplit and KmerSort. Wrote ClumpList to turn a list of clumped reads into a list of clumps. Added preliminary consensus operations to Clump and KmerSort. Added KmerReduce to produce the set of pivot kmers. Fixed an out-of-bounds error in CutPrimers. Noted by vmikk (SeqAnswers). Moved UnicodeToAscii (which did not work), TableLoaderLockFreeU, and TableReaderU to z_old, since they cause compilation problems. Noted by Martin M. Fixed prefilter onepass mode causing a crash. Added clump package. Added kmer-count restrictions to Clump pivot selection. Not clear whether it is useful. Added local maximum capability to KmerComparator. Fixed BBNorm to work with kmers>31, for normalization (not error-correction yet). Not fully tested, though. Noted by Kurt L. KmerTableSet read loading now does read validation per thread. This allows better multithreaded scaling. BBDuk and Seal also now do validation per thread. 35.69 Multithreaded kmer table dumping by KmerCountExact and Tadpole; over 4x speedup. 35.70 Tadpole now does validation per-thread when error-correcting. Slight speed increase. Fixed a bug in KmerCountExact in which prefilter did not work with K>31, due to using key() instead of xor(). Noted by Kurt L. 35.71 Added A_SampleMT, with full line-by-line comments. Improved A_Sample's comments. Added kmer histogram generation to rqcfilter (khist flag). Reorganized rqcfilter.sh documentation. Added a_sample_mt.sh. Fixed documentation in shuffle.sh. Running any program with -version, -help, etc now prints a useful message. 35.72 LogLog now retains last cardinality estimate in a static field. RQCFilter now chooses BBNorm or KmerCountExact for the khist depending on the estimated cardinality. Kmer histograms now have a header by default. 35.73 HashBufferU now only tries to acquire a lock every 16th time, like HashBuffer. Removed some checks for the literal string null. Fixed some else-if fallthroughs where else was missing. Addressed some compiler warnings in kmer, ukmer packages. Wrote FilterByTaxa for filtering of sequences labelled with their taxonomy (gi number or ncbi taxID). Wrote PrintTaxonomy. Wrote taxonomy.sh and filterbytaxa.sh. 35.74 Added peaks output to rqcfilter. Made TaxFilter class and revised FilterByTaxa to use it. Added FilterByTaxa support to RQCFilter for microbial decontamination. Fixed a bug in which empty files had their format misdetected. Fixed a couple array-out-of-bounds errors with unicode characters in genetic sequence. They are now converted to N. 35.75 Enabled worker thread read validation in A_SampleMT. Wrote SplitByTaxa and splitbytaxa.sh. 35.76 Changed documentation structure. There is now changelog.txt, readme.txt, UsageGuide.txt, and ToolDescriptions.txt. Fixed IntList resize overflow bug. Noted by jazz710 (SeqAnswers). Removed unicode2ascii.sh since it does not work. Wrote FungalRelease and fungalrelease.sh. Requested by Kurt and Jasmyn. 35.77 BBDuk can now call CalcTrueQuality to generate matrices if given a sam file. Added scaffold name remapping legend to FungalRelease. Requested by Kurt and Jasmyn. Wrote BBDukGuide. Wrote BBMergeGuide. Wrote TadpoleGuide. 35.78 Fixed infinite recursion when setting threadcount. Found by Matt Nolan. 35.79 Changed BBNorm defaults to target=100 min=5. Wrote Reformat guide. Wrote Seal guide. Wrote Taxonomy guide. Added Tools.startsWith(byte[], String) Revised GiToNcbi and TaxTree functions to allow gi_ as well as gi|, to avoid pipe symbol. 35.79 Standardized syntax of gitable and taxtree flags, and added "auto" option. 35.80 Added FilterBySequence and filterbysequence.sh. Wrote PreprocessingGuide. Wrote DedupeGuide. Wrote BBNormGuide. Removed bbmap20.sh, bbnorm20.sh, bbsplit20.sh, and khist20.sh since they can now have memory set explicitly. 35.81 Unified shellscipts between private and public release - module load commands now only run if NERSC_HOST==genepool, and "-l" removed from /bin/bash header. JNI mode is now enabled by default if NERSC_HOST==genepool. 35.82 Fixed a double-print of BBMap version number. Updated projectb pre-deploy version of BBTools compiled jni code and disabled auto-NJI-enable when NERSC_HOST==genepool. Wrote guides for A_Sample, BBMask, Stats, CalcUniqueness, Repair, SplitNextera, Clumpify, and AddAdapters. Wrote BBMapGuide. 35.83 Added "banns" to RandomReads. 35.84 Fixed an unnecessary assertion for negative values of pairedScore in Tools.removeLowQualitySitesUnpaired2. Noted by Jason S. Fixed a bug in SamLine.makeMdTag for handling deletions called off the end of a scaffold. Noted by Jason S. 35.85 Fixed a bug in which BBMap was verifying the presence of the wrong input file. Noted by Adrian P. Added a check for a memory environment variable to calcmem.sh (only affects jobs run at JGI on Genepool) 35.87 Changed version in shared.java to 35.87, fixed bug in calcmem.sh (was calling itself) 35.90 Added ProcessSpeed and ProcessFragMerging for collating output in a BBMerge comparison. ByteFile1 had an error related to Windows-formatted (\r\n newline) files. This was fixed by dropping support for legacy Mac (\r) newlines. So now, valid newlines are \n (Unix/MaxOS X) or \r\n (Windows) but not very old Mac (\r) which I have never seen anyway. This change also increased ByteFile1 speed by 10%. Added GC-filter mode toggle between individual reads and pair averages for paired reads with the "pairgc" flag. Requested by Torben. GC histogram mode now also obeys the pairgc flag. Default is true; previously, it was false. So, now the filter and histogram defaults match. 35.91 Fixed an assertion error that fired on fasta files containing blank lines. Noted by Vasanth S. 35.92 Fixed interleaving detection when BBDuk generates calibration matrices. Fixed a bug in calcmem that ignored used memory when an SGI flag is set. Removed debug code from quality trimming. 35.93 Rewrote coverage stats line parsing to be header-defined. 35.94 Wrote MDWalker to help parse MD tags. Modified sam line parsing to use MD tags and base calls to differentiate between substitutions and nocalls. 35.95 Added commonAncestor methods to TaxTree. Added positional startsWith method to Tools. Wrote jgi.A_SampleByteFile as a text-stream processing template. Wrote tax.FindAncestor and gi2ancestors.sh for converting sets of GI numbers into a single taxonomy. Wrote driver.ProcessWebcheck and webcheck.sh to calculate statistics on the RQC site uptime. Requested by Bryce. Removed an assertion from KmerCompressor that fired with very long contigs. Noted by Eugene H. 35.96 Fixed quality trimming of fasta reads. Terminal Ns were supposed to be removed, but it was not happening. Fixed fake quality score generation of N-containing fasta reads. Changed tax package so that tax queries by name return a list of names when there are multiple hits (try bacteria, for example). Modified PrintTaxonomy to accomodate multiple hits. GiToNcbi now capable of accepting a comma-delimited list of dump files, since NCBI keeps nucleotide and protein sequences in different files. Improved parsing in TaxTree. Changed do loops to while loops in TaxFilter to allow filtering by tree root (life). 35.97 Added TaxFilter ability to require specific ancestor nodes to be defined, such as phylum. FileFormat now recognizes gzip extension (in addition to gz). Added BBMap excludefraction (ef) flag to manually override the fraction of the kmers discarded as low-information. Added BBMap ignorefrequentkmers (ifk) flag to determine whether to ignore low-information kmers. Added BBMap greedy flag to manually override whether to discard the least informative kmers on a per-read basis. FilterReadsByName can now accept position to only output a fraction of a sequence. Wrote FilterAssemblySummary for processing NCBI assembly_summary_refseq.txt and assembly_summary_genbank.txt files using TaxTree. Wrote filterassemblysummary.sh. 35.98 Capped Output buffer became full messages at one. Modified KCompress to assemble forward kmers only (rcomp=f flag). 35.99 Wrote RenameByHeader to rename files based on their headers. This is designed for NCBI RefSeq genomes. Re-added missing statsfile flag in BBMap. Noted by Vasanth S. Wrote HeaderInputStream to support input of header files with no sequence. Wrote ReplaceHeaders and replaceheaders.sh to insert headers into a sequence file from a different file. Default FileFormat detection type for files without fasta- or fastq-specific symbols is now DEFAULT. Wrote removemicrobes.sh to use the new common microbe filter. Added removemicrobes build support to RQCFilter. BBMask now supports wildcards for sam file paths. TODO: Read magic number of potentially gzipped files? TODO: String indicating exit failure. TODO: BBMap does not correctly track semiperfect sites and N rate when read Ns align with reference Ns. TODO: RQCFilter runs out of memory during khist for metagenomes. TODO: Add minprob to LogLog. TODO: Heejung encountered a random null-pointer exception in ByteFile2.run() at "list[loc]=s;". However, I manually examined the code and this state appears to be impossible. Perhaps it is a JVM bug? TODO: Autoset bits and prefilter for khist based on cardinality. TODO: KmerSet prefilter=1 onepass does not work (prefilter=2 onepass does work). TODO: Validate BBNorm results with k>31. TODO: Add summary of how many reads got removed to BBDuk. (Hemant). TODO: Add Tadpole ability to screen reads with kmers only occurring at most N times, or having errors/Ns after correction. (Torben). TODO: Program that can demux a sequence file into multiple sequence files randomly. TODO: SynthMDA with a short reference outputs lots of reads with Ns (Alex Copeland). TODO: Parser.parse should go at the end, not beginning, of parse blocks for all programs. TODO: Tadpole should keep nodes with only outward branches. TODO: Print kmer coverage information after Tadpole assemblies (Alex Copeland). ***TODO: Replace QuadHeap with a heap of longs. The current implementation is very slow on NUMA machines. **TODO: Compare Seal performance and correctness with countvector flag. One may be faster for large numbers of ref sequences. TODO: Seal mcf flag. TODO: Represent covariant depth as a vector with 1.0 for max depth for binning. *TODO: Allow kcompress direct set subtraction and intersection. *TODO: Add outu support to filterbyname *TODO: Speed up BBMap indexing. TODO: Print information about which reference sequences hit which locations in which reads, for Seal. TODO: Second extra base for BBDuk edit distance...? TODO: Thoroughly vet the assertions in CHECKSITES and SiteScore.setScore() to ensure they will do not incur false positive error messages. TODO: BBMap shave and rinse are reducing contig length at level 2. TODO: bbcountunique should use a longer K and have an offset rather than just looking at the first K bases. TODO: Pincer could handle arbitrary problems - indels, error bursts, etc. TODO: Tail can handle bursts if it simply continues until X bases concur. TODO: Port pincer/tail over to BBNorm. TODO: Use entropy to determine how many bases to extend past errors. TODO: BBMap is not handling pairing when ambig=all. Pairing should be done at a SS level. (Elmar P). TODO: Tadpole multipass prefilter, and auto prefilter passes. TODO: BBMap MPI mode. TODO: Seal needs behavior with qhdist to be toggleable between searching or not searching for mutant kmers if a nonmutant kmer is found. TODO: BBDuk with hdist should reprocess the reference multiple times, first with hdist=0, then hdist=1, etc. That will improve specificity. *TODO: qtrim=r trimq=6 (or even 3) improves BBMerge rate for 2x250, 422 insert library. TODO: BBMerge - Track number of errors detected/corrected and error locations. TODO: Use small heap to reorder HashArray1D to optimize it. TODO: Dump kmers to text by way when max size is exceeded, then reload by way and re-dump low count kmers. TODO: Tadpole degenerate contig output. TODO: Locked/managed HashArray expansion. TODO: Fractional (1/4) way allocation per build thread. *TODO: extendToRight should return an exit code, not just true or false. May not need to be released. *TODO: Tadpole - first, classify all kmers as junction or non-junction (via ownership). ***TODO: Always verify that left max yields prev kmer/evicted base. If not, that is a hidden branch. *TODO: Allocation schedule for HashArrays. *TODO: Optional synchronized resize on final schedule slot to minimize memory use. TODO: MS state for MSA, always, for M1 state. TODO: extin and extout flags for BBMap. TODO: FastaReadInputStream asserts false for headers with no sequence. TODO: Speed up shaving (exploration) where possible. TODO: Seal kmer rank promotion with 1D arrays. TODO: Partition program, round-robin with equal number of bp or equal number of sequences per output. TODO: msa.sh should accept a file instead of literal. TODO: BBMap bed format (Alex C). TODO: BBMap fix for crash in filterbyname on sam file - SamLine 1490, assert(start_<=stop_). TODO: Reformat lhist and readlength.sh should have equivalent information. "I prefer readlength.sh info" TODO: Tadpole/KCE double-lock and double-buffer with LongLists for loading. TODO: xmx=auto or percentage TODO: reformat: multithread? TODO: (write scaffolder) TODO: (write polishing/consensus tool) TODO: (write breaker) V34. 34.00 Fixed a bug in BandedAlignerConcrete related to width being allowed to be even. 34.01 IdentityMatrix is now much faster for ghigh-identity sequences, and allows the 'width' flag to increase speed. Updated FilterReadsByName to allow "names=", supporting fastq, fasta, and sam. So, one file will be filtered according to the names of reads in a second file. "names=" where the file is just a list of names is still supported. 34.02 Fixed a couple errors in ConcurrentReadInputStreamD. Added fetching of a dummy list from "empty" for crisD, both master and slave. Added A_SampleD, which uses crisD. It now works correctly for master. Renamed various ConcurrentReadStreamInterface classes. Added an abstract superclass for all ConcurrentReadInputStreams, which extends Thread. Now, cris can be started directly without making a new thread. Changed all instances of wrapping cirs in a thread to just use start directly. These are mostly commented with "//4567" to find if something was missed (like starting the cris twice). Increased cris stability by removing "returnList(ListNum, boolean)" and replacing it with "returnList(long, boolean)". Lists may no longer be recycled. 34.03 Added scaffoldstats to BBQC and RQCFilter fileList logs. Requested by Bryce F. Fixed a strange deadlock in Dedupe/ConcurrentCollectionReadInputStream caused by making CRIS a Thread subclass. This will still occur if CRIS goes back to being a Thread. Noted by Shoudan. 34.04 Removed hitCount tracking from Seal. "qtrim=" is now allowed for all classes using Parser.parseTrim(). Parser.parseZip, parseInterleaved, parseQuality, parseTrim, parseFasta, and parseCommonStatic were integrated into most classes; reduced code size by almost 200kb. Parser.parseTrim got some extra functionality, like maxNs. Made an abstract superclass for KmerCount* classes, allowing removal of some code. Removed all KmerCount.countFasta() methods; they must now use a CRIS. Retired ErrorCorrectMT (superceded ny KmerNormalize). Fixed bug in BBDuk, Seal, and ReformatReads - when quality trimming and force-trimming, count of trimmed reads could go over 100%. Now these counts are independent. Noted by ysnapus (SeqAnswers). Removed "minscaf" and "mincontig" flags from Parser.parseFasta() because they were conflated. Determined cause of Kurt's error message in Dedupe - lower-case letters can trigger a failure. Dedupe now defaults to "tuc=t" (all input is made upper-case). Moved CRIS factory from CGRIS to CRIS. Copied cc2-cc5 to /global/projectb/sandbox/gaag/TestData/SingleCell/SimMockCommunity/plate*/. These are simulated cross-contaminated single cell plates. Removed conflated "qual" flag from RandomReads; "q" should be used instead to set all read quality values to a single number. Fixed conflated "renamebymapping" flag in RenameReads. "tbr" flag is conflated in KmerNormalize; adjusted so that it now controls both "tossBadReads" (reads with errors) and "tossBrokenReads" (reads with the wrong number of quality scores). Conflated "gzip" flag in ChromArrayMaker/FastaToChromArrays changed to "gz". Handled conflated "ziplevel" flag in AbstractMapper. Conflated "fakequality" flag resolved by moving from BBMap to Parser and renaming "fakefastaquality"/"ffq". Added hdist2 and edist2 to BBDuk. These allow independently specifying hdist/edist for full-length kmers and short kmers when using mink. Added trimhdist2 to RQCFilter/BBQC. *** Added path and mapref flags to RQCFilter/BBQC; they can now map to an arbitrary genome instead of just human. Added Shared.USE_MPI field (parsed by Parser.parseCommonStatic; "mpi" or "usempi"). Added Shared.MPI_RANK field (should be set automatically). Added Shared.MPI_KEEP_ALL field. This controls whether CRISD objects retain all reads, or just some of them. CRIS now automatically returns a CRISD when USE_MPI is enabled, as a slave or master depending on whether rank==0. ListNum is now Serializable. CRISD now transmits ListNum objects rather than ArrayLists, so that the number is preserved. Added Maxns flag to reformat. Fixed BBQC and RQCFilter's unnecessary addition of "usejni" to BBMap phase, since it is now already parsed by parseCommonStatic. BBQC now defaults to normalization and ecc off, but can be enabled with the "ecc" and "norm" flags, and supports cecc flag. Added notes on compiling JNI version suggested by sdriscoll. 34.05 Commented out a reference to ErrorCorrectMT in MateReadsMT. 34.06 FindPrimers (msa.sh) now accepts multiple queries (primers) and will use the best-matching of them. Added a BBMap flag to disable score penalty due to ambiguous alignments ("penalizeambiguous" or "pambig"). Requested by Matthias. Fixed failure to start CRIS in A_SampleD. Fixed some incorrect division in CRISD. Added MPI_NUM_RANKS to Shared. This is parsed by parser via e.g. "mpi=4". Added BBMap flags subfilter, insfilter, delfilter, inslenfilter, dellenfilter, indelfilter, editfilter. These function similarly to idfilter. Requested by sdriscoll. 34.07 Dedupe now automatically calls Dedupe2 if more than 2 affixes are requested. Added "subset" (sst) and "subsetcount" (sstc) flags to Dedupe. Added "printLengthInEdges" (ple) flag to Dedupe. 34.08 Finished Dedupe subset processing for graph file generation. 34.09 Fixed bug where 'k' was not added to filename in RQCFilter. Noted by Vasanth. 34.10 Documented "ordered" and "trd" flags for BBDuk/Seal. Added crismpi flag to allow switching between crisd and crismpi. Added shared.mpi package, containing MPIWrapper and ConcurrentReadInputStreamMPI. 34.11 Added detection of read length, interleaving, and quality coding to FileFormat objects, but these fields are not currently read. FileFormat.main() now outputs read length, if in fastq format. Reformat will now allow sam -> sam conversion; not useful in practice, but maybe useful in testing. Added flag "mpikeepall", default true. Fixed deadlock when mpikeepall=false. Noted by Jon Rood. 34.12 Added 'auto' option to gcbins and idbins flags. Requested by Seung-jin. Added dedupe "addpairnum" flag to control whether ".1" is appended to numbered graph nodes. Added real quality to qhist plot, when mhist is being generated. Moved maxns and maq to AFTER quality trimming in RQCFilter and BBDuk. Added "ftm" (forcetrimmodulo) flag to BBDuk/Reformat/RQCFilter/BBQC. Default 5 for RQCFilter/BBQC, 0 otherwise. 34.13 Fixed a missing "else" in RQCFilter/BBQC. Noted by Kurt LaButti. 34.14 Added .size() to ListNum. CrisD gained "unicast" method. Also, unicast and listen now have mandatory toRank parameter. Made CrisD MPI methods protected rather than private, so they can be overridden. Refactored RTextOutputStream3. 34.15 Added Shared.LOW_MEMORY: Disables multithreaded index gen. Disables multithreaded ReadWrite writeObjectInThread method. Disables ByteFile2. For some reason it does not really seem to reduce memory consumption... Added BBMap qfin1 and qfin2 flags. Updated BBMap to use more modern input stream initialization. Added mapnt.sh for mapping to nt on a 120g node. 34.16 Changed RQCFilter "t" to mean "trimmed"; "k" was removed. Added parser noheadersequences (nhs) flag for sam files with millions of ref sequences. Documented "ambig" flag in Seal. Fixed issue where Shared.READ_BUFFER_NUM_BUFFERS was not getting changed with THREADS was changed. Now both are private and get set together. Verified that mapnt.sh works on 120G nodes. 34.17 RTextOutputStream3 renamed to ConcurrentReadOutputStream. ReadStreamByteWriter refactored to be cleaner. Merged MPI dev branch into master. 34.18 Moved Seal's maxns/maq to after trimming. Added chastity filter to bbduk and reformat (reads containing " 1:Y:" or " 2:Y:"). Requested by Lynn A. Dedupe outd stream now produces correctly interleaved reads. Requested by Lynn A. Replaced Dedupe TextStreamWriters with ByteStreamWriters, for read output. 34.19 Added parseCommon() to BBDuk, allowing samplerate flag. 34.20 FASTA_WRAP moved to Shared. Numeric qual output is now wrapped after the same number of bases as fasta output. "Low quality discards:" line is now triggered by chastity filter. SPLIT_READS and MIN_READ_LEN are now disabled when processing reference in BBDuk/Seal. Seal gained parseCommon and parseQuality. 34.21 Fixed MIN_READ_LEN bug (set to 0; should have been set to 1) 34.22 Added qfin (qual file) flags to BBDuk/Seal. Applied BBDuk restrictleft and restrictright to filtering and masking; before, it was only valid for trimming. Added calcCigarBases. Required includeHardClip parameter for all calls to calcCigarLength(), start(), or stop(). Fixed bug in pileup caused by hard-clipped reads. Noted by Casey B. 34.23 DecontaminateByNormalization was excluding contigs with length under 50bp, which caused an assertion error. Fixed a crash in BBDuk2 when not using a reference. Noted by Dave O. Added entropy filter to BBDuk/BBDuk2. Set "entropy=X" where X is 0 to 1 to filter reads with less entropy than X. 34.24 Added maxreads flag to readlength.sh. Fixed bug in BBMap - when directly outputting coverage, secondary alignments were never being used. BBMap now uses the "ambig" and "secondary" flags to determine whether to include secondary site coverage. Specifically, "ambig=all" will use secondary sites, while other modes will not unless "secondary=t". In other words, use of secondary sites in coverage will be exactly the same as use of them in a sam output file. Removed "uscov=t Include secondary alignments when calculating coverage." from shellscript. Fixed minid trumping minratio when both were specified. Now, the last one specified will be used. Added pileup support for reads with asterisks instead of bases, as long as they have a cigar string. Also sped up calculation of read stop position. Cigar string 'M' symbols are now converted to match string 'N' symbols if there is no reference. 34.25 BBMerge initialization order bug fixed; it was preventing jni from being used with the "loose" or "vloose" flags. Noted by sarvidsson (SeqAnswers). 34.26 Fixed semiperfect mode allowing non-semiperfect rescued alignments. Noted by Dave O. Fixed ReadStats columns header for qhist when mhist was also generated. Fixed an inequality in BBMergeOverlapper that favored shorter overlaps with an equal number of mismatches, in some cases. Had no impact on a normal 1M read benchmark except when margin=0, where it tripled the false-positive rate. 34.27 Enabled verbose mode in BBMergeOverlapper. 34.28 Added "align2." to sam header command line of BBMap. Fixed bug in BBMap that could cause "=" to be printed for "rnext" even when pairs were on different scaffolds. Noted by rkanwar (SeqAnswers). 34.30 Reformat can now produce indelhist from sam files prior to v1.4. Fixed a crash bug in BBMap caused by an improper assertion. Noted by Rob Egan. 34.31 BBDuk/Seal now recognize "scafstats" flag as equivalent to "stats". Seal now defaults to 5 stats columns (includes #bp). Wrote BBTool_ST, and abstract superclass for singlethreaded BBTools. Clarified documentation of "trimq=X" as meaning "regions with average quality under X will be trimmed". Fixed major bug in RQCFilter/BBQC: "forcetrimmod" was being set to same value as "ktrim". Noted by Brian Foster. 34.32 Changed the way BBMerge handles qualities to make it 40% faster (in java mode). Reduced size of jni matrix accordingly. Fixed lack of readgroup tags for unmapped reads in sam format. Noted by Rahul (SeqAnswers). Ensured Read.CHANGE_QUALITY affects both lower (<0) and upper (>41) values. 34.33 Pushed BBMergeOverlapper.c to commit. 34.34 Documented trimfragadapter and removehuman in RQCFilter. Added Parser flag for Shared.READ_BUFFER_LENGTH (readbufferlength). Added Parser flag for Shared.READ_BUFFER_MAX_DATA (readbufferdata). Added Parser flag for Shared.READ_BUFFER_NUM_BUFFERS (readbuffers). RQCFilter now accepts multiple references for decontamination by mapping. Added FuseSequence (the first BBTool_ST subclass) and fuse.sh, for gluing contigs together with Ns. Reformatted many scripts' help info to remove echo statements. Fixed bugs in stats and countgc; they were not including undefined bases when printing the length in gcformat=1 and gcformat=4. Replaced all instances of .bases.length with .length(), to prevent null pointer exceptions (for example in sam lines with no bases). Added cat and dog flags to rqcfilter. Changed defaults of BBMask to reduce amount masked in cat and dog to ~1% of genome. This still masks all of the coincidental low-complexity hits from fungi. Determined that dog is contaminated with fungus, particularly chr7 and chr13. 34.35 Fixed a bug in which data was retained from the prior index when indexing a second fasta file in nodisk mode. 34.36 Disabled an assertion in BBMerge that the input is paired; it crashes if the input file is empty. 34.37 NSLOTS is now ignored if at least 16, to account for new 20-core nodes. ReadWrite.getOutputStream now creates the directory structure if it does not already exist. Problem discovered by Brian Foster. BBQC and RQCFilter now strip directory names before writing temp files. BBDuk now correctly reports number of reads quality-filtered. Added "unmappedonly" flag to Reformat. RQCFilter now defaults to using TMPDIR. 34.38 BBMap now prints reads/second correctly. Before, it actually displayed pairs/second with paired data. Added maxq flag to BBMerge, which allows quality values over 41 where reads overlap. Requested by Eric J. Changed CoveragePileup from TextFiles to ByteFiles; increased read speed by 3.65x. Changed CoveragePileup from TextStreamWriters to ByteStreamWriters; increased write speed by 1.46x. Fixed a bug in BBQC/RQCFilter: paired input and interleaved output was getting its paired status lost. Noted by Simon P. Reformat, when in "mappedonly" or "unmappedonly" mode, now excludes reads with no bases or secondary alignments. 34.39 Human contaminant removal is now optional in BBQC. 34.40 ConcurrentReadOutputStream made abstract superclass. Added ConcurrentGenericReadInputStream, the default implementation. Added ConurrentReadOutputStreamD, distributed template. Merged some duplicate methods in MPIWrapper/ConcurrentReadInputStreamMPI. 34.41 Added some features to CoveragePileup, FilterByCoverage, and DecontaminateByNormalization to quantify low-coverage regions on otherwise high-coverage contigs. Added parser fastadump flag to toggle dumping of kmers as fasta vs 2-columns. Fixed a couple bugs in RQCFilter which mixed up names of stats files for trimming and filtering. RQCFilter will now map to cat, dog, and human together with BBSplit if all three are specified, and produce "refstats.txt". BBDuk/Seal now support ambiguous IUPAC codes in reference sequences. 34.42 ByteFile now returns empty lines as byte[0] instead of null. This allows processing of fastq files with 0-length reads. Noted by lankage (SeqAnswers). Fixed a bug in FastaToChromArrays2 - blank lines in fasta files were interpreted as breaks between sequences. Noted by Alex Spunde. Fixed "unmappedonly" flag in reformat.sh - it was providing inverted output. Noted by Kristin T. 34.43 Improved MD tag generation. Reference Ns were not being counted, and unnecessary zeros were appearing between adjacent substitutions. Noted by Jason S. expectedErrors() and averageQuality() both now require a boolean parameter, includeUndefined. Fixed a bug in BBQC's output directory - primary output was going to scratch. Noted by Simon P. Added path to BBSplit's help menu. Noted by Ed K. 34.44 TranslateSixFrames now can accept AA input and produce NT output. Merged dev branch into master. Enabled CrosMPI to be created when CRIS_MPI is set to true. BBDuk and Seal now use MPI streams correctly for reading the reference (when MPI is enabled). Added truseq_rna.fa.gz to resources. 34.45 Added BBDuk skipr1/skipr2 flags. Requested by Stephanie H. Fixed a null pointer in ConcurrentReadOutputStreamD. 34.46 Added SamLine.parseFlagOnly(byte[]) for rapid classification of sam lines. Revised SplitSamFile and added splitsam.sh to the public distribution. It's now fast (~540MB/s). Added a table of contents to /resources/. 34.47 Fixed bug parens around in FindTipDeletions; it was sometimes running when it should have been disabled. Added swap flag to Reformat, for substituting one base for another (as in bisulfite treatment). Added underscore flag to Reformat. Fixed threads flag in BBDuk; it was getting parsed in 2 places and never set. Added qhdist/qhdist2 flags to BBDuk/Seal for mutating query kmers. Suggested by sdriscoll (SeqAnswers). Corrected mkh flag in Seal. Noted by Vasanth S. FastaReadInputStream now has a mandatory amino field in constructor. 34.48 Bitsets and coverage arrays can now both be disabled in pileup. Reorganized buffer lengths in BBIndexPacBio to reduce memory usage and support long (6000bp) reads with shorter kmers, down to 9bp. Added small rna adapter path to BBQC/RQCFilter. Fixed FilterReadsByName processing of sam files; bug found by Marissa Miller. Accelerated and reduced memory usage of FilterReadsByName; moved name parsing over to Tools. Added ReadStreamWriter.USE_ATTACHED_SAMLINE. 34.49 Fixed qin/qout flags in many classes; they were being ignored. Noted by Jason H. Added Nextera LMP adapter sequences. 34.50 Added AssemblyStats format=7. Requested by Andrew Tritt. 34.51 Added physical (aka fragment) coverage flag to Pileup and BBMap. Added rpkm/fpkm output to Piluep and BBMap. Requested by Vasanth. Changed Seal FPKM calcluation; it was dividing by number of mapped reads rather than number of mapped fragments. 34.52 Added SmallKmerFrequency and commonkmers.sh. Requested by Bill A. Fixed a bug in ReadStreamByteWriter; "attachment" mode was printing a period instead of newline. 34.53 Added graphical display of GC level to gchist. The gcplot flag works with all programs that use gchist. Requested by Kecia D. Added reparse to BBTool_ST. This allows parsing of subclass fields which are otherwise overwritten by their defaults. Added count output to SmallKmerFrequency. 34.54 Added cumulative column to gchist. This is also enabled by the gcplot flag. Requested by Seung-Jin. Added BBMap normcov and normcovo flags. Requested by Vasanth. Added support for out= to stats and statswrapper. Requested by Brian Foster. Fixed a bug in which stdout was being closed by closing a PrintWriter that wrapped it. Disabled a message about read pairing for sam input. Finished DedupeByMapping and created dedupebymapping.sh. 34.55 Fixed a bug in BBMap's coverage flags; normcovo was called normcovOverall. Noted by Matt Nolan. 34.56 Fixed qin flag being ignored by BBMap. Noted by Adrian P. Removed obsolete classes ReformatFasta and ReverseComplement (both handled by ReformatReads now). 34.57 Added BBMap timetag flag and thist output. Fixed bug in AssemblyStats GC output. Noted by Jasmyn P. Added format=0 to stats (no output). 34.58 Version bump. 34.59 BBSplit now supports # operator in filenames. Requested by Vicente G. BBMap now prevents cross-scaffold alignments if any output file is sam or bam, not just the primary one. Reformat now has a primaryonly flag to prevent output of secondary alignments. Added KillSwitch class. This will kill the process after X seconds with under Y CPU utilization. It is invoked by the command line argument "monitor" for post programs. 34.60 BBDuk can now remove reads with less than X% of any single base. Requested by Alicia C. Added reformat 'filterbits' and 'requiredbits' flags. Removed obsolete colorspace-specific fields Read.expectedErrors and Read.mapLength. Wrote SplitNexteraLMP.java and splitnextera.sh. Added Read.subRead(from, to). Added BBMap call to ReadStats.checkFiles() to force a crash before running, rather than after running, if there are problems. Changed nextera_LMP_linker.fa.gz to a double linker after examining real data. Modified BBTool_ST for greater flexibility with additional IO streams. Wrote MultiStateAligner9XFlat.java. For testing. A flatter, faster MSA. 34.61 Completely removed all support for Solid colorspace. Removed ChromosomeArrayCompressed. Removed MultiStateAligner9fs. Removed FastaStream, QualStream, FastqReadInputStream_old. Renamed FastaQualReadInputStream3 to FastaQualReadInputStream. Added kmer.TableLoaderLockfree. This unifies the load portion of BBDuk and Seal for filling AbstractKmerTables. Added kmer.TableReader. This makes it easy to read data from kmer tables. Added error message for KmerCountExact if k<1 or k>31. Added warning to BBDuk if no kmers are loaded but a kmer operation is specified. Added kmask flag to BBDuk.sh help and clarified ktrim flag. Timer now automatically self-starts upon creation. Added NexteraLMP support to rqcfilter. Added versions of Illumina contaminant files without Nextera adapter junctions. 34.62 Fixed null pointer in BBDukF. Found by Alicia C. 34.63 Created tax package for processing NCBI taxonomy data. Added IntList.getUniqueCount() Added TaxNode and TaxTree, for accumulating taxa counts. Added GiToNcbi, for translating gi numbers to taxa ids. Added RenameGiToNcbi, for renaming sequences (e.g. nt) with their taxa id. Added SortByTaxa, for sorting sequences based on taxonomy for better compression. TaxTree and GiToNcbi now support serialized input; much smaller. Integrated taxonomy support into Seal. Seal now uses 9 ways, and uses pigz when loading the reference. Added ftr2 (forcetrimright2) flag, which allows trimming a fixed number of bases on the rightmost end. Added BBMap parsing logic to prevent bad vad values of maxsites and maxsites2. Fixed BBTools failure to find primes.txt.gz if there is a space in the classpath (Matt Kearse). All calls to average quality now require a max number of bases to process. Added maqb flag (min average quality bases); maq calculation will be restricted to that many leading bases. (Shoudan) Retired FilterReads (superceded by Reformat and BBNorm). Added reformat "tossjunk", "fixjunk", and "aminoin" flags. 34.64 Fixed off-by-one error in forcetrimright2. 34.65 Made gi2taxid.sh, which calls RenameGiToNcbi. RenameGiToNcbi updated to split input into valid and invalid output, where invalid gets anything with no taxid. Added FileFormat.isFasta(String) method. Improved SortByTaxa. Now does preorder traversal of tree, and supports dummy nodes, fusing, and promotion. Added sortbytaxa.sh. BBSplit ref= can now point to directories. Requested by Manuel K. 34.66 Fixed an uncaught overflow in ByteBuilder.expand(). Added SortByTaxa max fusion length. Added barcode filtering to Reformat and BBDuk. Sped up chastity filtering and allowed it to process reads with / before read number. Added chastity filtering and barcode filtering to RQCFilter. 34.67 Fixed BBDuk double-counting chastity/barcode-filtered reads. Fixed BBDuk overcounting of reads that were trimmed by overlap. BBMap nzo flag now affects refstats and scafstats in addition to covstats (Vasanth). Added BBMap sortstats flag. Added BBMap rebuild flag (Vasanth). 34.68 Added qtrim=window flag (Alicia). Added slashspace flag to Reformat (disable space when adding /1 and /2 to read names). Added clearzone to Seal (Vasanth). Added stoptag to Reformat. Added boundstag to Parser (Shoudan). 34.69 Slight fix for samline inbounds detection. Fixed a corrupt Truseq RNA adapter sequence. Added Truseq RNA adapters to adapters.fa. 34.70 Added BBMerge useratio mode (enabled by default in vloose mode). Added BBMerge adapter processing. Added BBDuk kmask=lowercase. 34.71 Added BBMerge uloose and vstrict modes. Added BBMerge requireratiomatch, ratiominoverlapreduction, and ratiooffset flags. 34.72 Adjusted BBMerge uloose settings. 34.73 Added RandomReads path flag. Increased BBMerge defaults to -Xmx1000m and readbufferlen=400 to improve scaling. baseToComplementExtended array now maps lowercase letters to lowercase letters. Changed BBMerge to use floating-point probabilities. Fixed missing quote mark in bbsplit.sh. Noted by Manuel K. 34.74 Added Reformat quantize flag (Alicia Clum). 34.75 RandomReads ignored q=x in perfect mode, and perfect flag did not work without a number. Added Reformat skipreads flag. Added DualCris, for dual input files of unequal length. Repair.sh now works if r1 and r2 files are unequal length. Added flags for BBMerge normalmode and ratiomode. Made BBMerge default to ratiomode-only for default and loose stringencies. BBMerge efilter now enabled by default in all modes. Improved BBMerge efilter (now occurs after both merge modes). Updated BBMerge efilter to examine only the trailing X bases depending on mininsert setting. Made normalmode and ratiomode more independent, so requireratiomatch flag is more efficient. Added ratiomode settings for strict, vstrict and ustrict. Ustrict has requireratiomatch enabled. Increased readlength.sh default memory and reduced number of reads in buffer. Added BBMerge ordered flag; disabled by default. 34.76 Accelerated BBMerge by using a buffer for quality values translated to probabilities. Added early exits before mininsert0 and minoverlap, and added mininsert0 flag. Added ecc mode, for correction only and no merging. Tested removal of runtime division; speed unchanged. 34.77 Added BBMerge pfilter flag; discards overlaps with low probability mismatches. Fixed BBMergeOverlapper.expectedMismatches() and probability(). Both were considering the wrong bases. Increased mateByOverlapRatioJava speed with altBadlimt. Slightly increases false-positive rate. Added reformat itn flag (convert iupac symbols to N). Simplified mateByOverlap and mateByOverlapRatio - removed no-quality loop. BBMerge uloose settings tweaked; no longer uses normalmode. 34.78 Fixed BBDuk compatibility with new BBMergeOverlapper float array requirement. 34.79 Integrated new BBMergeOverlapper ratiomode call into BBDuk. Uses settings similar to strict mode. Adjusted BBMerge defaults some more. 34.80 Added outu flag to demuxbyname. Added overlapWithoutQuality (owq) and overlapUsingQuality (ouq) flags to BBMerge; default is ouq=f owq=t. Made a quality-free loop in BBMergeOverlapper now that quality is disabled by default; 20% faster. Removed minoi flag and some legacy fields from Overlapper that dealt with mapping information. 34.81 Split read verification into a different function outside of constructor. Added a mode for worker threads to verify reads, instead of at construction time. Added input file check to BBMerge. Greatly increased speed of overlapper with findBestRatio function. 34.82 Minor changes to BBMerge entropy calculations. Added BBMerge static function errorCorrect(). Changed fast from normalmode to ratiomode. Added ecc flag to BBDuk and Seal. 34.83 Fixed DualCris and SplitPairsAndSingles (repair.sh). Crash bug found by GenoMax. 34.84 Added overlap flag to BBNorm to regulate whether overlapping is used for error correction. BBMerge now has static functions mergeableFraction() and makeInsertHistogram(). Added ecc flag to kmercountexact. Added normbins flag to BBMap and fixed normcovo flag. 34.85 Finished ArrayListSet. Created MultiCros and a functional reference implementation in main(). Upgraded Seal handling of refnames mode; now all sequences for a ref file get the same ID (in that mode). Added BBSplit-style output for Seal refstats. Requested by Vasanth. Removed outsingle from Seal. Added multiple output streams to Seal with the pattern flag; acts like basename in BBSplit. DemuxByName now supports arbitrary output streams without needing a name list. Moved directory parsing from BBSplit to Tools.getFileOrFiles(). 34.86 Fixed text string for output stats of FilterByName. Noted by Alex C. Tiny change to BBMergeOverlapper regarding Ns. Added mkf (minkmerfraction) flag to Seal. Requested by Vasanth S. 34.87 Added /ref/qual/ directory for recalibration matrices. Added recalibrate() function to CalcTrueQuality. recalibrate (recal) is now a parser flag and has been enabled for BBMerge, Reformat, and BBDuk. Added fixheader flag to parser. Requested by Shijie. Changed parseInt to parseLong for matrix loading. Added qb123 matrix. Added estimate by max observed error rate rather than average. 34.89 Added unicode2ascii.sh for fixing files with strange symbols. 34.90 Fixed FastaReadInputStream's ability to process extended ascii characters (128-255). Fixed BBDuk, Reformat, and Seal sometimes ignoring the ftr2 flag. Added adapter sequence detection to BBMerge (outa flag). Made CalcTrueQuality multithreaded. Fixed ROOT_QUALITY not getting set with the path flag. Added path flag to BBDuk and BBMerge. Added notags flag to BBMap. 34.91 CalcTrueQuality now tracks paired reads independently. BBDuk and various programs no longer deadlock waiting for sam header to be read. Added BBMerge iupacton (itn) flag. Seal can now write sam output from sam input. BBDuk can write sam output from sam input, but only for quality recalibration, not other operations. Made RemapQuality to test recalibration. Internal 2-pass recalibration is now working. Changed $CMD to eval $CMD in all shellscripts, which allows escaping spaces in filenames using backslash. Thanks Jon! Added qchist (qualityCountHistogram). Gives counts of bases with each quality score. Fixed BBDuk not testing to see if qahist was set. Changed CalcTrueQuality default observationcutoff to 100, because higher settings cause odd-looking graphs. 34.92 Added RenameReads prefixonly flag. Added driver.SummarizeCoverage to summarize cross-contamination scafstats files. Changed calcmem.sh to work correctly even if ulimit fails. Noted by Tomas B. Thanks to vladr (stackoverflow)! Changed bbduk.sh/bbduk2.sh default ram to 1400m from 2000m so that they should work on 32-bit MacOS systems without setting the -Xmx flag. Removed some redundancy from TextFile. DecontaminateByNormalization now named CrossBlock. Added Read.uToT and parser utot flag for converting uracil to thymine in reads. BBMap now converts U to T when generating the reference, and all degenerate bases to N. Changed SummarizeCoverage to be memory-efficient with large files (e.g. coverage vs nt). Removed colorspace from ChromosomeArray. Fixed a bug in which cigar strings were sometimes not printed for secondary alignments or when using filters. Noted by Jason S. Added qb12 matrix to CalcTrueQuality. Added support for adjusting read quality score limits beyond 2~41 with the mincalledquality and maxcalledquality flags. Recalibration matrices may be extended above Q41 with the recalqmax flag, for processing consensus reads. 34.93 Fixed version flag in parser; it was being ignored if there were arguments. Fixed calcmem.sh behavior on a Mac (or other system without /proc/mem) when ulimit=unlimited. Fixed issue where BBTool_ST subclasses were having params overridden by defaults. ReadStats now correctly tracks read1 and read2 for qhist using sam input, instead of lumping them together. Fixed CalcTrueQuality's recalibration of sam input; it was applying the read1 profile to both reads. Finally works perfectly. Added BBSplit force-rebuild logic. Coverage histogram now goes to 1 million in 32bit mode, instead of 64k. Added pjet to rqcfilter/bbqc. Fixed a division-by-zero bug in ReadStats. Noted by Seung-Jin. Removed confusing readme line stating that BBMap is free for noncommercial use. This is true, but it is also free for commercial use. Added stdev to covstats output (as long as arrays are enabled). Added driver.SummarizeSealStats and summarizeseal.sh for analyzing cross-contamination results using Seal stats output. Added summarizescafstats.sh script for driver.SummarizeCoverage. Added jgi.FilterReadsWithSubs it select only those reads with substitution errors for bases in a specified quality range. Added phix_adapters.fa.gz to /resources and updated contents. Added qahist average deviation header line. 34.94 BBMap now produces an error message if indexing fastq files rather than just crashing. Added config file support to all BBTools via the config= flag. Added some sample config files and a config file readme. Moved bloom filter and count-min sketch data structures to bloom package. 34.95 Added trd flag to reformat.sh help. Noted missing by Esther Singer. Added merge support to SplitNexteraLMP. Currently unknown which is better, merge=t or f. Added better output names to RQCFilter. Requested by Bryce F. Added kmer ownership to AbstractKmerTable. KmerLink is now an AbstractKmerTable subclass. Wrote Tadpole. Fixed bug when reading empty fasta files. Noted by Matt K. Added SamLine.countTrailingClip, and modified countLeadingClip. Now both have soft/hard clipping toggles. Removed static SamLine.SUBTRACT_LEADING_SOFT_CLIP and replaced with required parameter. Reduced default initial CoverageArray size from 16m to 500. Added mincov, maxcov, and delcov flags to BBMask. Updated BBMask readme. Split IntList/LongList toString method into SetView and ListView. Added Pileup toggle for including soft-clipped bases; default false. Made Tadpole shell script. Fixed Seal stats header indicating over 100% matched sequences with ambig=all. Noted by Esther Singer. AbstractKmerTable classes now correctly return -1 instead of 0 if a key is not present. Added AbstractKmerTable.clearOwner() to clean up ownership trails of abandoned contigs. Fixed various bugs in Tadpole. Added substring and case flags to FilterByName. Requested by Esther Singer. Tadpole now works correctly multithreaded. 34.96 Fixed a bug in FastaReadInputStream not shutting down subprocesses when done. Added LMP insert-size detection mode to Tadpole. Added ByteBuilder.appendKmer(kmer, k). Added Tadpole read-extension mode. Tadpole now builds contigs from kmer seeds rather than contig seeds. Slower but more consistent. Tadpole is now a complete assembler. 34.97 Added length, coverage, and GC to Tadpole contig names. Added directional substrings for filterbyname. Requested by Esther Singer. Disabled module lines in reformat.sh. Noted by Xiaoli D. Renamed summarizecoverage to summarizescafstats. Added ambig and kfilter flags to CrossBlock. Requested by Ken H. 34.98 BBDuk/BBDuk2 default maxrskip set to 1 (disabled), to reduce confusion. Fixed Seal generating ArrayListSets even when pattern output was not specified. Fixed Seal bug classifying both read1 and read2 as matched when only one matched in kpt=f mode (this IS the intended behavior in default kpt=t mode). Noted by Alex Spunde. Fixed string compare bug in FilterReadsByName making substring=header and substring=name fail. Noted by Esther Singer. 34.99 Fixed BBMap not printing coverage statistics when machineout=t. Noted by Vasanth Singan. Added minlen flag to filterbyname. Changed Dedupe primary structure from HashMap to LinkedHashMap to (somewhat) preserve input order. TODO: BBDuk crashes with K>31 (Alex Spunde). TODO: Memory autodetection does not work on Amazon. TODO: BBMap machineout to file (Vasanth). TODO: chrombits and CHROMS_PER_BLOCK may be obsolete and ready to remove. TODO: out=stdout.bam does not work. TODO: Include deletions toggle for Pileup. TODO: Soft-clipping coverage flag. TODO: Add match/cigar/SamLine trimming to TrimRead. TODO: Write Hollow. TODO: Multithread splitnextera. TODO: config flag in Parser TODO: Normalize CalcTrueQuality on 50% GC by tracking GC rates (etc) observed in reads. TODO: Make Recalibrate class and recalibrate.sh to automate everything. TODO: Track quality-score accuracy per base location. TODO: Track quality-score accuracy per base letter. TODO: Tool to extract reads mapped to a specific locus. TODO: Make it easy to test a decontam tool on the synth datasets. TODO: Map unknowns in 48-sample-plate. TODO: BBMerge return codes. -1 no solution, -2 ambig, -3 too long (short overlap), -4 too short. TODO: Seal speed and mkf flags should work together. TODO: Apply Seal refnames upgrade to taxonomy handling, if not already done. TODO: BBNorm histout with 1pass/ecc does not seem to generate anything. TODO: randomreads does not name reads by origin in fasta format. TODO: Hamming distance for demuxbyname. TODO: MultiCros wrapper and hash-based multi-listnum object. TODO: Reformat should be able to trim mapped sam files (Aldo J). TODO: Mask bases overlapping from Dedupe graph (Shoudan). TODO: RQCFilter - dynamically switch between $TMPDIR or /dev/shm depending on input size and available disk space. TODO: BBMerge - trim adapters for unmerged reads (?) TODO: Fungal pipeline: FindErrors? *TODO: BBMap calls calcCorrectness even when data is not synthetic. TODO: BBMap File containing all reads/pairs that are not completely contained within a single contig. (Shoudan) TODO: BBDuk/Seal - enable tracking of kmers by reference file rather than reference sequence. TODO: Batch setting for BBDuk to operate on multiple files and auto-name output. TODO: Get data from Chris B, count mismatched pairs, send to E. TODO: Stats does not accurately estimate BBMap RAM usage for K=15. TODO: Accelerate maxindel=0 mode for BBMap by banning MSA usage. TODO: Redo DedupeByMapping so that it can handle sorted input using a heap. TODO: MSA Flat - remove states to increase speed. TODO: Dedupe does not work with sam input. (Lynn A.) TODO: Change all instances of "remove bases with quality below minq" to "...trimq" in shellscripts. TODO: Parse extra part of sam lines into a byte array (optionally). *TODO: Dedupe crash on input in C:\temp\dd1\bad.fa (Shoudan). TODO: Tile-based statistics and filtering for BBMap, BBDuk, etc. TODO: Pileup could calculate ref/nonref coverage. TODO: Marcel wants a program to essentially sort reads and remove duplicates that are at least X identity. TODO: Move parsing of "threads" to parseCommonStatic and adjust all relevant classes. TODO: Add 'remap' from Reformat flag to BBMap. *TODO: BBMerge won't go below 17bp in normal mode or 26bp in loose mode, regardless of minoi flag. TODO: BBMerge dynamic mode - test to determine best overlap limits. TODO: Bed output of masked regions by BBMask, or regions with Ns. TODO: Bed output of regions with coverage abover or below X (Bob Bower). TODO: Document append in shellscripts. TODO: Genbank format parser (Sam D). Looks confusing. TODO: Decontam should break at (or N-mask) low-coverage areas rather than discarding the whole contig. TODO: BED support for pileup. And make Pileup faster by ignoring irrelevent sam fields. TODO: CrossMask. Accept set of files; for each, mask using BBDuk with all others as ref. TODO: Study bisulfite data on BBMap. Possibly use multiple reference copies with different transforms (C->T, A-G, both, neither). TODO: Shellscripts are not able to handle paths containing spaces. TODO: Add mininsert flag to BBMap. And maybe maxinsert. TODO: Parse MD tag when available. TODO: CC rates for all 3 platforms in one chart; ignore R1/R2 differences. *TODO: Dedupe loses reads when using paired data and run multithreaded. TODO: document nhs flag. TODO: Filter cross-contam plates with only depth and length, test cc rates. TODO: Fix dedupe crash when minclustersize=1. TODO: Clarify or fix what minid does in Dedupe. TODO: Add ribosomal filtering to rqc. TODO: Update BandedAlignerJNI for quicker width reset. TODO: Optional penalty when seq ends before ref in banded. TODO: Make sure AddAdapters is adding them correctly, i.e., reverse-complemented (or not). TODO: Make list of proposed higher stringency adapter trimming changes and send to Vasanth/Erika. TODO: Retire ErrorCorrect, and move the functionality over to another class. TODO: Implement ErrorCorrectBulk in KmerNormalize. It is used in MateReadsMT. TODO: BBMerge should allow optional inline error-correction for reads that fail to merge, and revert if they still fail. TODO: Retire KmerCount7MT (non-atomic version). *TODO: It appears that timeslip is being correctly applied by fillLimited (etc), but not by calcDelScore() or calcAffineScore(). TODO: Dedupe should warn if lowercase letters are present. (Kurt) v33. Added "usemodulo" flag to BBMap. Allows throwing away 80% of reference kmers to save memory. Slight reduction in sensitivity. Requested by Rob Egan. Moved GetReads back to jgi package and fixed shellscript. Fixed rare crash when using "local" mode on paired-end data on highly-repetitive genomes (Creinhardtii). Found by Vasanth S. Improved "usemodulo" mode - it was biased against minus-strand hits. Now, it keeps kmers where (kmer%5==rkmer%5). Result is virtually no reduction in sensitivity (zero in error-free reads, and less than 0.01% in reads with 8% error). BBMap will now discard reads shorter than "minlen". Added "idhistbins" or "idbins" flag to BBMap; allows setting the number of bins used in the idhist. Rescaled BBMap's MAPQ to be lower. It is now 0 for unmapped, 1-3 for ambiguous, and roughly 4-45 otherwise, with higher values allowed for longer reads. Added a much flatter MSA version, "MultiStateAligner9Flat", requested by JJ Chai. Fixed SNR output formatting. Added "forcesectionname" flag; fasta reads will always get an "_1" at the end, even if they are not broken into multiple pieces. (requested by Shoudan) Changed "fastareadlen" suffixes to only be appended when read is > maxlen rather than >= Reorganized SamLine and created SamHeader class. Modified CountBarcodes to append sub distance from expected barcodes and 'valid' for valid barcodes. Fixed null pointer exception related to "qhist", "aqhist", and "qahist". Noted by Harald (seqanswers). Fixed issue of readlength.sh breaking up reads when processing fasta files without a fasta extension. Updated BBDuk documentation. Added "maxlength" and qahist support to BBDuk. Added "minoverlap" and "mininsert" to BBDuk. Added "maxlength" to BBMerge. Created countbarcodes.sh Added edit distance column to CountBarcodes output. Added raw mapping score tag, YS:i:, controlled by "scoretag" flag and disabled by default. Added 'cq' (changequality) flag to reformat. Default: true. Fixed mhist being generated from sam files. Added readgroup support; a readgroup field "xx" can be specified with the flag "rgxx=value". Updated 'usemodulo' flag to use (kmer%9==0 || rkmer%9==0). Requiring the remainders to be equal unevenly affected palindromes and thus even kmer lengths. Updated RemoveHuman to use 'usemodulo' flag and reduced RAM allotment from 23g to 10g. Updated index location of HG19 masked. Added "idfilter" to BBMap. Made BandedAligner abstract superclass and created BandedAlignerConcrete for the Java implementation, and BandedAlignerJNI for the C version. Made file extension detection more robust against capitalization. Added outsingle to BBDuk. Replaced FastaToChromArrays with ChromArrayMaker. Now, indexing can be done from fastq files instead of just fasta. Fixed MAJOR bug in which reference was split up into pieces (as of 33.12). Reverted to old version of reference loader (as of 33.13) as there was still a bug (skipping every other scaffold). BBDuk (and BBDuk2) now better support kmer masking! Every occurance of a kmer is individually masked. Added parseQuality (qin, qout, etc) to Dedupe. Changed Dedupe default cluster stats cutoff to 2 (from 10), min cluster size to 2, and by default these values are linked. Added 'outbest' to Dedupe, writing the representative read per cluster (regardless of 'pbr' flag). This is mainly for 16s clustering. Fixed sorting of depths in pileup.sh. Noted by Alicia Clum. Fixed 'outbest' of Dedupe (was writing to wrong stream). Slightly accelerated read trimming. Added read/base count tracking to ConcurrentReadStreamInterface. Added display of exact number of input and output bases and reads to reformat.sh (requested by Seung-Jin). Fixed capital letters changing to lower-case in output filenames when using the "basename" flag with BBSplit. Noted by Shoudan Liang. Added Tools.condenseStrict(array). Fixed fast/slow flags with BBSplit. Noted by Shoudan Liang. Added 3-frames option to TranslateSixFrames by adding the flag "frames=3". Requested by Anne M. TranslateSixFrames now defaults to fasta format when the file extension is unclear. Added "estherfilter.sh" for filtering blastall queries. Added option of getting an input stream from a process with null file argument. Wrote FastaToChromArrays2 based on ByteFile/ByteBuilder for slightly better indexing speed and lower memory use. Modified ChromosomeArray to work with ByteBuilder. Fixed reformat displaying wrong number of input reads when run interleaved (due to recent changes). Added minratio, maxindel, minhits, and fast flags to BBQC, for controlling BBMap. Fixed "assert(false)" statement accidentally left in SamPileup from testing. Noted by Brian Foster. Added kfilter and local flags to BBQC. Fixed "bs" (bamscript) flag with BBSplit. Previously, it did not include the per-reference output streams. Added Jonathan Rood's C code and JNI class for Dedupe. Modified dedupe shellscripts to allow JNI code. BBSplit was not outputting any reads when reference files had uppercase letters (as a result of the recent case-sensitivity change). This has been fixed. Noted by Shoudan Liang. BBMap can now output fastq files with reads renamed to indicate mapping location, using the flags "rbm" and "don" (renamebymapping and deleteoldname). FastaQualInputStream replaced by FastaQualInputStream3. At least 2.5x faster, and correctly reads input in which fasta and qual lines are wrapped at different lengths. Bug noted by Kurt LaButti. Added bqhist, which allows box plots of read quality-per-base-location. Fixed a slowdown when making quality histograms due to recalculating probability rather than using cached value. Default sam format is now 1.4. RemoveHuman/BBQC/RQCFilter now default to minhits=1 because 'usemodulo' reduces the number of valid keys. Programs no longer default to outputting to stdout when "out=" is not specified because it's annoying. To write to stdout set "out=stdout.fq" (for example). AssemblyStats now counts IUPAC and invalid characters seperately. X and N now denote gaps between contigs, but no other symbols do. The code was also cleaned somewhat. The output formatting changed slightly. Preliminarily integrated Jon Rood's JNI versions of BandedAligner and MultiStateAligner into both Java code and shellscripts to test Genepool deployment. C code is now in /jni/ folder, at same level as /resources/ and /docs/. Clarified documentation of BBMap, BBSplit, and BBWrap to differentiate some parameters. For example, "refstats" only works with BBSplit. Added LW and RW (whisker values) columns to bqhist output, set at the 2nd and 98th percentiles. Requested by Seung-Jin Sul. BBQC will now compress intermediate files to level 2 instead of level 4, to save time. Fixed incompatibility of dot graph output and other output in Dedupe. Reverted to default "minhits=2" for RemoveHuman, because minhits=1 took 5x as long. Added median, mean, and stdev to gchist. Requested by Seung-Jin. Added obqhist (overall base quality histogram). Requested by Seung-Jin. Fixed various places, such as BBDuk, where the "int=true" flag caused references to be loaded interleaved. Noted by Jessica Jarett. Added some parser flags to allow dynamically enabling verbose mode and assertions specifically for certain classes. Fixed a bug in BBMap that made secondary alignments sometimes not get cigar strings. Added "addprefix" mode to rename reads, which simply prepends a prefix to the existing name. Clarified documentation of different histogram outputs in shellscripts. Ported BBMapThread changes over to BBMap variants. Restructured SamPileup and renamed it to CoveragePileup. Now supports Read objects (instead of just SamLines). Integrated CoveragePileup with BBMap and documented new flags. CoveragePileup: Added a concise coverage output, stranded coverage, and read-start-only coverage. Removed an obsolete Java classes and some shellscripts. Increased robustness of BBDuk's detection invalid file arguments, and clarified the error messages. Noted by Scott D. Fixed a problem with interleaving not being forced on fasta input. Paired output files will now force BBDuk input to be treated as interleaved. BBDuk now tracks statistics on which reference sequences were trimmed or masked - previously, it just tracked what was filtered. Reverse-complemented Nextera adapters and added them to official release (/resources/nextera.fa.gz). Added Illumina adapter sequence legal disclaimer to /docs/Legal_Illumina.txt Implemented GC calculation from index, for generating coverage stats while mapping. Tracked down strangeness with BBDuk. It is possible for "rcomp=f" to slightly reduce sensitivity when "mm=t" using an even kmer length, due to asymmetry. This appears to be correct. Merged in revised JNI Dedupe version that should be working correctly. Verified that it returns same answer as non-JNI version. Tests indicate roughly triple speed, when working with PacBio reads of insert. BBMap JNI version now seems roughly 30% faster than Java version. Added insert size quartiles to BBMap and BBMerge. Requested by Alex Copeland. Fixed rare bug related to SiteScore.fixXY(), caused by aligning reads with insufficient padding, fixing the tips, but not changing the start/stop positions. Found by Brian Foster. Fixed a race condition in TextStreamWriter that could randomly cause a deadlock in numerous different programs. Found by Shoudan Liang. Added "maxsites2" flag to allow over 800 alignments for a given read. Fixed bounds of kmer masking in BBDuk; they were off by 2 (too big). Fixed unintended debug print line. Noted by Shoudan Liang. Updated RandomReadInputStream to work with the newer RandomReads3 class. ConcurrentGenericReadInputStream now supports RandomReadInputStream3 as a producer. Fixed kmer dumping from CountKmersExact. Fixed length of vector created in BBMergeOverlapper (4->5). Noted by Jon Rood. Changed default kmer length in BBDuk to 27 so that the 'maskmiddle' base will be in the middle for both forward and reverse kmers. "pairlen" flag accidentally deleted from BBMap; restored. Noted by HGV (seqanswers). BBMerge now has a JNI version from Jonathan Rood - 60% faster than pure Java. Requires compiling the C code; details are in /jni/README.txt. Wrapped BBMerge JNI initializer in a conditional, so it will not try to load unless "usejni" is specified. Added "parseCommonStatic" to BBMerge and BBDuk (to allow JNI flag parsing). Commented out "module load" and "module unload" statements in public version. Added 'printlastbin' or 'plb' flag to countunique to produce a final bin smaller than binsize. Suggested for use in cumulative mode. Requested by Andrew Tritt. Added support for bzip2 and pbzip2 compression and decompression. The programs must be installed to use bz2 format. Elminated use of "sh" when launching subprocesses. This also allows pigz compression support in Windows. Files were not being closed after "testInterleaved()". Fixed. Improved error messages when improper quality values are observed. Updated hard-coded adapter path to include Nextera adapters. This affects BBQC and RQCFilter. Improved file format detection. Now FileFormat (testformat.sh) will print a warning when the contents and extension don't match, and it can differentiate between sam and fastq. Problem noted by Vasanth Singan. Fixed issue where "scafstats" output was printing inflated numbers with chimeric paired reads, or pairs with only one mapped read. Noted by HGV (seqanswers). Closed stream after reading in FileFormat. Unrolled, debranched, and removed assertion function calls from BBMerge inner loop. Fixed a bug in which findTipDeletions was not changing the bounds of the gap array. Added getters and setters for SiteScores that enforce gap correctness. Improved GapTools to test for and fix non-ascending points. Forced use of setters in TranslateColorspaceRead, AbstractMapThread, and BBIndex* classes; this caught some inconsistencies that should increase stability and correctness. Enabled jni-mode alignment by default for BBQC and removehuman. Added a BBMap output line indicating how many reads survived for use with, e.g., removehuman. Requested by Brian Foster. Added messages to BBQC to indicate which phase is executing. Requested by Brian Foster. SiteScore start and stop are exclusively set by methods now. Fixed a bug with local flag noted by Vasanth Singan. Added MaximumSpanningTree generation to Dedupe (mst flag). Merged in faster BBMerge overlapper JNI version; now 90% faster than Java with fastq and 70% faster with fasta. Improved Dedupe's support for paired reads: fixed an assertion, and added "in1" and "in2". Fixed a assertion involving semiperfect alignments of repetitive reads, that go out of the alignment window. Found by Alicia Clum. Fixed idhist mean calculation. Added mode, median, stdev, both by read count and base count. Better documented ConcurrentReadStreamInterface. Fixed a crash in CoveragePileup when using 32-bit mode. Fixed a couple instances in which the first two arguments being unrecognized would not be noticed. Fixed a bug in pileup causing coverage fraction to be reported incorrectly, if arrays were not being used. Noted by Vasanth Singan. Fixed a twocolumn mode in pileup; it was generating no output. Added additional parse flags to pileup, such as "stats" and "outcov". Added additional output fields to coverage stats - total number of covered bases, and number of reads mapped to plus and minus strands. CountKmersExact: Added preallocation (faster, less memory) and a one-pass-mode for the prefilter (faster, but nondeterministic). Replaced most instances of "Long.parseLong" with "Tools.parseKMG" to support kilo, mega, and giga abbreviated suffixes. Added jgi.PhylipToFasta and phylip2fasta.sh, for converting interleaved phylip files to fasta. Requested by Esther Singer. v33.58 Began listing point-version numbers in this readme. Added jgi.A_Sample2, an simpler template for a concurrent pipe-filter stage. Added jgi.MakeChimeras, a tool for making chimeric PacBio reads from input non-chimeric reads. Also, makechimeras.sh. Requested by Esther Singer. Added support for normalized binning to CoveragePileup. Requested by Vasanth Singan. v33.59 Fixed pileup's normalized scaling when dealing with 0-coverage scaffolds. v33.60 Added driver.FilterReadsByName.java and filterbyname.sh. Allows inclusion or exclusion of reads by name. Added midpad flag to RandomReads (allows defining inter-scaffold padding). v33.61 Added ConcurrentReadInputStreamD, prototype for MPI-version of input stream. Made Read and all classes that might be attached to reads Serializable. Added DemuxByName and demuxbyname.sh which allows a single file to be split into multiple files based on read names. v33.62 Added FilterByCoverage and filterbycoverage.sh to filter assemblies based on contig coverage stats (from Pileup). Added CovStatsLine, an object representation of Pileup's coverage stats. Added '#' symbol to coverage stats header. v33.63 Fixed path in filterbycoverage.sh v33.64 Added custom scripts driver.MergeCoverageOTU and mergeOTUs.sh for Esther. Added DecontaminateByNormalization, for automating SAG plate decontamination. Fixed legacy code that set KmerNormalize to use 8 threads in some cases. Added "fixquality" for capping quality scores at 41. Requested by Bryce Foster. Added fasta output to kmercountexact. Requested by Alex Copeland. Added kmer histogram to kmercountexact (2-column and 3-column). Requested by Alex Copeland. Added multiple memory-related and output formatting flags to kmercountexact. Made KmerNode a subclass of AbstractKmerTable. Improved Data's "unloadall" to also clear scaffold-related data. Removed obsolete class CoverageArray1. v33.65 Reduced preallocated memory in kmercountexact to avoid a crash on high memory machines. Also reduced total number of threads. v33.66 "CountKmersExact.java" renamed to "KmerCountExact.java". kmercountexact now writes histogram and kmer dump simultaneously in seperate threads. kmercountexact.sh now specifies both -Xms and -Xmx. CountKmersExact will no longer run out of memory if -Xms is not specified; instead, it will preallocate a smaller table. v33.67 Messed with MDA amp in RandomReads a bit. Added parser "ztd" ("zipthreaddivisor") flag. Defaults to 2 for removehuman.sh. Added BBMerge flags "maq" (minaveragequality) and "mee" (mmaxexpectederrors). Reads violating these will not be attempted to merge. Added BBMerge "efilter" flag, to allow disabling of the efilter. Efilter bans merges of reads that have more than the expected number of errors, based on quality scores. Closed A_Sample2 I/O streams after completion. Noted by Jon Rood. Created SynthMDA, a program to make a synthetic MDA'd single cell genome. This genome would be used as a reference for RandomReads. Added Reformat "vpair" or (verifypairing) flag, which allows validation of pair names. Before, it was just interleaved reads. Pair name validation will now accept identical names, if the "ain" (allowidenticalnames) flag is set. Updated reformat.sh, repair.sh, bbsplitpairs.sh with new flags. Removed FastaReadInputStream_old.java. Added "forcelength" flag to MakeChimeras. v33.68 Added "ihist" flag to rqcfilter, default "ihist.txt". Unless this is set to null, BBMerge will run to generate the insert size histogram after filtering completes. AbstractKmerTable preallocation is now multithreaded. Unfortunately, this did not result in a speedup. Added ByteBuilder-related methods to certain Read output formats. Added ByteStreamWriter. This is a threaded writer with low overhead, and is substantially faster than TextStreamWriter (perhaps 2x speed). Fixed a bug in KmerNode (traversing wrong branch during dump). All AbstractKmerTable subclasses now dump kmers using bsw/ByteBuilder instead of tsw/StringBuilder. Added ForceTrimLeft/ForceTrimRight flags to Dedupe (requested by Bryce/Seung-Jin). v33.69 FilterByCoverage (and thus DecontaminatebyNormalization) now produce a log file indicating which contigs were removed. FilterByCoverage and DecontaminatebyNormalization can now optionally process coverage before and after normalization, and not remove contigs unless the coverage changes by at least some ratio (default 2). Enable with "mapraw" and optionally "minratio" flag. Added ihist to file-list.txt. TODO: Verify success. Reads longer than 200bp are now detected as ASCII-33 regardless of their quality values. This helps with handling PacBio CCS/ROI data. Added support in FixPairsAndSingles (repair.sh) for reads with names that do not contain whitespace, but still end with "/1" and "/2". Added qout flag to RandomReads3. Refactored TextStreamWriter to be more like ByteStreamWriter. Added gcformat 0 (no base content info printed) to AssemblyStats2 (stats.sh). v33.70 Updated RQCFilter and BBQC to bring them closer together and improve some of their defaults. RQCFilter now has more parameters such as k for filtering and trimming. RQCFilter now correctly produces the insert size histogram. v33.71 Fixed a bug in Dedupe preventing overlap detection when 'absorb match' and 'absorb containment' were both disabled. Noted by Shoudan Liang. Optimized synthetic MDA procedure. v33.72 Fixed a bug in SynthMDA.java. Further tweaked parameters. Added synthmda.sh. v33.73 Further tweaked SynthMDA defaults to better match some real data sent to me by Shoudan and Alex. Fixed a bug in BBDuk's mask mode in which all bases in a masked read were assigned quality 0. Noted by luc (SeqAnswers). Fixed a small error in KmerCountExact's preallocation calculation. Added preallocation to BBDuk/BBDuk2. Not recommended for BBDuk2 because the tables may need unequal sizes. Added "restrictleft" and "restrictright" flags to BBDuk (not BBDuk2). These allow only looking for kmer matches in the leftmost or rightmost X bases. Requested by lankage (SeqAnswers). v33.74 Added jgi.Shuffle.java to input a read set and output it in random order. It can also sort by various things (coordinates, sequence, name, and numericID). Added CallPeaks, which can call peaks from a histogram. Requested by Kurt LaButti. Integrated peak calling into BBNorm and KmerCountExact. BBNorm now has a "histogramcolumns" flag, so it can produce Jellyfish-compatible output. Added callpeaks.sh. v33.75 CallPeaks now calls by raw kmer count rather than unique kmer count. This better detects higher-order peaks. Finished CrossContaminate.java and added crosscontaminate.sh. Added "header" and "headerpound" to pileup.sh, to control header presence and whether they start with "#". Added "prefix" flag to SynthMDA and RandomReads3, to better track origin of reads during cross-contamination trials. RQCFilter and BBQC now parse 'usejni' flag; rqcfilter.sh and bbqc.sh default to this being enabled. Added "uselowerdepth" flag to BBNorm (default true). Allows normalization by depth of higher or lower read. Set to false by DecontaminateByNormalization. v33.76 Fixed a bug in synthmda.sh command line. Fixed build number not being parsed by SynthMDA. Added some error handling to CrossContaminate, so it shouldn't hang as a result of missing files. v33.77 SynthMDA now nullifies reference in memory prior to generating reads. Parser was not correctly setting the number of compression threads when exactly 1 was requested. Shuffle is now multithreaded, and CrossContaminate defaults to shufflethreads=3. Shuffle now removes reads as they are printed, reducing memory usage. Created shellscript templates for generating and assembling full plates of synth MDA data, and ran successfully. *SamLine was fixed when generating pnext from clipped reads. Still needs work; pos1 and pos2 need to be recalculated considering clipping. BBDuk now tracks #contaminant bases as well as #contaminant reads per scaffold for stats. Additional flag "columns=5" enables this output. BBDuk stats are now sorted by #bases, not #reads. BBDuk counting arrays changed from int to long to handle potential overflow. v33.78 Modified DemuxByName to handle affixes of variable length (though it's less efficient with multiple lengths). v33.79 Changed the way "pos" and "pnext" are calculated for paired reads to be consistent. Bug had been noted with soft-clipped reads by Rob Egan. Changed LOCAL_ALIGN_TIP_LENGTH from 8 to 1. Previously, soft-clipping would only occur if at least 8 bases would be clipped; not sure why I did that. Changed the way "tlen" is calculated to compensate for clipping. v33.80 Changed default decontaminate minratio from2 to 0 (disabling it) because of false negatives. Changed default decontaminate mincov from 4 to 5 due to a false negative. Changed default decontaminate kfilter from 63 to 55 to better reflect Spades defaults. Fixed a bug in filterbycoverage which was outputting contaminant contigs instead of clean contigs. Added outd (outdirty) flag to FilterByCoverage. v33.81 Changed decontaminate normalization target from 100 to 50, and minlength from 0 to 500. Changed decontaminate minc and minp flags from int to float. v33.82 Changed cross contaminate probability root from 2 to 3 (increasing amount of lower-level contamination). Fixed a crash bug in sam file generation caused by the change in the way pos was calculated. v33.83 Added aecc=f, cecc=f, minprob=0.5, depthpercentile=0.8 flags to DecontaminateByNormalization. Defaults are as listed. Dropped mindepth to 3 and maxdepth to target; target default changed to 20. Changed the way mindepth is handled in normalization; now it is based on the depth of the higher read. v33.84 Added BBNorm prebits flag for setting prefilter cell size (default 2). Added Decontaminate filterbits and prefilterbits flags, default 32 and 4. 4 was chosen because MDA data has high error kmer counts. v33.85 Fixed parsing of decontaminate minc and minp (parsed as ints; should have been floats) Changed default minc to 3.5. Change default ratio to 1.2. v33.86 Changed decontaminate default dp to 0.75. Changed decontaminate default prebits to 2. Changed decontaminate default minr (min reads) to 20. Some tiny (~500bp) low-coverage contigs were getting through. Changed decontaminate mindepth to 2. Decontaminate results now prints extra columns for read counts and pre-norm coverage. v33.87 Added "covminscaf" flag to BBMap and Pileup, to supress output of really short contigs. Default 0. Changed CrossContaminate coverage distribution from cubic to geometric. v33.88 Shuffle removing reads caused incredible slowness; it should have set reads to null. Fixed. v33.89 Added HashArrayA, HashForestA, KmerNodeA and updated AbstractKmerTable to allow sets of values per kmer. Refactored all AbstractKmerTable subclasses. Added scaffold length tracking to BBDuk (for RPKM). Added RPKM output to BBDuk (enable with "rpkm" flag). BBDuk now unloads kmers after finishing processing reads. v33.90 BBDuk counter arrays are now local per-thread, to prevent cache-thrashing. Added IntList.toString() Created Seal class, based on BBDuk with values stored in arrays. Adjusted auto skip settings of BBDuk (increased size threshold for longer skips). Added BBDuk skip flag (controls minskip and maxskip). Fixed a bug in DemuxByName/DecontaminateByNormalization/CrossContaminate: attempt to read directories as files. v33.91 Fixed a bug in BBDuk related to clearing data too early. Noted by Brian Foster. v33.92 Added per-reference-file stats counting to BBDuk/Seal, and "refstats" flag. Added returnList(boolean) to ConcurrentReadStreamInterface. Removed an extra listen() call from ConcurrentReadInputStreamD. Documented "addname" flag for stats.sh. Implemented restrictleft and restrictright for BBDuk2. Added "nzo" flag for BBDuk/Seal. Added sdriscoll's reformatted shellscript help for BBDuk and BBMap. Thanks! Added more documentation to bbmap.sh (usequality flag). Added maq (minaveragequality) flag to BBMap, at request of sdriscoll. Added rename flag to BBDuk/Seal - renames reads based on what sequences they matched. Added userefnames flag BBDuk/Seal - the names of reference files are used, rather than scaffold IDs. v33.93 maxindel flag now allows KMG suffix. Added "speed" flag to BBDuk/Seal. Added read processing time to BBDuk/Seal output. BBDuk "fbm" (findbestmatch) mode is now much faster, using variable rather than fixed-length counters. Fixed BBDuk2 not working when using the "ref" flag rather than "filterref". Changed AbstractKmerTable subclass names to *1D and *2D. Made KmerNode a superclass of KmerNode1D and KmerNode2D and eliminated redundant methods. Eliminated 2D version of HashForest; it now works with 1D and 2D nodes. Made HashArray a superclass of HashArray1D and HashArray2D. Created HashArrayHybrid. Added slow debugging methods to AbstractKmerTable classes, to verify that values were present after being added. Fixed bug in KmerNode1D; was never changing its value on 'set'. Probably only affected Seal. Seal 1D now appears to produce identical output for prealloc and non-prealloc. Finished debugging KmerNode2D, KmerForest, HashArray2D, HashArrayHybrid, and Seal. Added "fbm" and "fum" to Seal. Seal now defaults to 7 ways. Adjusted Seal's memory preallocation. Added -Xms flag to BBMergeGapped BBNorm shellscripts. v33.94 Added -Xms flag to BBDuk and Seal. Added qskip flag to BBDuk and Seal (for skipping query kmers). v33.95 Seal now defaults to HashArrayHybrid rather than HashArrayArray2D v33.96 Fixed a slowdown in Seal and BBDuk caused by sorting list of ID hits. v33.97 Wrote driver.CorrelateIdentity and matrixtocolumns.sh for identity correlations between 16S and V4. Wrote jgi.IdentityMatrix and idmatrix.sh for all-to-all alignment. Added BandedAligner.alignQuadruple() to check all orientations. BandedAligner now does not clear the full arrays, only the used portion, which can vary depending on read length. v33.98 No change - build failure. v33.99 Changed BandedAligner.PenalizeOffCenter(). Indels were getting double-penalized when they led to length mismatches between query and ref. Added AlignDouble(), but it looks like AlignQuadruple is the only viable method for calculating full identity when the sequences do not start or stop at the same place. Added test method to ReadStats to ensure the files are safe to write (ReadStats.testFiles()). Fixed a bug bqhist output giving read 1 and read 2 same values. Noted by Shoudan/Bryce Fixed a bug in BBDuk initialization when no kmer input supplied. Noted by Bill A. Fixed a bug in BBDuk/Seal giving a spurious warning. Detected race condition in ByteFile2 triggered by closing early. Not very important. Added jni path flags to BBDuk shellscript command line. Wrote FindPrimers and msa.sh to locate primer sites. Uses MultiStateAligner; outputs in sam format. Wrote CutPrimers and cutprimers.sh to cut regions flanked by mapped primer locations from sequences, e.g. V4. TODO: Plot correlation of V4 and 16s. TODO: Add length into edges of Dedupe output. (Ted) TODO: Benchmark Seal. Speed seems inconsistent. TODO: Locking version of Seal. TODO: HashArray resize - grow fast up to a limit, then resize to exactly the max allowable. TODO: Alicia BBMap PacBio slowdown (try an older version...) TODO: BBMerge rename mode with insert sizes. TODO: Dump info about Seal kmer copy histogram. TODO: Dedupe crash bug. (Kurt) TODO: CallPeaks minwidth should be a subsumption threshold, not creation threshold. TODO: CallPeaks should not subsume peaks with valleys in between that are very low. *TODO: Make TextStreamWriter an abstract superclass. TODO: BBDuk split mode TODO: Add option for BBMap to convert U to T. (Asaf Levy) TODO: Add dedupe support for graphing containments and matches. TODO: Log normalization. TODO: Prefilterpasses (prepasses) TODO: Test forcing msa.scoreNoIndels to always run bidirectionally. TODO: Message for BBNorm indicating pairing (this is nontrivial) TODO: Average quality for pileup.sh TODO: Fix ChromArrayMaker which may skip every other scaffold (for now I have reverted to old, correct version). ***Possibly fixed by disabling interleaving; TODO: Test. TODO: Consider changing ConcurrentGenericReadInputStream to put read/base statistics into incrementGenerated(), or at least in a function. TODO: BBSplit produces alignments to the wrong reference in the output for a specific reference. (Shoudan) TODO: Change the way Ns are handled in cigar strings, both input and output. TODO: Add #clipped reads/bases to BBMap output. TODO: Add method for counting number of clipped bases in a read and unclipped length. TODO: Orientation statistics for BBMap ihist. TODO: Clarify documentation of 'reads' flag to note that it means reads OR pairs. TODO: bs flag does not work with BBWrap (Shoudan). TODO: Fasta input tries to sometimes keep reading from the file when a limited number of reads is specified. Gives error message but output is fine. TODO: 'saa' flag sometimes does not work (Shoudan). TODO: Kmer transition probabilities for binning. TODO: One coverage file per scaffold; abort if over X scaffolds. (Andrew Tritt) TODO: Enable JNI by default for BBMap and Dedupe on Genepool. TODO: Disable cigar string generation when dumping coverage only (?). This will disable stats, though. TODO: Pipethread spawned when decompressing from standard in with an external process. TODO: FileFormat should test interleaving and quality individually on files rather than relying on a static field. TODO: Refstats (BBSplit) still reports inflated rates for pairs that don't map to the same reference. This behavior is difficult to change because it is conflated with BBSPlit's output streams. v32. Revised all shellscripts to better detect memory in Linux. This should massively increase reliability and ease of use. Added append flag. Allows appending to output files instead of overwriting. Append flag now should work with BBWrap, with sam files, and with gzipped files. All statistics are now stored in longs, rather than ints. Added statistics tracking of # bases as well as # reads. Updated human-readable output to show 4 columns. Split bbmerge into gapped (split kmer) and ungapped (overlap only) versions. bbmerge.sh calls the ungapped version. Added "qahist" to bbmap - match/sub/ins/del histogram by quality score. Fixed "pairlen" flag; it was only being used if greater than the default. (Noted by Harald on seqanswers) Added insert size median and standard deviation to output stats. The 'ihist=' flag must be set to enable this, otherwise the data won't be tracked. (Requested by Harald on seqanswers) Fixed bug in which non-ACGTN IUPAC symbols were not being converted to N. (Noted by Leanne on seqanswers) Changed shellscripts from DOS to Unix EOL encoding. Added support for "-h" and "--help" in shellscripts (before it was just in java files). Created Dedupe2 - faster, and supports 1-cluster-per-file output. Created Dedupe3 - supports more than 2 affix tables. Uses slightly more memory. BBMap now generates "sort" shellscripts even if the output is in bam format. pileup.sh now prints a coverage summary to standard out. Added 'split' flag to BBMask. Fixed bug in randomreads allowing paired reads to come from 'nearby' scaffolds. Documented randomreads.sh. Added gaussian insert size distribution to randomreads. Fixed a bug in calcmem.sh that prevented requesting memory that Linux considered 'cached'. TODO: Penalize score of sites with errors near read tips, and long deletions. Added "Median_fold" column to pileup. You need to set 'bitset= Changed default quality-filtering mode to average probability rather than average quality score. Default number of threads now takes the environment variable NSLOTS into consideration. However, because Mendel nodes have hpyerthreading enabled, if NSLOTS>8 and (# processors)==NSLOTS*2, then #processors will be used instead. So it is still recommended that you set threads manually if you don't have exclusive access to a node. Fixed bbmerge, which was crashing on fasta input. Fixed gaussian insert size distribution in randomreads (it was causing a crash). Enabled unpigz support in Windows (decompression only). TODO: BBNorm needs in1/in2/out1/out2 support. Added mingc and maxgc to reformat. Added 'passes' flag to BBQC and reduced default passes to 1 if normalization is disabled. Swapped FileFormat's method signature "allowFileRead" and "allowSubprocess" parms for some functions, as they were inconsistent. This may have unknown effects. TODO: unclear if fasta files are currently checked for interleaving. Method added to "FASTQ". TODO: FileFormat should perhaps test for quality format and interleaving. Fixed reversed variables in "machineout" stats for %mapped and %unambiguous. Found by Michael Barton. Added "testformat.sh". Fixed dedupe "csf" output to work even when no other outputs specified. Fixed dedupe erroneous assumption that "bandwidth" had not been custom-specified. Changed MakeLengthHistogram (readlength.sh) default behavior to place reads in lower bins rather than closest bins. Toggle with "round" flag. Added "repair" flag to SplitPairsAndSingles. Created "repair.sh". Fixed a bug in which tabs were not allowed in fasta headers. Improved BBMerge: default minqo 7->8, made margin a parameter, added 'strict' macro that reduces false positive rate. Added "samestrand" flag to RandomReads. Fixed a dedupe bug with "pto" and paired reads; read2 was not getting a UnitID. Fixed a bug in which the BBMap stats for insertion rate was sometimes higher than the true value. Fixed bugs in BBMerge; increased speed slightly. Created grademerge.sh to grade merged reads. Added 'variance' flag to randomreads; used to make qualities less uniform between reads. BBDuk now has overwrite=true by default. calcmem.sh now sets -Xmx and -Xms from each other if only one was specified. Fixed bug with "ambig=all" and "stoptag" flags being used together. Found by WhatSoEver (seqanswers). Added 'findbestmatch'/'fbm' flag to BBDuk; reports the reference sequence sharing the greatest number of kmers with the read. Shellscripts no longer try to calculate memory before displaying help (noted by Kjiersten Fagnan). -ea and -da are now valid parameters for all shellscripts. Improved documentation of Dedupe. Added "loose" and "vloose" modes to BBMerge. Added novel-kmer-filtering to BBMerge - bans merged reads that create a novel kmer. Does not seem to help. Added entropy-detection to BBMerge - minimum allowed overlap is determined by entropy rather than a constant. Moderate improvement. Fixed bug causing "repair.sh" script to not work. Noted by SES (seqanswers). Added "fast" mode to BBMerge. Fixed a rounding problem in RandomReads that caused gaussian distribution to have 2x frequency of intended reads at exactly insert size of double read length. Added exponential decay insert size distribution to RandomReads, for use in LMP libraries. TODO: Track different paired read orientation rates (innie, outie, same direction, etc) with BBMap. Added sssr (secondarysitescoreratio) and ssao (secondarysiteasambiguousonly) flags. Response to WhatSoEver (seqanswers). Ambiguously-mapped reads that print a primary site now print a minimum of 1 secondary site, and all sites with the same score as the top secondary site. Improved error message for paired reads with unequal number of read 1 vs read 2. Response to Salvatore (seqanswers). Updated bbcountunique.sh help message. Changed AddAdapters default to "arc=f" (no reverse-complement adapters). Added "addpaired" flag (adds adapter to same location of both reads). Added BBDuk/BBDuk2 "tbo" (trimbyoverlap) flag. Vastly reduces false-negatives with no increase in false-positives. Adding "fragadapter" flag to RandomReads. Also added ability to handle multiple different adapters for both read 1 and read 2. Adapters are added to paired reads with insert size shorter than read length. Added "ordered" flag to BBDuk/BBDuk2. Added "tpe" (trimpairsevenly) flag to BBDuk/BBDuk2. This works in conjunction with kmer-trimming to the right. Slightly decreases false negatives and doubles false positives. Updated rqcfilter and bbqc with 'tbo' and 'tpe' flags. TODO: Migrate RQCFilter to BBDuk2. Improved addadapters to better handle reads annotated by renamereads. BBMap's fillLimited routine is now affected by 'sssr' flag, if secondary sites are enabled. This will make things slightly slower when secondary sites are enabled, if sssr uses a low value (default is 0.95). statswrapper now allows comma-delimited files. Added standard deviation to BBMerge (requested by Bryce F). Added "tbo" (trimbyoverlap) flag to BBMerge, as an alternative to joining. Updated help for 'ambig' in bbmap.sh to remove the obsolete information that 'ambig=all' did not support sam output. Updated BBMapSkimmer and its shellscript to default to 'ambig=all', which is its intended mode. BBDuk no longer defaults to "out=stdout.fq" because that was incredibly annoying. Now it defaults to "out=null". Changed BBDuk default mink from 4 to 6. Changed BBDuk, Reformat, SplitPairsAndSingles default trimq from 4 to 6. Added "ftr"/"ftl" flags to BBDuk. Added "bbmapskimmer" to the list of options parsed by BBWrap. (Noted by JJ Chai) Corrected documentation of idtag and stoptag - both default to false, not true. (Noted by JJ Chai) Added "mappedonly" flag to reformat. (Requested by Kristen T) Added "rmn" (requirematchingnames) flag to Dedupe. Requested by Alex Copeland. Added ehist, indelhist, idhist, gchist, lhist flags to BBMap, BBDuk, and Reformat. Added removesmartbell.sh wrapper for pacbio.RemoveAdapters2. Fixed instance in KmerCoverage where input stream was being started twice. Noted by Alicia Clum. Added "ngn" (NumberGraphNodes) flag to dedupe; default true. Allows toggling of labelling graph nodes with read number or read name. "slow" flag now disables a heuristic that skipped mapping reads containing only kmers that are highly overrepresented in the reference. Problem noted by Shoudan Liang. Added MergeBarcodes and mergebarcodes.sh Identity is now calculated neutrally by default. Added "qin" and "qout" documentation to bbnorm shellscripts. Noted by muol (seqanswers). Changed qhist to ouput additional columns - both linear averages and logrithmic averages. Added mode to BBMerge output. Added mode, min, max, median, and standard deviation to ReadLength output. The mode and std dev are affected by bin size, so will only be exactly correct when bin size is 1. Added "nzo" (nonzeroonly) flag to ReadLength. Created "A_Sample", a template for programs that input reads, perform some function, and output reads. BBNorm now works correctly with dual input and output files. Noted by Olaf (seqanswers). Added mode to BBMap insert size statistics. Added CorrelateBarcodes and filterbarcodes.sh, for analyzing and filtering reads by barcode quality. Added "aqhist" (average quality histogram) to ReadStats - can be used by BBMap, BBDuk, Reformat. v31. TODO: Change pipethreads to redirects (where possible), and hash pipethreads by process, not by filename. TODO: Improve scoring function by using gembal distribution and/or accounting for read length. TextStreamWriter was improperly testing for output format 'other'. Noted by Brian Foster. Fixed bug for read stream 2 in RTextOutputStream3. Found by Brian Foster. Fixed bug in MateReadsMT creating an unwanted read stream 2. Found by Brian Foster. TrimRead.testOptimal() mode added, and made default when quality trimming is performed; old mode can be used with 'otf=f' flag. Fixed a couple cases where output file format was set to "ordered" even though the process was singlethreaded; this had caused an out-of-memory crash noted by Bill A. Changed shellscripts of MapPacBio classes to remove "interleaved=false" term. Reduced Shared.READ_BUFFER_LENGTH from 500 to 200 and Shared.READ_BUFFER_MAX_DATA from 1m to 500k, to reduce ram usage of buffers. Noticed small bug in trimming; somehow a read had a 'T' with quality 0, which triggered assertion error. I disabled the assertion but I'm not sure how it happened. Fixed bug in which pigz was not used to decompress fasta files. All program message information now defaults to stderr. Added "ignorebadquality" (ibq) flag for reads with out-of-range quality. TODO: mask by information content Added "mtl"/"mintrimlength" flag (default 60). Reads will not be trimmed shorter than that. Made 'tuc' (to uppercase) default to true for bbmap, to prevent assertion errors. Reads MUST be uppercase to match reference. Added new tool, BBMask. Reads and SamLines can now be created with null bases. SamLines to Read is now faster, skipping colorspace check. Added deprecated 'SOH' symbol support to FastaInputStream. This will be replaced with a '>'. Needed to process NCBI's NT database. Added "sampad" or "sp" flag to BBMask, to allow masking beyond bounds of mapped reads. TODO: %reads with ins, del, splice TODO: #bases mapped/unmapped, avg read length mapped/unmapped Dedupe now tracks and prints scaffolds that were duplicates with "outd=". (request by Andrew Tritt) Updated all shellscripts to support the -h and --help flags. (suggested by Westerman) RAM detection is now skipped if user supplies -Xmx flag, preventing a false warning. (noted by Westerman) Created AddAdapters.java. Capable of adding adapter sequence to a fastq file, and grading the trimmed file for correctness. Removed some debug code from FileFormat causing a crash on "stdin" with no extension. Noted by Matt Nolan. Added BBWrap and bbwrap.sh. Wraps BBMap to allow multiple input/output files without reloading the reference. Added support for breaking long fastq reads into shorter reads (maxlength and minlength flags). Requested by James Han. Added Pileup support for residual bins smaller than binsize. Flag "ksb", "keepshortbins". Requested by Kurt LaButti. Fixed support for breaking long reads; was failing on the last read in the set. Noted by James Han. Improved accuracy slightly by better detecting when padding is needed. Improved verbose output from MSA. Created TranslateSixFrames, first step toward amino acid mapping. Improved RandomReads ability to simulate PacBio error profile. Fixed crash when using BBSplit in PacBio mode. (Noted by Esther Singer) May have improved ability to read relatively-pathed files if "." is not in $PATH. (nope, seems not) Fixed crash when using "usequality=f" flag with fasta input reads. (Noted by Esther Singer) Corrected behaviour of minlength with regards to trimming; it was not always working correctly. Added "bhist" (base composition histogram) flag. v30. Disabled compression/decompression subprocesses when total system threads allowed is less than 3. Fixed assertion error in calcCorrectness in which SiteScores are not necessarily sorted if AMBIGUOUS_RANDOM=true. Noted by Brian Foster. Fixed bug in toLocalAlignment with respect to considering XY as insertions, not subs. TODO: XY should be standardized as substitutions. Added scarf input support. Requested by Alex Copeland. TODO: Allow sam input with interleaved flag. TODO: Make pigz a module dependency or script load. Fixed bug with nodisk mode dropping the name of the first scaffold of every 500MB chunk after the first. Noted by Vasanth Singan. Overhaul of I/O channel creation. Sequence files are now initialized with a FileFormat object which contains information about the format, permission to overwrite, etc. Increased limit of number of index threads in Windows in nodisk mode (since disk fragmentation is no longer relevant). Renamed Read.list to sites; added Read.topSite() and Read.numSites(); replaced many instances of things like "r.sites!=null && !r.sites.isEmpty()" Refactored to put Read and all read-streaming I/O classes in 'stream' package. Moved kmer hashing and indexing classes to kmer package. Moved Variation, subclasses, and related classes to var package. Moved FastaToChrom and ChromToFasta to dna package. Moved pacbio error correction classes to pacbio package. Removed stack, stats, primes, and other packages; prefixed all unused pacakges with z_. TODO: Sites failing Data.isSingleScaffold() test should be clipped, not discarded. RandomReads3 no longer adds /1 and /2 to paired fastq read names by default (can be enabled with 'addpairnum' flag). Added "inserttag" flag; adds the insert size to sam output. Fixed insert size histogram anomaly. There was a blip at insert==(read1.length+read2.length) because the algorithm used to calculate insert size was different for reads that overlap and reads that don't overlap. Skimmer now defaults to cigar=true. Added maxindel1 and maxindel2 (or maxindelsum) flags. Removed OUTER_DIST_MULT2 because it caused assertion errors when different from OUTER_DIST_MULT; changed OUTER_DIST_MULT from 15 to 14. Added shellscript for skimmer, bbmapskimmer.sh TODO: Document above changes to parameters. v29. New version since major refactoring. Added FRACTION_GENOME_TO_EXCLUDE flag (fgte). Setting this lower increases sensitivity at expense of speed. Range is 0-1 and default is around 0.03. Added setFractionGenometoExclude() to Skimmer index. LMP librares were not being paired correctly. Now "rcs=f" may be used to ignore orientation when pairing. Noted by Kurt LaButti. Allocating memory to alignment score matrices caused uncaught out-of-memory error on low-memory machines, resulting in a hang. This is now caught and results in an exit. Noted by Alicia Clum. GPINT machines are now detected and restricted to 4 threads max. This helps prevent out-of-memory errors with PacBio mode. Fixed sam output bug in which an unmapped read would get pnext of 0 rather than 1 when its mate mapped off the beginning of a scaffold. Noted by Rob Egan. Added memory test prior to allocating mapping threads. Thread count will be reduced if there is not enough memory. This is to address the issue noted by James Han, in which the PacBio versions would crash after running out of memory on low-memory nodes. TODO: Detect and prevent low-memory crashes while loading the index by aborting. Fixed assertion error caused by strictmaxindel mode (noted by James Han). Added flag "trd" (trimreaddescriptions) which truncates read names at the first whitespace. Added "usequality/uq" flag to turn on/off usage of quality information when mapping. Requested by Rob Egan. Added "keepbadkeys/kbk" flag to prevent discarding of keys due to low quality. Requested by Rob Egan. Fixed crash with very long reads and very small kmers due to exceeding length of various kmer array buffers. Avg Initial Sites and etc no longer printed for read 2 data. TODO: Support for selecting long-mate-pair orientation has been requested by Alex C. Fixed possible bug in read trimming when the entire read was below the quality threshold. Fixed trim mode bug: "trim=both" was only trimming the right side. "qtrim" is also now an alias for "trim". Fixed bug in ConcurrentGenericReadInputStream causing an incorrect assertion error for input in paired files and read sampling. Found by Alex Copeland. Added insert size histogram: ihist= Added "machineout" flag for machine-readable output stats. TODO: reads_B1_100000x150bp_0S_0I_0D_0U_0N_interleaved.fq.gz (ecoli) has 0% rescued for read1 and 0.7% rescued for read 2. After swapping r1 and r2, .664% of r2 is rescued and .001% of r1 is rescued. Why are they not symmetric? Added 'slow' flag to bbmap for increased accuracy. Still in progress. Added MultiStateAligner11ts to MSA minIdToMinRatio(). Changed the way files are tested for permission to write (moved to Tools). Fixed various places in which version string was parsed as an integer. Added test for "help" and "version" flags. Fixed bug in testing for file existence; noted by Bryce Foster. Fixed issue with scaffold names not being trimmed on whitespace boundaries when 'trd=t'. Noted by Rob Egan. Added pigz (parallel gzip) support, at suggestion of Rob Egan. Improved support for subprocesses and pipethreads; they are now automatically killed when not needed, even if the I/O stream is not finished. This allows gunzip/unpigz when a file is being partially read. Added shellscript test for the hostname 'gpint'; in that case, memory will be capped at 4G per process. Changed the way cris/ros are shut down. All must now go through ReadWrite.closeStreams() TODO: Force rtis and tsw to go through that too. TODO: Add "Job.fname" field. Made output threads kill processes also. Modified TrimRead to require minlength parameter. Fixed a bug with gathering statistics in BBMapPacBioSkimmer (found by Matt Scholz). Fixed a bug in which reads with match string containing X/Y were not eligible to be semiperfect (Found by Brian Foster). Fixed a bug related to improving the prior fix; I had inverted an == operator (Found by Brian Foster). Added SiteScore.fixXY(), a fast method to fix reads that go out-of-bounds during alignment. Unfinished; score needs to be altered as a result. Added "pairsonly" or "po" flag. Enabling it will treat unpaired reads as unmapped, so they will be sent to 'outu' instead of 'outm'. Suggested by James Han and Alex Copeland. Added shellscript support for java -Xmx flag (Suggested by James Han). Changed behavior: with 'quickmatch' enabled secondary sites will now get cigar strings (mostly, not all of them). "fast" flag now enables quickmatch (50% speedup in e.coli with low-identity reads). Very minor effect on accuracy. Fixed bug with overflowing gref due GREFLIMIT2_CUSHION padding. Found by Alicia Clum. Fixed bug in which writing the index would use pigz rather than native gzip, allowing reads from scaffolds.txt.gz before the (buffered) writing finished. Rare race condition. Found by Brian Foster. Fixed stdout.fa.gz writing uncompressed via ReadStreamWriter. Added "allowSubprocess" flag to all constructors of TextFile and TextStreamWriter, and made TextFile 'tryAllExtensions' flag the last param. allowSubprocess currently defaults to true for ByteFiles and ReadInput/Output Streams. TODO: TextFile and TextStreamWriter (and maybe others?) may ignore ReadWrite.killProcess(). TODO: RTextOutputStream3 - make allowSubprocess a parameter TODO: Assert that first symbol of reference fasta is '>' to help detect corrupt fastas. Improved TextStreamWriter, TextFile, and all ReadStream classes usage of ReadWrite's InputStream/OutputStream creation/destruction methods. All InputStream and OutputStream creation/destruction now has an allowSubprocesses flag. Added verbose output to all ReadWrite methods. Fixed bug in which realigned SiteScores were not given a new perfect/semiperfect status. Noted by Brian Foster and Will Andreopoulos. v28. New version because the new I/O system seems to be stable now. Re-enabled bam input/output (via samtools subprocess). Lowered shellscript memory from 85% to 84% to provide space for samtools. Added "-l" to "#!/bin/bash" at top. This may make it less likely for the environment to be messed up. Thanks to Alex Boyd for the tip. Addressed potential bug in start/stop index padding calculation for scaffolds that began or ended with non-ACGT bases. Made superclass for Index. Made superclass for BBMap. Removed around 5000 lines of code as a result of dereplication into superclasses. Added MultiStateAligner11ts, which uses arrays for affine transform instead of if blocks. Changing insertions gave a ~5% speedup; subs gave an immeasurably small speedup. Found bug in calculation of insert penalties during mapping. Fixing this bug increases speed but decreases accuracy, so it was modified toward a compromise. v27. Added command line to sam file header. Added "msa=" flag. You can specify which msa to use by entering the classname. Added initial banded mode. Specify "bandwidth=X" or "bandwidthratio=X" accelerate alignment. Cleaned up argument parsing a bit. Improved nodisk mode; now does not use the disk at all for indexing. BBSplitter still uses the disk. Added "fast" flag, which changes some paramters to make mapping go faster, with slightly lower sensitivity. Improved error handling; corrupt input files should be more likely to crash with an error message and less likely to hang. Noted by Alex Copeland. Improved SAM input, particularly coordinates and cigar-string parsing; this should now be correct but requires an indexed reference. Of course this information is irrelevant for mapping so this parsing is turned off by default for bbmap. Increased maximum read speed with ByteFile2, by using 2 threads per file. May be useful in input-speed limited scenarios, as when reading compressed input on a node with many cores. Also accelerates sam input. TODO: Consider moving THREADS to Shared. Updated match/cigar flag syntax. Updated shellscript documentation. Changed ByteFile2 from array lists to arrays; should reduce overhead. TODO: Increase speed of sam input. TODO: Increase speed of output, for all formats. TODO: Finish ReadStreamWriter.addStringList(), which allows formatting to be done in the host. In progress: Moving all MapThread fields to abstract class. MapThread now passes reverse-complemented bases to functions to prevent replication of this array. Fixed very rare bug when a non-semiperfect site becomes semiperfect after realignment, but subsequently is no longer highest-ranked. strictmaxindel can now be assigned a number (e.g. stricmaxindel=5). If a fasta read is broken into pieces, now all pieces will recieve the _# suffix in their name. Previously, the first piece was exempt. TODO: Consider changing SamLine.rname to a String and seq, qual to byte[]. Changed SamLine.seq, qual to byte[]. Now stored in original read order and only reversed for minus strand during I/O. Added sortscaffolds flag (requested by Vasanth Singan). Fixed XS tag bug; in some cases read 2 was getting opposite flag (noted by Vasanth Singan). Fixed bug when reading sam files without qualities (noted by Brian Foster). Fixed bug where absent cigar strings were printed as "null" instead of "*" as a result of recent changes to sam I/O (noted by Vasanth Singan). Found error when a read goes off the beginning of a block. Ref padding seems to be absent, because Ns were replaced by random sequence. Cause is unknown; cannot replicate. Fixed Block.getHitList(int, int). Changed calcAffineScore() to require base array for information when throwing exceptions. Changed generated bamscript to unload samtools module before loading samtools/0.1.19. sam file idflag and stopflag are both now faster, particularly for perfect mappings. But both default to off because they are still slow nonetheless. Fixed bug in BBIndex in which a site was considered perfect because all bases matched the reference, but some of the bases were N. Canonically, reads with Ns can never be perfect even if the ref has Ns in the same locations. Fixed above bug again because it was not fully fixed: CHECKSITES was allowing a read to be classified as perfect even if it contained an N. Increased sam read speed by ~2x; 30MB/s to 66MB/s Increased sam write speed from ~18MB/s to ~32MB/s on my 4-core computer (during mapping), with mapping at peak 42MB/s with out=null. Standalone (no mapping) sam output seems to run at 51MB/s but it's hard to tell. Increased fasta write from 118MB/s to 140 MB/s Increased fastq write from 70MB/s to 100MB/s Increased fastq read from 120MB/s (I think) to 296MB/s (663 megabytes/sec!) with 2 threads or 166MB/s with 1 thread Some of these speed increases come from writing byte[] into char[] buffer held in a ThreadLocal, instead of turning them into Strings or appending them byte-by-byte. All of these speed optimizations caused a few I/O bugs that temporarily affected some users between Oct 1 and Oct 4, 2013. Sorry! Flipped XS tag from + to - or vice versa. I seem to have misinterpreted the Cufflinks documentation (noted by Vasanth Singan). Fixed bug in which (as a result of speed optimizations) reads outside scaffold boundaries, in sam 1.3 format, were not getting clipped (Noted by Brian Foster). Changed default behavior of all shellscripts to run with -Xmx4g if maximum memory cannot be detected (typically, because ulimit=infinity). Was 31. Unfortunately things will break either way. Fixed off-by-1 error in sam TLEN calculation; also simplified it to give sign based on leftmost POS and always give a plus and minus even when POS is equal. Added sam NH tag (when ambig=all). Disabled sam XM tag because the bowtie documentation and output do not make any sense. Changed sam MD and NM tags to account for 'N' symbol in cigar strings. Made sam SM tag score compatible with mapping score. Fixed bug in SamLine when cigar=f (null pointer when parsing match string). (Found by Vasanth Singan) Fixed bug in BBMapThread* when local=true and ambiguous=toss (null pointer to read.list). (Found by Alexander Spunde) Changed synthetic read naming and parsing (parsecustom flag) to use " /1" and " /2" at the end of paired read names. (Requested by Kurt LaButti) Increased fastq write to 200MB/s (590 megabytes/s) Increased fasta write to 212MB/s (624 megabytes/s measured by fastq input) Increased sam write to 167MB/s (492 megabytes/s measured by fastq input) Increased bread write to 196MB/s (579 megabytes/s measured by fastq input) bf2 (multithreaded input) is now enabled by default on systems with >4 cores, or in ReformatReads always. Fixed RTextOutputStream3.finishedSuccessfully() returning false when output was in 2 files. Changed output streams to unbuffered. No notable speed increase. Fixed bug in ByteFile2 in which reads would be recycled when end of file was hit (found by Brian Foster, Bryce Foster, and Kecia Duffy). v26. Fixed crash from consecutive newlines in ByteFile. Made SiteScore clonable/copyable. Removed @RG line from headers. It implies that reads should be annotated with addition fields based on the RG line information. Changed sam flags (at advice of Joel Martin). Now single-ended reads will never have flags 0x2, 0x40, or 0x80 set. Added correct insert size average to output stats, in place of old inner distance and mapping length. Fixed crash when detecting length of SamLines with no cigar string. (Found by Shayna Stein) Added flag "keepnames" which keeps the read names unchanged when writing in sam format. Normally, a trailing "/1", "/2", " 1", or " 2" are stripped off, and if read 2's name differs from read 1's name, read 1's name is used for both. This is to remain spec-compliant with the sam format. However, in some cases (such as grading synthetic reads tagged with the correct mapping location) it is useful to retain the original name of each read. Added local alignment option, "local". Translates global alignments into a local alignments using the same affine transform (and soft-clips ends). Changed killbadpairs default to false. Now by default improperly paired reads are allowed. Merged TranslateColorspaceRead versions into a single class. Added interleaved input and output for bread format. May be useful for error correction pipeline. TODO: Mode where reads are mapped to multiple scaffolds, but are mapped at most one time per scaffold. I.e., remove all but top site per scaffold (and forbid self-mapping). Fixed yet another instance of negative coordinates appearing in an unmapped read, which the new version of samtools can't handle. Fixed bug in counting ambiguous reads; was improperly including in statistics reads that were ambiguous but had a score lower than minratio. Fixed rare crash found related to realignment of reads with ambiguous mappings (found by Rob Egan). Unified many of the differences between the MapThread variants, and added a new self-checking function (checkTopSite) to ensure a Read is self-consistent. Added some bitflag fetch functions to SamLine and fixed 'pairedOnSameChrom()' which was not handling the '=' symbol. TODO: Make GENERATE_BASE_SCORES_FROM_QUALITY a parameter, default false in BBMapPacBio and true elsewhere. (I verified this should work fine) TODO: Make GENERATE_KEY_SCORES_FROM_QUALITY a parameter, default true (probably even in BBMapPacBio). (I verified this should work fine) Updated LongM (merged with LongM from Dedupe). Fixed bug in SamLine in which clipped leading indels were not considered, causing potential negative coordinates. (Found by Brian Foster) TODO: Match strings like NNNNNNDDDDDNNNNNmmmmmmmmmmmmmmmmm...mmmmmmm should never exist in the first place. Why did that happen? Added "strictmaxindel" flag (default: strictmaxindel=f). Attempts to kill mappings in which there is a single indel event longer than the "maxindel" setting. Requested by James Han. TODO: Ensure strictmaxindel works in all situations, including rescued paired ends and recursively regenerated padded match strings. TODO: Redo msa to be strictly subtractive. Start with score=100*bases, then use e.g. 0 for match, -1 for del, -370 for sub, -100 for N, etc. No need for negative values. Changed TIMEBITS in MultiStateAligner9PacBio from 10 to 9 to address a score underflow assertion error found by Alicia Clum. The underflow occuerd around length 5240; new limit should be around 10480. TODO: Alicia found an error of exceeding gref bounds. Fixed race condition in TextStreamWriter. Improved functionality of splitter. Now you can index once and map subsequently using "basename" without specifying "ref=" every single time. "Reads Used" in output now dispays the number of reads used. Before, for paired reads, it would display the number of pairs (half as many). Added bases used to reads used at Kurt's request. Improved bam script generation. Now correctly sets samtools memory based on detected memory, and warns user that crashes may be memory-related. Fixed an obsolete assertion in SamLine found by Alicia. Added XS tag option ("xstag=t") for Cufflinks; the need for this was noted by requested by Vasanth Singan. Added 'N' cigar operation for deletions longer than X bases (intronlen=X). Also needed by Cufflinks. Secondary alignments now get "*" for bases and qualities, as recommended by the SAM spec. This saves space, but may cause problems when converting sam into other formats. Fixed bug that caused interleaved=true to override in2. Now if you set in and in2, interleaved input will be disabled. (noted by Andrew Tritt). Fixed some low-level bugs in I/O streams. When shutting down streams I was waiting until !Thread.isAlive() rather than Thread.getState()==Thread.State.TERMINATED, which caused a race condition (since a thread is not alive before it starts execution). Added debugging file with random name written to /ref/ directory. This should help debugging if somewhere deep in a pipeline multiple processes try to index at the same location simultaneously. Suggested by Bryce Foster. Fixed log file generation causing a crash if the /ref/ directory did not exist, found by Vasanth Singan. Also logging is now disabled by default but enabled if you set "log=t". Input sequence data will now translate '.' and '-' to 'N' automatically, as some fasta databases appear to use '.' instead of 'N'. (Thanks to Kecia Duffy and James Han) Added capability to convert lowercase reads to upper case (crash on lowercase noted by Vasanth Singan). v25. Increased BBMapPacBio max read length to 6000, and BBMapPacBioSkimmer to 4000. Fixed bugs in padding calculations during match string generation. Improved some assertion error output. Added flag "maxsites" for max alignments to print. Added match field to sitescore. Made untrim() affect sitescores as well. Decreased read array buffer from 500 to 20 in MapPacBio. TODO: stitcher for super long reads. TODO: wrapper for split reference mapping and merging. Improved fillAndScoreLimited to return additional information. Added flag "secondary" to print secondary alignments. Does not yet ensure that all secondary alignments will get cigar strings, but most do. Added flag "quickmatch" to generate match strings for SiteScores during slow align. Speeds up the overall process somewhat (at least on my PC; have not tested it on cluster). Improved pruning during slow align by dynamically increasing msa limit. Addressed a bug in which reads sometimes have additional sites aligned to the same coordinates as the primary site. The bug can still occur (typically during match generation or as a result of padding), but is detected and corrected during runtime. Tracked down and fixed a bug relating to negative coordinates in sam output for unmapped reads paired with reads mapped off the beginning of a scaffold, with help from Rob Egan. Disabled frowny-face warning message which had caused some confusion. TODO: Add verification of match strings on site scores. Made superclass for MSA. This will allow merging of redundant code over the various BBMap versions. Fixed a crash-hang out-of-memory error caused by initialization order. Now crashes cleanly and terminates. Found by James Han. Fixed bug in output related to detecting cigar string length under sam 1.4 specification (found by Rob Egan). Added flag "killbadpairs"/"kbp". Added flag "fakequality" for fasta. Permanently fixed bugs related to unexpected short match strings caused by error messages. Increased speed of dynamic program phase when dealing with lots of Ns. TODO: In-line generation of short match string when printing a read, rather than mutating the read. (mutation is now temporary) Added flag, "stoptag". Allows generation of SAM tag YS:i: Added flag, "idtag". Allows generation of SAM tag YI:f: v24. Fixed bug that slightly reduced accuracy for reads with exactly 1 mismatch. They were always skipping slow align, sometimes preventing ambiguous reads from being detected. Increased speed of MakeRocCurve (for automatic grading of sam files from synthetic reads). Had used 1 pass per quality level; now it uses only 1 pass total. Increased accuracy of processing reads and contigs with ambiguous bases (in mapping phase). Adjusted clearzones to use gradient functions and asymptotes rather than step functions. Reduces false positives and increases true positives, especially near the old step cutoffs. Fixed trimSitesBelowCutoff assertion that failed for paired reads. Added single scaffold toggle to RandomReads. Default 'singlescaffold=true'; forces reads to come from a single scaffold). This can cause non-termination if no scaffolds are long enough, and may bias against shorter scaffolds. Added min scaffold overlap to RandomReads. Default 'overlap=1'; forces reads to overlap a scaffold at least this much. This can cause non-termination if no scaffolds are long enough, and may bias against shorter scaffolds. Fixed setPerfect(). Previously, reads with 'N' overlapping 'N' in the reference could be considered perfect matches, but no reads containing 'N' should ever be considered a perfect mapping to anything. Formalized definition of semiperfect to require read having no ambiguous bases, and fixed "isSemiperfect()" function accordingly. Shortened and clarified executable names. Fixed soft-clipped read start position calculation (mainly relevant to grading). Prevented reads from being double-counted when grading, when a program gives multiple primary alignments for a read. Fixed a bug in splitter initialization. Added "ambiguous2". Reads that map to multiple references can now be written to distinct files (prefixed by "AMBIGUOUS_") or thrown away, independantly of whether they are ambiguous in the normal sense (which includes ambiguous within a single reference). Added statistics tracking per reference and per scaffold. Enable with "scafstats=" or "refstats=". "ambiguous" may now be shortened to "ambig" on the command line. "true" and "false" may now be shortened to t, 1, or f, 0. If omitted entirely, "true" is assumed; e.g. "overwrite" is equivalent to "overwrite=true". Added stderr as a vaild output destination specified from the command line. BBSplitter now has a flag, "mapmode"; can be set to normal, accurate, pacbio, or pacbioskimmer. Fixed issue where stuff was being written to stdout instead of stderr and ended up in SAM files (found by Brian Foster). TODO: Add secondary alignments. TODO: Unlimited length reads. TODO: Protein mapping. TODO: Soft clipping in both bbmap and GradeSamFile. Should universally adjust coords by soft-clip amount when reported in SAM format. Fixed assertion error concerning reads containing Ns marked as perfect, when aligned to reference Ns (found by Rob Egan). Fixed potential null-pointer error in "showprogress" flag. v23. Created BBSplitter wrapper for BBMap that allows merging any number references together and splitting the output into different streams. Added support for ambiguous=random with paired reads (before it was limited to unpaired). TODO: Iterative anchored alignment for very long reads, with a full master gref. TODO: untrim=c/m/s/n/r TODO: mode=vfast/veryfast: k=14 minratio=0.8 minhits=2 maxindel=20 TODO: mode=slow/accurate: BBMapi TODO: mode=pacbio: BBMapPacBio k=12 TODO: mode=rnaseq TODO: Put untrim in caclStatistics section TODO: Test with MEGAN. Finished new random read generator. Much faster, and solves coordinate problem with multiple indels. Improved error message on read parsing failures. TODO: Insert size histogram TODO: "outp=", output for reads that mapped paired TODO: "outs=", output for reads that mapped singly Corrected assertion in "isSingleScaffold()" Fixed a rare bug preventing recursive realignment when ambiguous=random (found by Brian Foster) Added samversion/samv flag. Set to 1.3 for cigar strings with 'M' or 1.4 for cigar strings with '=' and 'X'. Default is 1.3. Added enforcement of thread limit when indexing. Added internal autodetection of gpint machines. Set default threadcount for gpints at 2. Improved ability to map with maxindel=0 Added XM:i: optional SAM flag because some programs seem to demand it. Like all extra flags, this is omitted if the read is not mapped. Otherwise, it is set to 1 for unambiguously mapped reads, and 2 or more for ambiguously mapped reads. The number can range as high as the total number of equal-scoring sites, but this is not guaranteed unless the "ambiguous=random" flag is used. Fixed bug in autodetection of paired ends, found by Rob Egan. v22. Added match histogram support. Added quality histogram support. Added interleaving support to random read generator. Added ability to disable pair rescue ("rescue=false" flag), which can speed things up in some cases. Disabled dynamic-programming slow alignment phase when no indels are allowed. Accelerated rescue in perfect and semiperfect mode. Vastly accelerated paired mapping against references with a very low expected mapping rate. Fixed crash in rescue caused by reads without quality strings (e.g. paired fasta files). (found by Brian Foster) v21. If reference specified is same as already-processed reference, the old index will not be deleted. Added BBMap memory usage estimator to assembly statistics tool: java -Xmx120m jgi.AssemblyStats2 k= Added support for multiple output read streams: all reads (set by out=), mapped reads (set by outm=), and unmapped reads (set by outu=). They can be in different formats and any combination can be used at once. You can set pair output to secondary files with out2, outm2, and outu2. Changed definition of "out=". You can no longer specify split output streams implicitly by using a "#" in the filename; it must be explicit. the "#" wildcard is still allowed for input streams. Fixed a bug with sam input not working. (found by Brian Foster) Added additional interleaved autodetection pattern for reads named "xxxxx 1:xxxx" and "xxxxx 2:xxxx" Fixed a bug with soft-clipped deletions causing an incorrect cigar length. (found by Brian Foster) Fixed a bug with parsing of negative numbers in byte arrays. TODO: Found a new situation in which poly-N reads preferentially map to poly-N reference (probably tip search?) Fixed a bug in which paired reads occasionally are incorrectly considered non-semiperfect. (found by Brian Foster) Added more assertion tests for perfection/imperfection status. Added blacklist support. This allows selection of output stream based on the name of the scaffold to which a read maps. Created Blacklist class, allowing creation of blacklists and whitelists. Added outb (aka outblacklist) and outb2 streams, to output reads that mapped to blacklisted scaffolds. Added flag "outputblacklisted=" which contols whether blacklisted reads are printed to the "out=" stream. Default is true. Added support for streaming references. e.g. "cat ref1.fa ref2.fa | java BBMap ref=stdin.fa" Updated and reorganized this readme. Removed a dependency on Java 7 libraries (so that the code runs in Java 6). Added per-read error rate histogram. Enable with qhist= TODO: generate standard deviation. Added per-base-position M/S/D/I/N rate tracking. Enable with mhist= Added quality trimming. Reads may be trimmed prior to mapping, and optionally untrimmed after mapping, so that no data is lost. Trimmed bases are reported as soft-clipped in this case. Trimming will extend until at least 2 consecutive bases have a quality greater than trimq (default 5). Added flags: trim=, trimq=<5>, untrim= TODO: Correct insert size in realtime for trim length. TODO: Consider adding a TrimRead pointer to reads, rather than using obj. TODO: Consider extending match string as 'M' rather than 'C' as long as clipped bases match. Found and made safe some instances where reads could be trimmed to less than kmer length. Found and fixed instance where rescue was attempted for length-zero reads. Fixed an instance where perfect reads were not marked perfect (while making match string). v20.1 (not differentiated from v20 since the differences are minor) Fixed a minor, longstanding bug that prevented minus-strand alignment of rads that only had a single valid key (due to low complexity or low quality). Increased accuracy of perfectmode and semiperfectmode, by allowing mapping of reads with only one valid key, without loss of speed. They still don't quite match normal mode since they use fewer keys. Added detection of and error messages for reads that are too long to map. Improved shell script usage information. v20. Made all MapThreads subclasses of MapThread, eliminating duplicate code. Any exception thrown by a MapThread will now be detected, allowing the process to complete normally without hanging. Exceptions (e.g. OutOfMemory) when loading reference genome are now detected, typically causing a crash exit instead of a hang. Exceptions (e.g. OutOfMemory) when generating index are now detected, causing a crash exit instead of a hang. Exceptions in output stream (RTextOutputStream) subthreads are now detected, throwing an exception. Added support for soft clipping. All reads that go off the ends of scaffolds will be soft-clipped when output to SAM format. (The necessity of this was noted by Rob Egan, as negative scaffold indices can cause software such as samtools to crash) v19. Added support for leading FASTA comments (denoted by semicolon). Fixed potential problem in FASTA read input stream with very long reads. Recognizes additional FASTA file extensions: .seq, .fna, .ffn, .frn, .fsa, .fas Disabled gzip subprocesses to circumvent a bug in UGE: Forking can cause a program to be terminated. Gzip is still supported. Slightly reduced memory allocation in shellscript. Ported "Analyze Index" improvement over to all versions (except v5). Added flags: fastaminread, showprogress Fixed problem noted by Rob Egan in which paired-end reads containing mostly 'N' could be rescued by aligning to the poly-N section off the end of a contig. Fixed: Synthetic read headers were being improperly parsed by new FASTQ input stream. Made a new, faster, more correct version of "isSemiperfect". Added "semiperfect" test for reads changed during findDeletions. Identified locations in "scoreNoIndels" where call 'N' == ref 'N' is considered a match. Does not seem to cause problems. Noted that SAM flag 0x40 and 0x80 definitions differ from my usage. v18. Fastq read input speed doubled. Fasta read input speed increased 50%. Increased speed of "Analyze Index" by a factor of 3+ (just for BBMap so far; have not yet ported change over to other versions). Fixed an array out-of-bounds bug found by Alicia Clum. Added bam output option (relies on Samtools being installed). Allows gzip subprocesses, which can sometimes improve gzipping and gunzipping speed over Java's implementation (will be used automatically if gzip is installed). This can be disabled with with the flags "usegzip=false" and "usegunzip=false". Started a 32-bit mode which allows 4GB per block instead of 2GB, for a slight memory savings (not finished yet). Added nondeterministic random read sampling option. Added flags: minscaf, startpad, stoppad, samplerate, sampleseed, kfilter, usegzip, usegunzip v17. Changed the way error rate statistics are displayed. All now use match string length as denominator. Identified error in random read generator regarding multiple insertions. It will be hard to fix but does not matter much. Found out-of-bounds error when filling gref. Fixed (but maybe not everywhere...). Added random mapping for ambiguous reads. Changed index from 2d array to single array (saves a lot of memory). Increased speed by ~10%. Improved index generation and loading speed (typically more than doubled). Changed chrom format to gzipped. Added "nodisk" flag; index is not written to disk. Fixed a rare out-of-bounds error. Increased speed of perfect read mapping. Fixed rare human PAR bug. v16. Changes since last version: Supports unlimited number of unscaffolded contigs. Supports piping in and out. Set "out=stdout.sam" and "in=stdin.fq" to pipe in a fastq file and pipe out a sam file (other extensions are also supported). Ambiguously named files (without proper extensions) will be autodetected as fasta or fastq (though I suggest not relying on that). Added additional flags (described in parameters section): minapproxhits, padding, tipsearch, maxindel. minapproxhits has a huge impact on speed. Going from 1 to 2 will typically at least double the speed (on a large genome) at some cost to accuracy. v15. Changes since last version: Contig names are retained for output. SAM header @SQ tags fixed. SAM header @PG tag added. An out-of-bounds error was fixed. An error related to short match strings was found and possibly handled. All versions now give full statistics related to %matches, %substitutions, %deletions, and %insertions (unless match string generation is disabled). Increased speed and accuracy for tiny (<20MB) genomes. Added dynamic detection of scaffold sizes to better partition index, reducing memory in some cases. Added command-line specification of kmer length. Added more command line flags and described them in this readme. Allowed overwriting of existing indices, for ease of use (only when overwrite=true). For efficiency you should still only specify "ref=" the first time you map to a particular reference, and just specify the build number subsequently.