Releases: epruesse/SINA
Minor fix (build issue w/o TBB Malloc)
Minor fix (rounding error in classifier)
- Fixes #93 where an LCA-Quorum of 0.8 on 10 results allowed only one outlier, rather than the expected 2.
Improved CSV output
The old --meta-fmt CSV
option has been deprecated in favor of having multiple output modules active. To get CSV output as well as the aligned sequences, you can now write -o aligned.fasta.gz -o aligned.csv
. The fields that are written to CSV, FASTA or ARB output types can be configured with the --field
(-f
) parameter. SINA can now also show a list of all fields available in a reference ARB database using --arb-list-fields FILENAME
.
Changelog:
Minor fixes
- All progress bars now silence when the output is redirected into a file or pipe
- Progress bars no longer overwrite some of the previous output (i.e. the cursor is no longer moved up too often).
Speedups: Internal Kmer Search Now Default
With 1.6.0, the new, very fast internal search engine has become the default. The --search
module has been parallelized and performance has been tweaked in many other places.
Here are some numbers:
Input | Reference | Settings | 1.6.0 | 1.5.0 | speedup |
---|---|---|---|---|---|
V4 | SILVA NR | align | 282/s | 22/s | 12.8 |
V4 | SILVA NR | align & classify | 185/s | 3/s | 61.7 |
V4 | SILVA NR | turn & align & classify | 120/s | 3/s | 40 |
full | SILVA NR | align | 42/s | 3/s | 14 |
full | SILVA NR | align & classify | 35/s | 0.65/s | 58.3 |
full | SILVA NR | turn & align & classify | 33/s | 0.6/s | 55 |
V4 | test (38k) | align | 312/s | 225/s | 1.4 |
V4 | test (38k) | align & classify | 265/s | 25/s | 10.6 |
V4 | test (38k) | turn & align & classify | 260/s | 25/s | 10.4 |
full | test (38k) | align | 58/s | 45/s | 1.3 |
full | test (38k) | align & classify | 51/s | 9.6/s | 5.3 |
full | test (38k) | turn & align & classify | 51/s | 6/s | 8.5 |
(Numbers from a Ryzen 1700 with 32GB and 16 threads)
Prerelease: speedups!!!
It's finally done. Please give it a spin.
With 1.6.0, the new internal search engine is becoming the default. The --search
module has been parallelized and performance has been tweaked in many other places.
Towards an Internal Kmer Search Engine
Internal Kmer Search Update
With this release, the internal kmer search is nearing completion. The kmer-index is now persisted to disk, computed in parallel, and uses a presence/absence optimization to reduce its total size and search speed. It's many times faster than the original PT server based search. (You still need to use --num-pts
though to make it use multiple threads). Tweaks to the way SINA interacts with ARB and caches sequences internally have reduced the memory usage of the kmer search indexing and use stages to allow working with the current SILVA Ref NR with on a 16GB machine.
Documentation Update
The documentation is now up to date with the current features. A man file is distributed with SINA and available via man sina
from conda environments. Text-file versions are shipped in share/doc/sina
, and a pretty html version rendered by sphinx is available at https://sina.readthedocs.io.
Evalutation Options Reinstated
The options --show-dist
and --fs-msc-max
have been re-instated to allow evaluating the accuracy of SINA. New unit tests are in place to verify that the accuracy doesn't accidentally drop. These will help making the switch to the internal kmer search without risking significant changes to the overall accuracy.
Changelog
- update documentation (#20)
- reinstate
--show-dist
- reinstate
--fs-msc-max
- add choice "exact" to
--search-iupac
- change default for
--search-kmer-len
to match--fs-kmer-len
- parallelize launch of background PT servers
- lower memory usage:
- avoid redundant sequence caching by libARBDB
- use compact aligned base (50% on internal sequence cache)
- improve internal kmer search performace
- add caching of kmer index on disk
- parallelize kmer index construction
- add presence/absence optimization
- fix field align_ident_slv added for 100% matches even when
not enabled - fix crash on overhang past alignment edge
- fix libARBDB writing to stdout, clobbering sequence output
- fix out-of-bounds access on iterator in NAST implementation
- remove dependency on boost serialization library
- build release binaries with GCC 7 and C++11 ABI
- add integration tests watching for accuracy regressions (#25)
Parallel SINA
Parallel SINA is here!
Use --num-pts N
to specify the number of PT servers you would like working in parallel. The rest of SINA will adapt dynamically to the available resources (if you must, adjust it with --threads
).
Please remember that the PT server is rather memory hungry. If you set --num-pts
too high, you will run out and SINA will crash.
Other Improvements:
Add search result to output:
Using --add-relatives N
you can now ask SINA to add the search result sequences to the sequence output file. If you have --search
enabled, it will use the n best results from the alignment based homology search. Otherwise, it will use the n sequences with the highest relative number of kmers shared with each query. Each reference sequence will be added only once.
Input / Output:
SINA will now read and write gzipped FASTA files transparently. You can also use -
as input/output file name to pipe sequences through SINA.
Logging
SINA now has an actual logging facility. You can change it's verbosity with -q
, and -v
(repeat to increase or decrease further). The log file specified with --log-file
will always be verbose (but not include debug messages).
Parallel SINA - Preview
Parallelization adds a whole new class of bugs that become possible. If this breaks, stalls, crashes or otherwise misbehaves, please create an issue!
- process sequences in parallel (#17, #31)
- add support for gzipped read/write (#29)
- add support for "-" to read/write using pipes
- remove internal pipeline in favor of TBB
- add option --add-relatives; adds ref sequences to output (#19)
- add logging with variable verbosity (#14)
- be smart about locating arb_pt_server binary (#30)
- add --add-relatives adding search result to output (#19)
Maintenance Release
- report number of references discarded due to configured constraints
- fix crash (regression) if no acceptable references found for a query
- fix --search causes a program option error (#28)
- fix race condition in terminating PT server