All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- The
mmseqs search
command has been replaced by a two-step alignment workflow. In the first alignment step,--alignment-mode 1
and--max-rejected
are utilized, while the second step uses--alignment-mode 2
and-c 0.2
. This change reduces the number of alignments that are rejected due to not meeting the minimum coverage cutoff and mitigates the issue where the annotation results change when the input sequence order is altered. - The
--min-ungapped-score
parameter ofmmseqs prefilter
was increased from20
to25
. - The
--max-rejected
parameter of the firstmmseqs align
step was increased from225
to280
.
- Replace
np.warnings
withwarnings
to add compatibility withnumpy >= 1.24
.
- Update
numba
(>=0.57
) andnumpy
(>=1.21
) version requirements. - Use
casefold
for sequence comparison within theSequence
class. - Remove type annotations of methods of the
Sequence
class that return an instance ofSequence
. - Use
console.status
to log the deletion of the.tar.gz
file during the execution ofdownload-database
. - Make the conservative assignment at the family level optional via the
--conservative-taxonomy
parameter. This increases the amount of viral genomes assigned to a family when executing geNomad with default parameters.
- Fix parameter names in the error message of
--conservative
and--relaxed
(e.g.--min_score
→--min-score
).
- Display a progress bar showing the progress of the classification process in
nn-classification
.
- Update
README.md
to the database version 1.3.0.
- Make
mmseqs convertalis
output the whole sequence header instead of gene accesions. This prevents parsing conflits with geNomad's other components in cases where MMseqs2 uses its built-in special parsers for specific header formats (e.g. RefSeq).
- Add the
--threads
parameter to thenn-classification
module, which allows controlling the number of threads used for classifying sequences using the neural network model.
- Mention post-classification filters the in the
summary
module description.
- Given that geNomad applies a minimum score filter (since version 1.4.0), the help dialogue of the
--min-score
parameter was modified to remove the following sentence: "By default, the sequence is classified as virus/plasmid if its virus/plasmid score is higher than its chromosome score, regardless of the value". - The following parameters were added to the MMseqs2 search command:
--max-seqs 1000000 --min-ungapped-score 20 --max-rejected 225
. As a result, changing--splits
won't affect the search results anymore.
- Mention Docker and the NMDC EDGE implementation in the
README.md
. - Add the
--min-plasmid-hallmarks-short-seqs
and--min-virus-hallmarks-short-seqs
parameters. These options allow filtering out short sequences (less than 2,500 bp) that don't encode a minimum number of hallmark genes. By default, short sequences need to encode at least one hallmark to be classified as a virus or a plasmid. - Add the
--conservative
and--relaxed
presets that control post-classification filters. The--conservative
option makes those filters even more aggressive, resulting in more restricted sets of plasmid and virus, containing only sequences whose classification is strongly supported. The--relaxed
preset disables all post-classification filters.
- Windows with more than 4,000 Ns are ignored when encoding sequences for the neural network classification. The first window is always processed, regardless of the amount of Ns.
- Changed the default value of
--min-score
from 0.0 to 0.7. - Changed the default search sensitivity from 4.0 to 4.2.
- Update
README.md
to version 1.4.0. This includes mentions to the--conservative
and--relaxed
flags and a warning about how changes in--splits
can affect geNomad's output.
- Fix a bug in
score-calibration
that happened whenfind-proviruses
was executed but no provirus was detected. The module now checks if proviruses were detected (usingutils.check_provirus_execution
) before counting the total number of sequences.
- Require
numpy <1.6
. Fixes #7, which occurs becausenumba
doesn't supportnumpy >=1.24
yet.
- Check if
find-proviruses
was executed when counting the number of sequences in thescore-calibration
module.
- Add support for AMR annotation.
- Update database parsing to allow BUSCO-based USCGs.
- Sequences with no terminal repeats will be flagged with
No terminal repeats
, asLinear
can be misleading. - Print the number of plasmids and viruses in the summary module.
- Set
click.rich_click.MAX_WIDTH
toNone
. - Reduce the default
--sensitivity
to4.0
. - Update
README.md
to version 1.3.0.
- Set
prog_name
inclick.version_option
.
- Mention the Zenodo upload of geNomad's database in
README.md
. - Add the following sentence for the help dialogue of the
--min-plasmid-marker-enrichment
,--min-virus-marker-enrichment
,--min-plasmid-hallmarks
, and--min-virus-hallmarks
parameters: "This option will be ignored if the annotation module was not executed". - Apply a uniform prior to the empirical sample composition in
score_batch_correction
. This will shrink the effect of calibration when the empirical composition distribution is very skewed. - Reduce the
--min-score
in theREADME.md
example to 0.7.
- Fix a bug in the score calibration module where the sample size was set to a constant value and the "Your sample has less than 1,000 sequences…" warning would always appear.
- Dockerfile for version 1.0.0.
Sequence
class: add support forstr
in__eq__
.Sequence
class: add a__hash__
method.- Compute marker enrichment in the
marker-classification
module. - Add columns for plasmid and virus marker enrichment to the
_plasmid_summary.tsv
and_virus_summary.tsv
files. - Set
--min-plasmid-marker-enrichment
and--min-virus-marker-enrichment
to0
as default. This will alter the results when using default parameters. - Add support for plasmid and virus hallmarks. Requires geNomad database v1.1.
- Add CONJscan annotations to
_plasmid_summary.tsv
. Requires geNomad database v1.1.
Sequence
class: simplifyhas_dtr
return statement.Sequence
class: make__repr__
more friendly for long sequences.Sequence
class: rename theid
property toaccession
.- Amino acids are now written to
_provirus_aragorn.tsv
. - Update the XGBoost model file to the
.ubj
format. - Require
xgboost >=1.6
. - The taxonomic lineage in
_taxonomy.tsv
and_virus_summary.tsv
will useViruses
as the highest rank, instead ofroot
. - Change order of the columns in
_plasmid_summary.tsv
and_virus_summary.tsv
. - Explicitly set
fraction
to0.5
intaxopy.find_majority_vote
.
- tRNA coordinates are now 1-indexed.
- Write
summary_execution_info
. - Fix a problem in
DatabaseDownloader.get_version
where only the major version was compared.
- First release.