All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
1.16.0 - 2024-10-14
- Bug fix on embedding attribute setting.
- Upgrade to
transformers
4.45.2 and zensols.deeplearn 1.12.0.
1.15.1 - 2024-08-28
- A no operational implementation (
NoOpWordEmbedModel
) ofWordEmbedModel
. This is used in unit test cases that download large models that do not fit on GitHub's workflow actions environments.
1.15.0 - 2024-05-11
ClassifyModelFacade.feature_stash
property override. Overriding this property only should be done in sub classes ofClassifyModelFacade
.
- Word piece vectorizer for documents with added word piece embeddings.
- The default for the word piece feature document parser/factory uses an in-memory cache instead of file system. Currently persisting embeddings added to features and sentences is not implemented.
- Add new RNN layer defaults for easier configuration.
- Rename
word_piece_*
resource library configuration.
1.14.0 - 2024-04-14
- Guard on cycles in botched dependency head trees when creating features.
- Upgrade zensols.nlparse to 1.11.0.
1.13.0 - 2024-03-07
- A CLI application for prediction using packaged models.
- Upgrade zensols.deeplearn v1.11.0 for updated model packaging, downloading and inferencing.
1.12.0 - 2024-02-27
- Fix sizing of logits to padded output for sequence transformer for truncated word piece tokens limited by the HuggingFace tokenzier.
- Fix token level classification prediction dataframes created from results.
- Large refactoring of word piece mapping in
TokenizedDocument
. - Default to non-padding model truncation in HuggingFace tokenizer.
- Merged
Feature{Sentence,Document}DataPoint
intoTokenContainerDataPoint
. - Folded directories with single module into parent name:
zensols.deepnlp.batch.domain
->zensols.deepnlp.batch
zensols.deepnlp.cli.app
->zensols.deepnlp.cli
zensols.deepnlp.feature.stash
->zensols.deepnlp.feature
zensols.deepnlp.score.bertscore
->zensols.deepnlp.score
- Fold in zensols.nlparse
TokenAnnotatedFeatureDocument
class name typo.
1.11.1 - 2024-01-04
- Fix fill-mask example after spaCy 3.6 upgrade.
- Add configurable HuggingFace tokenization parameters.
1.11.0 - 2023-12-05
- Upgraded to HuggingFace Transformers, 4.35, zensols.deeplearn 1.9, spaCy 3.6.
- Support for Python 3.11.
- Support for Python 3.9.
1.10.1 - 2023-08-25
- Masked model bug fix.
1.10.0 - 2023-08-16
Downstream moderate risk update release.
- Add
MaskFillPredictor
and resource library.
- Prevent glove weight archive from re-downloading on every access.
1.9.1 - 2023-06-29
- Cleanup downloaded model resources after install.
1.9.0 - 2023-06-09
- Added BERTScore scoring method to zensols.nlparse scoring API.
- Upgraded zensols.nlparse to 1.7.0.
- Transformer padding uses longest sentence by default.
- Vectorizer model accessible in Latent Semantic Indexing component.
- Bug fixes for
WordEmbedModel
caching, persisted naming and word piece document parser resource library. - Upgraded zensols.nlparse to 1.6.0.
- Resource library file naming.
- Upgraded zensols.deeplearn to 1.7.0.
1.8.0 - 2023-04-05
- Upgraded zensols.nlparse to 1.6.0.
- Bug fixes in word piece document API.
1.7.0 - 2023-02-02
- Upgraded zensols.util to 1.13.0.
1.6.0 - 2023-01-23
- Word piece API to map to non-word-piece tokens.
- Add word piece embeddings.
1.5.0 - 2022-11-06
- Sentence BERT (sbert) resource library and tested.
- Add HuggingFace local download model files resource library defaults.
- Switched additional columns from tuple to as dictionary to solve ordering in
DataframeDocumentFeatureStash
. - Fix
OneHotEncodedFeatureDocumentVectorizer
for document use case. - Fix model
ClassifyNetwork
linear input size calculation so transformers (or models that do not use a terminal CRF layer) can add document level features.
1.4.1 - 2022-10-02
- Transformer model fetch configuration.
1.4.0 - 2022-10-01
- Add a token embedding feature vectorizer.
- Replace
None
shape component with -1 inEnumContainer
vectorizer.
1.3.0 - 2022-08-08
- Update dependent libraries release.
- Upgrade torch 1.12.
- Upgraded to spaCy 3.2
- Upgrade resource library with
zensols.util
changes.
1.2.0 - 2022-06-14
This is primarily a refactoring release to simplify the API.
- Resource library configuration taken from examples and made generic for reuse.
- Resource library and example documentation.
- Simplification of the API and examples.
- Added option to tokenize only during encoding for transformer components.
- Fixed transformer expander vectorizer bugs.
- Fixed deallocation issues in test notebook.
- Replaced example model configuration with
--override
option semantics.
1.1.2 - 2022-05-15
- Fixed YML resource library configuration files not found.
1.1.1 - 2022-05-15
- Retrofit resource library and examples with batch metadata changes from zensols.deeplearn.
1.1.0 - 2022-05-04
- A recurrent CRF and default classify facade to the resource library.
- Tokenized transformer document truncation.
- Token classification resource library.
- More huggingface support, models and tests.
- Facebook fastText embeddings.
- Recurrent embedded CRF uses a new network settings factory method.
- Update examples.
- Pin
zensols.nlp
version dependency to minor (second component) release. - All deep NLP vectorizers inherit from
TransformableFeatureVectorizer
to simplify class hierarchy. This change now requiresencode_transformed
in respective vectorizer configurations. - Embedded Bi{LSTM,GRU,RNN}-CRF}: utilize
recurcrf
module decode over re-implementation. - Change default dropout, activation order (that use them) in all layers per the literature.
1.0.1 - 2022-02-12
- Runtime bench marking.
- Missing batch configuration in resource library from zensols.deeplearn.
- Add observer pattern for logging and Pandas data frame / CSV output.
- Word embedding model now compatible with gensim 4.
1.0.0 - 2022-01-25
Major stable release.
- DistilBERT pooler output.
- The
word2vec
model is installed programmatically. - Clickbate example now also includes RoBERTa and DistilBERT.
- Upgrade to transformers 4.12.5.
- Fix duplicate word embeddings matrix copied to GPU, which saves space and time.
- Other efficiencies such as log guards and data structure creation checks.
- Notebook example fixes and cleanup.
- PyTorch init call in nlp package init so the client can do it before other modules are loaded.
0.0.8 - 2021-10-22
- A factory method in
zensols.deepnlp.WordEmbedModel
to create a GensimKeyedVectors
instance to provide word vector operations for all embedding model types. - Make sub directory in text embedding models configurable.
- Glove model automatically downloads embeddings if not present on the file
system using
zensols.install
.
FeatureDocumentVectorizerManager.token_feature_ids
default to its owneddoc_parser
's token features.- Pin dependencies to working huggingface transformers as new version breaks this version.
- Fix glove embedding factory create functionality.
0.0.7 - 2021-09-22
- Refactored downstream renaming of files from zensols.deeplearn.
- Moved
ClassificationPredictionMapper
class to newclassify
module.
- Classification module and classes now fully implement text classification with RNN/LSTM/GRU network types or any HuggingFace transformer with pooler output. This means there is no coding necessary for text classification with the exception of writing a data loader if not in a supported format like Pandas dataframe (i.e. CSV file).
- Configuration resource library.
- Clickbate corpus example and documentation.
0.0.6 - 2021-09-07
- Revert to version 3.8.3 of gensim and support back/forward comparability.
- Upgrade zensols libraries.
- Documentation and clean up.
0.0.5 - 2021-08-07
- Upgrade dependencies.
0.0.4 - 2021-08-07
- Sequence/token classification for BiLSTM+CRF and HuggingFace transformers. This has been tested with BERT/DistilBERT/RoBERTa and the large BERT models.
- The HuggingFace transformers optimizer for
AdamW
and scheduler for functionality such as fine tuning warm up. - More NLP facade specific support such as easier embedding model access.
- Better support for Jupyter notebook rapid prototyping and experimentation.
- Jupyter integration tests in review movie example.
- Upgrade to spaCy 3 via the zensols.nlparse dependency.
- Move feature containers and parser to zensols.nlparse, including test cases.
- The dependency on bcolz as it is no longer maintained. The caching of binary word vectors was replaced with H5PY.
0.0.3 - 2021-04-30
- BERT/DistilBERT/RoBERTa transformer word piece tokenizer to linguistic token mapping.
- Upgraded to
gensum
4.0.1. - Upgraded to zensols.deeplearn 0.1.2, which is upgraded to use PyTorch 1.8.
- Added simple vectorizer example.
- Multiprocessing vectorization now supports GPU access via torch multiprocessing subsystem.
- Refactored word embedding (sub) modules.
- Moved BERT transformer embeddings to separate
transformer
module. - Refactored vectorizers to standardize around
FeatureDocument
rather token collection instances. - Standardize vectorizer shapes.
- Updated examples to use new vectorizer API and zensols.util application CLI.
0.0.2 - 2020-12-29
Maintenance release.
- Upgraded dependencies and tested across Python 3.7, 3.8, 3.9.
0.0.1 - 2020-05-04
- Initial version.