Change Log

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog and this project adheres to Semantic Versioning.

Unreleased

1.16.0 - 2024-10-14

Changed

Bug fix on embedding attribute setting.
Upgrade to transformers 4.45.2 and zensols.deeplearn 1.12.0.

1.15.1 - 2024-08-28

Added

A no operational implementation (NoOpWordEmbedModel) of WordEmbedModel. This is used in unit test cases that download large models that do not fit on GitHub's workflow actions environments.

1.15.0 - 2024-05-11

Removed

ClassifyModelFacade.feature_stash property override. Overriding this property only should be done in sub classes of ClassifyModelFacade.

Added

Word piece vectorizer for documents with added word piece embeddings.

Changed

The default for the word piece feature document parser/factory uses an in-memory cache instead of file system. Currently persisting embeddings added to features and sentences is not implemented.
Add new RNN layer defaults for easier configuration.
Rename word_piece_* resource library configuration.

1.14.0 - 2024-04-14

Changed

Guard on cycles in botched dependency head trees when creating features.
Upgrade zensols.nlparse to 1.11.0.

1.13.0 - 2024-03-07

Added

A CLI application for prediction using packaged models.

Changed

Upgrade zensols.deeplearn v1.11.0 for updated model packaging, downloading and inferencing.

1.12.0 - 2024-02-27

Changed

Fix sizing of logits to padded output for sequence transformer for truncated word piece tokens limited by the HuggingFace tokenzier.
Fix token level classification prediction dataframes created from results.
Large refactoring of word piece mapping in TokenizedDocument.
Default to non-padding model truncation in HuggingFace tokenizer.
Merged Feature{Sentence,Document}DataPoint into TokenContainerDataPoint.
Folded directories with single module into parent name:
- zensols.deepnlp.batch.domain -> zensols.deepnlp.batch
- zensols.deepnlp.cli.app -> zensols.deepnlp.cli
- zensols.deepnlp.feature.stash -> zensols.deepnlp.feature
- zensols.deepnlp.score.bertscore -> zensols.deepnlp.score
Fold in zensols.nlparse TokenAnnotatedFeatureDocument class name typo.

1.11.1 - 2024-01-04

Changed

Fix fill-mask example after spaCy 3.6 upgrade.

Added

Add configurable HuggingFace tokenization parameters.

1.11.0 - 2023-12-05

Changed

Upgraded to HuggingFace Transformers, 4.35, zensols.deeplearn 1.9, spaCy 3.6.

Added

Support for Python 3.11.

Removed

Support for Python 3.9.

1.10.1 - 2023-08-25

Changed

Masked model bug fix.

1.10.0 - 2023-08-16

Downstream moderate risk update release.

Added

Add MaskFillPredictor and resource library.

Changed

Prevent glove weight archive from re-downloading on every access.

1.9.1 - 2023-06-29

Changed

Cleanup downloaded model resources after install.

1.9.0 - 2023-06-09

Added

Added BERTScore scoring method to zensols.nlparse scoring API.
Upgraded zensols.nlparse to 1.7.0.

Changed

Transformer padding uses longest sentence by default.
Vectorizer model accessible in Latent Semantic Indexing component.
Bug fixes for WordEmbedModel caching, persisted naming and word piece document parser resource library.
Upgraded zensols.nlparse to 1.6.0.
Resource library file naming.
Upgraded zensols.deeplearn to 1.7.0.

1.8.0 - 2023-04-05

Changed

Upgraded zensols.nlparse to 1.6.0.
Bug fixes in word piece document API.

1.7.0 - 2023-02-02

Changed

Upgraded zensols.util to 1.13.0.

1.6.0 - 2023-01-23

Added

Word piece API to map to non-word-piece tokens.
Add word piece embeddings.

1.5.0 - 2022-11-06

Added

Sentence BERT (sbert) resource library and tested.
Add HuggingFace local download model files resource library defaults.

Changed

Switched additional columns from tuple to as dictionary to solve ordering in DataframeDocumentFeatureStash.
Fix OneHotEncodedFeatureDocumentVectorizer for document use case.
Fix model ClassifyNetwork linear input size calculation so transformers (or models that do not use a terminal CRF layer) can add document level features.

1.4.1 - 2022-10-02

Changed

Transformer model fetch configuration.

1.4.0 - 2022-10-01

Added

Add a token embedding feature vectorizer.

Changes

Replace None shape component with -1 in EnumContainer vectorizer.

1.3.0 - 2022-08-08

Update dependent libraries release.

Changed

Upgrade torch 1.12.
Upgraded to spaCy 3.2
Upgrade resource library with zensols.util changes.

1.2.0 - 2022-06-14

This is primarily a refactoring release to simplify the API.

Added

Resource library configuration taken from examples and made generic for reuse.
Resource library and example documentation.

Changed

Simplification of the API and examples.
Added option to tokenize only during encoding for transformer components.
Fixed transformer expander vectorizer bugs.
Fixed deallocation issues in test notebook.

Removed

Replaced example model configuration with --override option semantics.

1.1.2 - 2022-05-15

Changed

Fixed YML resource library configuration files not found.

1.1.1 - 2022-05-15

Changed

Retrofit resource library and examples with batch metadata changes from zensols.deeplearn.

1.1.0 - 2022-05-04

Added

A recurrent CRF and default classify facade to the resource library.
Tokenized transformer document truncation.
Token classification resource library.
More huggingface support, models and tests.
Facebook fastText embeddings.

Changed

Recurrent embedded CRF uses a new network settings factory method.
Update examples.
Pin zensols.nlp version dependency to minor (second component) release.
All deep NLP vectorizers inherit from TransformableFeatureVectorizer to simplify class hierarchy. This change now requires encode_transformed in respective vectorizer configurations.
Embedded Bi{LSTM,GRU,RNN}-CRF}: utilize recurcrf module decode over re-implementation.
Change default dropout, activation order (that use them) in all layers per the literature.

1.0.1 - 2022-02-12

Added

Runtime bench marking.
Missing batch configuration in resource library from zensols.deeplearn.
Add observer pattern for logging and Pandas data frame / CSV output.

Changed

Word embedding model now compatible with gensim 4.

1.0.0 - 2022-01-25

Major stable release.

Added

DistilBERT pooler output.
The word2vec model is installed programmatically.
Clickbate example now also includes RoBERTa and DistilBERT.

Changed

Upgrade to transformers 4.12.5.
Fix duplicate word embeddings matrix copied to GPU, which saves space and time.
Other efficiencies such as log guards and data structure creation checks.
Notebook example fixes and cleanup.

Removed

PyTorch init call in nlp package init so the client can do it before other modules are loaded.

0.0.8 - 2021-10-22

Added

A factory method in zensols.deepnlp.WordEmbedModel to create a Gensim KeyedVectors instance to provide word vector operations for all embedding model types.
Make sub directory in text embedding models configurable.
Glove model automatically downloads embeddings if not present on the file system using zensols.install.

Changed

FeatureDocumentVectorizerManager.token_feature_ids default to its owned doc_parser's token features.
Pin dependencies to working huggingface transformers as new version breaks this version.
Fix glove embedding factory create functionality.

0.0.7 - 2021-09-22

Changed

Refactored downstream renaming of files from zensols.deeplearn.
Moved ClassificationPredictionMapper class to new classify module.

Added

Classification module and classes now fully implement text classification with RNN/LSTM/GRU network types or any HuggingFace transformer with pooler output. This means there is no coding necessary for text classification with the exception of writing a data loader if not in a supported format like Pandas dataframe (i.e. CSV file).
Configuration resource library.
Clickbate corpus example and documentation.

0.0.6 - 2021-09-07

Changed

Revert to version 3.8.3 of gensim and support back/forward comparability.
Upgrade zensols libraries.
Documentation and clean up.

0.0.5 - 2021-08-07

Changed

Upgrade dependencies.

0.0.4 - 2021-08-07

Added

Sequence/token classification for BiLSTM+CRF and HuggingFace transformers. This has been tested with BERT/DistilBERT/RoBERTa and the large BERT models.
The HuggingFace transformers optimizer for AdamW and scheduler for functionality such as fine tuning warm up.
More NLP facade specific support such as easier embedding model access.
Better support for Jupyter notebook rapid prototyping and experimentation.
Jupyter integration tests in review movie example.

Changed

Upgrade to spaCy 3 via the zensols.nlparse dependency.

Removed

Move feature containers and parser to zensols.nlparse, including test cases.
The dependency on bcolz as it is no longer maintained. The caching of binary word vectors was replaced with H5PY.

0.0.3 - 2021-04-30

Added

BERT/DistilBERT/RoBERTa transformer word piece tokenizer to linguistic token mapping.
Upgraded to gensum 4.0.1.
Upgraded to zensols.deeplearn 0.1.2, which is upgraded to use PyTorch 1.8.
Added simple vectorizer example.
Multiprocessing vectorization now supports GPU access via torch multiprocessing subsystem.

Changed

Refactored word embedding (sub) modules.
Moved BERT transformer embeddings to separate transformer module.
Refactored vectorizers to standardize around FeatureDocument rather token collection instances.
Standardize vectorizer shapes.
Updated examples to use new vectorizer API and zensols.util application CLI.

0.0.2 - 2020-12-29

Maintenance release.

Changed

Upgraded dependencies and tested across Python 3.7, 3.8, 3.9.

0.0.1 - 2020-05-04

Added

Initial version.