Skip to content

Latest commit

 

History

History
397 lines (300 loc) · 13 KB

CHANGELOG.md

File metadata and controls

397 lines (300 loc) · 13 KB

Change Log

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog and this project adheres to Semantic Versioning.

1.16.0 - 2024-10-14

Changed

  • Bug fix on embedding attribute setting.
  • Upgrade to transformers 4.45.2 and zensols.deeplearn 1.12.0.

1.15.1 - 2024-08-28

Added

  • A no operational implementation (NoOpWordEmbedModel) of WordEmbedModel. This is used in unit test cases that download large models that do not fit on GitHub's workflow actions environments.

1.15.0 - 2024-05-11

Removed

  • ClassifyModelFacade.feature_stash property override. Overriding this property only should be done in sub classes of ClassifyModelFacade.

Added

  • Word piece vectorizer for documents with added word piece embeddings.

Changed

  • The default for the word piece feature document parser/factory uses an in-memory cache instead of file system. Currently persisting embeddings added to features and sentences is not implemented.
  • Add new RNN layer defaults for easier configuration.
  • Rename word_piece_* resource library configuration.

1.14.0 - 2024-04-14

Changed

  • Guard on cycles in botched dependency head trees when creating features.
  • Upgrade zensols.nlparse to 1.11.0.

1.13.0 - 2024-03-07

Added

  • A CLI application for prediction using packaged models.

Changed

  • Upgrade zensols.deeplearn v1.11.0 for updated model packaging, downloading and inferencing.

1.12.0 - 2024-02-27

Changed

  • Fix sizing of logits to padded output for sequence transformer for truncated word piece tokens limited by the HuggingFace tokenzier.
  • Fix token level classification prediction dataframes created from results.
  • Large refactoring of word piece mapping in TokenizedDocument.
  • Default to non-padding model truncation in HuggingFace tokenizer.
  • Merged Feature{Sentence,Document}DataPoint into TokenContainerDataPoint.
  • Folded directories with single module into parent name:
    • zensols.deepnlp.batch.domain -> zensols.deepnlp.batch
    • zensols.deepnlp.cli.app -> zensols.deepnlp.cli
    • zensols.deepnlp.feature.stash -> zensols.deepnlp.feature
    • zensols.deepnlp.score.bertscore -> zensols.deepnlp.score
  • Fold in zensols.nlparse TokenAnnotatedFeatureDocument class name typo.

1.11.1 - 2024-01-04

Changed

  • Fix fill-mask example after spaCy 3.6 upgrade.

Added

  • Add configurable HuggingFace tokenization parameters.

1.11.0 - 2023-12-05

Changed

Added

  • Support for Python 3.11.

Removed

  • Support for Python 3.9.

1.10.1 - 2023-08-25

Changed

  • Masked model bug fix.

1.10.0 - 2023-08-16

Downstream moderate risk update release.

Added

  • Add MaskFillPredictor and resource library.

Changed

  • Prevent glove weight archive from re-downloading on every access.

1.9.1 - 2023-06-29

Changed

  • Cleanup downloaded model resources after install.

1.9.0 - 2023-06-09

Added

Changed

  • Transformer padding uses longest sentence by default.
  • Vectorizer model accessible in Latent Semantic Indexing component.
  • Bug fixes for WordEmbedModel caching, persisted naming and word piece document parser resource library.
  • Upgraded zensols.nlparse to 1.6.0.
  • Resource library file naming.
  • Upgraded zensols.deeplearn to 1.7.0.

1.8.0 - 2023-04-05

Changed

1.7.0 - 2023-02-02

Changed

1.6.0 - 2023-01-23

Added

  • Word piece API to map to non-word-piece tokens.
  • Add word piece embeddings.

1.5.0 - 2022-11-06

Added

  • Sentence BERT (sbert) resource library and tested.
  • Add HuggingFace local download model files resource library defaults.

Changed

  • Switched additional columns from tuple to as dictionary to solve ordering in DataframeDocumentFeatureStash.
  • Fix OneHotEncodedFeatureDocumentVectorizer for document use case.
  • Fix model ClassifyNetwork linear input size calculation so transformers (or models that do not use a terminal CRF layer) can add document level features.

1.4.1 - 2022-10-02

Changed

  • Transformer model fetch configuration.

1.4.0 - 2022-10-01

Added

  • Add a token embedding feature vectorizer.

Changes

  • Replace None shape component with -1 in EnumContainer vectorizer.

1.3.0 - 2022-08-08

  • Update dependent libraries release.

Changed

  • Upgrade torch 1.12.
  • Upgraded to spaCy 3.2
  • Upgrade resource library with zensols.util changes.

1.2.0 - 2022-06-14

This is primarily a refactoring release to simplify the API.

Added

  • Resource library configuration taken from examples and made generic for reuse.
  • Resource library and example documentation.

Changed

  • Simplification of the API and examples.
  • Added option to tokenize only during encoding for transformer components.
  • Fixed transformer expander vectorizer bugs.
  • Fixed deallocation issues in test notebook.

Removed

  • Replaced example model configuration with --override option semantics.

1.1.2 - 2022-05-15

Changed

  • Fixed YML resource library configuration files not found.

1.1.1 - 2022-05-15

Changed

  • Retrofit resource library and examples with batch metadata changes from zensols.deeplearn.

1.1.0 - 2022-05-04

Added

  • A recurrent CRF and default classify facade to the resource library.
  • Tokenized transformer document truncation.
  • Token classification resource library.
  • More huggingface support, models and tests.
  • Facebook fastText embeddings.

Changed

  • Recurrent embedded CRF uses a new network settings factory method.
  • Update examples.
  • Pin zensols.nlp version dependency to minor (second component) release.
  • All deep NLP vectorizers inherit from TransformableFeatureVectorizer to simplify class hierarchy. This change now requires encode_transformed in respective vectorizer configurations.
  • Embedded Bi{LSTM,GRU,RNN}-CRF}: utilize recurcrf module decode over re-implementation.
  • Change default dropout, activation order (that use them) in all layers per the literature.

1.0.1 - 2022-02-12

Added

  • Runtime bench marking.
  • Missing batch configuration in resource library from zensols.deeplearn.
  • Add observer pattern for logging and Pandas data frame / CSV output.

Changed

  • Word embedding model now compatible with gensim 4.

1.0.0 - 2022-01-25

Major stable release.

Added

  • DistilBERT pooler output.
  • The word2vec model is installed programmatically.
  • Clickbate example now also includes RoBERTa and DistilBERT.

Changed

  • Upgrade to transformers 4.12.5.
  • Fix duplicate word embeddings matrix copied to GPU, which saves space and time.
  • Other efficiencies such as log guards and data structure creation checks.
  • Notebook example fixes and cleanup.

Removed

  • PyTorch init call in nlp package init so the client can do it before other modules are loaded.

0.0.8 - 2021-10-22

Added

  • A factory method in zensols.deepnlp.WordEmbedModel to create a Gensim KeyedVectors instance to provide word vector operations for all embedding model types.
  • Make sub directory in text embedding models configurable.
  • Glove model automatically downloads embeddings if not present on the file system using zensols.install.

Changed

  • FeatureDocumentVectorizerManager.token_feature_ids default to its owned doc_parser's token features.
  • Pin dependencies to working huggingface transformers as new version breaks this version.
  • Fix glove embedding factory create functionality.

0.0.7 - 2021-09-22

Changed

  • Refactored downstream renaming of files from zensols.deeplearn.
  • Moved ClassificationPredictionMapper class to new classify module.

Added

  • Classification module and classes now fully implement text classification with RNN/LSTM/GRU network types or any HuggingFace transformer with pooler output. This means there is no coding necessary for text classification with the exception of writing a data loader if not in a supported format like Pandas dataframe (i.e. CSV file).
  • Configuration resource library.
  • Clickbate corpus example and documentation.

0.0.6 - 2021-09-07

Changed

  • Revert to version 3.8.3 of gensim and support back/forward comparability.
  • Upgrade zensols libraries.
  • Documentation and clean up.

0.0.5 - 2021-08-07

Changed

  • Upgrade dependencies.

0.0.4 - 2021-08-07

Added

  • Sequence/token classification for BiLSTM+CRF and HuggingFace transformers. This has been tested with BERT/DistilBERT/RoBERTa and the large BERT models.
  • The HuggingFace transformers optimizer for AdamW and scheduler for functionality such as fine tuning warm up.
  • More NLP facade specific support such as easier embedding model access.
  • Better support for Jupyter notebook rapid prototyping and experimentation.
  • Jupyter integration tests in review movie example.

Changed

Removed

  • Move feature containers and parser to zensols.nlparse, including test cases.
  • The dependency on bcolz as it is no longer maintained. The caching of binary word vectors was replaced with H5PY.

0.0.3 - 2021-04-30

Added

  • BERT/DistilBERT/RoBERTa transformer word piece tokenizer to linguistic token mapping.
  • Upgraded to gensum 4.0.1.
  • Upgraded to zensols.deeplearn 0.1.2, which is upgraded to use PyTorch 1.8.
  • Added simple vectorizer example.
  • Multiprocessing vectorization now supports GPU access via torch multiprocessing subsystem.

Changed

  • Refactored word embedding (sub) modules.
  • Moved BERT transformer embeddings to separate transformer module.
  • Refactored vectorizers to standardize around FeatureDocument rather token collection instances.
  • Standardize vectorizer shapes.
  • Updated examples to use new vectorizer API and zensols.util application CLI.

0.0.2 - 2020-12-29

Maintenance release.

Changed

  • Upgraded dependencies and tested across Python 3.7, 3.8, 3.9.

0.0.1 - 2020-05-04

Added

  • Initial version.