new doc push

plandes · Aug 29, 2024 · 0c0fff6 · 0c0fff6
commit 0c0fff6
Show file tree

Hide file tree

Showing 113 changed files with 46,932 additions and 0 deletions.
diff --git a/.buildinfo b/.buildinfo
@@ -0,0 +1,4 @@
+# Sphinx build info version 1
+# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: f3af96c144ccebb77ecf7b5b3554227f
+tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/.nojekyll b/.nojekyll
diff --git a/CHANGELOG.html b/CHANGELOG.html
diff --git a/CONTRIBUTING.html b/CONTRIBUTING.html
diff --git a/LICENSE.html b/LICENSE.html
diff --git a/_modules/index.html b/_modules/index.html
diff --git a/_modules/zensols/deepnlp.html b/_modules/zensols/deepnlp.html
diff --git a/_modules/zensols/deepnlp/classify/domain.html b/_modules/zensols/deepnlp/classify/domain.html
diff --git a/_modules/zensols/deepnlp/classify/facade.html b/_modules/zensols/deepnlp/classify/facade.html
diff --git a/_modules/zensols/deepnlp/classify/model.html b/_modules/zensols/deepnlp/classify/model.html
diff --git a/_modules/zensols/deepnlp/classify/pred.html b/_modules/zensols/deepnlp/classify/pred.html
diff --git a/_modules/zensols/deepnlp/cli.html b/_modules/zensols/deepnlp/cli.html
diff --git a/_modules/zensols/deepnlp/embed/doc.html b/_modules/zensols/deepnlp/embed/doc.html
diff --git a/_modules/zensols/deepnlp/embed/domain.html b/_modules/zensols/deepnlp/embed/domain.html
diff --git a/_modules/zensols/deepnlp/embed/fasttext.html b/_modules/zensols/deepnlp/embed/fasttext.html
diff --git a/_modules/zensols/deepnlp/embed/glove.html b/_modules/zensols/deepnlp/embed/glove.html
diff --git a/_modules/zensols/deepnlp/embed/word2vec.html b/_modules/zensols/deepnlp/embed/word2vec.html
diff --git a/_modules/zensols/deepnlp/embed/wordtext.html b/_modules/zensols/deepnlp/embed/wordtext.html
diff --git a/_modules/zensols/deepnlp/feature.html b/_modules/zensols/deepnlp/feature.html
diff --git a/_modules/zensols/deepnlp/index/domain.html b/_modules/zensols/deepnlp/index/domain.html
diff --git a/_modules/zensols/deepnlp/index/lda.html b/_modules/zensols/deepnlp/index/lda.html
diff --git a/_modules/zensols/deepnlp/index/lsi.html b/_modules/zensols/deepnlp/index/lsi.html
diff --git a/_modules/zensols/deepnlp/layer/conv.html b/_modules/zensols/deepnlp/layer/conv.html
diff --git a/_modules/zensols/deepnlp/layer/embed.html b/_modules/zensols/deepnlp/layer/embed.html
diff --git a/_modules/zensols/deepnlp/layer/embrecurcrf.html b/_modules/zensols/deepnlp/layer/embrecurcrf.html
diff --git a/_modules/zensols/deepnlp/layer/wordvec.html b/_modules/zensols/deepnlp/layer/wordvec.html
diff --git a/_modules/zensols/deepnlp/model/facade.html b/_modules/zensols/deepnlp/model/facade.html
diff --git a/_modules/zensols/deepnlp/model/sequence.html b/_modules/zensols/deepnlp/model/sequence.html
diff --git a/_modules/zensols/deepnlp/score.html b/_modules/zensols/deepnlp/score.html
diff --git a/_modules/zensols/deepnlp/transformer.html b/_modules/zensols/deepnlp/transformer.html
diff --git a/_modules/zensols/deepnlp/transformer/domain.html b/_modules/zensols/deepnlp/transformer/domain.html
diff --git a/_modules/zensols/deepnlp/transformer/embed.html b/_modules/zensols/deepnlp/transformer/embed.html
diff --git a/_modules/zensols/deepnlp/transformer/layer.html b/_modules/zensols/deepnlp/transformer/layer.html
diff --git a/_modules/zensols/deepnlp/transformer/mask.html b/_modules/zensols/deepnlp/transformer/mask.html
diff --git a/_modules/zensols/deepnlp/transformer/optimizer.html b/_modules/zensols/deepnlp/transformer/optimizer.html
diff --git a/_modules/zensols/deepnlp/transformer/pred.html b/_modules/zensols/deepnlp/transformer/pred.html
diff --git a/_modules/zensols/deepnlp/transformer/resource.html b/_modules/zensols/deepnlp/transformer/resource.html
diff --git a/_modules/zensols/deepnlp/transformer/tokenizer.html b/_modules/zensols/deepnlp/transformer/tokenizer.html
diff --git a/_modules/zensols/deepnlp/transformer/vectorizers.html b/_modules/zensols/deepnlp/transformer/vectorizers.html
diff --git a/_modules/zensols/deepnlp/transformer/wordpiece.html b/_modules/zensols/deepnlp/transformer/wordpiece.html
diff --git a/_modules/zensols/deepnlp/vectorize/embed.html b/_modules/zensols/deepnlp/vectorize/embed.html
diff --git a/_modules/zensols/deepnlp/vectorize/manager.html b/_modules/zensols/deepnlp/vectorize/manager.html
diff --git a/_modules/zensols/deepnlp/vectorize/spacy.html b/_modules/zensols/deepnlp/vectorize/spacy.html
diff --git a/_modules/zensols/deepnlp/vectorize/vectorizers.html b/_modules/zensols/deepnlp/vectorize/vectorizers.html
diff --git a/_sources/CHANGELOG.md.txt b/_sources/CHANGELOG.md.txt
@@ -0,0 +1,390 @@
+# Change Log
+All notable changes to this project will be documented in this file.
+
+The format is based on [Keep a Changelog](http://keepachangelog.com/)
+and this project adheres to [Semantic Versioning](http://semver.org/).
+
+
+## [Unreleased]
+
+
+## [1.15.1] - 2024-08-28
+### Added
+- A no operational implementation (`NoOpWordEmbedModel`) of `WordEmbedModel`.
+  This is used in unit test cases that download large models that do not fit
+  on GitHub's workflow actions environments.
+
+
+## [1.15.0] - 2024-05-11
+### Removed
+- `ClassifyModelFacade.feature_stash` property override.  Overriding this
+  property only should be done in sub classes of `ClassifyModelFacade`.
+
+### Added
+- Word piece vectorizer for documents with added word piece embeddings.
+
+### Changed
+- The default for the word piece feature document parser/factory uses an
+  in-memory cache instead of file system.  Currently persisting embeddings
+  added to features and sentences is not implemented.
+- Add new RNN layer defaults for easier configuration.
+- Rename `word_piece_*` resource library configuration.
+
+
+## [1.14.0] - 2024-04-14
+### Changed
+- Guard on cycles in botched dependency head trees when creating features.
+- Upgrade [zensols.nlparse] to 1.11.0.
+
+
+## [1.13.0] - 2024-03-07
+### Added
+- A CLI application for prediction using packaged models.
+
+### Changed
+- Upgrade [zensols.deeplearn] v1.11.0 for updated model packaging, downloading
+  and inferencing.
+
+
+## [1.12.0] - 2024-02-27
+### Changed
+- Fix sizing of logits to padded output for sequence transformer for truncated
+  word piece tokens limited by the HuggingFace tokenzier.
+- Fix token level classification prediction dataframes created from results.
+- Large refactoring of word piece mapping in `TokenizedDocument`.
+- Default to non-padding model truncation in HuggingFace tokenizer.
+- Merged `Feature{Sentence,Document}DataPoint` into `TokenContainerDataPoint`.
+- Folded directories with single module into parent name:
+  - `zensols.deepnlp.batch.domain` -> `zensols.deepnlp.batch`
+  - `zensols.deepnlp.cli.app` -> `zensols.deepnlp.cli`
+  - `zensols.deepnlp.feature.stash` -> `zensols.deepnlp.feature`
+  - `zensols.deepnlp.score.bertscore` -> `zensols.deepnlp.score`
+- Fold in [zensols.nlparse] `TokenAnnotatedFeatureDocument` class name typo.
+
+
+## [1.11.1] - 2024-01-04
+### Changed
+- Fix fill-mask example after spaCy 3.6 upgrade.
+
+### Added
+- Add configurable HuggingFace tokenization parameters.
+
+
+## [1.11.0] - 2023-12-05
+### Changed
+- Upgraded to [HuggingFace Transformers], 4.35, [zensols.deeplearn] 1.9,
+  [spaCy] 3.6.
+
+### Added
+- Support for Python 3.11.
+
+### Removed
+- Support for Python 3.9.
+
+
+## [1.10.1] - 2023-08-25
+### Changed
+- Masked model bug fix.
+
+
+## [1.10.0] - 2023-08-16
+Downstream moderate risk update release.
+
+### Added
+- Add `MaskFillPredictor` and resource library.
+
+### Changed
+- Prevent glove weight archive from re-downloading on every access.
+
+
+## [1.9.1] - 2023-06-29
+### Changed
+- Cleanup downloaded model resources after install.
+
+
+## [1.9.0] - 2023-06-09
+### Added
+- Added BERTScore scoring method to [zensols.nlparse] scoring API.
+- Upgraded [zensols.nlparse] to 1.7.0.
+
+### Changed
+- Transformer padding uses longest sentence by default.
+- Vectorizer model accessible in Latent Semantic Indexing component.
+- Bug fixes for `WordEmbedModel` caching, persisted naming and word piece
+  document parser resource library.
+- Upgraded [zensols.nlparse] to 1.6.0.
+- Resource library file naming.
+- Upgraded [zensols.deeplearn] to 1.7.0.
+
+
+## [1.8.0] - 2023-04-05
+### Changed
+- Upgraded [zensols.nlparse] to 1.6.0.
+- Bug fixes in word piece document API.
+
+
+## [1.7.0] - 2023-02-02
+### Changed
+- Upgraded [zensols.util] to 1.13.0.
+
+
+## [1.6.0] - 2023-01-23
+### Added
+- Word piece API to map to non-word-piece tokens.
+- Add word piece embeddings.
+
+
+## [1.5.0] - 2022-11-06
+### Added
+- Sentence BERT (sbert) resource library and tested.
+- Add HuggingFace local download model files resource library defaults.
+
+### Changed
+- Switched additional columns from tuple to as dictionary to solve ordering in
+  `DataframeDocumentFeatureStash`.
+- Fix `OneHotEncodedFeatureDocumentVectorizer` for document use case.
+- Fix model `ClassifyNetwork` linear input size calculation so transformers (or
+  models that do not use a terminal CRF layer) can add document level features.
+
+
+## [1.4.1] - 2022-10-02
+### Changed
+- Transformer model fetch configuration.
+
+
+## [1.4.0] - 2022-10-01
+### Added
+- Add a token embedding feature vectorizer.
+
+### Changes
+- Replace `None` shape component with -1 in `EnumContainer` vectorizer.
+
+
+## [1.3.0] - 2022-08-08
+- Update dependent libraries release.
+
+### Changed
+- Upgrade torch 1.12.
+- Upgraded to spaCy 3.2
+- Upgrade resource library with `zensols.util` changes.
+
+
+## [1.2.0] - 2022-06-14
+This is primarily a refactoring release to simplify the API.
+
+### Added
+- Resource library configuration taken from examples and made generic for
+  reuse.
+- Resource library and example documentation.
+
+### Changed
+- Simplification of the API and examples.
+- Added option to tokenize only during encoding for transformer components.
+- Fixed transformer expander vectorizer bugs.
+- Fixed deallocation issues in test notebook.
+
+### Removed
+- Replaced example model configuration with `--override` option semantics.
+
+
+## [1.1.2] - 2022-05-15
+### Changed
+- Fixed YML resource library configuration files not found.
+
+
+## [1.1.1] - 2022-05-15
+### Changed
+- Retrofit resource library and examples with batch metadata changes from
+  [zensols.deeplearn].
+
+
+## [1.1.0] - 2022-05-04
+### Added
+- A recurrent CRF and default classify facade to the resource library.
+- Tokenized transformer document truncation.
+- Token classification resource library.
+- More huggingface support, models and tests.
+- Facebook fastText embeddings.
+
+### Changed
+- Recurrent embedded CRF uses a new network settings factory method.
+- Update examples.
+- Pin `zensols.nlp` version dependency to minor (second component) release.
+- All deep NLP vectorizers inherit from `TransformableFeatureVectorizer` to
+  simplify class hierarchy.  This change now requires `encode_transformed` in
+  respective vectorizer configurations.
+- Embedded Bi{LSTM,GRU,RNN}-CRF}: utilize `recurcrf` module decode over
+  re-implementation.
+- Change default dropout, activation order (that use them) in all layers per
+  the literature.
+
+
+## [1.0.1] - 2022-02-12
+### Added
+- Runtime bench marking.
+- Missing batch configuration in resource library from [zensols.deeplearn].
+- Add observer pattern for logging and Pandas data frame / CSV output.
+
+### Changed
+- Word embedding model now compatible with gensim 4.
+
+
+## [1.0.0] - 2022-01-25
+Major stable release.
+
+### Added
+- DistilBERT pooler output.
+- The `word2vec` model is installed programmatically.
+- Clickbate example now also includes RoBERTa and DistilBERT.
+
+### Changed
+- Upgrade to transformers 4.12.5.
+- Fix duplicate word embeddings matrix copied to GPU, which saves space and
+  time.
+- Other efficiencies such as log guards and data structure creation checks.
+- Notebook example fixes and cleanup.
+
+### Removed
+- PyTorch init call in nlp package init so the client can do it before other
+  modules are loaded.
+
+
+## [0.0.8] - 2021-10-22
+### Added
+- A factory method in `zensols.deepnlp.WordEmbedModel` to create a Gensim
+  `KeyedVectors` instance to provide word vector operations for all embedding
+  model types.
+- Make sub directory in text embedding models configurable.
+- Glove model automatically downloads embeddings if not present on the file
+  system using `zensols.install`.
+
+### Changed
+- `FeatureDocumentVectorizerManager.token_feature_ids` default to its owned
+  `doc_parser`'s token features.
+- Pin dependencies to working huggingface transformers as new version breaks
+  this version.
+- Fix glove embedding factory create functionality.
+
+
+## [0.0.7] - 2021-09-22
+### Changed
+- Refactored downstream renaming of files from [zensols.deeplearn].
+- Moved `ClassificationPredictionMapper` class to new `classify` module.
+
+### Added
+- Classification module and classes now fully implement text classification
+  with RNN/LSTM/GRU network types or any HuggingFace transformer with pooler
+  output.  This means there is no coding necessary for text classification with
+  the exception of writing a data loader if not in a supported format like
+  Pandas dataframe (i.e. CSV file).
+- Configuration resource library.
+- Clickbate corpus example and documentation.
+
+
+## [0.0.6] - 2021-09-07
+### Changed
+- Revert to version 3.8.3 of gensim and support back/forward comparability.
+- Upgrade zensols libraries.
+- Documentation and clean up.
+
+
+## [0.0.5] - 2021-08-07
+### Changed
+- Upgrade dependencies.
+
+
+## [0.0.4] - 2021-08-07
+### Added
+- Sequence/token classification for BiLSTM+CRF and HuggingFace transformers.
+  This has been tested with BERT/DistilBERT/RoBERTa and the large BERT models.
+- The HuggingFace transformers optimizer for `AdamW` and scheduler for
+  functionality such as fine tuning warm up.
+- More NLP facade specific support such as easier embedding model access.
+- Better support for Jupyter notebook rapid prototyping and experimentation.
+- Jupyter integration tests in review movie example.
+
+### Changed
+- Upgrade to spaCy 3 via the [zensols.nlparse] dependency.
+
+### Removed
+- Move feature containers and parser to [zensols.nlparse], including test
+  cases.
+- The dependency on [bcolz] as it is no longer maintained.  The caching of
+  binary word vectors was replaced with [H5PY].
+
+
+## [0.0.3] - 2021-04-30
+### Added
+- BERT/DistilBERT/RoBERTa transformer word piece tokenizer to linguistic token
+  mapping.
+- Upgraded to `gensum` 4.0.1.
+- Upgraded to [zensols.deeplearn] 0.1.2, which is upgraded to use [PyTorch] 1.8.
+- Added simple vectorizer example.
+- Multiprocessing vectorization now supports GPU access via torch
+  multiprocessing subsystem.
+
+### Changed
+- Refactored word embedding (sub) modules.
+- Moved BERT transformer embeddings to separate `transformer` module.
+- Refactored vectorizers to standardize around `FeatureDocument` rather token
+  collection instances.
+- Standardize vectorizer shapes.
+- Updated examples to use new vectorizer API and [zensols.util] application
+  CLI.
+
+
+## [0.0.2] - 2020-12-29
+Maintenance release.
+### Changed
+- Upgraded dependencies and tested across Python 3.7, 3.8, 3.9.
+
+
+## [0.0.1] - 2020-05-04
+### Added
+- Initial version.
+
+
+<!-- links -->
+[Unreleased]: https://github.com/plandes/deepnlp/compare/v1.15.1...HEAD
+[1.15.1]: https://github.com/plandes/deepnlp/compare/v1.15.0...v1.15.1
+[1.15.0]: https://github.com/plandes/deepnlp/compare/v1.14.0...v1.15.0
+[1.14.0]: https://github.com/plandes/deepnlp/compare/v1.13.1...v1.14.0
+[1.13.1]: https://github.com/plandes/deepnlp/compare/v1.13.0...v1.13.1
+[1.13.0]: https://github.com/plandes/deepnlp/compare/v1.12.0...v1.13.0
+[1.12.0]: https://github.com/plandes/deepnlp/compare/v1.11.1...v1.12.0
+[1.11.1]: https://github.com/plandes/deepnlp/compare/v1.11.0...v1.11.1
+[1.11.0]: https://github.com/plandes/deepnlp/compare/v1.10.1...v1.11.0
+[1.10.1]: https://github.com/plandes/deepnlp/compare/v1.10.0...v1.10.1
+[1.10.0]: https://github.com/plandes/deepnlp/compare/v1.9.1...v1.10.0
+[1.9.1]: https://github.com/plandes/deepnlp/compare/v1.9.0...v1.9.1
+[1.9.0]: https://github.com/plandes/deepnlp/compare/v1.8.0...v1.9.0
+[1.8.0]: https://github.com/plandes/deepnlp/compare/v1.7.0...v1.8.0
+[1.7.0]: https://github.com/plandes/deepnlp/compare/v1.6.0...v1.7.0
+[1.6.0]: https://github.com/plandes/deepnlp/compare/v1.5.0...v1.6.0
+[1.5.0]: https://github.com/plandes/deepnlp/compare/v1.4.1...v1.5.0
+[1.4.1]: https://github.com/plandes/deepnlp/compare/v1.4.0...v1.4.1
+[1.4.0]: https://github.com/plandes/deepnlp/compare/v1.3.0...v1.4.0
+[1.3.0]: https://github.com/plandes/deepnlp/compare/v1.2.0...v1.3.0
+[1.2.0]: https://github.com/plandes/deepnlp/compare/v1.1.2...v1.2.0
+[1.1.2]: https://github.com/plandes/deepnlp/compare/v1.1.1...v1.1.2
+[1.1.1]: https://github.com/plandes/deepnlp/compare/v1.1.0...v1.1.1
+[1.1.0]: https://github.com/plandes/deepnlp/compare/v1.0.1...v1.1.0
+[1.0.1]: https://github.com/plandes/deepnlp/compare/v1.0.0...v1.0.1
+[1.0.0]: https://github.com/plandes/deepnlp/compare/v0.0.8...v1.0.0
+[0.0.8]: https://github.com/plandes/deepnlp/compare/v0.0.7...v0.0.8
+[0.0.7]: https://github.com/plandes/deepnlp/compare/v0.0.6...v0.0.7
+[0.0.6]: https://github.com/plandes/deepnlp/compare/v0.0.5...v0.0.6
+[0.0.5]: https://github.com/plandes/deepnlp/compare/v0.0.4...v0.0.5
+[0.0.4]: https://github.com/plandes/deepnlp/compare/v0.0.3...v0.0.4
+[0.0.3]: https://github.com/plandes/deepnlp/compare/v0.0.2...v0.0.3
+[0.0.2]: https://github.com/plandes/deepnlp/compare/v0.0.1...v0.0.2
+[0.0.1]: https://github.com/plandes/deepnlp/compare/v0.0.0...v0.0.1
+
+[zensols.util]: https://github.com/plandes/util
+[zensols.nlparse]: https://github.com/plandes/nlparse
+[zensols.deeplearn]: https://github.com/plandes/deeplearn
+[PyTorch]: https://pytorch.org
+[bcolz]: https://github.com/Blosc/bcolz
+[H5PY]: https://www.h5py.org
+[HuggingFace Transformers]: https://pypi.org/project/transformers/
+[spaCy]: https://spacy.io