DeepZensols Natural Language Processing

Deep learning utility library for natural language processing that aids in feature engineering and embedding layers.

See the full documentation.
See the paper

Features:

Configurable layers with little to no need to write code.
Natural language specific layers:
- Easily configurable word embedding layers for Glove, Word2Vec, fastText.
- Huggingface transformer (BERT) context based word vector layer.
- Full Embedding+BiLSTM-CRF implementation using easy to configure constituent layers.
NLP specific vectorizers that generate zensols deeplearn encoded and decoded batched tensors for spaCy parsed features, dependency tree features, overlapping text features and others.
Easily swapable during runtime embedded layers as batched tensors and other linguistic vectorized features.
Support for token, document and embedding level vectorized features.
Transformer word piece to linguistic token mapping.
Two full documented reference models provided as both command line and Jupyter notebooks.
Command line support for training, testing, debugging, and creating predictions.

Documentation

Full documentation
Layers: NLP specific layers such as embeddings and transformers
Vectorizers: specific vectorizers that digitize natural language text in to tensors ready as PyTorch input
API reference
Reference Models

Obtaining

The easiest way to install the command line program is via the pip installer:

pip3 install zensols.deepnlp

Binaries are also available on pypi.

Usage

The API can be used as is and manually configuring each component. However, this (like any Zensols API) was designed to instantiated with inverse of control using resource libraries.

Component

Components and out of the box models are available with little to no coding. However, this simple example that uses the library's components is recommended for starters. The example is a command line application that in-lines a simple configuration needed to create deep learning NLP components.

Similarly, this example is also a command line example, but uses a masked langauge model to fill in words.

Reference Models

If you're in a rush, you can dive right in to the Clickbate Text Classification reference model, which is a working project that uses this library. However, you'll either end up reading up on the zensols deeplearn library before or during the tutorial.

The usage of this library is explained in terms of the reference models:

The Clickbate Text Classification is the best reference model to start with because the only code consists of is the corpus reader and a module to remove sentence segmentation (corpus are newline delimited headlines). It was also uses resource libraries, which greatly reduces complexity, where as the other reference models do not. Also see the Jupyter clickbate classification notebook.
The Movie Review Sentiment trained and tested on the Stanford movie review and Cornell sentiment polarity data sets, which assigns a positive or negative score to a natural language movie review by critics. Also see the Jupyter movie sentiment notebook.
The Named Entity Recognizer trained and tested on the CoNLL 2003 data set to label named entities on natural language text. Also see the Jupyter NER notebook.

The unit test cases are also a good resource for the more detailed programming integration with various parts of the library.

Attribution

This project, or reference model code, uses:

Gensim for Glove, Word2Vec and fastText word embeddings.
Huggingface Transformers for BERT contextual word embeddings.
h5py for fast read access to word embedding vectors.
zensols nlparse for feature generation from spaCy parsing.
zensols deeplearn for deep learning network libraries.

Corpora used include:

Stanford movie review
Cornell sentiment polarity
CoNLL 2003 data set

Citation

If you use this project in your research please use the following BibTeX entry:

@inproceedings{landes-etal-2023-deepzensols,
    title = "{D}eep{Z}ensols: A Deep Learning Natural Language Processing Framework for Experimentation and Reproducibility",
    author = "Landes, Paul  and
      Di Eugenio, Barbara  and
      Caragea, Cornelia",
    editor = "Tan, Liling  and
      Milajevs, Dmitrijs  and
      Chauhan, Geeticka  and
      Gwinnup, Jeremy  and
      Rippeth, Elijah",
    booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",
    month = dec,
    year = "2023",
    address = "Singapore, Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.nlposs-1.16",
    pages = "141--146"
}

Changelog

An extensive changelog is available here.

Community

Please star this repository and let me know how and where you use this API. Contributions as pull requests, feedback and any input is welcome.

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

DeepZensols Natural Language Processing

Documentation

Obtaining

Usage

Component

Reference Models

Attribution

Citation

Changelog

Community

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

DeepZensols Natural Language Processing

Documentation

Obtaining

Usage

Component

Reference Models

Attribution

Citation

Changelog

Community

License