All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
1.3.0 - 2022-08-06
- Token indexing mappings accounting for (named entity) multi-word tokens.
- IOB (
iob_
,iob
) features. - Re-loadable components and component initializers.
- Upgraded to spaCy 3.2
- Add spaCy tokens to spaCy feature tokens.
- Bug fixes in combining and overlapping sentences.
- Switched to shallow copy of document in overlapping sentence doc methods.
1.2.0 - 2022-06-16
- Remove resource library
regular_expression_escape:dollar
configuration. Use zensols.utilconf_esc:dollar
as a replacement.
1.1.2 - 2022-06-14
- Dependency bump.
1.1.1 - 2022-05-15
- Dependency bump.
1.1.0 - 2022-05-04
- Fix resource leaks and other bugs.
- Persist original text along with
FeatureDocument
rather than reconstruct it from sentence and/or token text.
- An lexical overlapping utility module (
overlap
). - A token normalizer that merges tokens in to spans (
JoinTokenMapper
). - Regular expression matching for entity and merge components (similar to
JoinTokenMapper
). - Add back
TokenAnnotatedFeatureSentence
for down stream packages. - Add token decorator to spacy parser to allow for add/modify features on creation separate from parser class hierarchy.
1.0.1 - 2022-01-25
- Sentences and tokens accessible by index.
- More robust regular expression for token splitting.
- Mapping combiner is persistable with spaCy tokens and handles split named entities.
1.0.0 - 2021-10-22
First major development release.
- A
FeatureDocumentCombiner
that merges features from different document parsers. - Top level library
NLPError
. - A pipeline component and resource configuration library entry to remove sentence boundaries in a spaCy document.
- Split out optional resource library content in to
mappers.conf
. - The spaCy model has attribute
langres
set onLanguageResource
to enable creation of factory instances from registered pipe components. - Fix issue with component creation with no pipeline arguments.
- The
DocStash
instance as it was too simple for any practical application.
0.1.3 - 2021-09-21
- Dependency.
zensols.nlp.lang.DocStash
0.1.2 - 2021-09-21
- Make
FeatureDocumentParser
callable. - Fix memory leak in
LanguageResource
.
- Configuration Resource library.
- Configuration for keyword arguments to the
add_pipe_comp
and example.
0.1.1 - 2021-09-07
- Fixed bug with creating a
dict
from aFeatureToken
. - Fixed/improved how
Feature{Token,Sentence,Document}
aredict
ified with (asdict
) and how they are written as text withwrite
.
- Creates a Pandas dataframe from token feature attributes.
- Add back
FeatureToken
feature ID -> type for write dumping - Add lexical location
SpacyTokenFeatures.loc
location in the document as an (starting, ending) range.
0.1.0 - 2021-08-16
This release simplifies the token attributes level classes in the features
module by:
- Using feature IDs instead of trying to make sense of the class property/attribute member data.
- Using the
FeatureDocumentParser
andFeatureToken
to copy spaCy resources to simple picklable Python classes.
Not only does this greatly reduce complexity in class hierarchy and data copy/move functionality, but speeds things up.
- Attributes set on detached token features are no longer robust. Before, if a
token feature ID was specified, but didn't exist on the source token feature
set, it would copy over a
None
. This now raises anAttributeError
instead. - For
TokenAttributes
, creation ofdicts
(either byasdict
orget_features
) is now consistent with the set attributes and properties of the class. Only those specified passed to methods, which default toFIELD_IDS
of the class (which can be overridden at a class level).
- The dictionary creation of attribute/property individual features methods
TokenAttributes.{string}features
. These methods are obviated by theget_features
, which returns all features inFIELD_IDS
. FeatureDocumentParser.additional_token_feature_ids
to simplify token feature IDs passed to feature tokens.- The
TokenAttributes
class, as it was just a metadata member holder.
- A SpaCy implementation of the
TokenFeatures
class, that somewhat resembles the oldTokenFeatures
of the old class hierarchy.
0.0.15 - 2021-08-07
- Upgrade from spaCy 2.x to 3.x.
- POS feature inclusion by default to support
is_pronoun
, which is needed after spaCy 3 changed how lemmatization works. - Move feature containers and parser from
zensols.deepnlp
, including test cases. - A sentence index feature (
i_sent
). - An index of sentence feature (
sent_i
). - Advanced spacy configuration by adding component classes. This gives more control over configuring the spaCy pipeline.
- Add feature containers (
FeatureDocument
) and parser (FeatureDocumentParser
), which were moved over from zensols.deepnlp.
0.0.14 - 2021-04-29
- Upgrade to zensols.util==1.4.1.
- Upgrade documentation API generation.
- Nail dependencies to spacy 2.3.5 until pip deps are fixed.
- Added sentence index features to reconstruct sentences from documents.
0.0.13 - 2021-01-14
- Fix component adds for spacy > 2.0.
- Add langres model to API documentation.
0.0.12 - 2020-12-29
- Upgraded zenbuild.
- Switched from Travis to GitHub workflows.
- Tested with Python 3.9.1.
0.0.11 - 2020-12-09
- Add basic token features for non-spacy parse use cases.
- Rename feature type to feature id.
TokeFeatures
is now a dictable with to_dict -> asdict.
0.0.10 - 2020-12-09
- Sphinx documentation, which includes API docs.
- Settable detached
TokenAttributes
instances. - Make
dataclasses
, and therefore, needs >= Python 3.7.
0.0.9 - 2020-05-10
- Home/master move lemmatizing out of default token normalizer.
- Update super method calls to modern (at least) Python 3.7.
- Fix annoying can't find smart_open.gcs bogus warning.
- Remove language resource factory.
- Upgrade to zensols.util 1.2.0 and get rid of custom factories.
- Feature to parse whole special tokens.
- Added porter stemmer from nltk.
- Moved word2vec embedding (
word2vec.py
) to zensols.deepnlp library. - Moved feature normalization (
fnorm.py
) to zensols.deepnlp library.
0.0.8 - 2020-04-14
- Upgrade to
spaCy
2.2.4 andtextacy
0.10.0
0.0.7 - 2020-01-24
- Added the Porter stemmer from the [NTLK].
- Better class naming for token mapper.
- Features debugging bug fix.
0.0.6 - 2019-12-14
- Fix Travis.
0.0.5 - 2019-12-14
Data classes are now used so Python 3.7 is now a requirement.
- Feature normalizers were added for neural networks.
- Implemented a better strategy for using language resources with token normalization.
0.0.4 - 2019-11-21
- Adding detachable and picklable token feature set.
0.0.3 - 2019-07-31
DocStash
that parses documents as a factory stash.
0.0.2 - 2019-07-25
- Feature to disable SpaCy pipeline components.
- Add configuration for removing punctuation and determiners.
- Skip textacy for document creation since it wasn't used. This is more efficient.
0.0.1 - 2019-07-06
- Initial version.