Skip to content

Latest commit

 

History

History
305 lines (229 loc) · 9.84 KB

CHANGELOG.md

File metadata and controls

305 lines (229 loc) · 9.84 KB

Change Log

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog and this project adheres to Semantic Versioning.

1.3.0 - 2022-08-06

Added

  • Token indexing mappings accounting for (named entity) multi-word tokens.
  • IOB (iob_, iob) features.
  • Re-loadable components and component initializers.

Changed

  • Upgraded to spaCy 3.2
  • Add spaCy tokens to spaCy feature tokens.
  • Bug fixes in combining and overlapping sentences.
  • Switched to shallow copy of document in overlapping sentence doc methods.

1.2.0 - 2022-06-16

Removed

  • Remove resource library regular_expression_escape:dollar configuration. Use zensols.util conf_esc:dollar as a replacement.

1.1.2 - 2022-06-14

Changed

  • Dependency bump.

1.1.1 - 2022-05-15

Changed

  • Dependency bump.

1.1.0 - 2022-05-04

Changed

  • Fix resource leaks and other bugs.
  • Persist original text along with FeatureDocument rather than reconstruct it from sentence and/or token text.

Added

  • An lexical overlapping utility module (overlap).
  • A token normalizer that merges tokens in to spans (JoinTokenMapper).
  • Regular expression matching for entity and merge components (similar to JoinTokenMapper).
  • Add back TokenAnnotatedFeatureSentence for down stream packages.
  • Add token decorator to spacy parser to allow for add/modify features on creation separate from parser class hierarchy.

1.0.1 - 2022-01-25

Added

  • Sentences and tokens accessible by index.

Changed

  • More robust regular expression for token splitting.
  • Mapping combiner is persistable with spaCy tokens and handles split named entities.

1.0.0 - 2021-10-22

First major development release.

Added

  • A FeatureDocumentCombiner that merges features from different document parsers.
  • Top level library NLPError.
  • A pipeline component and resource configuration library entry to remove sentence boundaries in a spaCy document.

Changed

  • Split out optional resource library content in to mappers.conf.
  • The spaCy model has attribute langres set on LanguageResource to enable creation of factory instances from registered pipe components.
  • Fix issue with component creation with no pipeline arguments.

Removed

  • The DocStash instance as it was too simple for any practical application.

0.1.3 - 2021-09-21

Changed

  • Dependency.

Removed

  • zensols.nlp.lang.DocStash

0.1.2 - 2021-09-21

Changed

  • Make FeatureDocumentParser callable.
  • Fix memory leak in LanguageResource.

Added

  • Configuration Resource library.
  • Configuration for keyword arguments to the add_pipe_comp and example.

0.1.1 - 2021-09-07

Changed

  • Fixed bug with creating a dict from a FeatureToken.
  • Fixed/improved how Feature{Token,Sentence,Document} are dictified with (asdict) and how they are written as text with write.

Added

  • Creates a Pandas dataframe from token feature attributes.
  • Add back FeatureToken feature ID -> type for write dumping
  • Add lexical location SpacyTokenFeatures.loc location in the document as an (starting, ending) range.

0.1.0 - 2021-08-16

This release simplifies the token attributes level classes in the features module by:

  • Using feature IDs instead of trying to make sense of the class property/attribute member data.
  • Using the FeatureDocumentParser and FeatureToken to copy spaCy resources to simple picklable Python classes.

Not only does this greatly reduce complexity in class hierarchy and data copy/move functionality, but speeds things up.

Changes

  • Attributes set on detached token features are no longer robust. Before, if a token feature ID was specified, but didn't exist on the source token feature set, it would copy over a None. This now raises an AttributeError instead.
  • For TokenAttributes, creation of dicts (either by asdict or get_features) is now consistent with the set attributes and properties of the class. Only those specified passed to methods, which default to FIELD_IDS of the class (which can be overridden at a class level).

Removed

  • The dictionary creation of attribute/property individual features methods TokenAttributes.{string}features. These methods are obviated by the get_features, which returns all features in FIELD_IDS.
  • FeatureDocumentParser.additional_token_feature_ids to simplify token feature IDs passed to feature tokens.
  • The TokenAttributes class, as it was just a metadata member holder.

Added

  • A SpaCy implementation of the TokenFeatures class, that somewhat resembles the old TokenFeatures of the old class hierarchy.

0.0.15 - 2021-08-07

Changes

  • Upgrade from spaCy 2.x to 3.x.

Added

  • POS feature inclusion by default to support is_pronoun, which is needed after spaCy 3 changed how lemmatization works.
  • Move feature containers and parser from zensols.deepnlp, including test cases.
  • A sentence index feature (i_sent).
  • An index of sentence feature (sent_i).
  • Advanced spacy configuration by adding component classes. This gives more control over configuring the spaCy pipeline.
  • Add feature containers (FeatureDocument) and parser (FeatureDocumentParser), which were moved over from zensols.deepnlp.

0.0.14 - 2021-04-29

Changes

  • Upgrade to zensols.util==1.4.1.
  • Upgrade documentation API generation.
  • Nail dependencies to spacy 2.3.5 until pip deps are fixed.
  • Added sentence index features to reconstruct sentences from documents.

0.0.13 - 2021-01-14

Changes

  • Fix component adds for spacy > 2.0.
  • Add langres model to API documentation.

0.0.12 - 2020-12-29

Changed

  • Upgraded zenbuild.
  • Switched from Travis to GitHub workflows.
  • Tested with Python 3.9.1.

0.0.11 - 2020-12-09

Changed

  • Add basic token features for non-spacy parse use cases.
  • Rename feature type to feature id.
  • TokeFeatures is now a dictable with to_dict -> asdict.

0.0.10 - 2020-12-09

Added

  • Sphinx documentation, which includes API docs.

Changed

  • Settable detached TokenAttributes instances.
  • Make dataclasses, and therefore, needs >= Python 3.7.

0.0.9 - 2020-05-10

Changed

  • Home/master move lemmatizing out of default token normalizer.
  • Update super method calls to modern (at least) Python 3.7.
  • Fix annoying can't find smart_open.gcs bogus warning.
  • Remove language resource factory.
  • Upgrade to zensols.util 1.2.0 and get rid of custom factories.

Added

  • Feature to parse whole special tokens.
  • Added porter stemmer from nltk.

Removed

0.0.8 - 2020-04-14

Changed

  • Upgrade to spaCy 2.2.4 and textacy 0.10.0

0.0.7 - 2020-01-24

Added

  • Added the Porter stemmer from the [NTLK].

Changed

  • Better class naming for token mapper.
  • Features debugging bug fix.

0.0.6 - 2019-12-14

Changed

  • Fix Travis.

0.0.5 - 2019-12-14

Data classes are now used so Python 3.7 is now a requirement.

Added

  • Feature normalizers were added for neural networks.
  • Implemented a better strategy for using language resources with token normalization.

0.0.4 - 2019-11-21

Added

  • Adding detachable and picklable token feature set.

0.0.3 - 2019-07-31

Added

  • DocStash that parses documents as a factory stash.

0.0.2 - 2019-07-25

Added

  • Feature to disable SpaCy pipeline components.
  • Add configuration for removing punctuation and determiners.

Changed

  • Skip textacy for document creation since it wasn't used. This is more efficient.

0.0.1 - 2019-07-06

Added

  • Initial version.