Release Improved I/O, VSM, docs, and more · chartbeat-labs/textacy

Changes:

Rename, refactor, and extend I/O functionality (PR #151)
- Related read/write functions were moved from read.py and write.py into
  format-specific modules, and similar functions were consolidated into one
  with the addition of an arg. For example, write.write_json() and
  write.write_json_lines() => json.write_json(lines=True|False).
- Useful functionality was added to a few readers/writers. For example,
  write_json() now automatically handles python dates/datetimes, writing
  them to disk as ISO-formatted strings rather than raising a TypeError
  ("datetime is not JSON serializable", ugh). CSVs can now be written to /
  read from disk when each row is a dict rather than a list. Reading/writing
  HTTP streams now allows for basic authentication.
- Several things were renamed to improve clarity and consistency from a user's
  perspective, most notably the subpackage name: fileio => io. Others:
  read_file() and write_file() => read_text() and write_text();
  split_record_fields() => split_records(), although I kept an alias
  to the old function for folks; auto_make_dirs boolean kwarg => make_dirs.
- io.open_sesame() now handles zip files (provided they contain only 1 file)
  as it already does for gzip, bz2, and lzma files. On a related note, Python 2
  users can now open lzma (.xz) files if they've installed backports.lzma.
Improve, refactor, and extend vector space model functionality (PRs #156 and #167)
- BM25 term weighting and document-length normalization were implemented, and
  and users can now flexibly add and customize individual components of an
  overall weighting scheme (local scaling + global scaling + doc-wise normalization).
  For API sanity, several additions and changes to the Vectorizer init
  params were required --- sorry bout it!
- Given all the new weighting possibilities, a Vectorizer.weighting attribute
  was added for curious users, to give a mathematical representation of how
  values in a doc-term matrix are being calculated. Here's a simple and a
  not-so-simple case:
```
 >>> Vectorizer(apply_idf=True, idf_type='smooth').weighting
 'tf * log((n_docs + 1) / (df + 1)) + 1'
 >>> Vectorizer(tf_type='bm25', apply_idf=True, idf_type='bm25', apply_dl=True).weighting
 '(tf * (k + 1)) / (tf + k * (1 - b + b * (length / avg(lengths))) * log((n_docs - df + 0.5) / (df + 0.5))'
```
- Terms are now sorted alphabetically after fitting, so you'll have a consistent
  and interpretable ordering in your vocabulary and doc-term-matrix.
- A GroupVectorizer class was added, as a child of Vectorizer and
  an extension of typical document-term matrix vectorization, in which each
  row vector corresponds to the weighted terms co-occurring in a single document.
  This allows for customized grouping, such as by a shared author or publication year,
  that may span multiple documents, without forcing users to merge /concatenate
  those documents themselves.
- Lastly, the vsm.py module was refactored into a vsm subpackage with
  two modules. Imports should stay the same, but the code structure is now
  more amenable to future additions.
Miscellaneous additions and improvements
- Flesch Reading Ease in the textstats module is now multi-lingual! Language-
  specific formulations for German, Spanish, French, Italian, Dutch, and Russian
  were added, in addition to (the default) English. (PR #158, prompted by Issue #155)
- Runtime performance, as well as docs and error messages, of functions for
  generating semantic networks from lists of terms or sentences were improved. (PR #163)
- Labels on named entities from which determiners have been dropped are now
  preserved. There's still a minor gotcha, but it's explained in the docs.
- The size of textacy's data cache can now be set via an environment
  variable, TEXTACY_MAX_CACHE_SIZE, in case the default 2GB cache doesn't
  meet your needs.
- Docstrings were improved in many ways, large and small, throughout the code.
  May they guide you even more effectively than before!
- The package version is now set from a single source. This isn't for you so
  much as me, but it does prevent confusing version mismatches b/w code, pypi,
  and docs.
- All tests have been converted from unittest to pytest style. They
  run faster, they're more informative in failure, and they're easier to extend.

Bugfixes:

Fixed an issue where existing metadata associated with a spacy Doc was being
overwritten with an empty dict when using it to initialize a textacy Doc.
Users can still overwrite existing metadata, but only if they pass in new data.
Added a missing import to the README's usage example. (#149)
The intersphinx mapping to numpy got fixed (and items for scipy and
matplotlib were added, too). Taking advantage of that, a bunch of broken
object links scattered throughout the docs got fixed.
Fixed broken formatting of old entries in the changelog, for your reading pleasure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved I/O, VSM, docs, and more