Improved I/O, VSM, docs, and more
Changes:
-
Rename, refactor, and extend I/O functionality (PR #151)
- Related read/write functions were moved from
read.py
andwrite.py
into
format-specific modules, and similar functions were consolidated into one
with the addition of an arg. For example,write.write_json()
and
write.write_json_lines()
=>json.write_json(lines=True|False)
. - Useful functionality was added to a few readers/writers. For example,
write_json()
now automatically handles python dates/datetimes, writing
them to disk as ISO-formatted strings rather than raising a TypeError
("datetime is not JSON serializable", ugh). CSVs can now be written to /
read from disk when each row is a dict rather than a list. Reading/writing
HTTP streams now allows for basic authentication. - Several things were renamed to improve clarity and consistency from a user's
perspective, most notably the subpackage name:fileio
=>io
. Others:
read_file()
andwrite_file()
=>read_text()
andwrite_text()
;
split_record_fields()
=>split_records()
, although I kept an alias
to the old function for folks;auto_make_dirs
boolean kwarg =>make_dirs
. io.open_sesame()
now handles zip files (provided they contain only 1 file)
as it already does for gzip, bz2, and lzma files. On a related note, Python 2
users can now open lzma (.xz
) files if they've installedbackports.lzma
.
- Related read/write functions were moved from
-
Improve, refactor, and extend vector space model functionality (PRs #156 and #167)
-
BM25 term weighting and document-length normalization were implemented, and
and users can now flexibly add and customize individual components of an
overall weighting scheme (local scaling + global scaling + doc-wise normalization).
For API sanity, several additions and changes to theVectorizer
init
params were required --- sorry bout it! -
Given all the new weighting possibilities, a
Vectorizer.weighting
attribute
was added for curious users, to give a mathematical representation of how
values in a doc-term matrix are being calculated. Here's a simple and a
not-so-simple case:>>> Vectorizer(apply_idf=True, idf_type='smooth').weighting 'tf * log((n_docs + 1) / (df + 1)) + 1' >>> Vectorizer(tf_type='bm25', apply_idf=True, idf_type='bm25', apply_dl=True).weighting '(tf * (k + 1)) / (tf + k * (1 - b + b * (length / avg(lengths))) * log((n_docs - df + 0.5) / (df + 0.5))'
-
Terms are now sorted alphabetically after fitting, so you'll have a consistent
and interpretable ordering in your vocabulary and doc-term-matrix. -
A
GroupVectorizer
class was added, as a child ofVectorizer
and
an extension of typical document-term matrix vectorization, in which each
row vector corresponds to the weighted terms co-occurring in a single document.
This allows for customized grouping, such as by a shared author or publication year,
that may span multiple documents, without forcing users to merge /concatenate
those documents themselves. -
Lastly, the
vsm.py
module was refactored into avsm
subpackage with
two modules. Imports should stay the same, but the code structure is now
more amenable to future additions.
-
-
Miscellaneous additions and improvements
- Flesch Reading Ease in the
textstats
module is now multi-lingual! Language-
specific formulations for German, Spanish, French, Italian, Dutch, and Russian
were added, in addition to (the default) English. (PR #158, prompted by Issue #155) - Runtime performance, as well as docs and error messages, of functions for
generating semantic networks from lists of terms or sentences were improved. (PR #163) - Labels on named entities from which determiners have been dropped are now
preserved. There's still a minor gotcha, but it's explained in the docs. - The size of
textacy
's data cache can now be set via an environment
variable,TEXTACY_MAX_CACHE_SIZE
, in case the default 2GB cache doesn't
meet your needs. - Docstrings were improved in many ways, large and small, throughout the code.
May they guide you even more effectively than before! - The package version is now set from a single source. This isn't for you so
much as me, but it does prevent confusing version mismatches b/w code, pypi,
and docs. - All tests have been converted from
unittest
topytest
style. They
run faster, they're more informative in failure, and they're easier to extend.
- Flesch Reading Ease in the
Bugfixes:
- Fixed an issue where existing metadata associated with a spacy Doc was being
overwritten with an empty dict when using it to initialize a textacy Doc.
Users can still overwrite existing metadata, but only if they pass in new data. - Added a missing import to the README's usage example. (#149)
- The intersphinx mapping to
numpy
got fixed (and items forscipy
and
matplotlib
were added, too). Taking advantage of that, a bunch of broken
object links scattered throughout the docs got fixed. - Fixed broken formatting of old entries in the changelog, for your reading pleasure.