Refactor for Consistency, Convenience, and Simplicity
After several months of somewhat organic development, textacy
had acquired some rough edges in the API and inconsistencies throughout the code base. This release breaks the existing API in a few (mostly minor) ways for the sake of consistency, user convenience, and simplicity. It also adds some new functionality and enhances existing functionality for a better overall experience and, I hope, more straightforward development moving forward.
Changes:
- Refactored and streamlined
TextDoc
; changed name toDoc
- simplified init params:
lang
can now be a language code string or an equivalentspacy.Language
object, andcontent
is either a string orspacy.Doc
; param values and their interactions are better checked for errors and inconsistencies - renamed and improved methods transforming the Doc; for example,
.as_bag_of_terms()
is now.to_bag_of_terms()
, and terms can be returned as integer ids (default) or as strings with absolute, relative, or binary frequencies as weights - added performant
.to_bag_of_words()
method, at the cost of less customizability of what gets included in the bag (no stopwords or punctuation); words can be returned as integer ids (default) or as strings with absolute, relative, or binary frequencies as weights - removed methods wrapping
extract
functions, in favor of simply calling that function on the Doc (see below for updates toextract
functions to make this more convenient); for example,TextDoc.words()
is nowextract.words(Doc)
- removed
.term_counts()
method, which was redundant withDoc.to_bag_of_terms()
- renamed
.term_count()
=>.count()
, and checking + caching results is now smarter and faster
- simplified init params:
- Refactored and streamlined
TextCorpus
; changed name toCorpus
- added init params: can now initialize a
Corpus
with a stream of texts, spacy or textacy Docs, and optional metadatas, analogous toDoc
; accordingly, removed.from_texts()
class method - refactored, streamlined, bug-fixed, and made consistent the process of adding, getting, and removing documents from
Corpus
- getting/removing by index is now equivalent to the built-in
list
API:Corpus[:5]
gets the first 5Doc
s, anddel Corpus[:5]
removes the first 5, automatically keeping track of corpus statistics for total # docs, sents, and tokens - getting/removing by boolean function is now done via the
.get()
and.remove()
methods, the latter of which now also correctly tracks corpus stats - adding documents is split across the
.add_text()
,.add_texts()
, and.add_doc()
methods for performance and clarity reasons
- getting/removing by index is now equivalent to the built-in
- added
.word_freqs()
and.word_doc_freqs()
methods for getting a mapping of word (int id or string) to global weight (absolute, relative, binary, or inverse frequency); akin to a vectorized representation (see:textacy.vsm
) but in non-vectorized form, which can be useful - removed
.as_doc_term_matrix()
method, which was just wrapping another function; so, instead ofcorpus.as_doc_term_matrix((doc.as_terms_list() for doc in corpus))
, dotextacy.vsm.doc_term_matrix((doc.to_terms_list(as_strings=True) for doc in corpus))
- added init params: can now initialize a
- Updated several
extract
functions- almost all now accept either a
textacy.Doc
orspacy.Doc
as input - renamed and improved parameters for filtering for or against certain POS or NE types; for example,
good_pos_tags
is nowinclude_pos
, and will accept either a single POS tag as a string or a set of POS tags to filter for; same goes forexclude_pos
, and analogouslyinclude_types
, andexclude_types
- almost all now accept either a
- Updated corpora classes for consistency and added flexibility
- enforced a consistent API:
.texts()
for a stream of plain text documents and.records()
for a stream of dicts containing both text and metadata - added filtering options for
RedditReader
, e.g. by date or subreddit, consistent with other corpora (similar tweaks toWikiReader
may come later, but it's slightly more complicated...) - added a nicer
repr
forRedditReader
andWikiReader
corpora, consistent with other corpora
- enforced a consistent API:
- Moved
vsm.py
andnetwork.py
into the top-level oftextacy
and thus removed therepresentations
subpackage- renamed
vsm.build_doc_term_matrix()
=>vsm.doc_term_matrix()
, because the "build" part of it is obvious
- renamed
- Renamed
distance.py
=>similarity.py
; all returned values are now similarity metrics in the interval [0, 1], where higher values indicate higher similarity - Renamed
regexes_etc.py
=>constants.py
, without additional changes - Renamed
fileio.utils.split_content_and_metadata()
=>fileio.utils.split_record_fields()
, without further changes (except for tweaks to the docstring) - Added functions to read and write delimited file formats:
fileio.read_csv()
andfileio.write_csv()
, where the delimiter can be any valid one-char string; gzip/bzip/lzma compression is handled automatically when available - Added better and more consistent docstrings and usage examples throughout the code base