Release Refactor for Consistency, Convenience, and Simplicity · chartbeat-labs/textacy

After several months of somewhat organic development, textacy had acquired some rough edges in the API and inconsistencies throughout the code base. This release breaks the existing API in a few (mostly minor) ways for the sake of consistency, user convenience, and simplicity. It also adds some new functionality and enhances existing functionality for a better overall experience and, I hope, more straightforward development moving forward.

Changes:

Refactored and streamlined TextDoc; changed name to Doc
- simplified init params: lang can now be a language code string or an equivalent spacy.Language object, and content is either a string or spacy.Doc; param values and their interactions are better checked for errors and inconsistencies
- renamed and improved methods transforming the Doc; for example, .as_bag_of_terms() is now .to_bag_of_terms(), and terms can be returned as integer ids (default) or as strings with absolute, relative, or binary frequencies as weights
- added performant .to_bag_of_words() method, at the cost of less customizability of what gets included in the bag (no stopwords or punctuation); words can be returned as integer ids (default) or as strings with absolute, relative, or binary frequencies as weights
- removed methods wrapping extract functions, in favor of simply calling that function on the Doc (see below for updates to extract functions to make this more convenient); for example, TextDoc.words() is now extract.words(Doc)
- removed .term_counts() method, which was redundant with Doc.to_bag_of_terms()
- renamed .term_count() => .count(), and checking + caching results is now smarter and faster
Refactored and streamlined TextCorpus; changed name to Corpus
- added init params: can now initialize a Corpus with a stream of texts, spacy or textacy Docs, and optional metadatas, analogous to Doc; accordingly, removed .from_texts() class method
- refactored, streamlined, bug-fixed, and made consistent the process of adding, getting, and removing documents from Corpus
  - getting/removing by index is now equivalent to the built-in list API: Corpus[:5] gets the first 5 Docs, and del Corpus[:5] removes the first 5, automatically keeping track of corpus statistics for total # docs, sents, and tokens
  - getting/removing by boolean function is now done via the .get() and .remove() methods, the latter of which now also correctly tracks corpus stats
  - adding documents is split across the .add_text(), .add_texts(), and .add_doc() methods for performance and clarity reasons
- added .word_freqs() and .word_doc_freqs() methods for getting a mapping of word (int id or string) to global weight (absolute, relative, binary, or inverse frequency); akin to a vectorized representation (see: textacy.vsm) but in non-vectorized form, which can be useful
- removed .as_doc_term_matrix() method, which was just wrapping another function; so, instead of corpus.as_doc_term_matrix((doc.as_terms_list() for doc in corpus)), do textacy.vsm.doc_term_matrix((doc.to_terms_list(as_strings=True) for doc in corpus))
Updated several extract functions
- almost all now accept either a textacy.Doc or spacy.Doc as input
- renamed and improved parameters for filtering for or against certain POS or NE types; for example, good_pos_tags is now include_pos, and will accept either a single POS tag as a string or a set of POS tags to filter for; same goes for exclude_pos, and analogously include_types, and exclude_types
Updated corpora classes for consistency and added flexibility
- enforced a consistent API: .texts() for a stream of plain text documents and .records() for a stream of dicts containing both text and metadata
- added filtering options for RedditReader, e.g. by date or subreddit, consistent with other corpora (similar tweaks to WikiReader may come later, but it's slightly more complicated...)
- added a nicer repr for RedditReader and WikiReader corpora, consistent with other corpora
Moved vsm.py and network.py into the top-level of textacy and thus removed the representations subpackage
- renamed vsm.build_doc_term_matrix() => vsm.doc_term_matrix(), because the "build" part of it is obvious
Renamed distance.py => similarity.py; all returned values are now similarity metrics in the interval [0, 1], where higher values indicate higher similarity
Renamed regexes_etc.py => constants.py, without additional changes
Renamed fileio.utils.split_content_and_metadata() => fileio.utils.split_record_fields(), without further changes (except for tweaks to the docstring)
Added functions to read and write delimited file formats: fileio.read_csv() and fileio.write_csv(), where the delimiter can be any valid one-char string; gzip/bzip/lzma compression is handled automatically when available
Added better and more consistent docstrings and usage examples throughout the code base

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor for Consistency, Convenience, and Simplicity