Release VSM and Topic Modeling · chartbeat-labs/textacy

Changes:

Added representations subpackage; includes modules for network and vector space model (VSM) document and corpus representations
- Document-term matrix creation now takes documents represented as a list of terms (rather than as spaCy Docs); splits the tokenization step from vectorization for added flexibility
- Some of this functionality was refactored from existing parts of the package
Added tm (topic modeling) subpackage, with a main TopicModel class for training, applying, persisting, and interpreting NMF, LDA, and LSA topic models through a single interface
Various improvements to TextDoc and TextCorpus classes
- TextDoc can now be initialized from a spaCy Doc
- Removed caching from TextDoc, because it was a pain and weird and probably not all that useful
- extract-based methods are now generators, like the functions they wrap
- Added .as_semantic_network() and .as_terms_list() methods to TextDoc
- TextCorpus.from_texts() now takes advantage of multithreading via spaCy, if available, and document metadata can be passed in as a paired iterable of dicts
Added read/write functions for sparse scipy matrices
Added fileio.read.split_content_and_metadata() convenience function for splitting (text) content from associated metadata when reading data from disk into a TextDoc or TextCorpus
Renamed fileio.read.get_filenames_in_dir() to fileio.read.get_filenames() and added functionality for matching/ignoring files by their names, file extensions, and ignoring invisible files
Rewrote export.docs_to_gensim(), now significantly faster
Imports in __init__.py files for main and subpackages now explicit

Bugfixes:

textstats.readability_stats() no longer filters out stop words (@henningko #7)
Wikipedia article processing now recursively removes nested markup
extract.ngrams() now filters out ngrams with any space-only tokens
functions with include_nps kwarg changed to include_ncs, to match the renaming of the associated function from extract.noun_phrases() to extract.noun_chunks()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VSM and Topic Modeling