VSM and Topic Modeling
Changes:
- Added
representations
subpackage; includes modules for network and vector space model (VSM) document and corpus representations- Document-term matrix creation now takes documents represented as a list of terms (rather than as spaCy Docs); splits the tokenization step from vectorization for added flexibility
- Some of this functionality was refactored from existing parts of the package
- Added
tm
(topic modeling) subpackage, with a mainTopicModel
class for training, applying, persisting, and interpreting NMF, LDA, and LSA topic models through a single interface - Various improvements to
TextDoc
andTextCorpus
classesTextDoc
can now be initialized from a spaCy Doc- Removed caching from
TextDoc
, because it was a pain and weird and probably not all that useful extract
-based methods are now generators, like the functions they wrap- Added
.as_semantic_network()
and.as_terms_list()
methods toTextDoc
TextCorpus.from_texts()
now takes advantage of multithreading via spaCy, if available, and document metadata can be passed in as a paired iterable of dicts
- Added read/write functions for sparse scipy matrices
- Added
fileio.read.split_content_and_metadata()
convenience function for splitting (text) content from associated metadata when reading data from disk into aTextDoc
orTextCorpus
- Renamed
fileio.read.get_filenames_in_dir()
tofileio.read.get_filenames()
and added functionality for matching/ignoring files by their names, file extensions, and ignoring invisible files - Rewrote
export.docs_to_gensim()
, now significantly faster - Imports in
__init__.py
files for main and subpackages now explicit
Bugfixes:
textstats.readability_stats()
no longer filters out stop words (@henningko #7)- Wikipedia article processing now recursively removes nested markup
extract.ngrams()
now filters out ngrams with any space-only tokens- functions with
include_nps
kwarg changed toinclude_ncs
, to match the renaming of the associated function fromextract.noun_phrases()
toextract.noun_chunks()