Skip to content

Corpora Readers, Better Examples, and Fewer Bugs

Compare
Choose a tag to compare
@bdewilde bdewilde released this 20 Jun 16:54

Changes:

  • Added corpora.RedditReader() class for streaming Reddit comments from disk, with .texts() method for a stream of plaintext comments and .comments() method for a stream of structured comments as dicts, with basic filtering by text length and limiting the number of comments returned
  • Refactored functions for streaming Wikipedia articles from disk into a corpora.WikiReader() class, with .texts() method for a stream of plaintext articles and .pages() method for a stream of structured pages as dicts, with basic filtering by text length and limiting the number of pages returned
  • Updated README and docs with a more comprehensive — and correct — usage example; also added tests to ensure it doesn't get stale
  • Updated requirements to latest version of spaCy, as well as added matplotlib for viz

Bugfixes:

  • textacy.preprocess.preprocess_text() is now, once again, imported at the top level, so easily reachable via textacy.preprocess_text() (@bretdabaker #14)
  • viz subpackage now included in the docs' API reference
  • missing dependencies added into setup.py so pip install handles everything for folks