Skip to content

jukujala/wiki_word_count

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

* What?
  Calculates word counts of Wikipedia from input given by https://github.com/jukujala/wiki_markup_to_text

* Input
  Wiki-text format: each line has one Wiki article in format "title tab string-escaped content"

* Output
  Pickled dictionary mapping words to occurences

* Usage
  cat corpus.txt | python decode_strings.py | python build_word_token_dict.py word_tokens.pickle

Releases

No releases published

Packages

No packages published

Languages