This repository contains a collection of small ruby scripts for working with text corpora. While there are many sophisticated corpus analysis tools already available, the purpose of these scripts was to provide a lightweight, "quick and dirty" analysis of unconventional and not necessarily optimized collections of text in different languages.
The tools originally developed from the challenges encountered while building stoplists for a large number of African languages, but they have also been used for a variety of other purposes, including providing data and resources for language revitalization.
- ruby
- The
unicode_utils
gem (to convert non-Latin text to lowercase, sincedowncase
in the core library cannot handle this)
The following scripts can be found in the scripts/
folder.
This script will print a frequency list to standard output for all texts in a given folder, all words in a file, or all text in a pipe. It has been tailored for use with the texts from the ASP corpus.
ruby corpus_freq.rb [path/to/directory]
or
ruby corpus_freq.rb filename.txt
or
ruby cat sometext.txt | corpus_freq.rb
A small library for extracting ngrams from a corpus (for n = some supplied value). This is required for using the print_ngrams.rb
script.
A script for parsing n-grams (sequences of adjacent words) in a corpus. As with the other scripts, it can take input from a file, directory, or pipe. It will automatically create a series of files containing lists of n-grams for three pre-determined values of n (specifically, bigrams, trigrams, and 4-grams), although the ngrams library allows for n to be any arbitrary number.
To output n-grams for different values of n, just edit the script to comment out the two lines with batch_ngram(dir, corpus)
and replace them with the following:
print_ngrams(dir, corpus, n)
(Where n
is any number.)
Given a stoplist for a particular language and a text file written in that language, print a frequency list of the most "salient" or "interesting" high frequency words (i.e., words that are not common stop words).
Process a frequency list directly:
./salient.rb frequency_list.txt
Process any raw text file using a pipe:
./corpus_freq.rb filename.txt | ruby salient.rb
In order to use the salient.rb
script, you need to first configure the location of your list of stop words, and the language of the source file.
This can be done by either configuring these values directly at the top of the script file, or by specifying them using command-line options (see the section below for details).
If your source files are usually in one particular language, it is probably easier to configure the script itself and use the command-line options for one-offs. The variables to configure in the script are as follows:
stoplist_dir
: A directory containing stoplists in json format needs to be specified here. A good example with a wide selection of languages can be found in the "dist" folder of the stopwords-json project.lang
: The ISO code for the language of the source text (this has to be a language that is available in your stoplist directory). Example:de
(German) orfr
(French)
Command-line options override the values configured in the script, so they are a convenient way to specify temporary values (e.g. for a language you don't usually work with, or an experimental list of stopwords).
The following options are available:
-l
(--language CODE
): Language code to use for processing-s
(--stoplist-dir DIR
): Directory containing stoplist files
So for example, to find salient words in a Portuguese source file (filename.txt
), you could use the following command:
./corpus_freq.rb filename.txt | ruby salient.rb -l pt
To specify the location of your stoplist directory, just add it using -s
option:
./corpus_freq.rb filename.txt | ruby salient.rb -l pt -s /path/to/stoplist_dir
Print corpus statistics for a specified corpus. Just point this script at your corpus file or directory and it will churn out a wide variety of statistics including the total number of files, words, and lines in the corpus, as well as the average number of words per file, and the top five most and least frequent words.
See the statistics from the Swahili corpus for an example of what the generated statistics look like.
This will extract the wiki text from a small mediawiki xml dump, such as the database backup dumps found here.
Note: this will only work if the database is small enough to be read entirely into working memory (RAM).
The intended use case is for extracting usable / analyzable text from minority language wikis. The resulting text can be piped to stats.rb
, corpus_freq.rb
, or other scripts in this repo.
If you have an extracted xml file:
./small_wiki_to_text_corpus.rb database.xml
Or work directly with the bzip-compressed dump:
bzcat database.bz2 | ruby small_wiki_to_text_corpus.rb
MIT.