Corpus = a collection of text
Token = a word or punctuation (e.g. "the", "rat", ".")
Type = Each distinct token forms a type
Sequence = one or more consecutive tokens
Unigram = 1-token sequence
Bigram = 2-token sequence
N-gram = sequence with 3 or more tokens