NLP (natural language processing)

Terminology

Corpus = a collection of text

Token = a word or punctuation (e.g. "the", "rat", ".")

Type = Each distinct token forms a type

Sequence = one or more consecutive tokens

Unigram = 1-token sequence

Bigram = 2-token sequence

N-gram = sequence with 3 or more tokens