Skip to content

Latest commit

 

History

History
18 lines (9 loc) · 320 Bytes

nlp.md

File metadata and controls

18 lines (9 loc) · 320 Bytes

NLP (natural language processing)

Terminology

Corpus = a collection of text

Token = a word or punctuation (e.g. "the", "rat", ".")

Type = Each distinct token forms a type

Sequence = one or more consecutive tokens

Unigram = 1-token sequence

Bigram = 2-token sequence

N-gram = sequence with 3 or more tokens