Don't use this tagger for actual research or production! Use SpaCy instead (faster, more reliable). I'm leaving this up only as educational material.
This repository contains a trained part-of-speech tagger for Dutch, as well as the code used to train it.
(The file cowparser.py
comes from this repository.)
Don't use the tagger in a production environment, unless you train it yourself using some other data. This code just shows you how the NLTK tagger works. I recommend Treetagger, Frog, or SpaCy.
Requirements:
- NLTK version 3.1
- Python 3
Key facts:
- The tagger was trained on the NLCOW14 corpus (which in turn was tagged using TreeTagger).
- The accuracy is about 97% on held-out data from the same corpus.
- The small model is trained on 2 million tokens, while the larger model is trained on 10 million tokens.
- The accuracy of the larger model is slightly better than the smaller model, but the larger model is over three times as large.
First run bash create_models.sh
. This will create the models for you. Then use the following code.
from nltk.tag.perceptron import PerceptronTagger
# This may take a few minutes. (But once loaded, the tagger is really fast!)
tagger = PerceptronTagger(load=False)
tagger.load('model.perc.dutch_tagger_small.pickle')
# Tag a sentence.
tagger.tag('Alle vogels zijn nesten begonnen , behalve ik en jij .'.split())
Result:
[('Alle', 'det__indef'), ('vogels', 'nounpl'), ('zijn', 'verbprespl'), ('nesten', 'nounpl'), ('begonnen', 'verbpapa'), (',', 'punc'), ('behalve', 'conjsubo'), ('ik', 'pronpers'), ('en', 'conjcoord'), ('jij', 'pronpers'), ('.', '$.')]
If the text is not tokenized yet, you can use the built-in tokenizer from the NLTK (be sure to download the NLTK data):
import nltk.data
from nltk.tokenize import word_tokenize
sent_tokenizer = nltk.data.load('tokenizers/punkt/dutch.pickle')
def tokenize(text):
for sentence in sent_tokenizer.tokenize(text):
yield word_tokenize(sentence)
sentences = tokenize('Alle vogels zijn nesten begonnen, behalve ik en jij. Waar wachten wij nu op?')
for sentence in sentences:
print(tagger.tag(sentence))
Result:
[('Alle', 'det__indef'), ('vogels', 'nounpl'), ('zijn', 'verbprespl'), ('nesten', 'nounpl'), ('begonnen', 'verbpapa'), (',', 'punc'), ('behalve', 'conjsubo'), ('ik', 'pronpers'), ('en', 'conjcoord'), ('jij', 'pronpers'), ('.', '$.')]
[('Waar', 'pronadv'), ('wachten', 'verbprespl'), ('wij', 'pronpers'), ('nu', 'adv'), ('op', 'adv'), ('?', '$.')]