forked from vanatteveldt/frogr
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
82 lines (57 loc) · 2.83 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
Calling frog from R
========================================================
Frog is a lemmatizer and dependency parser for Dutch which can also be run as a server.
This package contains functions for connecting to a frog server from R and creating a document-term matrix from the resulting tokens. Since this yields a standard `tm` term-document matrix, it can be used e.g. for [corpus analysis](https://github.com/kasperwelbers/corpus-tools/blob/master/howto/howto_compare_corpora.md), [topic modeling](https://github.com/kasperwelbers/corpus-tools/blob/master/howto/howto_latent_dirichlet_allocation_topmod.md), or machine learning using [RTextTools](http://www.rtexttools.net)
See http://ilk.uvt.nl/frog/ for more information on Frog.
Installing and running the frog server
----
The frog daemon (server) needs to be running before you can this package.
See http://ilk.uvt.nl/frog/ for documentation and installation instructions.
To install frog on debian/ubuntu you can use apt:
```{bash}
$ sudo apt-get install frog frogdata ucto
```
To run the frog server on port 9772, use:
```{bash}
$ frog -S 9772
```
If you only want to pos-tag and lemmatize,
you can skip the parsing and morphological analysis to speed up the analysis and conserve memory:
```{bash}
$ frog --skip=acpm
```
Installing frogr
----
`frogr` can be installed directly from this github repository using devtools:
```{r, message=F}
if (!require(devtools)) {install.package("devtools"); library(devtools)}
install_github("frogr", username="vanatteveldt")
library(frogr)
```
If devtools is unavailable (e.g. on Windows), you can also copy the file [frog.r](R/frog.r) and source it directly.
In that case, make sure the packages `tm`, `Matrix` and `zoo` are installed.
Calling frog
---
The function `call_frog` calls the frog server with a give text and results a data frame:
```{r}
text = c("Mijn kat Toby heeft nooit van andere katten gehouden.",
"Maar andere katjes houden wel van hem!")
tokens = call_frog(text, host="localhost", port=9772)
head(tokens)
```
Note that if you run frog with the `--skip=` argument, some columns will only contain NA values.
The `sentence` and `majorpos` columns are not produced by frog but included here for convenience. `majorpos` is simply the part of the POS tag before the first parenthesis.
Creating a document-term matrix
----
To create a document term matrix from the frog output (or in fact from any list of tokens), you can use the `create_dtm` function:
```{r}
m = create_dtm(tokens$docid, tokens$lemma)
as.matrix(m)
```
Of course, you can also first select to e.g. only keep nouns and verbs:
```{r}
subset = tokens[tokens$majorpos %in% c("N", "WW"), ]
m = create_dtm(subset$sent, subset$lemma)
as.matrix(m)
```
As you can see, all forms of cat (_kat_, _katten_, _katjes_), love (_houdt_, _houden_), and have (_heeft_) are properly lemmatized.