Below is a list of customizable settings that can be used to define the type of trees to be extracted and the associated information in the output. The default values are visible in the config.ini
file and can be modified by following these instructions.
General | Tree specification | Tree restrictions | Statistics | Visualisation | Threshold |
---|---|---|---|---|---|
input | node_type | size | association_measures | example | max_lines |
output | labeled | head | compare | grew_match | frequency_threshold |
label_subtypes | ignored_labels | depsearch | |||
fixed | allowed_labels | ||||
query |
For details on the settings pertaining to the tool performance, testing and rare use cases see advanced settings.
Value: <path to the input file or directory>
The --input
parameter defines the location of the input file or directory, i.e. one or more files in the .conllu
format. The tool is primarily aimed at processing corpora based on the Universal Dependencies annotation scheme, but can also be used for any other dependency-parsed corpus complying with the CONLL-U format, regardless of the tagsets used. The only condition is that there is at least one root node per sentence named root (regardless of the casing).
Value: <path to the output file>
STARK produces a single tab-separated file (.tsv) as output, the name and the location of which is defined using the --output
setting. The output file gives a list of all the trees matching the input criteria sorted by descending frequency, as illustrated by the sample output file here.
Values: form, lemma, upos, xpos, feats, deprel
The --node_type
parameter specifies which characteristics of the tokens should be considered when extracting and counting the trees: word form (value form, e.g. 'went'), lemma (lemma, e.g. 'go'), part-of-speech tag (upos, e.g. 'VERB'), morphological features (feats, e.g. 'Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin'), language-specific tag (xpos, e.g. 'VBD'), dependency role (deprel, e.g. 'obj'). You can also combine these values using the '+' operator (e.g. lemma+upos). If you do not want to differentiate your trees based on their nodes, simply comment the '--node_type' parameter to get trees with underscores as nodes.
For example, specifying the option form returns trees of the type 'Mary <nsubj went', while specifying the option upos returns trees of the type 'PROPN <nsubj VERB'. If this parameter is not specified, the trees are returned in the form '_ nsubj _'.
Values: yes, no
The obligatory --labeled
parameter specifies whether trees should be differentiated based on the syntactic relations (dependency labels) between the nodes of the tree (value yes), or not (value no).
For example, if the first option differentiates between trees 'NOUN <nsubj VERB' and 'NOUN <obj VERB', the second option considers them as two instances of the same tree, i.e. 'NOUN < VERB'.
Values: yes, no
The obligatory --label_subtypes
parameter specifies whether (labeled) trees should be differentiated based on label extensions, i.e. colon-marked relation sub-types (value yes), or not (value no).
For example, if specifying the option yes differentiates between trees 'NOUN <nsubj:pass VERB' and 'NOUN <nsubj VERB', specifying the option no considers them as two instances of the same tree, i.e. 'NOUN <nsubj VERB'.
Values: yes, no
The obligatory --fixed
parameter allows the users to specify whether they consider the order of the nodes in the tree, i.e. the surface word order, to be a distinctive feature of the trees (value yes) or not (value no).
For example, if the input treebank contained sentences ‘John gave the apple to Mary’ and ‘John the apple gave to Mary’ (an odd example in English but common in languages with free word order), using the yes option would extract the 'gave > apple' and 'apple < gave' as two distinct trees, while the no option would consider them as two instances of the same tree, i.e. 'gave > apple'.
Note that each of the two options is associated with specific formatting of the trees in the output. When choosing the fixed = yes option, the tree description in the first column reflects the word order of the nodes on the surface (e.g. '(seemingly < easy) < example'). On the other hand, when choosing the fixed = no option, the description of the tree in the first column is order-agnostic, with heads always preceding their dependents, i.e. all the arrows always pointing to the right (e.g. 'example > (easy > seemingly)'.
The second, order-agnostic description of a tree can also be produced by using the --depsearch
option (value yes), which--in combination with fixed = yes--might be useful for users investigating word order variation.
In contrast to the obligatory settings above specifying the criteria for defining the types of trees to be extracted, STARK also allows the users to restrict the extraction procedure to specific trees through the five options presented below.
Value: <integer number or range>
The obligatory --size
parameter allows the users to define the size of the trees to appear in the output file, i.e. the number of tokens (typically words) in the trees under investigation, which can either be specified as an integer number (e.g. 1, 2, 3 … ) or a range (e.g. 2-15). If you want to retrieve all possible trees regardless of size, set the maximum value to a very large number, e.g. 1-10000.
Value: <list of allowed head characteristics>
The optional --head
parameter allows the users to define specific constraints on the head node (i.e. the word that all other words in the (sub-)tree depend on) in the form of attribute-value pairs specifying its lexical or grammatical features.
For example, upos=NOUN would only return trees with nouns as heads (nominal phrases) and discard trees spanning from words belonging to other part-of-speech categories. Several restrictions on the head node can be introduced by using the '|' (OR), '&' (AND) and '!' (NOT) operators, e.g. specifying lemma=chair&upos=NOUN|lemma=bank&upos=VERB to extract trees spanning from the verb or noun 'chair'.
Value: <list of dependency relations to be ignored>
The optional --ignored_labels
parameter defines a list of dependency relations that are to be ignored when matching the trees and thus not displayed in the results file.
For example, specifying ignored_labels = punct produces a list of matched trees that do not include the punct relation (even if it is present in the actual tree). In addition to ignoring a certain type of relations, such as punctuation or other clause-peripheral phenomena, this is a particularly useful feature for users interested in a limited set of relations only, such as core predicate arguments. Such users would then use this parameter as a negative filter by specifying all relations except those pertaining to the core predicate arguments (e.g. nsubj, obj). In contrast to the --allowed_labels
parameter below, this parameter does not exclude trees containing a given relation, but only ignores them when they occur in a tree. Two or more relations specified should be separated by the '|' operator.
Value: <whitelist of allowed dependency relations>
The optional --allowed_labels
parameter defines a list of dependency relations that are allowed to occur in the trees to be extracted (i.e. a whitelist subset of all possible dependency labels) in the form of a list separated by the '|' operator.
For example, specifying allowed_labels = obj|iobj|nsubj extracts trees featuring only these three relations (and no other) and ignores all others. In contrast to the --ignored_labels
parameter above, the presence of any other label in the tree automatically excludes such tree from being matched and counted.
Value: <pre-defined tree query>
Finally, the optional --query
parameter allows the users to define a specific tree structure to be extracted by using the dep_search query language.
For example, the query upos=NOUN >amod (_ >advmod _) would return nouns that govern an adjectival modifier modified by an adverbial modifier, e.g. trees of the type 'seemingly easy example'. The query language requires the attributes to be written in full (e.g. upos=VERB, form=went, L=go) and also supports using the '|' (OR), '&' (AND), and '!' (NOT) operators. For the latter, the program enables negations of specific relation types (e.g. A >!case B), while negations of relations as such (e.g. A !> B ) are currently not supported.
When --query
is specified, the output takes into account tree specification settings, such as --node_type
, but ignores all other tree restriction settings, such as --size
.
By default, STARK produces a list of trees with the absolute frequency (raw count) and the relative frequency (normalized count per million tokens) of the trees in the input treebank. In addition, two optional types of statistics can also be computed in the output to help identify compelling syntactic phenomena.
Values: yes, no
The optional --association_measures
parameter (value yes) produces information on the strength of statistical association between the nodes of the tree by computing several common association scores (MI, MI3 , Dice, logDice, t-score, simple-LL). This is a particularly useful feature for treebank-driven collocation extraction and lexical analysis. Therefore, association scores are computed only for trees with a maximum length of 10 words. Trees exceeding this length are assigned a NaN value.
Values: yes, no
In addition, STARK can also be used to identify key or statistically significant phenomena in the input treebank by comparing the frequency of the extracted trees to that of another, so-called reference treebank. This is triggered by using the optional --compare
parameter which takes the name of the second, reference treebank as input (e.g. sl_ssj-ud-dev.conllu) to compute the frequencies in both treebanks and compare them using the simple ratio comparison and several common keyness scores (LL, BIC, log ratio, odds ratio and %DIFF). This feature is particularly useful for research on language- or genre-specific syntactic phenomena.
If a tree occurring in the first treebank is absent from the second treebank (i.e. its frequency is 0), one quadrillionth (0.000000000000000001) is used as a proxy for zero when computing the keyness scores to avoid complications arising from division with zero. When calculating the simple ratio, NaN value is given.
In addition to the default description of the trees featured in the first column of the output, which is based on the easy-to-read dep_search query language (e.g. 'ADJ <amod NOUN'), STARK can also produce two alternative ways of describing a tree, which also enable the users to visualize specific instances of the trees in the related treebank-browsing services.
Values: yes, no
First, the optional --grew_match
parameter (value yes) produces trees in accordance with the Grew query language (e.g. 'pattern {A [upos="NOUN"]; B [upos="ADJ"]; A -[amod]-> B }'), which is used by the Grew-match online treebank browsing service featuring the latest collections of UD treebanks available in more than 240 languages.
If the name of the input treebank begins with the standard declaration of the language code and the treebank name (e.g. en_gum-ud..., fr_rhapsodie-ud..., sl_ssj-ud...), the grew_match = yes option will also produce direct URL links to the instances of the tree in the latest version of the given input treebank, e.g. this URL for the 'ADJ <amod NOUN' case at hand.
Values: yes, no
Second, the optional --depsearch
parameter (value yes) produces trees in accordance with the dep_search query language (e.g. 'NOUN >amod ADJ'), which is used by the SETS online treebank-browsing service. Unfortunately, SETS is no longer maintained, but some derivations of it still exist, such as Drevesnik.
Values: yes, no
Additionaly, using the --example
parameter (value yes) produces an additional column with one random sentence containing the tree, in which the nodes of the tree are explicitely marked, e.g. a sentence We went to see [the]A [new]B [trains]C., for a tree of the type 'DET < ADJ < NOUN'.
Value: <minimum number of tree occurrances in the input treebank>
To limit the number of trees in the output file, the optional --frequency_threshold
parameter can be used to limit the extraction to trees occurring above a given threshold by specifying the minimal absolute frequency of the tree in the treebank (e.g. 5 to limit the search to trees occurring 5 or more times).
Value: <maximum number of lines in the output file>
Similarly, the optional --max_lines
parameter defines the maximum number of trees (lines) in the output file, which gives a frequency-ranked list of trees. For example, value 100 returns only the top-100 most frequent trees matching the input criteria.