Skip to content

Latest commit

 

History

History
156 lines (86 loc) · 14.4 KB

settings.md

File metadata and controls

156 lines (86 loc) · 14.4 KB

STARK settings explained

Below is a list of customizable settings that can be used to define the type of trees to be extracted and the associated information in the output. The default values are visible in the config.ini file and can be modified by following these instructions.

General Tree specification Tree restrictions Statistics Visualisation Threshold
input node_type size association_measures example max_lines
output labeled head compare grew_match frequency_threshold
label_subtypes ignored_labels depsearch
fixed allowed_labels
query

For details on the settings pertaining to the tool performance, testing and rare use cases see advanced settings.

General settings

--input

Value: <path to the input file or directory>

The --input parameter defines the location of the input file or directory, i.e. one or more files in the .conllu format. The tool is primarily aimed at processing corpora based on the Universal Dependencies annotation scheme, but can also be used for any other dependency-parsed corpus complying with the CONLL-U format, regardless of the tagsets used. The only condition is that there is at least one root node per sentence named root (regardless of the casing).

--output

Value: <path to the output file>

STARK produces a single tab-separated file (.tsv) as output, the name and the location of which is defined using the --output setting. The output file gives a list of all the trees matching the input criteria sorted by descending frequency, as illustrated by the sample output file here.

Tree specification

--node_type

Values: form, lemma, upos, xpos, feats, deprel

The --node_type parameter specifies which characteristics of the tokens should be considered when extracting and counting the trees: word form (value form, e.g. 'went'), lemma (lemma, e.g. 'go'), part-of-speech tag (upos, e.g. 'VERB'), morphological features (feats, e.g. 'Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin'), language-specific tag (xpos, e.g. 'VBD'), dependency role (deprel, e.g. 'obj'). You can also combine these values using the '+' operator (e.g. lemma+upos). If you do not want to differentiate your trees based on their nodes, simply comment the '--node_type' parameter to get trees with underscores as nodes.

For example, specifying the option form returns trees of the type 'Mary <nsubj went', while specifying the option upos returns trees of the type 'PROPN <nsubj VERB'. If this parameter is not specified, the trees are returned in the form '_ nsubj _'.

--labeled

Values: yes, no

The obligatory --labeled parameter specifies whether trees should be differentiated based on the syntactic relations (dependency labels) between the nodes of the tree (value yes), or not (value no).

For example, if the first option differentiates between trees 'NOUN <nsubj VERB' and 'NOUN <obj VERB', the second option considers them as two instances of the same tree, i.e. 'NOUN < VERB'.

--label_subtypes

Values: yes, no

The obligatory --label_subtypes parameter specifies whether (labeled) trees should be differentiated based on label extensions, i.e. colon-marked relation sub-types (value yes), or not (value no).

For example, if specifying the option yes differentiates between trees 'NOUN <nsubj:pass VERB' and 'NOUN <nsubj VERB', specifying the option no considers them as two instances of the same tree, i.e. 'NOUN <nsubj VERB'.

--fixed

Values: yes, no

The obligatory --fixed parameter allows the users to specify whether they consider the order of the nodes in the tree, i.e. the surface word order, to be a distinctive feature of the trees (value yes) or not (value no).

For example, if the input treebank contained sentences ‘John gave the apple to Mary’ and ‘John the apple gave to Mary’ (an odd example in English but common in languages with free word order), using the yes option would extract the 'gave > apple' and 'apple < gave' as two distinct trees, while the no option would consider them as two instances of the same tree, i.e. 'gave > apple'.

Note that each of the two options is associated with specific formatting of the trees in the output. When choosing the fixed = yes option, the tree description in the first column reflects the word order of the nodes on the surface (e.g. '(seemingly < easy) < example'). On the other hand, when choosing the fixed = no option, the description of the tree in the first column is order-agnostic, with heads always preceding their dependents, i.e. all the arrows always pointing to the right (e.g. 'example > (easy > seemingly)'.

The second, order-agnostic description of a tree can also be produced by using the --depsearch option (value yes), which--in combination with fixed = yes--might be useful for users investigating word order variation.

Restriction to specific structures

In contrast to the obligatory settings above specifying the criteria for defining the types of trees to be extracted, STARK also allows the users to restrict the extraction procedure to specific trees through the five options presented below.

--size

Value: <integer number or range>

The obligatory --size parameter allows the users to define the size of the trees to appear in the output file, i.e. the number of tokens (typically words) in the trees under investigation, which can either be specified as an integer number (e.g. 1, 2, 3 … ) or a range (e.g. 2-15). If you want to retrieve all possible trees regardless of size, set the maximum value to a very large number, e.g. 1-10000.

--head

Value: <list of allowed head characteristics>

The optional --head parameter allows the users to define specific constraints on the head node (i.e. the word that all other words in the (sub-)tree depend on) in the form of attribute-value pairs specifying its lexical or grammatical features.

For example, upos=NOUN would only return trees with nouns as heads (nominal phrases) and discard trees spanning from words belonging to other part-of-speech categories. Several restrictions on the head node can be introduced by using the '|' (OR), '&' (AND) and '!' (NOT) operators, e.g. specifying lemma=chair&upos=NOUN|lemma=bank&upos=VERB to extract trees spanning from the verb or noun 'chair'.

--ignored_labels

Value: <list of dependency relations to be ignored>

The optional --ignored_labels parameter defines a list of dependency relations that are to be ignored when matching the trees and thus not displayed in the results file.

For example, specifying ignored_labels = punct produces a list of matched trees that do not include the punct relation (even if it is present in the actual tree). In addition to ignoring a certain type of relations, such as punctuation or other clause-peripheral phenomena, this is a particularly useful feature for users interested in a limited set of relations only, such as core predicate arguments. Such users would then use this parameter as a negative filter by specifying all relations except those pertaining to the core predicate arguments (e.g. nsubj, obj). In contrast to the --allowed_labels parameter below, this parameter does not exclude trees containing a given relation, but only ignores them when they occur in a tree. Two or more relations specified should be separated by the '|' operator.

--allowed_labels

Value: <whitelist of allowed dependency relations>

The optional --allowed_labels parameter defines a list of dependency relations that are allowed to occur in the trees to be extracted (i.e. a whitelist subset of all possible dependency labels) in the form of a list separated by the '|' operator.

For example, specifying allowed_labels = obj|iobj|nsubj extracts trees featuring only these three relations (and no other) and ignores all others. In contrast to the --ignored_labels parameter above, the presence of any other label in the tree automatically excludes such tree from being matched and counted.

--query

Value: <pre-defined tree query>

Finally, the optional --query parameter allows the users to define a specific tree structure to be extracted by using the dep_search query language.

For example, the query upos=NOUN >amod (_ >advmod _) would return nouns that govern an adjectival modifier modified by an adverbial modifier, e.g. trees of the type 'seemingly easy example'. The query language requires the attributes to be written in full (e.g. upos=VERB, form=went, L=go) and also supports using the '|' (OR), '&' (AND), and '!' (NOT) operators. For the latter, the program enables negations of specific relation types (e.g. A >!case B), while negations of relations as such (e.g. A !> B ) are currently not supported.

When --query is specified, the output takes into account tree specification settings, such as --node_type, but ignores all other tree restriction settings, such as --size.

Statistics

By default, STARK produces a list of trees with the absolute frequency (raw count) and the relative frequency (normalized count per million tokens) of the trees in the input treebank. In addition, two optional types of statistics can also be computed in the output to help identify compelling syntactic phenomena.

--association_measures

Values: yes, no

The optional --association_measures parameter (value yes) produces information on the strength of statistical association between the nodes of the tree by computing several common association scores (MI, MI3 , Dice, logDice, t-score, simple-LL). This is a particularly useful feature for treebank-driven collocation extraction and lexical analysis. Therefore, association scores are computed only for trees with a maximum length of 10 words. Trees exceeding this length are assigned a NaN value.

--compare

Values: yes, no

In addition, STARK can also be used to identify key or statistically significant phenomena in the input treebank by comparing the frequency of the extracted trees to that of another, so-called reference treebank. This is triggered by using the optional --compare parameter which takes the name of the second, reference treebank as input (e.g. sl_ssj-ud-dev.conllu) to compute the frequencies in both treebanks and compare them using the simple ratio comparison and several common keyness scores (LL, BIC, log ratio, odds ratio and %DIFF). This feature is particularly useful for research on language- or genre-specific syntactic phenomena.

If a tree occurring in the first treebank is absent from the second treebank (i.e. its frequency is 0), one quadrillionth (0.000000000000000001) is used as a proxy for zero when computing the keyness scores to avoid complications arising from division with zero. When calculating the simple ratio, NaN value is given.

Alternative visualisation and examples

In addition to the default description of the trees featured in the first column of the output, which is based on the easy-to-read dep_search query language (e.g. 'ADJ <amod NOUN'), STARK can also produce two alternative ways of describing a tree, which also enable the users to visualize specific instances of the trees in the related treebank-browsing services.

--grew_match

Values: yes, no

First, the optional --grew_match parameter (value yes) produces trees in accordance with the Grew query language (e.g. 'pattern {A [upos="NOUN"]; B [upos="ADJ"]; A -[amod]-> B }'), which is used by the Grew-match online treebank browsing service featuring the latest collections of UD treebanks available in more than 240 languages.

If the name of the input treebank begins with the standard declaration of the language code and the treebank name (e.g. en_gum-ud..., fr_rhapsodie-ud..., sl_ssj-ud...), the grew_match = yes option will also produce direct URL links to the instances of the tree in the latest version of the given input treebank, e.g. this URL for the 'ADJ <amod NOUN' case at hand.

--depsearch

Values: yes, no

Second, the optional --depsearch parameter (value yes) produces trees in accordance with the dep_search query language (e.g. 'NOUN >amod ADJ'), which is used by the SETS online treebank-browsing service. Unfortunately, SETS is no longer maintained, but some derivations of it still exist, such as Drevesnik.

--example

Values: yes, no

Additionaly, using the --example parameter (value yes) produces an additional column with one random sentence containing the tree, in which the nodes of the tree are explicitely marked, e.g. a sentence We went to see [the]A [new]B [trains]C., for a tree of the type 'DET < ADJ < NOUN'.

Threshold settings

--frequency_threshold

Value: <minimum number of tree occurrances in the input treebank>

To limit the number of trees in the output file, the optional --frequency_threshold parameter can be used to limit the extraction to trees occurring above a given threshold by specifying the minimal absolute frequency of the tree in the treebank (e.g. 5 to limit the search to trees occurring 5 or more times).

--max_lines

Value: <maximum number of lines in the output file>

Similarly, the optional --max_lines parameter defines the maximum number of trees (lines) in the output file, which gives a frequency-ranked list of trees. For example, value 100 returns only the top-100 most frequent trees matching the input criteria.