Skip to content
This repository has been archived by the owner on Aug 16, 2023. It is now read-only.

Latest commit

 

History

History
19 lines (8 loc) · 2.45 KB

README.md

File metadata and controls

19 lines (8 loc) · 2.45 KB

Reordering of ambiguous morphological analyses

Problem: Even after disambiguation, some morphological analyses in Vabamorf's output remain ambiguous. These ambiguous analyses are not sorted by probabilities. So, a commonly used strategy -- picking the first analysis in case of ambiguity -- is not a good one on Vabamorf's plain output, as the first analysis may not be the most likely one.

Solution: We analyse manually annotated corpus (Estonian UD treebank) and collect frequencies of correct variants of ambiguous morphological analyses. As a result, we get lexicons that can be used for reordering ambiguous analyses (by likelihoods) with the help of MorphAnalysisReorderer (see this tutorial for details). This repository contains the source for creating the lexicons and evaluating their reordering performance.

Requirements: EstNLTK version 1.6.5+

Processing steps

  • 01_convert_ud_corpus_to_vm.ipynb -- converts Estonian UD treebank's data from UD format to Vabamorf's format. Saves results as EstNLTK's JSON files into folder 'UD_converted';

  • 02_word_to_analyses_freq_lexicons.ipynb -- re-annotates data with EstNLTK's Vabamorf, finds matches between automatic analyses and gold standard ones, and creates word_to_analyses frequency lexicons. A word_to_analyses lexicon shows for each ambiguous word, how many times each of its analysis was a correct one (according to manual annotations). Evaluates different word_to_analyses frequency lexicons on the test set, and examines how much the reordeing improves chances of getting the first analysis as a correct one;

  • 03_category_freq_lexicons.ipynb -- re-annotates data with EstNLTK's Vabamorf, finds matches between automatic analyses and gold standard ones, and creates partofspeech and form frequency lexicons. These lexicons show for each partofspeech and form category, how frequently the category appeared as a correct category of ambiguous words. Evaluates category frequency lexicons (in combination with the best word_to_analyses frequency lexicon) on the test set, and examines how much reordeings improve chances of getting the first analysis as a correct one;