A tool for Thesaurus Extension using Label Propagation methods. From a text corpus and an existing thesaurus, it generates suggestions for extending the existing synonym sets. This tool was developed during the Master's Thesis "Label Propagation for Tax Law Thesaurus Extension" at the Chair "Software Engineering for Business Information Systems (sebis)", Technical University of Munich (TUM).
Thesis Abstract. With the rise of digitalization, information retrieval has to cope with increasing amounts of digitized content. Legal content providers invest a lot of money for building domain- specific ontologies such as thesauri to retrieve a significantly increased number of relevant documents. Since 2002, many label propagation methods have been developed e.g. to identify groups of similar nodes in graphs. Label propagation is a family of graph-based semi-supervised machine learning algorithms. In this thesis, we will test the suitability of label propagation methods to extend a thesaurus from the tax law domain. The graph on which label propagation operates is a similarity graph constructed from word embeddings. We cover the process from end to end and conduct several parameter-studies to understand the impact of certain hyper-parameters on the overall performance. The results are then evaluated in manual studies and compared with a baseline approach.
The tool was implemented using the following pipes and filters architecture:
- Install
pipenv
(Installation Guide). - Install the project's requirements with
pipenv install
.
- Input: This tools expects a set of text corpus files in
data/RW40jsons
, and the thesaurus indata/german_relat_pretty-20180605.json
. See phase1.py and phase4.py for information about the expected file formats. - Start: See sample_commands.md for examples on how to run the pipeline with a set of hyper-parameters. The default hyper-parameter set is given in base_config.py. The hyper-parameters can be overwritten when running a start script.
- Output: In a run, output of each individual phase is stored in
output/<PHASE_FOLDER>/<DATE>
. Most important are08_propagation_evaluation
andXX_runs
. In08_propagation_evaluation
, the evaluation statistics are stored asstats.json
together with a table that contains predictions, training and test set (main.txt
, in the other scripts most often referred to asdf_evaluation
). InXX_runs
, a run's log is stored. If multiple runs were triggered via multi_runs.py (each with a different training/test set), the combined statistics of all individual runs are stored asall_stats.json
as well.
Via purew2v_parameter_studies.py, the synset vector baseline that we introduced in our thesis can be executed. It requires a set of word embeddings and one or multiple thesaurus training/test splits. See sample_commands.md for an example.
In ipynbs
, we provided some exemplary Jupyter notebooks that were used to generate (a) statistics, (b) diagrams and (c) the Excel files for the manual evaluations. You can explore them by running pipenv shell
and then starting Jupyter with jupyter notebook
.
- For GloVe embeddings to work, GloVe needs to be installed and built manually. Then, the build path needs to be either specified in base_config.py or specified as a parameter when calling
main.py
ormulti_run.py
. - This tool was developed for use on Ubuntu 18.04.1 LTS (Bionic Beaver) and macOS 10.14 (Mojave)
- In case you have any questions, feel free contact the authors or open an issue!