thesis

Repository for my Master thesis on The effects of word segmentation quality on word alignments. The thesis PDF can be found here. This repository handles the following functions:

Datasets: English-German, English-Romanian, English-Hindi. To any other datasets, add a folder with the names of the language pairs in data/input and under it the txt files with the following format: 'eng_with_X.txt', for X number of sentences and for both languages, and the gold standard. See examples in data/input
Alignment models: Fastalign, Eflomal
Sampling methods: Dropout
Tokenization: space mode, no space mode

These parameters and others can be set in settings.py.

Installation and run

Fastalign installation

sudo apt-get install libgoogle-perftools-dev libsparsehash-dev
cd /path/to/project
mkdir tools
cd tools
git clone https://github.com/clab/fast_align.git
cd fast_align
mkdir build
cd build
sudo apt install cmake
cmake ..
make

Eflomal installation

cd /path/to/project/tools
git clone https://github.com/robertostling/eflomal.git
cd eflomal
make
sudo make install
python3 setup.py install

Install dependencies

pip -r install requirements.txt

Modify settings.py for your desired parameters. To run all pipeline:

./run.sh

If you get an error like /bin/bash^M: bad interpreter: No such file, run this:

sed -i -e 's/\r$//' run.sh # https://stackoverflow.com/questions/14219092/bash-script-and-bin-bashm-bad-interpreter-no-such-file-or-directory
./run.sh

Project structure

.
├── data
│   ├── input
│   │   ├── eng-deu
│   │   │   ├── eng_with_10k.txt   # input txt file with 10k english sentences
│   │   │   ├── deu_with_10k.txt
│   │   │   ├── eng_deu.gold       # gold standard alignments
│   │   │   ├── eng.model          # merge list for english, space mode
│   │   │   ├── deu.model
│   │   │   ├── eng_ns.model       # merge list for english, no space mode
│   │   │   └── deu_ns.model
│   │   ├── eng-ron
│   │   └── eng-hin
│   ├── normal_bpe
│   │   ├── segmentations      # files obtained by applying BPE to corpus
│   │   │   └── *.bpe
│   │   ├── fastalign          # files obtained from fastalign 
│   │   │   └── *.wgdfa
│   │   └── eflomal            # files obtained from eflomal 
│   │       └── .wgdfa
│   └── dropout_bpe
│       ├── segmentations
│       │   └── *.bpe
│       ├── fastalign
│       │   └── *.wgdfa
│       └── eflomal
├── doc                        # LaTeX files for the writing of the thesis
│   ├── figures
│   ├── sections
│   └── *.tex files
├── reports
│   ├── scores_normal_bpe      # scores for BPE
│   │   └── *.csv, *.png
│   └── scores_dropout_bpe     # scores for BPE dropout space/no space, and depending on dropout rate
│       ├── space
│       │   ├── 0.1
│       │   └── 0.2
│       |       └── *.csv, *.png
│       └── no space
│           └── 0.1
│               └── *.csv, *.png
├── src                        # python files
│   ├── learn_bpe.py
│   ├── apply_bpe.py
│   ├── extract_alignments.py
│   └── calc_align_score.py
├── tools                        # fastalign, eflomal installation directories
│   ├── fastalign
│   └── eflomal
├── .gitignore
├── README.md
├── requirements.txt
└── settings.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

thesis

Installation and run

Project structure

About

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 224 Commits
data		data
doc		doc
reports		reports
src		src
.gitignore		.gitignore
README.md		README.md
improvements.md		improvements.md
requirements.txt		requirements.txt
settings.py		settings.py

anebz/thesis

Folders and files

Latest commit

History

Repository files navigation

thesis

Installation and run

Project structure

About

Topics

Resources

Stars

Watchers

Forks

Contributors 2

Languages