GitHub - Helsinki-NLP/dialect-topic-model: Scripts and metadata for the paper "Corpus-based dialectometry with topic models"

Dialect topic modeling

This repo contains scripts and metadata for the paper "Corpus-based dialectometry with topic models". For the original data, see below. There are two scripts: one for pre-processing your data to character n-grams, and one for the actual topic modeling. For Morfessor-segmentation, please refer to https://morfessor.readthedocs.io/en/latest/general.html.

ngramming.py assumes your data is stored in a folder as txt files. You can run the script by

python3 ngramming.py your-corpus-name

This will result in four json files: words_corpus, bigram_corpus, trigram_corpus and fourgram_corpus.

dialectTopicModel.py assumes your data is stored in the aforementioned json files. There are several arguments one can change in the running of the model.

Example runs of the topic model

5-component model of SKN on bigrams and NMF

dialectTopicModel.topic_model('skn', 'bigram', 'skn_bigram', 'nmf', 5, use_idf=True, norm='l2', sublinear=True)

2-component model of Archimob on words and LDA

dialectTopicModel.topic_model('archimob', 'words', 'archimob_words', 'lda', 2, relevance=True, lambda_=0.2)

Original paper data

Samples of Spoken Finnish: https://korp.csc.fi/download/SKN/skn-vrt/
Norwegian Dialect Corpus: http://www.tekstlab.uio.no/scandiasyn/download.html
Archimob Corpus of Swiss German: https://spur.uzh.ch/en/departments/research/textgroup/ArchiMob.html

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md
coords43_base_areas.csv		coords43_base_areas.csv
dialectTopicModel.py		dialectTopicModel.py
exampleRun.py		exampleRun.py
ngramming.py		ngramming.py
nor-data-speakers.csv		nor-data-speakers.csv
norwegian_dialects.csv		norwegian_dialects.csv
paikat_ja_murteet.csv		paikat_ja_murteet.csv
skn_aluetaso.csv		skn_aluetaso.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dialect topic modeling

Example runs of the topic model

Original paper data

About

Releases

Packages

Languages

License

Helsinki-NLP/dialect-topic-model

Folders and files

Latest commit

History

Repository files navigation

Dialect topic modeling

Example runs of the topic model

Original paper data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages