For more details on NOW corpus & HPC usage, please refer to this documentation.
The following three files can be accessd via /RelMod/model_input/
in TU Delft Webdata, unless otherwise stated.
large_model_input_gs1.pkl
- 21,98 GB. It can be downloaded here. This link is from LARN paper.larn_data_smaller.pt
- 11,45 GB. The same contents aslarge_model_input_gs.pkl
but the data type is 32-bit. You can chhose to use this one or the previous one (which might depend on how much memory you have).target_word_ix_counter.pk
- 234 KB. Counts of each unique word in the corpus.LARN-us-ru-4tp.pk
- 719,1 MB. Topic-related articles that are used as input inLARN.py
.trained-model*
- outputs from ARM model.
To train the model, you can run the following command line. To run the model on HPC, please use the sbatch files ARM.sbatch
in the folder ARM/sbatch/
.
python ARM.py
To produce the temporal trend plot, you can run the following command line. To run the model on HPC, please use the sbatch files tmp-trend.sbatch
in the folder ARM/sbatch/
.
python TemporalTrend.py
ARM/
Topic/
info*
- outputs from BERTopic. There are 10 clusters of topics per pair of country, each cluster has 10 keywords. This file contains the c-TF-IDF score of each keywords.prob*
- outputs from BERTopic. They are the probability of the assigned topic per document.embeddings*
- topic embeddings by multiplying document embeddings (generated by sentence transformer with pretrained model all-mpnet-base-v2) withprob*
.
sbatch/
- the sbatch files that can be submitted to HPC. Please refer to the documentation if you need more info of HPC.plot/
- temporal plots for 4 topics for US-RU that are created byTemporalTrend.py
TopicEmbedding.ipynb
- the script used to createinfo*
,prob*
andembeddings*
.ARM.py
- trains the model.TemporalTrend.py
- produces temporal trend per topic. The output plots are in theplot/
folder in this github repository.constants.py
- indicates the constants that are used in other scripts. You can change the input data directories here.modules.py
- contains modules of ARM.preprocessing.py
- is supposed to preprocess the raw textual data from May 2019 to April 2022 (which can be downloaded from TU Delft Webdata.) But I have not yet tried this script yet. This script is also from LARN paper.utils.py
- contains some common functions.
LARN-all-data/
contains the LARN model which can be run with all data (i.e., the original model from the paper.)LARN/
contains the LARN model which can be run with topic-related articles.FilterArticle.py
- selects the articles that contains at least one keywords from each topic.