Skip to content

Latest commit

 

History

History
52 lines (40 loc) · 3.41 KB

README.md

File metadata and controls

52 lines (40 loc) · 3.41 KB

PhD-project

For more details on NOW corpus & HPC usage, please refer to this documentation.

Data

The following three files can be accessd via /RelMod/model_input/ in TU Delft Webdata, unless otherwise stated.

  • large_model_input_gs1.pkl - 21,98 GB. It can be downloaded here. This link is from LARN paper.
  • larn_data_smaller.pt - 11,45 GB. The same contents as large_model_input_gs.pkl but the data type is 32-bit. You can chhose to use this one or the previous one (which might depend on how much memory you have).
  • target_word_ix_counter.pk - 234 KB. Counts of each unique word in the corpus.
  • LARN-us-ru-4tp.pk - 719,1 MB. Topic-related articles that are used as input in LARN.py.
  • trained-model* - outputs from ARM model.

Command Line Arguments

To train the model, you can run the following command line. To run the model on HPC, please use the sbatch files ARM.sbatch in the folder ARM/sbatch/.

python ARM.py

To produce the temporal trend plot, you can run the following command line. To run the model on HPC, please use the sbatch files tmp-trend.sbatch in the folder ARM/sbatch/.

python TemporalTrend.py

Files and Directories

  • ARM/
    • Topic/
      • info* - outputs from BERTopic. There are 10 clusters of topics per pair of country, each cluster has 10 keywords. This file contains the c-TF-IDF score of each keywords.
      • prob* - outputs from BERTopic. They are the probability of the assigned topic per document.
      • embeddings* - topic embeddings by multiplying document embeddings (generated by sentence transformer with pretrained model all-mpnet-base-v2) with prob*.
    • sbatch/ - the sbatch files that can be submitted to HPC. Please refer to the documentation if you need more info of HPC.
    • plot/ - temporal plots for 4 topics for US-RU that are created by TemporalTrend.py
    • TopicEmbedding.ipynb - the script used to create info*, prob* and embeddings*.
    • ARM.py - trains the model.
    • TemporalTrend.py - produces temporal trend per topic. The output plots are in the plot/ folder in this github repository.
    • constants.py - indicates the constants that are used in other scripts. You can change the input data directories here.
    • modules.py - contains modules of ARM.
    • preprocessing.py- is supposed to preprocess the raw textual data from May 2019 to April 2022 (which can be downloaded from TU Delft Webdata.) But I have not yet tried this script yet. This script is also from LARN paper.
    • utils.py - contains some common functions.
  • LARN-all-data/ contains the LARN model which can be run with all data (i.e., the original model from the paper.)
  • LARN/contains the LARN model which can be run with topic-related articles.
    • FilterArticle.py - selects the articles that contains at least one keywords from each topic.

ARM model structure