Skip to content

Latest commit

 

History

History
97 lines (81 loc) · 5.48 KB

README.md

File metadata and controls

97 lines (81 loc) · 5.48 KB

memo-canonical-novels 📚

cc cc

This repository contains code for embeddings, plots and results for our paper:

"Canonical Status and Literary Influence: A Comparative Study of Danish Novels from the Modern Breakthrough (1870–1900)" presented at NLP4DH at EMNLP 2024.

Useful directions 📌

Some useful directions:

  • memo_canonical_novels/ the main folder contains the source code for the project, here you will find the makefile to create embeddings
  • notebooks/ contains the notebooks used for the analysis, analysis.py is the main notebook, tfidf_comparison.py is the notebook used to compare the embeddings with tf-idf. Other notebooks contain sanity checks.
  • figures/ contains the figures generated by the notebooks
  • data/ contains saved embeddings (.json) used for the analysis (and will contain generated embeddings if you generate them)

Data & paper 📝

The dataset used is available at huggingface

Please cite our paper if you use the code or the embeddings:

@inproceedings{feldkamp-etal-2024-canonical,
    title = "Canonical Status and Literary Influence: A Comparative Study of {D}anish Novels from the Modern Breakthrough (1870{--}1900)",
    author = "Feldkamp, Pascale  and
      Lassche, Alie  and
      Kostkan, Jan  and
      Kardos, M{\'a}rton  and
      Enevoldsen, Kenneth  and
      Baunvig, Katrine  and
      Nielbo, Kristoffer",
    editor = {H{\"a}m{\"a}l{\"a}inen, Mika  and
      {\"O}hman, Emily  and
      Miyagawa, So  and
      Alnajjar, Khalid  and
      Bizzoni, Yuri},
    booktitle = "Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities",
    month = nov,
    year = "2024",
    address = "Miami, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.nlp4dh-1.14",
    pages = "140--155"
}

Project Organization 🏗️

├── LICENSE            <- Open-source license if one is chosen
├── Makefile           <- Makefile with convenience commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── notebooks          <- Jupyter notebooks.
│
├── pyproject.toml     <- Project configuration file with package metadata for 
│                         memo_canonical_novels and configuration for tools like black
│
├── figures            <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.cfg          <- Configuration file for flake8
│
└── src                <- Source code for use in this project, making embeddings.
    │
    ├── __init__.py             <- Makes memo_canonical_novels a Python module
    │
    ├── config.py               <- Store useful variables and configuration
    │
    ├── dataset.py              <- Scripts to download or generate data
    │
    ├── features.py             <- Code to create features for modeling
    │
    ├── modeling                
    │   ├── __init__.py 
    │   ├── predict.py          <- Code to run model inference with trained models          
    │   └── train.py            <- Code to train models
    └── pooling.py              <- Code to create average embeddings from raw embeddings
    │
    └── plots.py                <- Code to create visualizations