Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Would you like to post a simple example notebook? #1

Open
sharon-gao opened this issue Oct 23, 2020 · 4 comments
Open

Would you like to post a simple example notebook? #1

sharon-gao opened this issue Oct 23, 2020 · 4 comments

Comments

@sharon-gao
Copy link

Hi,

Thanks for releasing your code in this repository! It would be exciting to try the neural topic models with Knowledge Distillation.

I was wondering whether you would like to give a simple example as a guidance, perhaps as a jupyter notebook, so that I can know which file I should run first.

Best wishes

@ahoho
Copy link
Owner

ahoho commented Oct 26, 2020

Thanks for your interest in our work! We intend to put together a fuller how-to in the future, but for the time being, here's a rough outline of the necessary steps. Please let us know if you run into problems---I'll compile these into a readme once we're sure this all works.

  1. As of now, you'll need two conda environments to run both the BERT teacher and topic modeling student (which is a modification of Scholar). The environment files are defined in teacher/teacher.yml and scholar/scholar.yml for the teacher and topic model, respectively. For example:
    conda env create -f teacher/teacher.yml
    (edit the first line in the yml file if you want to change the name of the resulting environment; the default is transformers28).
  2. We don't have a general-purpose data processing pipeline together, but you can use the IMDb format as a guide:
conda activate scholar
python data/imdb/download_imdb.py

# main preprocessing script
python preprocess_data.py data/imdb/train.jsonlist data/imdb/processed --vocab_size 5000 --test data/imdb/test.jsonlist
# create a dev split from the train data
create_dev_split.py
  1. Run the teacher model. Below is what we used for IMDb
conda activate transformers28

python teacher/bert_reconstruction.py \
    --input-dir ./data/imdb/processed-dev \
    --output-dir ./data/imdb/processed-dev/logits \ 
    --do-train \
    --evaluate-during-training \
    --truncate-dev-set-for-eval 120 \
    --logging-steps 200 \
    --save-steps 1000 \
    --num-train-epochs 6 \
    --seed 42 \
    --num-workers 4 \
    --batch-size 20 \
    --gradient-accumulation-steps 8 \
    --document-split-pooling mean-over-logits
  1. Collect the logits from the teacher model (the --checkpoint-folder-pattern argument accepts grub pattern matching in case you want to create logits for different stages of training; be sure to enclose in double quotes ")
conda activate transformers28

python teacher/bert_reconstruction.py \
    --output-dir ./data/imdb/processed-dev/logits \
    --seed 42 \
    --num-workers 6 \
    --get-reps \
    --checkpoint-folder-pattern "checkpoint-9000" \
    --save-doc-logits \
    --no-dev
  1. Run the topic model. This is the messiest part of the code and we will be cleaning it up, but in the meantime, my apologies for all the extraneous/obscure arguments.
conda activate scholar

python scholar/run_scholar.py \
    ./data/imdb/processed-dev \
    --dev-metric npmi \
    -k 50 \
    --epochs 500 \
    --patience 500 \
    --batch-size 200 \
    --background-embeddings \
    --device 0 \
    --dev-prefix dev \
    -lr 0.002 \
    --alpha 0.5 \
    --eta-bn-anneal-step-const 0.25 \
    --doc-reps-dir ./data/imdb/processed-dev/logits/checkpoint-9000/doc_logits \
    --use-doc-layer \
    --no-bow-reconstruction-loss \
    --doc-reconstruction-weight 0.5 \
    --doc-reconstruction-temp 1.0 \
    --doc-reconstruction-logit-clipping 10.0 \
    -o ./outputs/imdb

@sharon-gao
Copy link
Author

Hi Alexander! When using python preprocess_data.py data/imdb/train.jsonlist data/imdb/processed --vocab_size 5000 --test data/imdb/test.jsonlist, what should I use for the option '--label' or '--label_dict'? I got an error when using '--label '1,2,3,4' '
Traceback (most recent call last):
File "../data/imdb/preprocess_data.py", line 661, in
main(sys.argv)
File "../data/imdb/preprocess_data.py", line 181, in main
preprocess_data(
File "../data/imdb/preprocess_data.py", line 297, in preprocess_data
for ids, tokens, labels in pool.imap(partial(_process_item, **kwargs), group):
File "/usr/local/Cellar/[email protected]/3.8.6/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 868, in next
raise value
KeyError: '1'

@ahoho
Copy link
Owner

ahoho commented Oct 29, 2020

Hm, I thought we had it set up so that the labels were optional, but in the meantime I think you can run --labels rating to get it to create labels for the associated review score (you aren't obligated to use them later).

@Pranav-Goel should also be able to help with this. It's possible I'm missing something, or we just need to update the code.

@Pranav-Goel
Copy link
Collaborator

Pranav-Goel commented Oct 31, 2020

Hi, sorry I am getting to this quite late. I think the preprocess script should be able to run without you having to provide any labels. Do you have labels that you want to provide?
If not, did you try running it without labels and got an error - if yes, could you share that error message (i.e. running without using --label)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants