Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpeNTF via NMT (OpenNMT) #243

Open
thangk opened this issue Jun 24, 2024 · 27 comments
Open

OpeNTF via NMT (OpenNMT) #243

thangk opened this issue Jun 24, 2024 · 27 comments
Assignees
Labels
experiment Running a study or baseline for results

Comments

@thangk
Copy link
Collaborator

thangk commented Jun 24, 2024

Tested dataset

data/preprocessed/dblp/dblp.v12.json.filtered.mt75.ts3

Input type

Sparse matrix

Command used

python -u main.py -data ../data/preprocessed/dblp/dblp.v12.json.filtered.mt75.ts3 -domain dblp -model nmt

Observations

The script ran through all 3 folds and produced results without errors, no predictions.

image

Next step(s)

  • Test with various parameters (optimization).
  • Test also with w2v input.
  • Test also with other datasets.
  • This test was ran on local lab PC. Test to make it also run on the Matrix server (as at the time of this post, the same codebase does not run on the Matrix server without errors).
  • Ultimately look to dockerize the project.
@hosseinfani
Copy link
Member

hosseinfani commented Jun 24, 2024

Hi @thangk
thanks for the progress log.

  • Just a quick note that you need to bring the prediction files and calculate the metrics we have in our codebase like precision, map, ndcg, ...

Opennmt only gives you the translation metrics like ppl, as seen in the image.

  • Also, schedule running nmt using gnn-based embeddings after w2v.

@jamil2388 please advise

@hosseinfani hosseinfani added the experiment Running a study or baseline for results label Jun 24, 2024
@jamil2388
Copy link
Member

@hosseinfani, @thangk for now, I am putting a doc link here. This contains almost all sets of arguments used for onmt pipeline.

https://community.libretranslate.com/t/documentation-for-opennmt-py-parameters/927/

I think looking into this argument in the page might help us for prediction file dumping :
–dump_preds

Also I advice Kap to learn about the behavior of the translation metrics used in the current runs. Because it will help crucially in understanding the model train and test behavior, eventually letting us know the direction of adjustments.

Thanks!

@hosseinfani
Copy link
Member

@jamil2388 thanks.

@thangk one more thing. when exploring hyperparameters, also see how you can use openmt for different type of translators. Because, we need to study the effect of translation for our work. These translators should be published in a paper such that we can cite them in the paper. I think openmt community update their codeline to include more and more new translators, which helps you for our task (this is like @jamil2388 using different gnn methods from pyg for team formation).

@thangk
Copy link
Collaborator Author

thangk commented Jun 28, 2024

Hi @hosseinfani,
I'll continue my question here if that's okay.

continuing conversation from whether or not to average all the folds' eval metrics to get one set of data for each epoch setting (ie. 500, 1000)

I was referring to these. Each fold produces its own eval metrics. There's one more, fold2, below fold1, which isn't visible in the screenshot. I am thinking the right approach is to average the e500 and e1000 pairs across all 3 folds to put in the excel.

image

@thangk
Copy link
Collaborator Author

thangk commented Jun 28, 2024

I saw some charts we've used in some papers, and I can see those papers use the average of the folds. I'll follow the same approach.

@hosseinfani
Copy link
Member

hosseinfani commented Jun 28, 2024

Hi @thangk
thabks for bringing the conversation here :)

now I see. There should be another file with no fold-idx, like test.epoch* that include the average of folds.

but you're right about average of folds

@thangk
Copy link
Collaborator Author

thangk commented Jun 28, 2024

There should be another file with no fold-idx, like test.epoch* that include the average of folds.

Yes, I see one outside the fold folders.

image

@hosseinfani
Copy link
Member

@thangk my preference is to keep the progress logs like this issue, rather than chats in teams or else where.

@thangk
Copy link
Collaborator Author

thangk commented Jul 31, 2024

Yesterday, I ran three (Transformer, ConvS2S, RNN with attention) seq2seq-based models on the dblp (filtered) dataset and out of the three, only two (ConvS2S and RNN with attention) ran successfully with the baseline configs I've set.

Here are the first run results for ConvS2S (left) and RNN with attention (right)

image

It seems there are issues with the shape of the input in the transformer model. I'll dig into the issue.

image

@thangk
Copy link
Collaborator Author

thangk commented Aug 8, 2024

This was the first run of all datasets using the ConvS2S model.

image

Hyperparameters:

word_vec_size: 128
cnn_size: 512
layers: 15
cnn_kernel_width: 3

encoder_type: cnn
decoder_type: cnn

optim: adam
learning_rate: 0.001
learning_rate_decay: 0.9
start_decay_steps: 50
decay_steps: 50
batch_size: 4
dropout: 0.5

@hosseinfani
Copy link
Member

@thangk
can you put the result of pure bnn and fnn, 1-hot skills in the input?

@thangk
Copy link
Collaborator Author

thangk commented Aug 8, 2024

@thangk can you put the result of pure bnn and fnn, 1-hot skills in the input?

I was thinking of putting the best results from Jamil's FNN and BNN. Do you want me to put the pure FNN and BNN from Rad et al's paper?

@hosseinfani
Copy link
Member

yes, I believe Jamil has reproduced the results already.

@thangk
Copy link
Collaborator Author

thangk commented Aug 8, 2024

yes, I believe Jamil has reproduced the results already.

yeah, he has the results for imdb and dblp. I'm gathering them for these tables.

@thangk
Copy link
Collaborator Author

thangk commented Aug 9, 2024

@hosseinfani

This is what I currently have for imdb. I am working on dblp now. The transformer model isn't working quite right as it needs some more debugging.

dblp

ConvS2S
t99375.s29661.m14214.etcnn.l512.wv256.lr0.0005.b16.e1000
RNN
t99375.s29661.m14214.etrnn.l512.wv256.lr0.0005.b16.e1000

image

imdb

ConvS2S
t32059.s23.m2011.etcnn.l512.wv256.lr0.0005.b16.e1000
RNN
t32059.s23.m2011.etrnn.l512.wv256.lr0.0005.b16.e1000

image

hyperparameters for Run 3:

# ConvS2S
word_vec_size: 256
cnn_size: 512
layers: 10
cnn_kernel_width: 3

encoder_type: cnn
decoder_type: cnn

optim: adam
learning_rate: 0.0005
learning_rate_decay: 0.95
start_decay_steps: 100
decay_steps: 100
batch_size: 16
dropout: 0.4

# RNN
word_vec_size: 256
rnn_size: 512
layers: 2

encoder_type: rnn
decoder_type: rnn
rnn_type: LSTM

optim: adam
learning_rate: 0.0005
learning_rate_decay: 0.95
start_decay_steps: 100
decay_steps: 100
batch_size: 16
dropout: 0.4

Edit: dblp results added.

@hosseinfani
Copy link
Member

how would I say this if I don't yet see a substantial performance improvement over the others yet? since it's a first for team formation, I'm not sure I even have the best settings yet. I've tried a few but they aren't still not as good as the fnn or bnn values. Can I say it has potential to be a viable option for team formation tasks yet needs further research?

Hi @thangk

here is my reply:

regarding the low performance of seq-2-seq, you need to know that these models can map a sentence to another one, that is the input space and output space are of size a language tokens (~100k), while keeping order between token. If they're not performing well, we need to find why? then how to change/customize them for our problem? For imdb, it makes sense becuase the input space is just 20-30 words, that should be mapped to a large output space. So, we can say the sparcity of source sequence/language. What else?

@thangk
Copy link
Collaborator Author

thangk commented Aug 9, 2024

how would I say this if I don't yet see a substantial performance improvement over the others yet? since it's a first for team formation, I'm not sure I even have the best settings yet. I've tried a few but they aren't still not as good as the fnn or bnn values. Can I say it has potential to be a viable option for team formation tasks yet needs further research?

Hi @thangk

here is my reply:

regarding the low performance of seq-2-seq, you need to know that these models can map a sentence to another one, that is the input space and output space are of size a language tokens (~100k), while keeping order between token. If they're not performing well, we need to find why? then how to change/customize them for our problem? For imdb, it makes sense becuase the input space is just 20-30 words, that should be mapped to a large output space. So, we can say the sparcity of source sequence/language. What else?

I see. I've also added the pure bnn, bnn_emb and rrn from other papers as the baselines. After doing this, my results aren't too far off, some are even better than the baselines. So, this validates my statement in the abstract.

Still, I'm eager to find more optimized hyperparameters and will do so. In the meantime, I'll keep these data and work more on the write-up. I'm also running the gith and uspt on both consvs2s and rnn with the same hyperparameters.

@thangk
Copy link
Collaborator Author

thangk commented Aug 9, 2024

results for gith and uspt with convs2s and rnn using the same hyperparameters as the other two datasets

image

image

@thangk
Copy link
Collaborator Author

thangk commented Aug 15, 2024

I ran a bunch of tests today to see how the metrics respond to the hyperparameters

image

3 produced the best result so far, besides AUCROC (which I'll also still work on), and I'll run more tests from this result.

@thangk
Copy link
Collaborator Author

thangk commented Aug 17, 2024

I noticed that we hardcoded the checkpoints to 500 in the nmt.py even though we have a field for it in the config file. I was wondering why my latest run with a large epoch count was using a lot of space. I've commented this out now so it shouldn't have this big space issue anymore.

image

It was using a lot of space

image

I'll delete this right away as soon as it's finished training. I've calculated how much more it'll take, and we have enough space to complete this training.

@thangk
Copy link
Collaborator Author

thangk commented Aug 22, 2024

I was able to run the Transformer model with the following settings:

# General opts
save_model: ../output/nmt/run/transformer_model
save_checkpoint_steps: 5000
warmup_steps: 8000
valid_steps: 5000
train_steps: 200000

# Batching
bucket_size: 10000
world_size: 1
gpu_ranks: [0]
num_workers: 8
batch_type: "tokens"
batch_size: 64
valid_batch_size: 128
accum_count: [1]

# Optimization
model_dtype: "fp16"
optim: adam
weight_decay: 0.0001
learning_rate: 1
decay_method: "noam"
adam_beta2: 0.998
learning_rate_decay: 0.95
decay_steps: 10000
max_grad_norm: 5
label_smoothing: 0.05
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model hyperparameters
encoder_type: transformer
decoder_type: transformer
position_encoding: true
max_relative_positions: 10
enc_layers: 4
dec_layers: 4
heads: 4
hidden_size: 256
word_vec_size: 256
transformer_ff: 2048
dropout: [0.3]
attention_dropout: [0.3]

beam_size: 5
length_penalty: 1.0

And here's the result compared to the others:

image

@hosseinfani
Copy link
Member

hosseinfani commented Aug 22, 2024

We need to make the comparison between the nmt models themselves and the bnn and fnn models. So, please do:

  • Run the nmt models for the same number of epochs and layers and layer size for nmt models. Most probably the results would be bad.
  • Then, increase one hyperparam, like number of epoch only, for the nmt models and bnn and fnn, like 1000. Most probably the time/memory bnn and fnn is a lot but the time/memory for nmt is tractable. So, simply report that that at the time that we got the nmt models results, the bnn and fnn models are still running
  • Then, go ahead with the number of layers, ....

This way, we argue that although we run the nmt models using more layers or epochs, and it may put them in an advantage compared to the bnn and fnn, however, the bnn and fnn cannot even accept such privilege of more epoch or layer for the same running time/memory.

@thangk let me know if you need more clarification.

@thangk
Copy link
Collaborator Author

thangk commented Aug 22, 2024

@hosseinfani

Okay, I will redo the models with comparable settings as the FNN and BNN's. Apparantly the models I've posted in the tables are done with steps instead of epochs. I'll find the epoch values used in Jamil's FNN and BNN numbers I used in the table and do the math, then rerun at the same epochs.

I'll update the table again shortly.

@thangk
Copy link
Collaborator Author

thangk commented Aug 23, 2024

@hosseinfani

I was able to run the Transformer model as apart of one of this week's task, finding one more architecture to include in the comparisons. The following results were ran 2-3 days ago (before we had the discussion about making as many settings same/similar as possible), that's why the settings aren't close. But it's to show, I was able to run one more model. I'll adjust the settings to be as close (and reasonably) as I can for future comparisons.

Note: also the epochs values seem strange because apparently, OpenNMT-py uses "steps" to determine the cycles instead of epochs. So, I realized this after these tests and I used the following formular to convert from steps to epochs which is why the strange epoch values. I'll address this better in future tests.

Formula for steps to epochs:

Steps per epoch = Size of sample / Batch size
Num of epochs = Train steps / Steps per epochs

Transformer model, "gith1", on gith dataset

image

Transformer model, "imdb1", on imdb dataset

image

The CSV files are available in the OpeNTF - NMT folder on MS Teams.

The hyperparameter settings used for the above:

Transformer gith1

# General opts
save_model: ../output/nmt/run/transformer_model
save_checkpoint_steps: 5000
warmup_steps: 4000
valid_steps: 5000
train_steps: 200000

# Batching
bucket_size: 10000
world_size: 1
gpu_ranks: [0]
num_workers: 8
batch_type: "tokens"
batch_size: 64
valid_batch_size: 128
accum_count: [1]

# Optimization
model_dtype: "fp16"
optim: adam
weight_decay: 0.0001
learning_rate: 1
decay_method: "noam"
adam_beta2: 0.998
learning_rate_decay: 0.95
decay_steps: 10000
max_grad_norm: 5
label_smoothing: 0.05
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model hyperparameters
encoder_type: transformer
decoder_type: transformer
position_encoding: true
max_relative_positions: 10
enc_layers: 4
dec_layers: 4
heads: 4
hidden_size: 256
word_vec_size: 256
transformer_ff: 2048
dropout: [0.3]
attention_dropout: [0.3]

beam_size: 5
length_penalty: 1.0

Transformer imdb1

# General opts
save_model: ../output/nmt/run/transformer_model
save_checkpoint_steps: 5000
warmup_steps: 8000
valid_steps: 5000
train_steps: 200000

# Batching
bucket_size: 5000
world_size: 1
gpu_ranks: [0]
num_workers: 8
batch_type: "tokens"
batch_size: 64
valid_batch_size: 128
accum_count: [1]

# Optimization
model_dtype: "fp16"
optim: adam
weight_decay: 0.0001
learning_rate: 1
decay_method: "noam"
adam_beta2: 0.998
learning_rate_decay: 0.95
decay_steps: 10000
max_grad_norm: 5
label_smoothing: 0.05
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model hyperparameters
encoder_type: transformer
decoder_type: transformer
position_encoding: true
max_relative_positions: 10
enc_layers: 4
dec_layers: 4
heads: 4
hidden_size: 256
word_vec_size: 256
transformer_ff: 2048
dropout: [0.3]
attention_dropout: [0.3]

beam_size: 5
length_penalty: 1.0

Thes settings are slightly different to accomondate the difference in dataset (i.e., larger dataset requires more train steps).

Again, this is to show a 3rd model is available to for future comparisons. I'll continue tweaking the settings to bring them as close as possible to the baselines we'd be comparing.

@hosseinfani
Copy link
Member

@thangk
thanks for the update.
just a quick note that please put the results of different datasets in different tables.

@thangk
Copy link
Collaborator Author

thangk commented Nov 12, 2024

Update

The followings are current latest results for the 3 models (t-teamrec, c-teamrec, r-teamrec) at same settings. I'm working on finding the best settings for each of the model on each dataset and to be finished within the coming weeks (will update with further information). I'll do the same for two new models which I am also lookning to add to the research.

image
image
image
image

@hosseinfani
Copy link
Member

@thangk thank you. Few notes:

  • We need statistical significant test
  • We need more s2s variations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experiment Running a study or baseline for results
Projects
None yet
Development

No branches or pull requests

3 participants