You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I hope you are well! I wanted to use OpeNTF to run baseline model, ideally if your function could take my train/test set's id (eg paper id in dblp data) as an input parameter, that would be very handy to use. But I don't find such an option.
I found in main.py that if I don't do time split, the train/test split is calling sklearn's train_test_split().
Then, my question is: would you ensure people (eg authors in dblp data) in the Test set have appeared in Train set? If yes, could you please point me to the code where do you do this? If not, could you please explain why we don't need to do that?
Thank you very much,
Lingling
The text was updated successfully, but these errors were encountered:
littlebeanbean7
changed the title
How to ensure to ppl in same test set have been seen in the train set?
How to ensure to ppl in test set have been seen in the train set?
Nov 15, 2024
Hi @littlebeanbean7
there is no garantee. the split is based on team instances. So, there is a chance that an expert, or all experts of a team, have not been seen during the training.
However, there is a filtering step in our pipeline that filters out the sparse experts, that is, to remove the experts who have less than a number of teams. This way, you can make sure that for each expert, there are at least some number of teams in the dataset. Hence, when you split, there is a chance that the expert happens to be in the train and test in some teams.
Also, since the evaluation is n-fold, and the final result is on the average of n models, each trained on each fold, there is an even lower chance of zero-shot for an expert.
Thank you for your kind reply @hosseinfani ! I will add a chunk of code to load in Train and Test sets in main.py to ensure fair comparison with my experiment.
Hello Fani Lab team,
I hope you are well! I wanted to use OpeNTF to run baseline model, ideally if your function could take my train/test set's id (eg paper id in dblp data) as an input parameter, that would be very handy to use. But I don't find such an option.
I found in main.py that if I don't do time split, the train/test split is calling sklearn's train_test_split().
Then, my question is: would you ensure people (eg authors in dblp data) in the Test set have appeared in Train set? If yes, could you please point me to the code where do you do this? If not, could you please explain why we don't need to do that?
Thank you very much,
Lingling
The text was updated successfully, but these errors were encountered: