Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to ensure to ppl in test set have been seen in the train set? #265

Open
littlebeanbean7 opened this issue Nov 15, 2024 · 2 comments
Open

Comments

@littlebeanbean7
Copy link

Hello Fani Lab team,

I hope you are well! I wanted to use OpeNTF to run baseline model, ideally if your function could take my train/test set's id (eg paper id in dblp data) as an input parameter, that would be very handy to use. But I don't find such an option.

I found in main.py that if I don't do time split, the train/test split is calling sklearn's train_test_split().

Then, my question is: would you ensure people (eg authors in dblp data) in the Test set have appeared in Train set? If yes, could you please point me to the code where do you do this? If not, could you please explain why we don't need to do that?

Thank you very much,
Lingling

@littlebeanbean7 littlebeanbean7 changed the title How to ensure to ppl in same test set have been seen in the train set? How to ensure to ppl in test set have been seen in the train set? Nov 15, 2024
@hosseinfani
Copy link
Member

Hi @littlebeanbean7
there is no garantee. the split is based on team instances. So, there is a chance that an expert, or all experts of a team, have not been seen during the training.

However, there is a filtering step in our pipeline that filters out the sparse experts, that is, to remove the experts who have less than a number of teams. This way, you can make sure that for each expert, there are at least some number of teams in the dataset. Hence, when you split, there is a chance that the expert happens to be in the train and test in some teams.

Also, since the evaluation is n-fold, and the final result is on the average of n models, each trained on each fold, there is an even lower chance of zero-shot for an expert.

@littlebeanbean7
Copy link
Author

Thank you for your kind reply @hosseinfani ! I will add a chunk of code to load in Train and Test sets in main.py to ensure fair comparison with my experiment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants