Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On the dataset configuration #38

Open
beotborry opened this issue Dec 13, 2023 · 4 comments
Open

On the dataset configuration #38

beotborry opened this issue Dec 13, 2023 · 4 comments

Comments

@beotborry
Copy link

beotborry commented Dec 13, 2023

First of all, thank you for your wonderful project and congrats your acceptance!

I have a few questions about the NSD dataset.

  1. It is stated that there are 24,980 training examples and 2,770 test samples in your paper. Does it mean that there are 24,980 training pairs of (voxel, img) and 2,770 for text per subject? I give you this question as the num_train = 8557 + 300 and num_val = 982 in your code and the number does not match. 8857 * 3 = 26571, 982 * 3 = 2946. So, could you clarify where these numbers come from?

  2. After I scrutinize the training data of subject1, I found that there are duplicates of images. That is, for example, images for 0th tuple and 400th tuple are the same. I checked that all 982 pairs in the test set and it does have all distinct images. However, training set does not. So I wonder if these duplicates in the training dataset are okay. (I checked that there are 5970 unique tuples in the training dataset)

  3. In addition, your codes seem to have an error on making validation loader. If 982 % batch_size != 0, then the last incomplete batch seems to be dropped. I think it has to be fixed and wonder if there can be a change in your reported results in the paper.

    .batched(val_batch_size, partial=False)

Thank you.

@XuZhang2
Copy link

XuZhang2 commented Jan 18, 2024

@beotborry Hi, I encountered the same issue while training. The actual training steps are four times larger than the expected ones (1104 vs. 276). How did you solve this problem? Do I need to change the values of num_train and num_test variables? I appreciate your help. Besides, I find that 8859+300 seems to be the right number. Maybe the problem is the ``get_dataloaders'' function.

@PaulScotti
Copy link
Collaborator

PaulScotti commented Jan 18, 2024

Sorry for the delay, all images in the Natural Scenes Dataset were seen up to 3 times by the subjects. During training we train the model using every sample (which explains there being 3 duplicates) but during testing we averaged across the same-image repeats (which explains the lack of 3 duplicates). Given that the validation is the test set, there are 982 unique test images so there shouldnt be an incomplete batch (in which case partial should probably be set to True to allow incomplete batches to go through). num_train should actually be multiplied by 3 because of how we are processing the not-averaged-across-repeats, so that's a mistake on our part (and was used for the results in the paper) ... I dont know how much that would tangibly affect results, probably not much.

@song-wensong
Copy link

Sorry for the delay, all images in the Natural Scenes Dataset were seen up to 3 times by the subjects. During training we train the model using every sample (which explains there being 3 duplicates) but during testing we averaged across the same-image repeats (which explains the lack of 3 duplicates). Given that the validation is the test set, there are 982 unique test images so there shouldnt be an incomplete batch (in which case partial should probably be set to True to allow incomplete batches to go through). num_train should actually be multiplied by 3 because of how we are processing the not-averaged-across-repeats, so that's a mistake on our part (and was used for the results in the paper) ... I dont know how much that would tangibly affect results, probably not much.

In the provided JSON file (metadata_subj01.json), the values for num_train, num_val, and num_test are as follows:

"train": 8559,
"val": 300,
"test": 982

However, in the initial code snippet, the default values for num_train, num_val, and num_test are specified as follows:

"num_train": 8859,
"num_val": 982,

I have the following questions regarding the differences:

Why is there a discrepancy between the values of train in the JSON file and num_train in the default code?

Additionally, why is num_val equal to test in the JSON file?

@PaulScotti
Copy link
Collaborator

Because when we were developing the model we used the validation set as our test set. For final models used in paper we consolidated the validation set into the training set and used the test set as our test set. 8559 + 300 = 8859

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants