Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training with custom data gets stuck on the first iteration. #15

Open
SamiurRahman1 opened this issue Jan 5, 2021 · 25 comments
Open

Training with custom data gets stuck on the first iteration. #15

SamiurRahman1 opened this issue Jan 5, 2021 · 25 comments

Comments

@SamiurRahman1
Copy link

When i am trying to train a model with my custom data, it is stuck with the following output:

No previous checkpoint found!
0it [00:00, ?it/s]

It stays like this until i interrupt the training. I can also see that the GPU is in use but i don't see anything else happening. Is it normal? As far as i know, i should be able to see some logs like accuracy, loss, epoch count etc.

@siavash-khodadadeh
Copy link
Owner

siavash-khodadadeh commented Jan 6, 2021

Yes, you should be able to see some logs here. Can you please check if a folder is created in maml directory for your logs? Also, would that be possible for you to share the code? probably on some fork or other branch here? What about the dataset?

@SamiurRahman1
Copy link
Author

a folder is actually created in the master folder as i moved my script in the master folder before i ran it. Of course i would love to share my code and dataset.

@SamiurRahman1
Copy link
Author

i have created a fork and uploaded my code and here is a link for the dataset and logfiles. Thank you again for your help.

Dataset: https://drive.google.com/file/d/1l37mCGycof3qI58gvjfDJ5tCE_cISEGz/view?usp=sharing

log file: https://drive.google.com/file/d/1yMrQlwn9AqaVWvOBuBt-jyCYPPAEsn27/view?usp=sharing

@siavash-khodadadeh
Copy link
Owner

Thanks! Please share the link to the fork with me. I will look into it soon, however, I am a little busy with some other projects now so it might take a couple of days before I get back to you.

@SamiurRahman1
Copy link
Author

here is the link of the repo https://github.com/SamiurRahman1/MetaLearning-TF2.0

@siavash-khodadadeh
Copy link
Owner

Can you please point me to the python files you added?

@SamiurRahman1
Copy link
Author

@SamiurRahman1
Copy link
Author

Hello, did you have some time to look at the code?

@SamiurRahman1
Copy link
Author

Hi, i am still waiting for reply on this if possible.

@siavash-khodadadeh
Copy link
Owner

Hello,

I have been a little bit busy. Unfortunately, I cannot check this from python files you shared on google drive since it is hard to track changes. Please put everything on Github and I can check out to that particular branch and debug it.

Thank you very much,
Siavash

@SamiurRahman1
Copy link
Author

Hi, uploading or creating a branch is disabled for your repo, for obvious reasons. I created a pull request and uploaded the files. Maybe you can find them there? If not, could you please tell me exactly how i should upload them?

@siavash-khodadadeh
Copy link
Owner

siavash-khodadadeh commented Feb 18, 2021

Sorry for my late reply. I have been busy. I looked at the codes. It seems to me that your dataset has only 6 classes? Is that correct? In that case, do you want to do meta-learning on it or do you want to just use it for the test? If you want to do meta-learning, you need to have different tasks and your meta-batch-size is 4 and n is 5 which means you need at least 20 classes. However, I guess the program should check this before running and give an appropriate error message. Please let me know if this is the case. One way you can try it is to set meta-batch-size=1 and see if the program still stuck. Thanks again for using this repo.

@SamiurRahman1
Copy link
Author

Hi, thanks for your reply. I'm trying to train a model with the dataset. I'll try your suggestion and get back to you.

@SamiurRahman1
Copy link
Author

Hi, so i tried running the training with meta-batch-size=1 but unfortunately it still gets stuck. Is there anything else i need to change if i want to train the model with only 6 classes?

@siavash-khodadadeh
Copy link
Owner

Okay, I see in the dataset class there are 4 classes for training and 2 classes for validation. In this case, there is no way to generate 5-way tasks during training because there are only 4 classes for training. Can you please try using all 6 classes for training and all 6 classes as well for validation and test to just see if that is the problem? Also, you can set n=4 instead of 5 with meta-batch-size=1, but again since your validation has just two classes, I think you might get the same problem.

@SamiurRahman1
Copy link
Author

When i set n=4 , num_train_classes=6, num_val_classes=6, it throws the following error:

ValueError: in user code:

    /data/yali/sam/Project/MetaLearning-TF2.0-master/models/base_model.py:254 meta_train_loop  *
        task_final_acc, task_final_loss = self.get_losses_of_tasks_batch(method='train')(
    /data/yali/sam/Project/MetaLearning-TF2.0-master/models/maml/maml.py:237 inner_train_loop  *
        self.create_meta_model(self.updated_models[0], self.model, gradients)
    /data/yali/sam/Project/MetaLearning-TF2.0-master/models/maml/maml.py:133 create_meta_model  *
        model_layer = model_layer.get_layer(layer_name)
    /home/lili/miniconda3/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:2398 get_layer  **
        raise ValueError('No such layer: ' + name + '.')

    ValueError: No such layer: simple_model.

@siavash-khodadadeh
Copy link
Owner

I do not think you should set num_train_classes and num_val_classes to 6 because that means you have at least 12 classes and the others are for the test. Can you please make sure your function
def get_train_val_test_folders(self) -> Tuple:
returns the same classes. Let me try to write it here:

def get_train_val_test_folders(self) -> Tuple:
        damageTypes = list()
        myDir = "/data/yali/sam/Project/MetaLearning-TF2.0-master/data/Family/"
        for item in os.listdir(myDir):
            damageTypes.append(item)

        #print(damageTypes)
        damageImg = list()

        for damage in damageTypes:
            damageImg.append(os.path.join(myDir+damage))
        damageImg.sort()

        num_train_classes = self.num_train_classes
        num_val_classes = self.num_val_classes

        random.shuffle(damageImg)
        train_chars = damageImg
        val_chars = damageImg
        test_chars = damageImg
        #print(train_chars)


        train_classes = {char: [os.path.join(char, instance) for instance in os.listdir(char)] for char in train_chars}
        val_classes = {char: [os.path.join(char, instance) for instance in os.listdir(char)] for char in val_chars}
        test_classes = {char: [os.path.join(char, instance) for instance in os.listdir(char)] for char in test_chars}
        return train_classes, val_classes, test_classes

@SamiurRahman1
Copy link
Author

I made the changes to the function as you suggested, set n=4 , num_train_classes=3, num_val_classes=3. Getting the same error unfortunately.

@siavash-khodadadeh
Copy link
Owner

Can you run maml_omniglot?

@SamiurRahman1
Copy link
Author

with the omniglot dataset you mean?

@SamiurRahman1
Copy link
Author

hmm, interestingly, i get the same error when i run maml_omniglot.py

@siavash-khodadadeh
Copy link
Owner

siavash-khodadadeh commented Feb 23, 2021

It seems to be something from TF version. What is your TensorFlow version?
I can run maml_omniglot.py with tf 2.2.0-rc2

@SamiurRahman1
Copy link
Author

my tf version is 2.3.1

@WuJi1
Copy link

WuJi1 commented Aug 11, 2021

I encounter the same problem when the TF version is 2.3.1.
Has this error been solved?

@siavash-khodadadeh
Copy link
Owner

siavash-khodadadeh commented Aug 11, 2021

The code currently works with TF version 2.2.0-rc2. Would be glad to get a merge request to update the version if you are interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants