Training with custom data gets stuck on the first iteration. #15

SamiurRahman1 · 2021-01-05T23:34:58Z

When i am trying to train a model with my custom data, it is stuck with the following output:

No previous checkpoint found!
0it [00:00, ?it/s]

It stays like this until i interrupt the training. I can also see that the GPU is in use but i don't see anything else happening. Is it normal? As far as i know, i should be able to see some logs like accuracy, loss, epoch count etc.

The text was updated successfully, but these errors were encountered:

siavash-khodadadeh · 2021-01-06T18:51:40Z

Yes, you should be able to see some logs here. Can you please check if a folder is created in maml directory for your logs? Also, would that be possible for you to share the code? probably on some fork or other branch here? What about the dataset?

SamiurRahman1 · 2021-01-06T22:28:30Z

a folder is actually created in the master folder as i moved my script in the master folder before i ran it. Of course i would love to share my code and dataset.

SamiurRahman1 · 2021-01-06T22:38:25Z

i have created a fork and uploaded my code and here is a link for the dataset and logfiles. Thank you again for your help.

Dataset: https://drive.google.com/file/d/1l37mCGycof3qI58gvjfDJ5tCE_cISEGz/view?usp=sharing

log file: https://drive.google.com/file/d/1yMrQlwn9AqaVWvOBuBt-jyCYPPAEsn27/view?usp=sharing

siavash-khodadadeh · 2021-01-06T23:18:39Z

Thanks! Please share the link to the fork with me. I will look into it soon, however, I am a little busy with some other projects now so it might take a couple of days before I get back to you.

SamiurRahman1 · 2021-01-06T23:26:44Z

here is the link of the repo https://github.com/SamiurRahman1/MetaLearning-TF2.0

siavash-khodadadeh · 2021-01-06T23:55:02Z

Can you please point me to the python files you added?

SamiurRahman1 · 2021-01-07T00:05:28Z

Sorry, not very familiar with github generally. Here are the script links:
https://drive.google.com/file/d/1183N5iO-fZugJ6_lXetZuno53W-uxrLS/view?usp=sharing, https://drive.google.com/file/d/1XelBlei1WLE3o5kOnzho4jupmpQmGng9/view?usp=sharing

SamiurRahman1 · 2021-01-15T10:02:41Z

Hello, did you have some time to look at the code?

SamiurRahman1 · 2021-01-25T22:50:09Z

Hi, i am still waiting for reply on this if possible.

siavash-khodadadeh · 2021-01-25T23:37:51Z

Hello,

I have been a little bit busy. Unfortunately, I cannot check this from python files you shared on google drive since it is hard to track changes. Please put everything on Github and I can check out to that particular branch and debug it.

Thank you very much,
Siavash

SamiurRahman1 · 2021-01-26T08:06:47Z

Hi, uploading or creating a branch is disabled for your repo, for obvious reasons. I created a pull request and uploaded the files. Maybe you can find them there? If not, could you please tell me exactly how i should upload them?

siavash-khodadadeh · 2021-02-18T14:52:03Z

Sorry for my late reply. I have been busy. I looked at the codes. It seems to me that your dataset has only 6 classes? Is that correct? In that case, do you want to do meta-learning on it or do you want to just use it for the test? If you want to do meta-learning, you need to have different tasks and your meta-batch-size is 4 and n is 5 which means you need at least 20 classes. However, I guess the program should check this before running and give an appropriate error message. Please let me know if this is the case. One way you can try it is to set meta-batch-size=1 and see if the program still stuck. Thanks again for using this repo.

SamiurRahman1 · 2021-02-21T00:21:02Z

Hi, thanks for your reply. I'm trying to train a model with the dataset. I'll try your suggestion and get back to you.

SamiurRahman1 · 2021-02-21T09:57:16Z

Hi, so i tried running the training with meta-batch-size=1 but unfortunately it still gets stuck. Is there anything else i need to change if i want to train the model with only 6 classes?

siavash-khodadadeh · 2021-02-22T16:27:11Z

Okay, I see in the dataset class there are 4 classes for training and 2 classes for validation. In this case, there is no way to generate 5-way tasks during training because there are only 4 classes for training. Can you please try using all 6 classes for training and all 6 classes as well for validation and test to just see if that is the problem? Also, you can set n=4 instead of 5 with meta-batch-size=1, but again since your validation has just two classes, I think you might get the same problem.

SamiurRahman1 · 2021-02-22T17:28:46Z

When i set n=4 , num_train_classes=6, num_val_classes=6, it throws the following error:

ValueError: in user code:

    /data/yali/sam/Project/MetaLearning-TF2.0-master/models/base_model.py:254 meta_train_loop  *
        task_final_acc, task_final_loss = self.get_losses_of_tasks_batch(method='train')(
    /data/yali/sam/Project/MetaLearning-TF2.0-master/models/maml/maml.py:237 inner_train_loop  *
        self.create_meta_model(self.updated_models[0], self.model, gradients)
    /data/yali/sam/Project/MetaLearning-TF2.0-master/models/maml/maml.py:133 create_meta_model  *
        model_layer = model_layer.get_layer(layer_name)
    /home/lili/miniconda3/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:2398 get_layer  **
        raise ValueError('No such layer: ' + name + '.')

    ValueError: No such layer: simple_model.

siavash-khodadadeh · 2021-02-23T15:21:06Z

I do not think you should set num_train_classes and num_val_classes to 6 because that means you have at least 12 classes and the others are for the test. Can you please make sure your function
def get_train_val_test_folders(self) -> Tuple:
returns the same classes. Let me try to write it here:

def get_train_val_test_folders(self) -> Tuple:
        damageTypes = list()
        myDir = "/data/yali/sam/Project/MetaLearning-TF2.0-master/data/Family/"
        for item in os.listdir(myDir):
            damageTypes.append(item)

        #print(damageTypes)
        damageImg = list()

        for damage in damageTypes:
            damageImg.append(os.path.join(myDir+damage))
        damageImg.sort()

        num_train_classes = self.num_train_classes
        num_val_classes = self.num_val_classes

        random.shuffle(damageImg)
        train_chars = damageImg
        val_chars = damageImg
        test_chars = damageImg
        #print(train_chars)


        train_classes = {char: [os.path.join(char, instance) for instance in os.listdir(char)] for char in train_chars}
        val_classes = {char: [os.path.join(char, instance) for instance in os.listdir(char)] for char in val_chars}
        test_classes = {char: [os.path.join(char, instance) for instance in os.listdir(char)] for char in test_chars}
        return train_classes, val_classes, test_classes

SamiurRahman1 · 2021-02-23T18:55:50Z

I made the changes to the function as you suggested, set n=4 , num_train_classes=3, num_val_classes=3. Getting the same error unfortunately.

siavash-khodadadeh · 2021-02-23T19:14:57Z

Can you run maml_omniglot?

SamiurRahman1 · 2021-02-23T19:35:35Z

with the omniglot dataset you mean?

SamiurRahman1 · 2021-02-23T19:37:33Z

hmm, interestingly, i get the same error when i run maml_omniglot.py

siavash-khodadadeh · 2021-02-23T19:43:33Z

It seems to be something from TF version. What is your TensorFlow version?
I can run maml_omniglot.py with tf 2.2.0-rc2

SamiurRahman1 · 2021-02-23T20:47:45Z

my tf version is 2.3.1

WuJi1 · 2021-08-11T13:05:37Z

I encounter the same problem when the TF version is 2.3.1.
Has this error been solved?

siavash-khodadadeh · 2021-08-11T14:45:32Z

The code currently works with TF version 2.2.0-rc2. Would be glad to get a merge request to update the version if you are interested.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training with custom data gets stuck on the first iteration. #15

Training with custom data gets stuck on the first iteration. #15

SamiurRahman1 commented Jan 5, 2021

siavash-khodadadeh commented Jan 6, 2021 •

edited

Loading

SamiurRahman1 commented Jan 6, 2021

SamiurRahman1 commented Jan 6, 2021

siavash-khodadadeh commented Jan 6, 2021

SamiurRahman1 commented Jan 6, 2021

siavash-khodadadeh commented Jan 6, 2021

SamiurRahman1 commented Jan 7, 2021

SamiurRahman1 commented Jan 15, 2021

SamiurRahman1 commented Jan 25, 2021

siavash-khodadadeh commented Jan 25, 2021

SamiurRahman1 commented Jan 26, 2021

siavash-khodadadeh commented Feb 18, 2021 •

edited

Loading

SamiurRahman1 commented Feb 21, 2021

SamiurRahman1 commented Feb 21, 2021

siavash-khodadadeh commented Feb 22, 2021

SamiurRahman1 commented Feb 22, 2021

siavash-khodadadeh commented Feb 23, 2021

SamiurRahman1 commented Feb 23, 2021

siavash-khodadadeh commented Feb 23, 2021

SamiurRahman1 commented Feb 23, 2021

SamiurRahman1 commented Feb 23, 2021

siavash-khodadadeh commented Feb 23, 2021 •

edited

Loading

SamiurRahman1 commented Feb 23, 2021

WuJi1 commented Aug 11, 2021

siavash-khodadadeh commented Aug 11, 2021 •

edited

Loading

Training with custom data gets stuck on the first iteration. #15

Training with custom data gets stuck on the first iteration. #15

Comments

SamiurRahman1 commented Jan 5, 2021

siavash-khodadadeh commented Jan 6, 2021 • edited Loading

SamiurRahman1 commented Jan 6, 2021

SamiurRahman1 commented Jan 6, 2021

siavash-khodadadeh commented Jan 6, 2021

SamiurRahman1 commented Jan 6, 2021

siavash-khodadadeh commented Jan 6, 2021

SamiurRahman1 commented Jan 7, 2021

SamiurRahman1 commented Jan 15, 2021

SamiurRahman1 commented Jan 25, 2021

siavash-khodadadeh commented Jan 25, 2021

SamiurRahman1 commented Jan 26, 2021

siavash-khodadadeh commented Feb 18, 2021 • edited Loading

SamiurRahman1 commented Feb 21, 2021

SamiurRahman1 commented Feb 21, 2021

siavash-khodadadeh commented Feb 22, 2021

SamiurRahman1 commented Feb 22, 2021

siavash-khodadadeh commented Feb 23, 2021

SamiurRahman1 commented Feb 23, 2021

siavash-khodadadeh commented Feb 23, 2021

SamiurRahman1 commented Feb 23, 2021

SamiurRahman1 commented Feb 23, 2021

siavash-khodadadeh commented Feb 23, 2021 • edited Loading

SamiurRahman1 commented Feb 23, 2021

WuJi1 commented Aug 11, 2021

siavash-khodadadeh commented Aug 11, 2021 • edited Loading

siavash-khodadadeh commented Jan 6, 2021 •

edited

Loading

siavash-khodadadeh commented Feb 18, 2021 •

edited

Loading

siavash-khodadadeh commented Feb 23, 2021 •

edited

Loading

siavash-khodadadeh commented Aug 11, 2021 •

edited

Loading