Sana Training time limit & epochs saved at every step (bug) #62

AfterHAL · 2024-12-03T21:20:40Z

Hi.
I was surprised to see this "stopping training at epoch ... due to time limit":

After taking a look in the code I found this "time limit fonction" set to nearly 4 hours:

                if (time.time() - training_start_time) / 3600 > 3.8:
                    logger.info(f"Stopping training at epoch {epoch}, step {global_step} due to time limit.")
                    return

So I changed the limit to 500 (hours), and then restarted my training session.

After around 4 hours, it started to record a pth checkpoint file at every step:

Could it be because of this line of code in train.py:
(debug is set to true by default in train.sh)

  if epoch % config.train.save_model_epochs == 0 or epoch == config.train.num_epochs and not config.debug:
      # save epoch checkpoint

Now, I wonder what happens each time I restart the training because there are a lot of images in my dataset (70k):

Does the training from "latest.pth" will retrain the dataset from the beginning of the dataset, as listed in the meta_data.json file?
Or is the dataset shuffled at each restart?

It seems the the dataset is shuffled because I set multi_scale: true in my config and the code reads :
[EDIT:] I just saw that multi_scale is set to false in the start training (train.sh) command line:

--model.multi_scale=false

So, the question remains...

(build_dataloader function docs : https://mmdetection.readthedocs.io/en/v2.20.0/_modules/mmdet/datasets/builder.html)

So, I hope the my dataset is shuffled at every restart, but I'm but sure how to check it.
Can some help me with me with it ? (every step pth file record and image shuffle).
Regards.

The text was updated successfully, but these errors were encountered:

Deng-Xian-Sheng · 2024-12-04T04:24:20Z

I feel this is very unusual. Can you provide your training configuration file and training command?

Especially when you modify the configuration file, the content of the configuration file is particularly important.

Deng-Xian-Sheng · 2024-12-04T04:29:45Z

I'm not even sure I understand what you're saying.

You mean, there will be a time limit for training?
This should not happen, usually only step or epoch should be limited

You also said that a checkpoint will be saved for each training step?

This is also scary, the hard disk space will be filled up, and your data set is 70k, which will exacerbate the situation.

Is it because you turned on debug?

Is this on by default?

I feel that debugging may be for debugging the code that implements train, rather than debugging the training process.

What I mean is that when writing the code implementation of train, it is more complicated. Debug is used for the convenience of testing. When actually using the train code, it should be turned off.

Deng-Xian-Sheng · 2024-12-04T04:32:24Z

What I don't understand the most is your requirement.

Why do you want to shuffle the dataset every time you start training?

I feel that this is not necessary and even makes no sense.

I think you may be looking for a way to resume training from a certain checkpoint?

I remember that this parameter is mentioned in the readme file.

Do you expect not to resume from a checkpoint? This is a very strange idea.

Deng-Xian-Sheng · 2024-12-04T04:39:30Z

Do you know what epoch means?

It means that all datasets are processed once

That means, if you interrupt training at a checkpoint somewhere in epoch 1

When you resume training, you should not train what you trained before, nor disrupt the dataset, but continue training from the coordinates of the previous training.

However, I think these things are relatively low-level, and torch should handle them, especially about saving checkpoints and then loading them.

This is too low-level, and often low-level things will not go wrong.

The more likely place to go wrong is the upper layer, some documentation and usage errors.

Deng-Xian-Sheng · 2024-12-04T04:47:16Z

Also, are you fine-tuning or retraining? The dataset is quite large.

How is the training?

Hmm... I want to know if the training is fast? This will help me prepare a budget.

I just got a H800 for a few hours, which has 80G video memory. If the batch size is set larger, I don’t know if I can get good results in 24 hours.

It’s too expensive, just over $1 per hour.

By the way, have you tried multi-GPU to improve training?

Many trainings have no difference in speed between multi-GPU and single GPU.

lawrence-cj · 2024-12-04T08:16:21Z

So, I hope the my dataset is shuffled at every restart, but I'm but sure how to check it.
Can some help me with me with it ? (every step pth file record and image shuffle).

I need to know your training config and command to help. Would you pls give more info about it?

lawrence-cj · 2024-12-04T08:19:33Z

sorry for the bug here. The reason for saving checkpoint every step is another check for 4 hours here:

Sana/train_scripts/train.py

Line 451 in 1b2901b

    
           if global_step % config.train.save_model_steps == 0 or (time.time() - training_start_time) / 3600 > 3.8:

Already fix in the latest commit.

AfterHAL · 2024-12-04T10:18:22Z

sorry for the bug here. The reason for saving checkpoint every step is another check for 4 hours here:

Sana/train_scripts/train.py

Line 451 in 1b2901b

if global_step % config.train.save_model_steps == 0 or (time.time() - training_start_time) / 3600 > 3.8:

Already fix in the latest commit.

Thank you @lawrence-cj.

AfterHAL · 2024-12-04T10:56:28Z

I'm not even sure I understand what you're saying.

You mean, there will be a time limit for training? This should not happen, usually only step or epoch should be limited

Not just meaning it. If you read my question, you can see I posted the code that sets the time limit.

You also said that a checkpoint will be saved for each training step?

No, I didn't say that it will, I say that it does. (check the screen capture on my question).

This is also scary, the hard disk space will be filled up, and your data set is 70k, which will exacerbate the situation.

Yes. I guess that writing a new 8GB file every 40 seconds will fill-up my drives. Perhaps this is why I think it is a bug, but mostly because it was supposed to only be saved at the time steps specified in the config file.

Is it because you turned on debug?

Who knows ?

Is this on by default?

In my question, I said that (debug is set to true by default in train.sh) , so perhaps it is set to true by default in train.sh file.

I feel that debugging may be for debugging the code that implements train, rather than debugging the training process.

What I mean is that when writing the code implementation of train, it is more complicated. Debug is used for the convenience of testing. When actually using the train code, it should be turned off.

Sorry. I thought that debugging was just for the fun of seeing a lot of tracing outputs in the console.

(I prefer to joke about all this.)

AfterHAL · 2024-12-04T11:11:39Z

What I don't understand the most is your requirement.

Why do you want to shuffle the dataset every time you start training?

I feel that this is not necessary and even makes no sense.

I think you may be looking for a way to resume training from a certain checkpoint?

I remember that this parameter is mentioned in the readme file.

Do you expect not to resume from a checkpoint? This is a very strange idea.

I don't want to shuffle my dataset.
I expected the train to finish at least one epoch, but because of the time limit I had to restart the training process multiple times.
That's when I start thinking about how the training process is dealing those restarts, and then I saw this "shuffle" option in the code. So, I was just wondering about it...

By the way, I was also thinking about how the training runs through the dataset if I change my dataset before a train resume.

AfterHAL · 2024-12-04T11:14:34Z

Do you know what epoch means?

It means that all datasets are processed once

That means, if you interrupt training at a checkpoint somewhere in epoch 1

When you resume training, you should not train what you trained before, nor disrupt the dataset, but continue training from the coordinates of the previous training.

However, I think these things are relatively low-level, and torch should handle them, especially about saving checkpoints and then loading them.

This is too low-level, and often low-level things will not go wrong.

The more likely place to go wrong is the upper layer, some documentation and usage errors.

Of course, low-level programers never make mistakes !
(Sorry, I should have known.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sana Training time limit & epochs saved at every step (bug) #62

Sana Training time limit & epochs saved at every step (bug) #62

AfterHAL commented Dec 3, 2024 •

edited

Loading

Deng-Xian-Sheng commented Dec 4, 2024

Deng-Xian-Sheng commented Dec 4, 2024

Deng-Xian-Sheng commented Dec 4, 2024

Deng-Xian-Sheng commented Dec 4, 2024

Deng-Xian-Sheng commented Dec 4, 2024

lawrence-cj commented Dec 4, 2024 •

edited

Loading

lawrence-cj commented Dec 4, 2024

AfterHAL commented Dec 4, 2024

AfterHAL commented Dec 4, 2024

AfterHAL commented Dec 4, 2024

AfterHAL commented Dec 4, 2024

Sana Training time limit & epochs saved at every step (bug) #62

Sana Training time limit & epochs saved at every step (bug) #62

Comments

AfterHAL commented Dec 3, 2024 • edited Loading

Deng-Xian-Sheng commented Dec 4, 2024

Deng-Xian-Sheng commented Dec 4, 2024

Deng-Xian-Sheng commented Dec 4, 2024

Deng-Xian-Sheng commented Dec 4, 2024

Deng-Xian-Sheng commented Dec 4, 2024

lawrence-cj commented Dec 4, 2024 • edited Loading

lawrence-cj commented Dec 4, 2024

AfterHAL commented Dec 4, 2024

AfterHAL commented Dec 4, 2024

AfterHAL commented Dec 4, 2024

AfterHAL commented Dec 4, 2024

AfterHAL commented Dec 3, 2024 •

edited

Loading

lawrence-cj commented Dec 4, 2024 •

edited

Loading