Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sana Training time limit & epochs saved at every step (bug) #62

Open
AfterHAL opened this issue Dec 3, 2024 · 11 comments
Open

Sana Training time limit & epochs saved at every step (bug) #62

AfterHAL opened this issue Dec 3, 2024 · 11 comments

Comments

@AfterHAL
Copy link

AfterHAL commented Dec 3, 2024

Hi.
I was surprised to see this "stopping training at epoch ... due to time limit":
image

After taking a look in the code I found this "time limit fonction" set to nearly 4 hours:

                if (time.time() - training_start_time) / 3600 > 3.8:
                    logger.info(f"Stopping training at epoch {epoch}, step {global_step} due to time limit.")
                    return

So I changed the limit to 500 (hours), and then restarted my training session.

After around 4 hours, it started to record a pth checkpoint file at every step:

image

Could it be because of this line of code in train.py:
(debug is set to true by default in train.sh)

  if epoch % config.train.save_model_epochs == 0 or epoch == config.train.num_epochs and not config.debug:
      # save epoch checkpoint

Now, I wonder what happens each time I restart the training because there are a lot of images in my dataset (70k):

  • Does the training from "latest.pth" will retrain the dataset from the beginning of the dataset, as listed in the meta_data.json file?
  • Or is the dataset shuffled at each restart?

It seems the the dataset is shuffled because I set multi_scale: true in my config and the code reads :
[EDIT:] I just saw that multi_scale is set to false in the start training (train.sh) command line:

--model.multi_scale=false

So, the question remains...

image

(build_dataloader function docs : https://mmdetection.readthedocs.io/en/v2.20.0/_modules/mmdet/datasets/builder.html)

So, I hope the my dataset is shuffled at every restart, but I'm but sure how to check it.
Can some help me with me with it ? (every step pth file record and image shuffle).
Regards.

@Deng-Xian-Sheng
Copy link

I feel this is very unusual. Can you provide your training configuration file and training command?

Especially when you modify the configuration file, the content of the configuration file is particularly important.

@Deng-Xian-Sheng
Copy link

I'm not even sure I understand what you're saying.

You mean, there will be a time limit for training?
This should not happen, usually only step or epoch should be limited

You also said that a checkpoint will be saved for each training step?

This is also scary, the hard disk space will be filled up, and your data set is 70k, which will exacerbate the situation.

Is it because you turned on debug?

Is this on by default?

I feel that debugging may be for debugging the code that implements train, rather than debugging the training process.

What I mean is that when writing the code implementation of train, it is more complicated. Debug is used for the convenience of testing. When actually using the train code, it should be turned off.

@Deng-Xian-Sheng
Copy link

What I don't understand the most is your requirement.

Why do you want to shuffle the dataset every time you start training?

I feel that this is not necessary and even makes no sense.

I think you may be looking for a way to resume training from a certain checkpoint?

I remember that this parameter is mentioned in the readme file.

Do you expect not to resume from a checkpoint? This is a very strange idea.

@Deng-Xian-Sheng
Copy link

Do you know what epoch means?

It means that all datasets are processed once

That means, if you interrupt training at a checkpoint somewhere in epoch 1

When you resume training, you should not train what you trained before, nor disrupt the dataset, but continue training from the coordinates of the previous training.

However, I think these things are relatively low-level, and torch should handle them, especially about saving checkpoints and then loading them.

This is too low-level, and often low-level things will not go wrong.

The more likely place to go wrong is the upper layer, some documentation and usage errors.

@Deng-Xian-Sheng
Copy link

Also, are you fine-tuning or retraining? The dataset is quite large.

How is the training?

Hmm... I want to know if the training is fast? This will help me prepare a budget.

I just got a H800 for a few hours, which has 80G video memory. If the batch size is set larger, I don’t know if I can get good results in 24 hours.

It’s too expensive, just over $1 per hour.

By the way, have you tried multi-GPU to improve training?

Many trainings have no difference in speed between multi-GPU and single GPU.

@lawrence-cj
Copy link
Collaborator

lawrence-cj commented Dec 4, 2024

So, I hope the my dataset is shuffled at every restart, but I'm but sure how to check it.
Can some help me with me with it ? (every step pth file record and image shuffle).

I need to know your training config and command to help. Would you pls give more info about it?

@lawrence-cj
Copy link
Collaborator

sorry for the bug here. The reason for saving checkpoint every step is another check for 4 hours here:

if global_step % config.train.save_model_steps == 0 or (time.time() - training_start_time) / 3600 > 3.8:

Already fix in the latest commit.

@AfterHAL
Copy link
Author

AfterHAL commented Dec 4, 2024

sorry for the bug here. The reason for saving checkpoint every step is another check for 4 hours here:

if global_step % config.train.save_model_steps == 0 or (time.time() - training_start_time) / 3600 > 3.8:

Already fix in the latest commit.

Thank you @lawrence-cj.

@AfterHAL
Copy link
Author

AfterHAL commented Dec 4, 2024

I'm not even sure I understand what you're saying.

You mean, there will be a time limit for training? This should not happen, usually only step or epoch should be limited

Not just meaning it. If you read my question, you can see I posted the code that sets the time limit.

You also said that a checkpoint will be saved for each training step?

No, I didn't say that it will, I say that it does. (check the screen capture on my question).

This is also scary, the hard disk space will be filled up, and your data set is 70k, which will exacerbate the situation.

Yes. I guess that writing a new 8GB file every 40 seconds will fill-up my drives. Perhaps this is why I think it is a bug, but mostly because it was supposed to only be saved at the time steps specified in the config file.

Is it because you turned on debug?

Who knows ?

Is this on by default?

In my question, I said that (debug is set to true by default in train.sh) , so perhaps it is set to true by default in train.sh file.

I feel that debugging may be for debugging the code that implements train, rather than debugging the training process.

What I mean is that when writing the code implementation of train, it is more complicated. Debug is used for the convenience of testing. When actually using the train code, it should be turned off.

Sorry. I thought that debugging was just for the fun of seeing a lot of tracing outputs in the console.

(I prefer to joke about all this.)

@AfterHAL
Copy link
Author

AfterHAL commented Dec 4, 2024

What I don't understand the most is your requirement.

Why do you want to shuffle the dataset every time you start training?

I feel that this is not necessary and even makes no sense.

I think you may be looking for a way to resume training from a certain checkpoint?

I remember that this parameter is mentioned in the readme file.

Do you expect not to resume from a checkpoint? This is a very strange idea.

I don't want to shuffle my dataset.
I expected the train to finish at least one epoch, but because of the time limit I had to restart the training process multiple times.
That's when I start thinking about how the training process is dealing those restarts, and then I saw this "shuffle" option in the code. So, I was just wondering about it...

By the way, I was also thinking about how the training runs through the dataset if I change my dataset before a train resume.

@AfterHAL
Copy link
Author

AfterHAL commented Dec 4, 2024

Do you know what epoch means?

It means that all datasets are processed once

That means, if you interrupt training at a checkpoint somewhere in epoch 1

When you resume training, you should not train what you trained before, nor disrupt the dataset, but continue training from the coordinates of the previous training.

However, I think these things are relatively low-level, and torch should handle them, especially about saving checkpoints and then loading them.

This is too low-level, and often low-level things will not go wrong.

The more likely place to go wrong is the upper layer, some documentation and usage errors.

Of course, low-level programers never make mistakes !
(Sorry, I should have known.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants