-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sana Training time limit & epochs saved at every step (bug) #62
Comments
I feel this is very unusual. Can you provide your training configuration file and training command? Especially when you modify the configuration file, the content of the configuration file is particularly important. |
I'm not even sure I understand what you're saying. You mean, there will be a time limit for training? You also said that a checkpoint will be saved for each training step? This is also scary, the hard disk space will be filled up, and your data set is 70k, which will exacerbate the situation. Is it because you turned on debug? Is this on by default? I feel that debugging may be for debugging the code that implements train, rather than debugging the training process. What I mean is that when writing the code implementation of train, it is more complicated. Debug is used for the convenience of testing. When actually using the train code, it should be turned off. |
What I don't understand the most is your requirement. Why do you want to shuffle the dataset every time you start training? I feel that this is not necessary and even makes no sense. I think you may be looking for a way to resume training from a certain checkpoint? I remember that this parameter is mentioned in the readme file. Do you expect not to resume from a checkpoint? This is a very strange idea. |
Do you know what epoch means? It means that all datasets are processed once That means, if you interrupt training at a checkpoint somewhere in epoch 1 When you resume training, you should not train what you trained before, nor disrupt the dataset, but continue training from the coordinates of the previous training. However, I think these things are relatively low-level, and torch should handle them, especially about saving checkpoints and then loading them. This is too low-level, and often low-level things will not go wrong. The more likely place to go wrong is the upper layer, some documentation and usage errors. |
Also, are you fine-tuning or retraining? The dataset is quite large. How is the training? Hmm... I want to know if the training is fast? This will help me prepare a budget. I just got a H800 for a few hours, which has 80G video memory. If the batch size is set larger, I don’t know if I can get good results in 24 hours. It’s too expensive, just over $1 per hour. By the way, have you tried multi-GPU to improve training? Many trainings have no difference in speed between multi-GPU and single GPU. |
I need to know your training config and command to help. Would you pls give more info about it? |
sorry for the bug here. The reason for saving checkpoint every step is another check for 4 hours here: Line 451 in 1b2901b
Already fix in the latest commit. |
Thank you @lawrence-cj. |
Not just meaning it. If you read my question, you can see I posted the code that sets the time limit.
No, I didn't say that it will, I say that it does. (check the screen capture on my question).
Yes. I guess that writing a new 8GB file every 40 seconds will fill-up my drives. Perhaps this is why I think it is a bug, but mostly because it was supposed to only be saved at the time steps specified in the config file.
Who knows ?
In my question, I said that (debug is set to true by default in train.sh) , so perhaps it is set to true by default in
Sorry. I thought that debugging was just for the fun of seeing a lot of tracing outputs in the console. (I prefer to joke about all this.) |
I don't want to shuffle my dataset. By the way, I was also thinking about how the training runs through the dataset if I change my dataset before a train resume. |
Of course, low-level programers never make mistakes ! |
Hi.
I was surprised to see this "stopping training at epoch ... due to time limit":
After taking a look in the code I found this "time limit fonction" set to nearly 4 hours:
So I changed the limit to 500 (hours), and then restarted my training session.
After around 4 hours, it started to record a pth checkpoint file at every step:
Could it be because of this line of code in
train.py
:(debug is set to true by default in
train.sh
)Now, I wonder what happens each time I restart the training because there are a lot of images in my dataset (70k):
meta_data.json
file?It seems the the dataset is shuffled because I set
multi_scale: true
in my config and the code reads :[EDIT:] I just saw that multi_scale is set to false in the start training (train.sh) command line:
So, the question remains...
(build_dataloader function docs : https://mmdetection.readthedocs.io/en/v2.20.0/_modules/mmdet/datasets/builder.html)
So, I hope the my dataset is shuffled at every restart, but I'm but sure how to check it.
Can some help me with me with it ? (every step pth file record and image shuffle).
Regards.
The text was updated successfully, but these errors were encountered: