-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sync optimizer problem #43
Comments
On which OS are you training your model?
I'm not sure but may be this message can be ignored. As I have seen a lot of printed logs having this message on github and on slack |
About your first question: it is normal that 3 sessions are started when using 2 GPUs. There is one for each GPU worker and one for the parameter server that holds the parameters. About the second question: I will check it out when I have the time |
@AzizCode92 I use docker with Centos version 7.4, I have met
either. |
@vrenkens
|
@vrenkens When I use 3GPU, I found the number is 5, So I guess except ps and chief worker, on the other worker master session start twice on each gpu. |
can you share please what is the output of
|
|
I doubt that this is the case. I have only tested it with 2 workers and 1 ps, because I don't have bigger machine's to play with. The overview of you jobs look normal. From your first post it seems that both workers are working, no? |
See here the task_index=0 is shared between the worker and the ps. |
@AzizCode92 The task_name is different. The indices for different task names can be the same |
@vrenkens Yes, the first two master session start, only one worker is working, when last master session start, the second GPU start working. |
That is pretty normal behavior, it can take a while before all workers are online. |
Yes, So pls help me find the sync replica optimizer can‘t applyer_gradient error. |
I will take a look when I have the time |
Hi, everyone: On the tf website, SyncReplicasOptimizer has a |
hi Vincent,
I have another problem, I use 0 1 as training GPU and numbatches_to_aggregate=0 in default config standardtrainer.cfg , but I found 3 Start master session in log. Is this behavior right?
In Contrast, When I set numbatches_to_aggregate=2 use sync replicas optimizer, there is a error msg like
So I add param global_step to apply_gradients_op in func _update
and start training, but there is no training log print anymore.
How to set global_step to apply_gradients_op? @vrenkens
The text was updated successfully, but these errors were encountered: