Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync optimizer problem #43

Open
fanlu opened this issue Aug 21, 2018 · 15 comments
Open

sync optimizer problem #43

fanlu opened this issue Aug 21, 2018 · 15 comments

Comments

@fanlu
Copy link

fanlu commented Aug 21, 2018

hi Vincent,
I have another problem, I use 0 1 as training GPU and numbatches_to_aggregate=0 in default config standardtrainer.cfg , but I found 3 Start master session in log. Is this behavior right?

2018-08-21 07:20:16.556159: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered          [286/1991]
/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2018-08-21 07:20:22.265699: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session 6066331584d0d493 with config: gpu_optios { allow_growth: true } allow_soft_placement: true
2018-08-21 07:20:22.711547: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session 51400be6a5c9e9bc with config: gpu_optios { allow_growth: true } allow_soft_placement: true
WORKER 0: step 0/15600 loss: 3.890086, learning rate: 0.001000 
         time elapsed: 8.318348 sec
         peak memory usage: 21/22604 MB
WORKER 0: step 1/15600 loss: 3.776481, learning rate: 0.001000 
         time elapsed: 1.858370 sec
         peak memory usage: 688/22604 MB
WORKER 0: step 2/15600 loss: 3.670535, learning rate: 0.001000 
         time elapsed: 1.519805 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 3/15600 loss: 3.695373, learning rate: 0.001000 
         time elapsed: 1.608191 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 4/15600 loss: 3.627747, learning rate: 0.000999 
         time elapsed: 2.627351 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 5/15600 loss: 3.646843, learning rate: 0.000999 
         time elapsed: 1.744121 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 6/15600 loss: 3.629820, learning rate: 0.000999 
         time elapsed: 1.468782 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 7/15600 loss: 3.638589, learning rate: 0.000999 
         time elapsed: 1.179851 sec
         peak memory usage: 893/22604 MB
2018-08-21 07:20:52.899330: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session c5b480c57e96eea5 with config: gpu_optios { allow_growth: true } allow_soft_placement: true
WORKER 0: step 8/15600 loss: 3.618140, learning rate: 0.000999 
         time elapsed: 1.060105 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 9/15600 loss: 3.596526, learning rate: 0.000999 
         time elapsed: 2.330892 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 10/15600 loss: 3.601762, learning rate: 0.000999 
         time elapsed: 1.179645 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 11/15600 loss: 3.615230, learning rate: 0.000998 
         time elapsed: 0.954928 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 12/15600 loss: 3.602469, learning rate: 0.000998 
         time elapsed: 1.251472 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 13/15600 loss: 3.601032, learning rate: 0.000998 
         time elapsed: 0.886364 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 14/15600 loss: 3.603386, learning rate: 0.000998 
         time elapsed: 1.565970 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 15/15600 loss: 3.622668, learning rate: 0.000998 
         time elapsed: 0.885060 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 16/15600 loss: 3.613049, learning rate: 0.000998 
         time elapsed: 1.336841 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 17/15600 loss: 3.610620, learning rate: 0.000997 
         time elapsed: 0.818247 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 18/15600 loss: 3.618244, learning rate: 0.000997 
         time elapsed: 0.795765 sec
         peak memory usage: 893/22604 MB
WORKER 1: step 17/15600 loss: 3.590606, learning rate: 0.000997 
         time elapsed: 8.500966 sec
         peak memory usage: 893/22604 MB

In Contrast, When I set numbatches_to_aggregate=2 use sync replicas optimizer, there is a error msg like

Traceback (most recent call last):
  File "nabu/scripts/train.py", line 112, in <module>
    testing=False)
  File "nabu/scripts/train.py", line 90, in train
    tr.train(testing)
  File "/opt/cephfs1/asr/users/fanlu/mfs/nabu/nabu/neuralnetworks/trainers/trainer.py", line 607, in train
    outputs = self._create_graph()
  File "/opt/cephfs1/asr/users/fanlu/mfs/nabu/nabu/neuralnetworks/trainers/trainer.py", line 187, in _create_graph
    cluster=cluster)
  File "/opt/cephfs1/asr/users/fanlu/mfs/nabu/nabu/neuralnetworks/trainers/trainer.py", line 569, in _update
    name='apply_gradients')
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/sync_replicas_optimizer.py", line 238, in apply_gradients
    raise ValueError("Global step is required to check staleness")
ValueError: Global step is required to check staleness

So I add param global_step to apply_gradients_op in func _update

# opperation to apply the gradients
        apply_gradients_op = optimizer.apply_gradients(
            grads_and_vars=grads_and_vars,
            name='apply_gradients', global_step=global_step)

and start training, but there is no training log print anymore.
How to set global_step to apply_gradients_op? @vrenkens

@AzizCode92
Copy link
Contributor

AzizCode92 commented Aug 21, 2018

On which OS are you training your model?
I have in my logs the same output almost. My concern is about having the

TF GPU device with id 0 was not registered

I'm not sure but may be this message can be ignored. As I have seen a lot of printed logs having this message on github and on slack

@vrenkens
Copy link
Owner

About your first question: it is normal that 3 sessions are started when using 2 GPUs. There is one for each GPU worker and one for the parameter server that holds the parameters.

About the second question: I will check it out when I have the time

@fanlu
Copy link
Author

fanlu commented Aug 21, 2018

@AzizCode92 I use docker with Centos version 7.4, I have met

TF GPU device with id 0 was not registered

either.

@fanlu
Copy link
Author

fanlu commented Aug 21, 2018

@vrenkens
plus ps, the count is 4

2018-08-21 08:29:04.230358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-08-21 08:29:04.312654: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session 0314965d18d48a0e with config: 
2018-08-21 08:29:04.314325: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:04.314370: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:04.317363: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:04.317405: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:04.703447: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/device:GPU:0 with 21557 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:02:00.0, compute capability: 6.1)
2018-08-21 08:29:04.703531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/device:GPU:0 with 21557 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:03:00.0, compute capability: 6.1)
2018-08-21 08:29:05.458095: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:05.458151: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:05.462393: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:05.462629: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:05.470015: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:05.470082: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2018-08-21 08:29:10.788696: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session 843a2ba3e9062528 with config: gpu_options { allow_growth: true } allow_soft_placement: true
2018-08-21 08:29:12.074345: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session b7a26beb96e7ca5a with config: gpu_options { allow_growth: true } allow_soft_placement: true
WORKER 0: step 0/15600 loss: 3.883478, learning rate: 0.001000 
         time elapsed: 7.911287 sec
         peak memory usage: 22/22604 MB
WORKER 0: step 2/15600 loss: 3.766951, learning rate: 0.001000 
         time elapsed: 1.964389 sec
         peak memory usage: 731/22604 MB
WORKER 0: step 4/15600 loss: 3.677392, learning rate: 0.000999 
         time elapsed: 1.568714 sec
         peak memory usage: 884/22604 MB
WORKER 0: step 6/15600 loss: 3.674108, learning rate: 0.000999 
         time elapsed: 2.856795 sec
         peak memory usage: 884/22604 MB
WORKER 0: step 8/15600 loss: 3.673018, learning rate: 0.000999 
         time elapsed: 1.789929 sec
         peak memory usage: 884/22604 MB
WORKER 0: step 10/15600 loss: 3.652077, learning rate: 0.000999 
         time elapsed: 1.303749 sec
         peak memory usage: 884/22604 MB
WORKER 0: step 12/15600 loss: 3.645280, learning rate: 0.000998 
         time elapsed: 1.514070 sec
         peak memory usage: 884/22604 MB
WORKER 0: step 14/15600 loss: 3.627352, learning rate: 0.000998 
         time elapsed: 1.308973 sec
         peak memory usage: 884/22604 MB
WORKER 0: step 16/15600 loss: 3.628558, learning rate: 0.000998 
         time elapsed: 1.871332 sec
         peak memory usage: 884/22604 MB
2018-08-21 08:29:41.694464: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session c98110034232ee37 with config: gpu_options { allow_growth: true } allow_soft_placement: true
WORKER 0: step 18/15600 loss: 3.619331, learning rate: 0.000997 
         time elapsed: 1.154132 sec
         peak memory usage: 884/22604 MB
WORKER 0: step 20/15600 loss: 3.636102, learning rate: 0.000997 
         time elapsed: 1.139632 sec
         peak memory usage: 884/22604 MB

@fanlu
Copy link
Author

fanlu commented Aug 21, 2018

@vrenkens When I use 3GPU, I found the number is 5, So I guess except ps and chief worker, on the other worker master session start twice on each gpu.

@AzizCode92
Copy link
Contributor

can you share please what is the output of

ps aux | grep python

@fanlu
Copy link
Author

fanlu commented Aug 21, 2018

@AzizCode92

root     3130930 3130927  0 16:28 pts/20   00:00:04 python nabu/scripts/prepare_train.py --recipe=config/recipes/LAS/TIMIT --expdir=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2 --mode=single_machine
root     3130987 3130930 22 16:28 pts/20   00:02:18 python -u nabu/scripts/train.py --clusterfile=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2/cluster/cluster --job_name=ps --task_index=0 --ssh_command=None --expdir=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2
root     3130988 3130930 99 16:28 pts/20   00:34:05 python -u nabu/scripts/train.py --clusterfile=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2/cluster/cluster --job_name=worker --task_index=0 --ssh_command=None --expdir=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2
root     3130989 3130930 99 16:28 pts/20   00:32:22 python -u nabu/scripts/train.py --clusterfile=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2/cluster/cluster --job_name=worker --task_index=1 --ssh_command=None --expdir=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2

@vrenkens
Copy link
Owner

I doubt that this is the case. I have only tested it with 2 workers and 1 ps, because I don't have bigger machine's to play with. The overview of you jobs look normal. From your first post it seems that both workers are working, no?

@AzizCode92
Copy link
Contributor

See here the task_index=0 is shared between the worker and the ps.
I thought that the the ps will be working on a different task

@vrenkens
Copy link
Owner

@AzizCode92 The task_name is different. The indices for different task names can be the same

@fanlu
Copy link
Author

fanlu commented Aug 21, 2018

@vrenkens Yes, the first two master session start, only one worker is working, when last master session start, the second GPU start working.

@vrenkens
Copy link
Owner

That is pretty normal behavior, it can take a while before all workers are online.

@fanlu
Copy link
Author

fanlu commented Aug 21, 2018

Yes, So pls help me find the sync replica optimizer can‘t applyer_gradient error.
Why do you use increment_step op significantly, I think it's odd.
How about tf.train.get_or_create_global_step()?

@vrenkens
Copy link
Owner

I will take a look when I have the time

@HawkAaron
Copy link

Hi, everyone:

On the tf website, SyncReplicasOptimizer has a sync_replicas_hook, that should be managed by MonitoredTrainingSession.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants