sync optimizer problem #43

fanlu · 2018-08-21T08:08:01Z

hi Vincent,
I have another problem, I use 0 1 as training GPU and numbatches_to_aggregate=0 in default config standardtrainer.cfg , but I found 3 Start master session in log. Is this behavior right?

2018-08-21 07:20:16.556159: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered          [286/1991]
/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2018-08-21 07:20:22.265699: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session 6066331584d0d493 with config: gpu_optios { allow_growth: true } allow_soft_placement: true
2018-08-21 07:20:22.711547: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session 51400be6a5c9e9bc with config: gpu_optios { allow_growth: true } allow_soft_placement: true
WORKER 0: step 0/15600 loss: 3.890086, learning rate: 0.001000 
         time elapsed: 8.318348 sec
         peak memory usage: 21/22604 MB
WORKER 0: step 1/15600 loss: 3.776481, learning rate: 0.001000 
         time elapsed: 1.858370 sec
         peak memory usage: 688/22604 MB
WORKER 0: step 2/15600 loss: 3.670535, learning rate: 0.001000 
         time elapsed: 1.519805 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 3/15600 loss: 3.695373, learning rate: 0.001000 
         time elapsed: 1.608191 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 4/15600 loss: 3.627747, learning rate: 0.000999 
         time elapsed: 2.627351 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 5/15600 loss: 3.646843, learning rate: 0.000999 
         time elapsed: 1.744121 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 6/15600 loss: 3.629820, learning rate: 0.000999 
         time elapsed: 1.468782 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 7/15600 loss: 3.638589, learning rate: 0.000999 
         time elapsed: 1.179851 sec
         peak memory usage: 893/22604 MB
2018-08-21 07:20:52.899330: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session c5b480c57e96eea5 with config: gpu_optios { allow_growth: true } allow_soft_placement: true
WORKER 0: step 8/15600 loss: 3.618140, learning rate: 0.000999 
         time elapsed: 1.060105 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 9/15600 loss: 3.596526, learning rate: 0.000999 
         time elapsed: 2.330892 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 10/15600 loss: 3.601762, learning rate: 0.000999 
         time elapsed: 1.179645 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 11/15600 loss: 3.615230, learning rate: 0.000998 
         time elapsed: 0.954928 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 12/15600 loss: 3.602469, learning rate: 0.000998 
         time elapsed: 1.251472 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 13/15600 loss: 3.601032, learning rate: 0.000998 
         time elapsed: 0.886364 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 14/15600 loss: 3.603386, learning rate: 0.000998 
         time elapsed: 1.565970 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 15/15600 loss: 3.622668, learning rate: 0.000998 
         time elapsed: 0.885060 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 16/15600 loss: 3.613049, learning rate: 0.000998 
         time elapsed: 1.336841 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 17/15600 loss: 3.610620, learning rate: 0.000997 
         time elapsed: 0.818247 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 18/15600 loss: 3.618244, learning rate: 0.000997 
         time elapsed: 0.795765 sec
         peak memory usage: 893/22604 MB
WORKER 1: step 17/15600 loss: 3.590606, learning rate: 0.000997 
         time elapsed: 8.500966 sec
         peak memory usage: 893/22604 MB

In Contrast, When I set numbatches_to_aggregate=2 use sync replicas optimizer, there is a error msg like

Traceback (most recent call last):
  File "nabu/scripts/train.py", line 112, in <module>
    testing=False)
  File "nabu/scripts/train.py", line 90, in train
    tr.train(testing)
  File "/opt/cephfs1/asr/users/fanlu/mfs/nabu/nabu/neuralnetworks/trainers/trainer.py", line 607, in train
    outputs = self._create_graph()
  File "/opt/cephfs1/asr/users/fanlu/mfs/nabu/nabu/neuralnetworks/trainers/trainer.py", line 187, in _create_graph
    cluster=cluster)
  File "/opt/cephfs1/asr/users/fanlu/mfs/nabu/nabu/neuralnetworks/trainers/trainer.py", line 569, in _update
    name='apply_gradients')
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/sync_replicas_optimizer.py", line 238, in apply_gradients
    raise ValueError("Global step is required to check staleness")
ValueError: Global step is required to check staleness

So I add param global_step to apply_gradients_op in func _update

# opperation to apply the gradients
        apply_gradients_op = optimizer.apply_gradients(
            grads_and_vars=grads_and_vars,
            name='apply_gradients', global_step=global_step)

and start training, but there is no training log print anymore.
How to set global_step to apply_gradients_op? @vrenkens

The text was updated successfully, but these errors were encountered:

AzizCode92 · 2018-08-21T08:25:16Z

On which OS are you training your model?
I have in my logs the same output almost. My concern is about having the

TF GPU device with id 0 was not registered

I'm not sure but may be this message can be ignored. As I have seen a lot of printed logs having this message on github and on slack

vrenkens · 2018-08-21T08:27:54Z

About your first question: it is normal that 3 sessions are started when using 2 GPUs. There is one for each GPU worker and one for the parameter server that holds the parameters.

About the second question: I will check it out when I have the time

fanlu · 2018-08-21T08:29:03Z

@AzizCode92 I use docker with Centos version 7.4, I have met

TF GPU device with id 0 was not registered

either.

fanlu · 2018-08-21T08:30:35Z

@vrenkens
plus ps, the count is 4

2018-08-21 08:29:04.230358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-08-21 08:29:04.312654: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session 0314965d18d48a0e with config: 
2018-08-21 08:29:04.314325: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:04.314370: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:04.317363: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:04.317405: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:04.703447: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/device:GPU:0 with 21557 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:02:00.0, compute capability: 6.1)
2018-08-21 08:29:04.703531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/device:GPU:0 with 21557 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:03:00.0, compute capability: 6.1)
2018-08-21 08:29:05.458095: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:05.458151: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:05.462393: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:05.462629: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:05.470015: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:05.470082: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2018-08-21 08:29:10.788696: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session 843a2ba3e9062528 with config: gpu_options { allow_growth: true } allow_soft_placement: true
2018-08-21 08:29:12.074345: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session b7a26beb96e7ca5a with config: gpu_options { allow_growth: true } allow_soft_placement: true
WORKER 0: step 0/15600 loss: 3.883478, learning rate: 0.001000 
         time elapsed: 7.911287 sec
         peak memory usage: 22/22604 MB
WORKER 0: step 2/15600 loss: 3.766951, learning rate: 0.001000 
         time elapsed: 1.964389 sec
         peak memory usage: 731/22604 MB
WORKER 0: step 4/15600 loss: 3.677392, learning rate: 0.000999 
         time elapsed: 1.568714 sec
         peak memory usage: 884/22604 MB
WORKER 0: step 6/15600 loss: 3.674108, learning rate: 0.000999 
         time elapsed: 2.856795 sec
         peak memory usage: 884/22604 MB
WORKER 0: step 8/15600 loss: 3.673018, learning rate: 0.000999 
         time elapsed: 1.789929 sec
         peak memory usage: 884/22604 MB
WORKER 0: step 10/15600 loss: 3.652077, learning rate: 0.000999 
         time elapsed: 1.303749 sec
         peak memory usage: 884/22604 MB
WORKER 0: step 12/15600 loss: 3.645280, learning rate: 0.000998 
         time elapsed: 1.514070 sec
         peak memory usage: 884/22604 MB
WORKER 0: step 14/15600 loss: 3.627352, learning rate: 0.000998 
         time elapsed: 1.308973 sec
         peak memory usage: 884/22604 MB
WORKER 0: step 16/15600 loss: 3.628558, learning rate: 0.000998 
         time elapsed: 1.871332 sec
         peak memory usage: 884/22604 MB
2018-08-21 08:29:41.694464: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session c98110034232ee37 with config: gpu_options { allow_growth: true } allow_soft_placement: true
WORKER 0: step 18/15600 loss: 3.619331, learning rate: 0.000997 
         time elapsed: 1.154132 sec
         peak memory usage: 884/22604 MB
WORKER 0: step 20/15600 loss: 3.636102, learning rate: 0.000997 
         time elapsed: 1.139632 sec
         peak memory usage: 884/22604 MB

fanlu · 2018-08-21T08:32:50Z

@vrenkens When I use 3GPU, I found the number is 5, So I guess except ps and chief worker, on the other worker master session start twice on each gpu.

AzizCode92 · 2018-08-21T08:38:25Z

can you share please what is the output of

ps aux | grep python

fanlu · 2018-08-21T08:39:45Z

@AzizCode92

root     3130930 3130927  0 16:28 pts/20   00:00:04 python nabu/scripts/prepare_train.py --recipe=config/recipes/LAS/TIMIT --expdir=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2 --mode=single_machine
root     3130987 3130930 22 16:28 pts/20   00:02:18 python -u nabu/scripts/train.py --clusterfile=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2/cluster/cluster --job_name=ps --task_index=0 --ssh_command=None --expdir=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2
root     3130988 3130930 99 16:28 pts/20   00:34:05 python -u nabu/scripts/train.py --clusterfile=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2/cluster/cluster --job_name=worker --task_index=0 --ssh_command=None --expdir=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2
root     3130989 3130930 99 16:28 pts/20   00:32:22 python -u nabu/scripts/train.py --clusterfile=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2/cluster/cluster --job_name=worker --task_index=1 --ssh_command=None --expdir=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2

vrenkens · 2018-08-21T08:41:10Z

I doubt that this is the case. I have only tested it with 2 workers and 1 ps, because I don't have bigger machine's to play with. The overview of you jobs look normal. From your first post it seems that both workers are working, no?

AzizCode92 · 2018-08-21T08:41:16Z

See here the task_index=0 is shared between the worker and the ps.
I thought that the the ps will be working on a different task

vrenkens · 2018-08-21T08:42:03Z

@AzizCode92 The task_name is different. The indices for different task names can be the same

fanlu · 2018-08-21T08:44:59Z

@vrenkens Yes, the first two master session start, only one worker is working, when last master session start, the second GPU start working.

vrenkens · 2018-08-21T09:20:12Z

That is pretty normal behavior, it can take a while before all workers are online.

fanlu · 2018-08-21T09:31:36Z

Yes, So pls help me find the sync replica optimizer can‘t applyer_gradient error.
Why do you use increment_step op significantly, I think it's odd.
How about tf.train.get_or_create_global_step()?

vrenkens · 2018-08-21T09:32:23Z

I will take a look when I have the time

HawkAaron · 2018-10-23T08:50:59Z

Hi, everyone:

On the tf website, SyncReplicasOptimizer has a sync_replicas_hook, that should be managed by MonitoredTrainingSession.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync optimizer problem #43

sync optimizer problem #43

fanlu commented Aug 21, 2018 •

edited

Loading

AzizCode92 commented Aug 21, 2018 •

edited

Loading

vrenkens commented Aug 21, 2018

fanlu commented Aug 21, 2018

fanlu commented Aug 21, 2018

fanlu commented Aug 21, 2018

AzizCode92 commented Aug 21, 2018

fanlu commented Aug 21, 2018

vrenkens commented Aug 21, 2018

AzizCode92 commented Aug 21, 2018

vrenkens commented Aug 21, 2018

fanlu commented Aug 21, 2018

vrenkens commented Aug 21, 2018

fanlu commented Aug 21, 2018

vrenkens commented Aug 21, 2018

HawkAaron commented Oct 23, 2018

sync optimizer problem #43

sync optimizer problem #43

Comments

fanlu commented Aug 21, 2018 • edited Loading

AzizCode92 commented Aug 21, 2018 • edited Loading

vrenkens commented Aug 21, 2018

fanlu commented Aug 21, 2018

fanlu commented Aug 21, 2018

fanlu commented Aug 21, 2018

AzizCode92 commented Aug 21, 2018

fanlu commented Aug 21, 2018

vrenkens commented Aug 21, 2018

AzizCode92 commented Aug 21, 2018

vrenkens commented Aug 21, 2018

fanlu commented Aug 21, 2018

vrenkens commented Aug 21, 2018

fanlu commented Aug 21, 2018

vrenkens commented Aug 21, 2018

HawkAaron commented Oct 23, 2018

fanlu commented Aug 21, 2018 •

edited

Loading

AzizCode92 commented Aug 21, 2018 •

edited

Loading