possible for distribute training #3

zhangqijun · 2019-09-16T10:04:48Z

I have 3 machine which each have one 2080ti。Do you have some suggestions about training method,dataset loader (from tfrecords)?

moono · 2019-09-17T00:23:16Z

I'm not expert in distributed training system. But I've tried (tested) some distributed training before so...

You could try some other strategies in tf.distribute.experimental besides tf.distribute.MirroredStrategy. But I recommend to put your GPUs in one machine and use tf.distribute.MirroredStrategy. Because other strategy like ParameterServerStrategy, it is a bit difficult to set up then MirroredStrategy. And the environment that how multiple machines are organized (like network settings) affects the performance.

If you want to try, follow the guide in old contrib readme. See Multi-worker Training section.

zhangqijun · 2019-09-24T10:41:56Z

First of all,thank you for your help.
After I change

train.py line74

from
distribution = tf.contrib.distribute.MirroredStrategy()
to
distribution = tf.contrib.distribute.CollectiveAllReduceStrategy(num_gpus_per_worker=1).
And add

import json
os.environ["TF_CONFIG"] = json.dumps({
  "cluster": {
   "worker": ["192.168.108.15:2222","192.168.108.11:2222"],
  },
  "task": {"type": "worker", "index": 0}
})

in the code begining.
But seems each machine start server with "localhost":2222, and can not communication with each other.I'm tring to fix this problem,also tried Standalone client mode
use tf.contrib.distribute.run_standard_tensorflow_server().join()
also has same problem.
Any thoughts about what I'm missing?

zhangqijun · 2019-09-24T13:31:33Z

Stupid Me,I find I use global system proxy.

ucasiggcas · 2019-12-08T14:37:11Z

if I use my own dataset,should I put the face/ in datasets/ ???
datasets/face/**.jpg ??

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

possible for distribute training #3

possible for distribute training #3

zhangqijun commented Sep 16, 2019

moono commented Sep 17, 2019

zhangqijun commented Sep 24, 2019 •

edited

Loading

zhangqijun commented Sep 24, 2019

ucasiggcas commented Dec 8, 2019

possible for distribute training #3

possible for distribute training #3

Comments

zhangqijun commented Sep 16, 2019

moono commented Sep 17, 2019

zhangqijun commented Sep 24, 2019 • edited Loading

zhangqijun commented Sep 24, 2019

ucasiggcas commented Dec 8, 2019

zhangqijun commented Sep 24, 2019 •

edited

Loading