Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible for distribute training #3

Open
zhangqijun opened this issue Sep 16, 2019 · 4 comments
Open

possible for distribute training #3

zhangqijun opened this issue Sep 16, 2019 · 4 comments

Comments

@zhangqijun
Copy link

I have 3 machine which each have one 2080ti。Do you have some suggestions about training method,dataset loader (from tfrecords)?

@moono
Copy link
Owner

moono commented Sep 17, 2019

I'm not expert in distributed training system. But I've tried (tested) some distributed training before so...

You could try some other strategies in tf.distribute.experimental besides tf.distribute.MirroredStrategy. But I recommend to put your GPUs in one machine and use tf.distribute.MirroredStrategy. Because other strategy like ParameterServerStrategy, it is a bit difficult to set up then MirroredStrategy. And the environment that how multiple machines are organized (like network settings) affects the performance.

If you want to try, follow the guide in old contrib readme. See Multi-worker Training section.

@zhangqijun
Copy link
Author

zhangqijun commented Sep 24, 2019

First of all,thank you for your help.
After I change

train.py line74

from
distribution = tf.contrib.distribute.MirroredStrategy()
to
distribution = tf.contrib.distribute.CollectiveAllReduceStrategy(num_gpus_per_worker=1).
And add

import json
os.environ["TF_CONFIG"] = json.dumps({
  "cluster": {
   "worker": ["192.168.108.15:2222","192.168.108.11:2222"],
  },
  "task": {"type": "worker", "index": 0}
})

in the code begining.
But seems each machine start server with "localhost":2222, and can not communication with each other.I'm tring to fix this problem,also tried Standalone client mode
use tf.contrib.distribute.run_standard_tensorflow_server().join()
also has same problem.
Any thoughts about what I'm missing?

img

@zhangqijun
Copy link
Author

Stupid Me,I find I use global system proxy.

@ucasiggcas
Copy link

if I use my own dataset,should I put the face/ in datasets/ ???
datasets/face/**.jpg ??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants