Skip to content

Commit

Permalink
Minor cosmetic changes
Browse files Browse the repository at this point in the history
  • Loading branch information
shankgan committed Mar 1, 2021
1 parent 137ea6a commit b7fe9f9
Show file tree
Hide file tree
Showing 4 changed files with 18 additions and 20 deletions.
26 changes: 13 additions & 13 deletions distribution_strategy/multi_worker_mirrored_strategy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ here are instructions to [create GKE clusters](https://cloud.google.com/kubernet
cp kubernetes/MultiWorkerMirroredTemplate.yaml.jinja myjob.template.jinja
```

4. Edit the `myjob.template.jinja` file to edit job parameters.
3. Edit the `myjob.template.jinja` file to edit job parameters.
1. `script` - which training program needs to be run. This should be either
`keras_mnist.py` or `custom_training_mnist.py` or `your_own_training_example.py`

Expand All @@ -51,9 +51,9 @@ here are instructions to [create GKE clusters](https://cloud.google.com/kubernet

4. `port` - the port used by tensorflow worker processes to communicate with each other

5. `model_checkpoint_dir` - directory where the model is checkpointed and saved from the chief worker process.
5. `checkpoint_pvc_name` - name of the persistent-volume-claim that will contain the checkpointed model.

6. `checkpoint_pvc_name` - name of the persistent-volume-claim which should be mounted at `model_checkpoint_dir`. This volume will contain the checkpointed model.
6. `model_checkpoint_dir` - mount location for inspecting the trained model in the volume inspector pod. Meant to be set if Volume inspector pod is mounted.

7. `image` - name of the docker image created in step 2 that needs to be loaded onto the cluster

Expand All @@ -63,25 +63,25 @@ here are instructions to [create GKE clusters](https://cloud.google.com/kubernet

10. `create_volume_inspector` - Create a pod to inspect the contents of the volume after the training job is complete. If this is `True`, `deploy` cannot be `True` since the checkpoint volume can be mounted as read-write by a single node. Inspection cannot happen when training is happenning.

5. Run the job:
4. Run the job:
1. Create a namespace to run your training jobs

```sh
kubectl create namespace <namespace>
```

2. [Optional] First set `deploy` to `False`, `create_pvc_checkpoint` to `True` and set the name of `checkpoint_pvc_name` appropriately. Then run
2. [Optional: If Persistent volume does not already exist on cluster] First set `deploy` to `False`, `create_pvc_checkpoint` to `True` and set the name of `checkpoint_pvc_name` appropriately in the .jinja file. Then run

```sh
python ../../render_template.py myjob.template.jinja | kubectl create -n <namespace> -f -
python ../../render_template.py myjob.template.jinja | kubectl apply -n <namespace> -f -
```

This will create a persistent volume claim where you can checkpoint your image.
This will create a persistent volume claim where you can checkpoint your image. In GKE, this claim will auto-create a GCE persistent disk resource to back up the claim.

3. Set `deploy` to `True` with all parameters specified in step 4 and then run
3. Set `deploy` to `True`, `create_pvc_checkpoint` to `False`, with all parameters specified in step 4 and then run

```sh
python ../../render_template.py myjob.template.jinja | kubectl create -n <namespace> -f -
python ../../render_template.py myjob.template.jinja | kubectl apply -n <namespace> -f -
```

This will create the Kubernetes jobs on the clusters. Each Job has a single service-endpoint and a single pod that runs the training image. You can track the running jobs in the cluster by running
Expand All @@ -101,17 +101,17 @@ here are instructions to [create GKE clusters](https://cloud.google.com/kubernet

4. Once the jobs are finished (based on the logs/output of kubectl get jobs),
the trained model can be inspected by a volume inspector pod. Set `deploy` to `False`
and `create_volume_inspector` to True. Then run
and `create_volume_inspector` to True. Also set `model_checkpoint_dir` to indicate location where trained model will be mounted. Then run

```sh
python ../../render_template.py myjob.template.jinja | kubectl create -n <namespace> -f -
python ../../render_template.py myjob.template.jinja | kubectl apply -n <namespace> -f -
```

Then, access the pod through ssh
This will create the volume inspector pod. Then, access the pod through ssh

```sh
kubectl get pods -n <namespace>
kubectl -n <namspace> exec --stdin --tty <volume-inspector-pod> -- /bin/bash
kubectl -n <namspace> exec --stdin --tty <volume-inspector-pod> -- /bin/sh
```

The contents of the trained model are available for inspection at `model_checkpoint_dir`.
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,4 @@ WORKDIR /app

COPY . /app/

ENTRYPOINT ["python", "/keras_mnist.py"]
ENTRYPOINT ["python", "keras_mnist.py"]
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ def main():
callbacks = [tf.keras.callbacks.experimental.BackupAndRestore(backup_dir=write_filepath(strategy))]
with strategy.scope():
multi_worker_model = build_and_compile_cnn_model()
multi_worker_model.fit(multi_worker_dataset, epochs=3, steps_per_epoch=70,
multi_worker_model.fit(multi_worker_dataset, epochs=10, steps_per_epoch=70,
callbacks=callbacks)
multi_worker_model.save(filepath=write_filepath(strategy))

Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
{%- set name = "tf-learning" -%}
{%- set image = "image-name" -%}
{%- set worker_replicas = 2 -%}
{%- set script = "keras_mnist.py" -%}
Expand Down Expand Up @@ -44,6 +45,7 @@ spec:
job: worker
task: "{{ i }}"
spec:
restartPolicy: Never
containers:
- name: tensorflow
image: {{ image }}
Expand All @@ -55,12 +57,9 @@ spec:
env:
- name: TF_CONFIG
value: '{"cluster": {"worker": [{{ worker_hosts() }}]}, "task": {"type": "worker", "index": {{ i }}}}'
args:
- "--model_checkpoint_dir={{ model_checkpoint_dir }}"
restartPolicy: Never
{% if i == 0 %}
volumeMounts:
- mountPath: "{{ model_checkpoint_dir }}"
- mountPath: /pvcmnt
name: pvc-mount
volumes:
- name: pvc-mount
Expand Down Expand Up @@ -103,7 +102,6 @@ spec:
resources:
limits:
memory: 512Mi
cpu: "1"
---
{% endif %}

Expand Down

0 comments on commit b7fe9f9

Please sign in to comment.