Skip to content

Commit

Permalink
Adding example to train the resnet56 model using MultiworkerMirroredT…
Browse files Browse the repository at this point in the history
…raining example on the cifar-10 dataset
  • Loading branch information
shankgan committed May 14, 2021
1 parent 2fb34e4 commit 188b8cd
Show file tree
Hide file tree
Showing 5 changed files with 670 additions and 6 deletions.
120 changes: 115 additions & 5 deletions distribution_strategy/multi_worker_mirrored_strategy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
The steps below are meant to train models using [MultiWorkerMirrored Strategy](https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy) using the tensorflow 2.x API on the Kubernetes platform.

Reference programs such as [keras_mnist.py](examples/keras_mnist.py) and
[custom_training_mnist.py](examples/custom_training_mnist.py) are available in the examples directory.
[custom_training_mnist.py](examples/custom_training_mnist.py) and [keras_resnet_cifar.py](examples/keras_resnet_cifar.py) are available in the examples directory.

The Kubernetes manifest templates and other cluster specific configuration is available in the [kubernetes](kubernetes) directory

Expand All @@ -28,14 +28,39 @@ here are instructions to [create GKE clusters](https://cloud.google.com/kubernet

5. Install [Docker](https://docs.docker.com/get-docker/) for your system, while also creating an account that you can associate with your container images.

6. For model storage and checkpointing, a [persistent-volume-claim](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) needs to be available to mount onto the chief worker pod. The steps below include the yaml to create a persistent-volume-claim for GKE backed by GCEPersistentDisk.
6. For the mnist examples, for model storage and checkpointing, a [persistent-volume-claim](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) needs to be available to mount onto the chief worker pod. The steps below include the yaml to create a persistent-volume-claim for GKE backed by GCEPersistentDisk.

### Steps to Run the job
### Additional prerequisites for resnet56 example

1. Create a
[service account](https://cloud.google.com/compute/docs/access/service-accounts)
and download its key file in JSON format. Assign Storage Admin role for
[Google Cloud Storage](https://cloud.google.com/storage/) to this service account:

```bash
gcloud iam service-accounts create <service_account_id> --display-name="<display_name>"
```

```bash
gcloud projects add-iam-policy-binding <project-id> \
--member="serviceAccount:<service_account_id>@<project_id>.iam.gserviceaccount.com" \
--role="roles/storage.admin"
```
2. Create a Kubernetes secret from the JSON key file of your service account:

```bash
kubectl create secret generic credential --from-file=key.json=<path_to_json_file>
```

3. For GPU based training, ensure your kubernetes cluster has a node-pool with gpu enabled.
The steps to achieve this on GKE are available [here](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus)

## Steps to train mnist examples

1. Follow the instructions for building and pushing the Docker image to a docker registry
in the [Docker README](examples/README.md).

2. Copy the template file:
2. Copy the template file `MultiWorkerMirroredTemplate.yaml.jinja`:

```sh
cp kubernetes/MultiWorkerMirroredTemplate.yaml.jinja myjob.template.jinja
Expand Down Expand Up @@ -114,4 +139,89 @@ here are instructions to [create GKE clusters](https://cloud.google.com/kubernet
kubectl -n <namspace> exec --stdin --tty <volume-inspector-pod> -- /bin/sh
```

The contents of the trained model are available for inspection at `model_checkpoint_dir`.
The contents of the trained model are available for inspection at `model_checkpoint_dir`.

## Steps to train resnet examples

1. Follow the instructions for building and pushing the Docker image using `Dockerfile.gpu` to a docker registry
in the [Docker README](examples/README.md).

2. Copy the template file `EnhancedMultiWorkerMirroredTemplate.yaml.jinja`

```sh
cp kubernetes/EnhancedMultiWorkerMirroredTemplate.yaml.jinja myjob.template.jinja
```
3. Create three buckets for model data, checkpoints and training logs using either GCP web UI or gsutil tool (included with the gcloud tool you have installed above):

```bash
gsutil mb gs://<bucket_name>
```
You will use these bucket names to modify `data_dir`, `log_dir` and `model_dir` in step #4.


4. Download CIFAR-10 data and place them in your data_dir bucket. Head to the [ResNet in TensorFlow](https://github.com/tensorflow/models/tree/r1.13.0/official/resnet#cifar-10) directory to obtain CIFAR-10 data. Alternatively, you can use this [direct link](https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz) to download and extract the data yourself as well.

```bash
python cifar10_download_and_extract.py
```

Upload the contents of cifar-10-batches-bin directory to your `data_dir` bucket.

```bash
gsutil -m cp cifar-10-batches-bin/* gs://<your_data_dir>/
```

5. Edit the `myjob.template.jinja` file to edit job parameters.
1. `script` - which training program needs to be run. This should be either
`keras_resnet_cifar.py` or `your_own_training_example.py`

2. `name` - the prefix attached to all the Kubernetes jobs created

3. `worker_replicas` - number of parallel worker processes that train the example

4. `port` - the port used by tensorflow worker processes to communicate with each other.

5. `model_dir` - the GCP bucket path that stores the model checkoints `gs://model_dir/`

6. `image` - name of the docker image created in step 2 that needs to be loaded onto the cluster

7. `log_dir` - the GCP bucket path that where the logs are stored `gs://log_dir/`

8. `data_dir` - the GCP bucket path for the Cifar-10 dataset `gs://data_dir/`

9. `gcp_credential_secret` - the name of secret created in the kubernetes cluster that contains the service Account credentials

10. `batch_size` - the global batch size used for training

11. `num_train_epoch` - the number of training epochs

4. Run the job:
1. Create a namespace to run your training jobs

```sh
kubectl create namespace <namespace>
```

2. Deploy the training workloads in the cluster

```sh
python ../../render_template.py myjob.template.jinja | kubectl apply -n <namespace> -f -
```

This will create the Kubernetes jobs on the clusters. Each Job has a single service-endpoint and a single pod that runs the training image. You can track the running jobs in the cluster by running

```sh
kubectl get jobs -n <namespace>
kubectl describe jobs -n <namespace>
```

By default, this also deploys tensorboard on the cluster.

```sh
kubectl get services -n <namespace> | grep tensorboard
```

Note the external-ip corresponding to the service and the previously configured `port` in the yaml
The tensorboard service should be accessible through the web at `http://tensorboard-external-ip:port`

3. The final model should be available in the GCP bucket corresponding to `model_dir` configured in the yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
FROM tensorflow/tensorflow:2.3.1-gpu-jupyter

RUN apt-get install -y python3 && \
apt install python3-pip

RUN pip3 install absl-py && \
pip3 install portpicker

# Install git
RUN apt-get update && \
apt-get install -y git && \
apt-get install -y vim

WORKDIR /app

RUN git clone --single-branch --branch benchmark https://github.com/tensorflow/models.git && \
mv models tensorflow_models && \
git clone https://github.com/tensorflow/model-optimization.git && \
mv model-optimization tensorflow_model_optimization

# Keeps Python from generating .pyc files in the container
ENV PYTHONDONTWRITEBYTECODE=1
# Turns off buffering for easier container logging
ENV PYTHONUNBUFFERED=1

COPY . /app/

ENV PYTHONPATH "${PYTHONPATH}:/:/app/tensorflow_models"

CMD ["python", "resnet_cifar_multiworker_strategy_keras.py"]
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,13 @@
This directory contains examples of MultiWorkerMirrored Training along with the docker file to build them

- [Dockerfile](Dockerfile) contains all dependenices required to build a container image using docker with the training examples
- [Dockerfile.gpu](Dockerfile.gpu) contains all dependenices required to build a container image using docker with gpu and the tensorflow model garden
- [keras_mnist.py](mnist.py) demonstrates how to train an MNIST classifier using
[tf.distribute.MultiWorkerMirroredStrategy and Keras Tensorflow 2.0 API](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras).
- [custom_training_mnist.py](mnist.py) demonstrates how to train a fashion MNIST classifier using
[tf.distribute.MultiWorkerMirroredStrategy and Tensorflow 2.0 Custom Training Loop APIs](https://www.tensorflow.org/tutorials/distribute/custom_training).

- [keras_resnet_cifar.py](keras_resnet_cifar.py) demonstrates how to train the resnet56 model on the Cifar-10 dataset using
[tf.distribute.MultiWorkerMirroredStrategy and Keras Tensorflow 2.0 API](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras).
## Best Practices

- Always pin the TensorFlow version with the Docker image tag. This ensures that
Expand Down Expand Up @@ -51,3 +53,10 @@ The [custom_training_mnist.py](mnist.py) example demonstrates how to train a fas
[tf.distribute.MultiWorkerMirroredStrategy and Tensorflow 2.0 Custom Training Loop APIs](https://www.tensorflow.org/tutorials/distribute/custom_training).
The final model is saved to disk by the chief worker process. The disk is assumed to be mounted onto the running container by the cluster manager.
It assumes that the cluster configuration is passed in through the `TF_CONFIG` environment variable when deployed in the cluster.

## Running the keras_resnet_cifar.py example

The [keras_resnet_cifar.py](keras_resnet_cifar.py) example demonstrates how to train a Resnet56 model on the cifar-10 dataset using
[tf.distribute.MultiWorkerMirroredStrategy and Keras Tensorflow 2.0 API](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras).
The final model is saved to the GCP storage bucket.
It assumes that the cluster configuration is passed in through the `TF_CONFIG` environment variable when deployed in the cluster.
Loading

0 comments on commit 188b8cd

Please sign in to comment.