Adding example to train the resnet56 model using MultiworkerMirroredT…

…raining example on the cifar-10 dataset
tensorflow · May 19, 2021 · 598bb4d · 598bb4d
1 parent 2fb34e4
commit 598bb4d
Show file tree

Hide file tree

Showing 5 changed files with 670 additions and 6 deletions.
diff --git a/distribution_strategy/multi_worker_mirrored_strategy/README.md b/distribution_strategy/multi_worker_mirrored_strategy/README.md
@@ -4,7 +4,7 @@
 The steps below are meant to train models using [MultiWorkerMirrored Strategy](https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy) using the tensorflow 2.x API on the Kubernetes platform.
 
 Reference programs such as [keras_mnist.py](examples/keras_mnist.py) and
-[custom_training_mnist.py](examples/custom_training_mnist.py) are available in the examples directory.
+[custom_training_mnist.py](examples/custom_training_mnist.py) and [keras_resnet_cifar.py](examples/keras_resnet_cifar.py) are available in the examples directory.
 
 The Kubernetes manifest templates and other cluster specific configuration is available in the [kubernetes](kubernetes) directory
 
@@ -28,14 +28,39 @@ here are instructions to [create GKE clusters](https://cloud.google.com/kubernet
 
 5. Install [Docker](https://docs.docker.com/get-docker/) for your system, while also creating an account that you can associate with your container images.
 
-6. For model storage and checkpointing, a [persistent-volume-claim](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) needs to be available to mount onto the chief worker pod. The steps below include the yaml to create a persistent-volume-claim for GKE backed by GCEPersistentDisk.
+6. For the mnist examples, for model storage and checkpointing, a [persistent-volume-claim](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) needs to be available to mount onto the chief worker pod. The steps below include the yaml to create a persistent-volume-claim for GKE backed by GCEPersistentDisk.
 
-### Steps to Run the job
+### Additional prerequisites for resnet56 example
+
+1. Create a
+    [service account](https://cloud.google.com/compute/docs/access/service-accounts) 
+    and download its key file in JSON format. Assign Storage Admin role for 
+    [Google Cloud Storage](https://cloud.google.com/storage/) to this service account:
+
+    ```bash
+    gcloud iam service-accounts create <service_account_id> --display-name="<display_name>"
+    ```
+
+    ```bash
+    gcloud projects add-iam-policy-binding <project-id> \
+    --member="serviceAccount:<service_account_id>@<project_id>.iam.gserviceaccount.com" \
+    --role="roles/storage.admin"
+    ```
+2. Create a Kubernetes secret from the JSON key file of your service account:
+
+    ```bash
+    kubectl create secret generic credential --from-file=key.json=<path_to_json_file>
+    ```
+
+3. For GPU based training, ensure your kubernetes cluster has a node-pool with gpu enabled. 
+   The steps to achieve this on GKE are available [here](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus)
+
+## Steps to train mnist examples
 
 1. Follow the instructions for building and pushing the Docker image to a docker registry
   in the [Docker README](examples/README.md).
 
-2. Copy the template file:
+2. Copy the template file `MultiWorkerMirroredTemplate.yaml.jinja`:
 
   ```sh
      cp kubernetes/MultiWorkerMirroredTemplate.yaml.jinja myjob.template.jinja
@@ -114,4 +139,89 @@ here are instructions to [create GKE clusters](https://cloud.google.com/kubernet
       kubectl -n <namspace> exec --stdin --tty <volume-inspector-pod> -- /bin/sh
       ```
 
-      The contents of the trained model are available for inspection at `model_checkpoint_dir`.
+      The contents of the trained model are available for inspection at `model_checkpoint_dir`.
+
+## Steps to train resnet examples
+
+1. Follow the instructions for building and pushing the Docker image using `Dockerfile.gpu`  to a docker registry
+  in the [Docker README](examples/README.md).
+
+2. Copy the template file `EnhancedMultiWorkerMirroredTemplate.yaml.jinja`
+
+  ```sh
+     cp kubernetes/EnhancedMultiWorkerMirroredTemplate.yaml.jinja myjob.template.jinja
+  ```
+3.  Create three buckets for model data, checkpoints and training logs using either GCP web UI or gsutil tool (included with the gcloud tool you have installed above):
+
+    ```bash
+    gsutil mb gs://<bucket_name>
+    ```
+    You will use these bucket names to modify `data_dir`, `log_dir` and `model_dir` in step #4.
+
+
+4. Download CIFAR-10 data and place them in your data_dir bucket. Head to the [ResNet in TensorFlow](https://github.com/tensorflow/models/tree/r1.13.0/official/resnet#cifar-10) directory to obtain CIFAR-10 data. Alternatively, you can use this [direct link](https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz) to download and extract the data yourself as well. 
+
+    ```bash
+    python cifar10_download_and_extract.py
+    ```
+
+    Upload the contents of cifar-10-batches-bin directory to your `data_dir` bucket.
+
+    ```bash
+    gsutil -m cp cifar-10-batches-bin/* gs://<your_data_dir>/
+    ```
+
+5. Edit the `myjob.template.jinja` file to edit job parameters.
+   1. `script` - which training program needs to be run. This should be either
+      `keras_resnet_cifar.py` or `your_own_training_example.py`
+
+   2. `name` - the prefix attached to all the Kubernetes jobs created
+
+   3. `worker_replicas` - number of parallel worker processes that train the example
+
+   4. `port` - the port used by tensorflow worker processes to communicate with each other.
+
+   5. `model_dir` - the GCP bucket path that stores the model checkoints `gs://model_dir/`
+
+   6. `image` - name of the docker image created in step 2 that needs to be loaded onto the cluster
+
+   7. `log_dir` - the GCP bucket path that where the logs are stored `gs://log_dir/`
+
+   8. `data_dir` - the GCP bucket path for the Cifar-10 dataset  `gs://data_dir/`
+
+   9. `gcp_credential_secret` - the name of secret created in the kubernetes cluster that contains the service Account credentials
+
+   10. `batch_size` - the global batch size used for training
+
+   11. `num_train_epoch` - the number of training epochs
+
+4. Run the job:
+   1. Create a namespace to run your training jobs
+
+      ```sh
+      kubectl create namespace <namespace>
+      ```
+
+   2. Deploy the training workloads in the cluster
+
+      ```sh
+      python ../../render_template.py myjob.template.jinja | kubectl apply -n <namespace> -f -
+      ```
+
+      This will create the Kubernetes jobs on the clusters. Each Job has a single service-endpoint and a single pod that runs the training image. You can track the running jobs in the cluster by running
+
+      ```sh
+      kubectl get jobs -n <namespace>
+      kubectl describe jobs -n <namespace>   
+      ```
+
+      By default, this also deploys tensorboard on the cluster. 
+
+      ```sh
+      kubectl get services -n <namespace> | grep tensorboard  
+      ``` 
+
+      Note the external-ip corresponding to the service and the previously configured `port` in the yaml
+      The tensorboard service should be accessible through the web at `http://tensorboard-external-ip:port`
+
+   3. The final model should be available in the GCP bucket corresponding to `model_dir` configured in the yaml
diff --git a/distribution_strategy/multi_worker_mirrored_strategy/examples/Dockerfile.gpu b/distribution_strategy/multi_worker_mirrored_strategy/examples/Dockerfile.gpu
@@ -0,0 +1,30 @@
+FROM tensorflow/tensorflow:2.3.1-gpu-jupyter
+
+RUN apt-get install -y python3 && \
+    apt install python3-pip
+
+RUN pip3 install absl-py && \
+    pip3 install portpicker
+
+# Install git
+RUN apt-get update && \
+    apt-get install -y git && \
+    apt-get install -y vim
+
+WORKDIR /app
+
+RUN git clone --single-branch --branch benchmark https://github.com/tensorflow/models.git && \
+    mv models tensorflow_models && \
+    git clone https://github.com/tensorflow/model-optimization.git && \
+    mv model-optimization tensorflow_model_optimization
+
+# Keeps Python from generating .pyc files in the container
+ENV PYTHONDONTWRITEBYTECODE=1
+# Turns off buffering for easier container logging
+ENV PYTHONUNBUFFERED=1
+
+COPY . /app/
+
+ENV PYTHONPATH "${PYTHONPATH}:/:/app/tensorflow_models"
+
+CMD ["python", "resnet_cifar_multiworker_strategy_keras.py"]
diff --git a/distribution_strategy/multi_worker_mirrored_strategy/examples/README.md b/distribution_strategy/multi_worker_mirrored_strategy/examples/README.md
@@ -3,11 +3,13 @@
 This directory contains examples of MultiWorkerMirrored Training along with the docker file to build them
 
 - [Dockerfile](Dockerfile) contains all dependenices required to build a container image using docker with the training examples
+- [Dockerfile.gpu](Dockerfile.gpu) contains all dependenices required to build a container image using docker with gpu and the tensorflow model garden
 - [keras_mnist.py](mnist.py) demonstrates how to train an MNIST classifier using
   [tf.distribute.MultiWorkerMirroredStrategy and Keras Tensorflow 2.0 API](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras).
 - [custom_training_mnist.py](mnist.py) demonstrates how to train a fashion MNIST classifier using
   [tf.distribute.MultiWorkerMirroredStrategy and Tensorflow 2.0 Custom Training Loop APIs](https://www.tensorflow.org/tutorials/distribute/custom_training).
-
+- [keras_resnet_cifar.py](keras_resnet_cifar.py) demonstrates how to train the resnet56 model on the Cifar-10 dataset using
+  [tf.distribute.MultiWorkerMirroredStrategy and Keras Tensorflow 2.0 API](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras).
 ## Best Practices
 
 - Always pin the TensorFlow version with the Docker image tag. This ensures that
@@ -51,3 +53,10 @@ The [custom_training_mnist.py](mnist.py) example demonstrates how to train a fas
 [tf.distribute.MultiWorkerMirroredStrategy and Tensorflow 2.0 Custom Training Loop APIs](https://www.tensorflow.org/tutorials/distribute/custom_training).
 The final model is saved to disk by the chief worker process. The disk is assumed to be mounted onto the running container by the cluster manager.
 It assumes that the cluster configuration is passed in through the `TF_CONFIG` environment variable when deployed in the cluster.
+
+## Running the keras_resnet_cifar.py example
+
+The [keras_resnet_cifar.py](keras_resnet_cifar.py) example demonstrates how to train a Resnet56 model on the cifar-10 dataset using
+[tf.distribute.MultiWorkerMirroredStrategy and Keras Tensorflow 2.0 API](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras).
+The final model is saved to the GCP storage bucket.
+It assumes that the cluster configuration is passed in through the `TF_CONFIG` environment variable when deployed in the cluster.