Cloud TPU Monitoring Debugging repository contains all the infrastructure and logic required to monitor and debug jobs running on Cloud TPU.
Terraform is used to deploy resources in google cloud project. Terraform is an open-source tool to set up and manage google cloud infrastructure based on configuration files. This repository will help the customers to deploy various google cloud resources via script, without any manual effort.
cloud-tpu-diagnostics PyPI package contains all the logic to monitor, debug and profile the jobs running on Cloud TPU.
- Follow this link to install Terraform on desktop.
- Run
terraform init
to initialize google cloud Terraform provider version. This command will add the necessary plugins and build the.terraform
directory. - If there is an update to terraform google cloud provider version, run
terraform init --upgrade
for the update to take place. - You can also run
terraform plan
to validate resource declarations, identify any syntax errors, version mismatch before deploying the resources.
By default, Terraform stores state locally in a file named terraform.tfstate
. This default configuration can make Terraform usage difficult for teams, especially when many users run Terraform at the same time and each machine has its own understanding of the current infrastructure. To help avoid such issues, this section configures a remote state that points to Google Cloud Storage (GCS) bucket.
-
In Cloud Shell, create the GCS bucket:
gsutil mb gs://${GCS_BUCKET_NAME}
-
Enable Object Versioning to keep the history of your deployments. Enabling Object Versioning increases storage costs, which you can mitigate by configuring Object Lifecycle Management to delete old state versions.
gsutil versioning set on gs://${GCS_BUCKET_NAME}
-
Enter the name of GCS bucket created above when you run
terraform init
to initialize Terraform.Initializing the backend... bucket The name of the Google Cloud Storage bucket Enter a value: <GCS_BUCKET_NAME>
There are following resources managed in this directory:
- Monitoring Dashboard: This is an outlier dashboard that displays statistics and outlier mode for TPU metrics.
- Debugging Dashboard: This dashboard displays the stack traces collected in Cloud Logging for the process running on TPU VMs.
- Logging Storage: This is an user-defined log bucket to store stack traces. Creating a new log storage is completely optional. If you choose not to create a separate log bucket, the stack traces will be collected in _Default log bucket.
Run terraform init && terraform apply
inside gcp_resources/gce
directory to deploy all the resources mentioned above for TPU workloads running on GCE. You will be prompted to provide values for some input variables. After confirming the action, all the resources will get automatically deployed in your gcp project.
Run terraform init && terraform apply
inside gcp_resources/gke
directory to deploy all the resources mentioned above for TPU workloads running on GKE. You will be prompted to provide values for some input variables. After confirming the action, all the resources will get automatically deployed in your gcp project.
NOTE: Please check the below guide for more details about GCE/GKE specific resources and prerequisites.
Follow the below guide to deploy the resources individually:
Run terraform init && terraform apply
inside gcp_resources/gce/resources/dashboard/monitoring_dashboard/
to deploy only monitoring dashboard for GCE in your gcp project.
If the node_prefix
parameter is not specified in the input variable var.monitoring_dashboard_config
or is set to an empty string, the metrics on the dashboard will plot the data points for all TPU VMs in your GCP project.
For instance, if you provide {"node_prefix": "test"}
as the input value for the input variable var.monitoring_dashboard_config
, then the metrics on the monitoring dashboard will only show the data points for the TPU VMs with node names that start with test
. Refer to this doc for more information on node prefix for TPUs in multislice.
Run terraform init && terraform apply
inside gcp_resources/gke/resources/dashboard/monitoring_dashboard/
to deploy only monitoring dashboard for GKE in your gcp project.
Run terraform init && terraform apply
inside gcp_resources/gce/resources/dashboard/logging_dashboard/
to deploy only debugging dashboard for GCE in your gcp project.
Run terraform init && terraform apply
inside gcp_resources/gke/resources/dashboard/logging_dashboard/
to deploy only debugging dashboard for GKE in your gcp project.
Users need to add a sidecar container to their TPU workload running on GKE to view traces in the debugging dashboard. The sidecar container must be named in a specific way, matching the regex [a-z-0-9]*stacktrace[a-z-0-9]*
. Here is an example of the sidecar container that should be added:
containers:
- name: stacktrace-log-collector
image: busybox:1.28
resources:
limits:
cpu: 100m
memory: 200Mi
args: [/bin/sh, -c, "while [ ! -d /tmp/debugging ]; do sleep 60; done; while [ ! -e /tmp/debugging/* ]; do sleep 60; done; tail -n+1 -f /tmp/debugging/*"]
volumeMounts:
- name: tpu-debug-logs
readOnly: true
mountPath: /tmp/debugging
- name: <main_container>
.....
.....
volumes:
- name: tpu-debug-logs
Run terraform init && terraform apply
inside gcp_resources/gce/resources/log_storage/
to deploy a separate log bucket to store stack traces for GCE. You will be prompted to provide name of your gcp project and also the bucket configuration. You can also set the retention period for the bucket.
Run terraform init && terraform apply
inside gcp_resources/gke/resources/log_storage/
to deploy a separate log bucket to store stack traces for GKE. You will be prompted to provide name of your gcp project and also the bucket configuration. You can also set the retention period for the bucket. Make sure that you have the sidecar container running in your GKE cluster as mentioned in Debugging Dashboard section for GKE.