Skip to content

Commit

Permalink
fix: NIM Pattern Enhancement 20240716 (#588)
Browse files Browse the repository at this point in the history
  • Loading branch information
hustshawn authored Jul 16, 2024
1 parent c9b2208 commit 3cfd28c
Show file tree
Hide file tree
Showing 3 changed files with 47 additions and 16 deletions.
3 changes: 2 additions & 1 deletion ai-ml/nvidia-triton-server/nvidia-nim.tf
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,7 @@ resource "helm_release" "nim_llm" {
]

depends_on = [
null_resource.download_nim_deploy
null_resource.download_nim_deploy,
module.eks_blueprints_addons.ingress_nginx
]
}
5 changes: 4 additions & 1 deletion gen-ai/inference/nvidia-nim/nim-client/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,10 @@ async def main(FLAGS):
"top_k": 20,
"max_tokens": 512,
}
client = openai.AsyncOpenAI(base_url=FLAGS.url)
client = openai.AsyncOpenAI(
base_url=FLAGS.url,
api_key="not_used_for_self_host", # To avoid report OPENAI_API_KEY missing
)
with open(FLAGS.input_prompts, "r") as file:
print(f"Loading inputs from `{FLAGS.input_prompts}`...")
prompts = file.readlines()
Expand Down
55 changes: 41 additions & 14 deletions website/docs/gen-ai/inference/nvidia-nim-llama3.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ NIMs are packaged as container images on a per model/model family basis. Each NI

![NIM Architecture](img/nim-architecture.png)

Source: https://docs.nvidia.com/nim/large-language-models/latest/introduction.html#architecture

## Overview of this deployment pattern on Amazon EKS

This pattern combines the capabilities of NVIDIA NIM, Amazon Elastic Kubernetes Service (EKS), and various AWS services to deliver a high-performance and cost-optimized model serving infrastructure.
Expand All @@ -52,6 +54,9 @@ By combining these components, our proposed solution delivers a powerful and cos

Before getting started with NVIDIA NIM, ensure you have the following:

<details>
<summary>Click to expand the NVIDIA NIM account setup details</summary>

**NVIDIA AI Enterprise Account**

- Register for an NVIDIA AI Enterprise account. If you don't have one, you can sign up for a trial account using this [link](https://enterpriseproductregistration.nvidia.com/?LicType=EVAL&ProductFamily=NVAIEnterprise).
Expand Down Expand Up @@ -87,6 +92,7 @@ echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-s
docker pull nvcr.io/nim/meta/llama3-8b-instruct:latest
```
You do not have to wait for it to complete, just to make sure the API key is valid to pull the image.
</details>

The following are required to run this tutorial
- An active AWS account with admin equivalent permissions
Expand Down Expand Up @@ -319,11 +325,14 @@ kubectl apply -f genaiperf-deploy.yaml
```

Once the pod is ready with running status `1/1`, can execute into the pod.

```bash
export POD_NAME=$(kubectl get po -l app=tritonserver -ojsonpath='{.items[0].metadata.name}')
kubectl exec -it $POD_NAME -- bash
```

Run the testing to the deployed NIM Llama3 model

```bash
genai-perf \
-m meta/llama3-8b-instruct \
Expand All @@ -342,6 +351,7 @@ genai-perf \
--profile-export-file my_profile_export.json \
--url nim-llm.nim:8000
```

You should see similar output like the following

```bash
Expand All @@ -362,20 +372,19 @@ You should be able to see the [metrics](https://docs.nvidia.com/deeplearning/tri

To understand the command line options, please refer to [this documentation](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/client/src/c%2B%2B/perf_analyzer/genai-perf/README.html#command-line-options).


## Observability

As part of this blueprint, we have also deployed the Kube Prometheus stack, which provides Prometheus server and Grafana deployments for monitoring and observability.

First, let's verify the services deployed by the Kube Prometheus stack:

```bash
kubectl get svc -n kube-prometheus-stack
kubectl get svc -n monitoring
```

You should see output similar to this:

```text
kubectl get svc -n kube-prometheus-stack
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-prometheus-stack-grafana ClusterIP 172.20.225.77 <none> 80/TCP 10m
kube-prometheus-stack-kube-state-metrics ClusterIP 172.20.237.248 <none> 8080/TCP 10m
Expand All @@ -394,7 +403,9 @@ kubectl port-forward -n nim svc/nim-llm 8000
curl localhost:8000/metrics # run this in another terminal
```

We also provided a pre-configured Grafana dashboard. In the Grafana dashboard below, it contains several important metrics:
### Grafana Dashboard

We provides a pre-configured Grafana dashboard to better visualize NIM status. In the Grafana dashboard below, it contains several important metrics:

- **Time to First Token (TTFT)**: The latency between the initial inference request to the model and the return of the first token.
- **Inter-Token Latency (ITL)**: The latency between each token after the first.
Expand All @@ -404,34 +415,50 @@ You can find more metrics description from this [document](https://docs.nvidia.c

![NVIDIA LLM Server](img/nim-dashboard.png)

You can visualize these metrics using the Grafana. To view the Grafana dashboard to monitor these metrics, follow the steps below:
To view the Grafana dashboard to monitor these metrics, follow the steps below:

```bash
- Port-forward Grafana service:
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-prometheus-stack
<details>
<summary>Click to expand details</summary>

**1. Retrieve the Grafana password.**

- Grafana Admin user
admin
The password is saved in the AWS Secret Manager. Below Terraform command will show you the secret name.

- Get secret name from Terraform output
```bash
terraform output grafana_secret_name
```

- Get admin user password
Then use the output secret name to run below command,

```bash
aws secretsmanager get-secret-value --secret-id <grafana_secret_name_output> --region $AWS_REGION --query "SecretString" --output text
```

**Login to Grafana:**
**2. Expose the Grafana Service**

Use port-forward to expose the Grafana service.

```bash
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring
```

**3. Login to Grafana:**

- Open your web browser and navigate to [http://localhost:3000](http://localhost:3000).
- Login with the username `admin` and the password retrieved from AWS Secrets Manager.

**Open the NIM Monitoring Dashboard:**
**4. Open the NIM Monitoring Dashboard:**

- Once logged in, click "Dashboards" on the left sidebar and search "nim"
- You can find the Dashboard `NVIDIA NIM Monitoring` from the list
- Click and entering to the dashboard.

You should now see the metrics displayed on the Grafana dashboard, allowing you to monitor the performance your NVIDIA NIM service deployment.
</details>

:::info
As of writing this guide, NVIDIA also provides an example Grafana dashboard. You can check it from [here](https://docs.nvidia.com/nim/large-language-models/latest/observability.html#grafana).
:::

## Cleanup

Expand Down

0 comments on commit 3cfd28c

Please sign in to comment.