New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

feat: Llama2 Chat Inference with Inf2 instances #356

Merged

ovaleanu merged 9 commits into main from inf2

Oct 27, 2023

Collaborator

vara-bonthu commented Oct 26, 2023 •

edited

Loading

What does this PR do?

🛑 Please open an issue first to discuss any significant work and flesh out details/direction - we would hate for your time to be wasted.
Consult the CONTRIBUTING guide for submitting pull-requests.

This pull request (PR) introduces several enhancements to the blueprint, focusing on deploying and managing the Llama2 13B chat model on Inf2 instances. Here's a breakdown of what this PR does:

New Managed Node Groups for Inf2:

Adds new managed node groups tailored for Inf2 instances. These node groups are optimized for running inference workloads.

New Karpenter Provisioners:

Introduces new Karpenter provisioners specifically designed for both Trainium and Inf2 instances. This ensures efficient resource allocation and management.

JupyterHub Addon:

Implements the JupyterHub addon for this blueprint, enabling users to run Jupyter notebooks seamlessly within the environment.

NGINX and ALB Controllers:

Integrates NGINX and ALB (Application Load Balancer) controllers. These controllers enhance the blueprint's capabilities for routing and load balancing.

JupyterHub Notebook Example:

Includes a JupyterHub notebook example demonstrating how to run the Llama2 13B chat model on Inf2.24xlarge instances. This serves as a practical guide for users.
Llama2 13B Chat Model Deployment:

Deploys the Llama2 13B chat model on Inf2.48xlarge instances using the RAY SERVE framework. This setup is optimized for high-performance model serving.

Gradio Web UI:

Adds a Gradio Web UI for the Llama2 chat model. Users can interact with the model through a user-friendly web interface, making it accessible and user-friendly.

These enhancements collectively improve the functionality and usability of the blueprint, making it easier for users to deploy and utilize the Llama2 chat model on Inf2 instances for various inference tasks.

Motivation

More

Yes, I have tested the PR using my local account setup (Provide any test evidence report under Additional Notes)
Mandatory for new blueprints. Yes, I have added a example to support my blueprint PR
Mandatory for new blueprints. Yes, I have updated the website/docs or website/blog section for this feature
Yes, I ran pre-commit run -a with this PR. Link for installing pre-commit locally

For Moderators

E2E Test successfully complete before merge?

Additional Notes

vara-bonthu added 7 commits

October 10, 2023 09:44


          feat: Inference node group added for inf2

abefcd0


          feat: Inf2 with Llama2 chat serving with ray

c8018a8


          Updated to Llama2 model

fae4748


          Updated the docker image

77cb72a


          updates to rayserve models

be1b229


          Gradio updates

9b72b34


          updates to gradio

77cb5e7

vara-bonthu temporarily deployed to DoEKS Test

October 26, 2023 19:18

— with

GitHub Actions Inactive


          Llama2 docs added

9f4786d

vara-bonthu temporarily deployed to DoEKS Test

October 27, 2023 18:25

— with

GitHub Actions Inactive

5cp reviewed

View reviewed changes

website/docs/gen-ai/inference/Llama2.md Outdated

+              :::danger
+              Note: Use of this Llama-2 model is governed by the Meta license.
+              In order to download the model weights and tokenizer, please visit the [website](https://ai.meta.com/) and accept our License before requesting access here.

Contributor

5cp Oct 27, 2023

"accept our License before requesting access here." -> accept the license before requesting access.

5cp reviewed

View reviewed changes

website/docs/gen-ai/inference/Llama2.md Outdated

+              **Scalability and Availability**
+              One of the key challenges in deploying large language models (`LLMs`) like Llama-2 is the scalability and availability of suitable hardware. Traditional `GPU` instances often face scarcity due to high demand, making it challenging to provision and scale resources effectively.
+              In contrast, AWS Neuron instances, such as `trn1.32xlarge`, `trn1n.32xlarge`, `inf2.24xlarge` and `inf2.48xlarge`, are tailor-made for LLM workloads. They offer both scalability and availability, ensuring that you can deploy and scale your `Llama-2` models as needed, without resource bottlenecks or delays.

Contributor

5cp Oct 27, 2023

"tailor-made for LLM workloads" -> purpose built for high-performance deep learning (DL) training and inference of generative AI models, including LLMs.

5cp reviewed

View reviewed changes

website/docs/gen-ai/inference/Llama2.md Outdated

+              # Llama-2-Chat on EKS: Deploying Llama-2-13b Chat Model with Ray Serve and Gradio
+              Welcome to the comprehensive guide on deploying the [Meta Llama-2-13b chat](https://ai.meta.com/llama/#inside-the-model) model on Amazon Elastic Kubernetes Service (EKS) using [Ray Serve](https://docs.ray.io/en/latest/serve/index.html).
+              In this tutorial, you will not only learn how to harness the power of Llama-2, but also gain insights into the intricacies of deploying large language models (LLMs) efficiently, particularly on [AWS Neuron](https://aws.amazon.com/machine-learning/neuron/) instances, such as `inf2.24xlarge` and `inf2.48xlarge`, which are optimized for deploying and scaling large language models.

Contributor

5cp Oct 27, 2023

AWS Neuron is the SDK. We prefer to call-out trn1/inf2 instances (powered by AWS Trainium and Inferentia) rather than brand the instances as 'Neuron'.

5cp reviewed

View reviewed changes

website/docs/gen-ai/inference/Llama2.md Outdated

+              ### **Which Llama-2 model size should I use?**
+              The best Llama-2 model size for you will depend on your specific needs. and it may not always be the largest model for achieving the highest performance. It's advisable to evaluate your needs and consider factors such as computational resources, response time, and cost-efficiency when selecting the appropriate Llama-2 model size. The decision should be based on a comprehensive assessment of your application's goals and constraints.
+              ## Inference on AWS Neuron Instances: Unlocking the Full Potential of Llama-2

Contributor

5cp Oct 27, 2023

as above - AWS Neuron is the SDK. We prefer to call-out trn1/inf2 instances (powered by AWS Trainium and Inferentia) rather than brand the instances as 'Neuron'.

5cp reviewed

View reviewed changes

website/docs/gen-ai/inference/Llama2.md Outdated

+              The best Llama-2 model size for you will depend on your specific needs. and it may not always be the largest model for achieving the highest performance. It's advisable to evaluate your needs and consider factors such as computational resources, response time, and cost-efficiency when selecting the appropriate Llama-2 model size. The decision should be based on a comprehensive assessment of your application's goals and constraints.
+              ## Inference on AWS Neuron Instances: Unlocking the Full Potential of Llama-2
+              **Llama-2** can be deployed on a variety of hardware platforms, each with its own set of advantages. However, when it comes to maximizing the efficiency, scalability, and cost-effectiveness of Llama-2, [AWS Neuron instances](https://aws.amazon.com/ec2/instance-types/inf2/) shine as the optimal choice.

Contributor

5cp Oct 27, 2023

as above - AWS Neuron is the SDK. We prefer to call-out trn1/inf2 instances (powered by AWS Trainium and Inferentia) rather than brand the instances as 'Neuron'.

5cp reviewed

View reviewed changes

website/docs/gen-ai/inference/Llama2.md Outdated

+              **Scalability and Availability**
+              One of the key challenges in deploying large language models (`LLMs`) like Llama-2 is the scalability and availability of suitable hardware. Traditional `GPU` instances often face scarcity due to high demand, making it challenging to provision and scale resources effectively.
+              In contrast, AWS Neuron instances, such as `trn1.32xlarge`, `trn1n.32xlarge`, `inf2.24xlarge` and `inf2.48xlarge`, are tailor-made for LLM workloads. They offer both scalability and availability, ensuring that you can deploy and scale your `Llama-2` models as needed, without resource bottlenecks or delays.

Contributor

5cp Oct 27, 2023

as above - AWS Neuron is the SDK. We prefer to call-out trn1/inf2 instances (powered by AWS Trainium and Inferentia) rather than brand the instances as 'Neuron'.

5cp reviewed

View reviewed changes

website/docs/gen-ai/inference/Llama2.md Outdated

+              **Cost Optimization:**
+              Running LLMs on traditional GPU instances can be cost-prohibitive, especially given the scarcity of GPUs and their competitive pricing.
+              AWS Neuron instances provide a cost-effective alternative. By offering dedicated hardware optimized for AI and machine learning tasks, Neuron instances allow you to achieve top-notch performance at a fraction of the cost.

Contributor

5cp Oct 27, 2023

as above -AWS Neuron is the SDK. We prefer to call-out trn1/inf2 instances (powered by AWS Trainium and Inferentia) rather than brand the instances as 'Neuron'.

5cp reviewed

View reviewed changes

website/docs/gen-ai/inference/Llama2.md Outdated

+              In conclusion, you will have successfully deployed the **Llama-2-13b chat** model on EKS with Ray Serve and created a chatGPT-style chat web UI using Gradio.
+              This opens up exciting possibilities for natural language processing and chatbot development.
+              In summary, when it comes to deploying and scaling Llama-2, AWS Neuron instances offer a compelling advantage.

Contributor

5cp Oct 27, 2023

as above - AWS Neuron is the SDK. We prefer to call-out trn1/inf2 instances (powered by AWS Trainium and Inferentia) rather than brand the instances as 'Neuron'.


          Llama2 docs are updated with comments

baf44e5

vara-bonthu temporarily deployed to DoEKS Test

October 27, 2023 19:39

— with

GitHub Actions Inactive

ovaleanu approved these changes

View reviewed changes

Contributor

ovaleanu left a comment

LGTM!

ovaleanu merged commit a20d094 into main

44 of 46 checks passed

vara-bonthu deleted the inf2 branch

June 11, 2024 00:31

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet