-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Llama2 Chat Inference with Inf2 instances #356
Conversation
:::danger | ||
|
||
Note: Use of this Llama-2 model is governed by the Meta license. | ||
In order to download the model weights and tokenizer, please visit the [website](https://ai.meta.com/) and accept our License before requesting access here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"accept our License before requesting access here." -> accept the license before requesting access.
|
||
**Scalability and Availability** | ||
One of the key challenges in deploying large language models (`LLMs`) like Llama-2 is the scalability and availability of suitable hardware. Traditional `GPU` instances often face scarcity due to high demand, making it challenging to provision and scale resources effectively. | ||
In contrast, AWS Neuron instances, such as `trn1.32xlarge`, `trn1n.32xlarge`, `inf2.24xlarge` and `inf2.48xlarge`, are tailor-made for LLM workloads. They offer both scalability and availability, ensuring that you can deploy and scale your `Llama-2` models as needed, without resource bottlenecks or delays. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"tailor-made for LLM workloads" -> purpose built for high-performance deep learning (DL) training and inference of generative AI models, including LLMs.
|
||
# Llama-2-Chat on EKS: Deploying Llama-2-13b Chat Model with Ray Serve and Gradio | ||
Welcome to the comprehensive guide on deploying the [Meta Llama-2-13b chat](https://ai.meta.com/llama/#inside-the-model) model on Amazon Elastic Kubernetes Service (EKS) using [Ray Serve](https://docs.ray.io/en/latest/serve/index.html). | ||
In this tutorial, you will not only learn how to harness the power of Llama-2, but also gain insights into the intricacies of deploying large language models (LLMs) efficiently, particularly on [AWS Neuron](https://aws.amazon.com/machine-learning/neuron/) instances, such as `inf2.24xlarge` and `inf2.48xlarge`, which are optimized for deploying and scaling large language models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AWS Neuron is the SDK. We prefer to call-out trn1/inf2 instances (powered by AWS Trainium and Inferentia) rather than brand the instances as 'Neuron'.
### **Which Llama-2 model size should I use?** | ||
The best Llama-2 model size for you will depend on your specific needs. and it may not always be the largest model for achieving the highest performance. It's advisable to evaluate your needs and consider factors such as computational resources, response time, and cost-efficiency when selecting the appropriate Llama-2 model size. The decision should be based on a comprehensive assessment of your application's goals and constraints. | ||
|
||
## Inference on AWS Neuron Instances: Unlocking the Full Potential of Llama-2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above - AWS Neuron is the SDK. We prefer to call-out trn1/inf2 instances (powered by AWS Trainium and Inferentia) rather than brand the instances as 'Neuron'.
The best Llama-2 model size for you will depend on your specific needs. and it may not always be the largest model for achieving the highest performance. It's advisable to evaluate your needs and consider factors such as computational resources, response time, and cost-efficiency when selecting the appropriate Llama-2 model size. The decision should be based on a comprehensive assessment of your application's goals and constraints. | ||
|
||
## Inference on AWS Neuron Instances: Unlocking the Full Potential of Llama-2 | ||
**Llama-2** can be deployed on a variety of hardware platforms, each with its own set of advantages. However, when it comes to maximizing the efficiency, scalability, and cost-effectiveness of Llama-2, [AWS Neuron instances](https://aws.amazon.com/ec2/instance-types/inf2/) shine as the optimal choice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above - AWS Neuron is the SDK. We prefer to call-out trn1/inf2 instances (powered by AWS Trainium and Inferentia) rather than brand the instances as 'Neuron'.
|
||
**Scalability and Availability** | ||
One of the key challenges in deploying large language models (`LLMs`) like Llama-2 is the scalability and availability of suitable hardware. Traditional `GPU` instances often face scarcity due to high demand, making it challenging to provision and scale resources effectively. | ||
In contrast, AWS Neuron instances, such as `trn1.32xlarge`, `trn1n.32xlarge`, `inf2.24xlarge` and `inf2.48xlarge`, are tailor-made for LLM workloads. They offer both scalability and availability, ensuring that you can deploy and scale your `Llama-2` models as needed, without resource bottlenecks or delays. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above - AWS Neuron is the SDK. We prefer to call-out trn1/inf2 instances (powered by AWS Trainium and Inferentia) rather than brand the instances as 'Neuron'.
|
||
**Cost Optimization:** | ||
Running LLMs on traditional GPU instances can be cost-prohibitive, especially given the scarcity of GPUs and their competitive pricing. | ||
AWS Neuron instances provide a cost-effective alternative. By offering dedicated hardware optimized for AI and machine learning tasks, Neuron instances allow you to achieve top-notch performance at a fraction of the cost. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above -AWS Neuron is the SDK. We prefer to call-out trn1/inf2 instances (powered by AWS Trainium and Inferentia) rather than brand the instances as 'Neuron'.
In conclusion, you will have successfully deployed the **Llama-2-13b chat** model on EKS with Ray Serve and created a chatGPT-style chat web UI using Gradio. | ||
This opens up exciting possibilities for natural language processing and chatbot development. | ||
|
||
In summary, when it comes to deploying and scaling Llama-2, AWS Neuron instances offer a compelling advantage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above - AWS Neuron is the SDK. We prefer to call-out trn1/inf2 instances (powered by AWS Trainium and Inferentia) rather than brand the instances as 'Neuron'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
What does this PR do?
🛑 Please open an issue first to discuss any significant work and flesh out details/direction - we would hate for your time to be wasted.
Consult the CONTRIBUTING guide for submitting pull-requests.
This pull request (PR) introduces several enhancements to the blueprint, focusing on deploying and managing the Llama2 13B chat model on Inf2 instances. Here's a breakdown of what this PR does:
New Managed Node Groups for Inf2:
Adds new managed node groups tailored for Inf2 instances. These node groups are optimized for running inference workloads.
New Karpenter Provisioners:
Introduces new Karpenter provisioners specifically designed for both Trainium and Inf2 instances. This ensures efficient resource allocation and management.
JupyterHub Addon:
Implements the JupyterHub addon for this blueprint, enabling users to run Jupyter notebooks seamlessly within the environment.
NGINX and ALB Controllers:
Integrates NGINX and ALB (Application Load Balancer) controllers. These controllers enhance the blueprint's capabilities for routing and load balancing.
JupyterHub Notebook Example:
Includes a JupyterHub notebook example demonstrating how to run the Llama2 13B chat model on Inf2.24xlarge instances. This serves as a practical guide for users.
Llama2 13B Chat Model Deployment:
Deploys the
Llama2 13B chat model
onInf2.48xlarge
instances using the RAY SERVE framework. This setup is optimized for high-performance model serving.Gradio Web UI:
Adds a Gradio Web UI for the Llama2 chat model. Users can interact with the model through a user-friendly web interface, making it accessible and user-friendly.
These enhancements collectively improve the functionality and usability of the blueprint, making it easier for users to deploy and utilize the Llama2 chat model on Inf2 instances for various inference tasks.
Motivation
More
website/docs
orwebsite/blog
section for this featurepre-commit run -a
with this PR. Link for installing pre-commit locallyFor Moderators
Additional Notes