Ray serve with Llama.cpp for CPU inference on Graviton #706

ddynwzh1992 · 2024-11-19T04:20:14Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

What is the outcome that you are trying to reach?

You can run CPU inference on EKS with Graviton, ray serve with llama.cpp can help with that

Describe the solution you would like

Describe alternatives you have considered

Additional context

ddynwzh1992 · 2024-12-04T03:05:05Z

In this blueprint, we will have a ray deployment of llama.cpp for model inference, a script for quantizing model and rearranging Model Weights, a script for benchmark

ddynwzh1992 · 2024-12-17T00:49:01Z

Ray on EKS will help with scalable and distributed inference jobs, Graviton will help with cost efficiency

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray serve with Llama.cpp for CPU inference on Graviton #706

Ray serve with Llama.cpp for CPU inference on Graviton #706

ddynwzh1992 commented Nov 19, 2024

ddynwzh1992 commented Dec 4, 2024

ddynwzh1992 commented Dec 17, 2024

Ray serve with Llama.cpp for CPU inference on Graviton #706

Ray serve with Llama.cpp for CPU inference on Graviton #706

Comments

ddynwzh1992 commented Nov 19, 2024

Community Note

What is the outcome that you are trying to reach?

Describe the solution you would like

Describe alternatives you have considered

Additional context

ddynwzh1992 commented Dec 4, 2024

ddynwzh1992 commented Dec 17, 2024