In AgentScope, in addition to OpenAI API, we also support open-source models with post request API. In this document, we will introduce how to fast set up local model API serving with different inference engines.
- Set up Model API Serving
- Table of Contents
Flask is a lightweight web application framework. It is easy to build a local model API serving with Flask.
Here we provide two Flask examples with Transformers and ModelScope library, respectively. You can build your own model API serving with few modifications.
Install Flask and Transformers by following command.
pip install flask torch transformers accelerate
Taking model meta-llama/Llama-2-7b-chat-hf
and port 8000
as an example,
set up the model API serving by running the following command.
python flask_transformers/setup_hf_service.py \
--model_name_or_path meta-llama/Llama-2-7b-chat-hf \
--device "cuda:0" \
--port 8000
You can replace meta-llama/Llama-2-7b-chat-hf
with any model card in
huggingface model hub.
In AgentScope, you can load the model with the following model configs: ./flask_transformers/model_config.json
.
{
"model_type": "post_api",
"config_name": "flask_llama2-7b-chat",
"api_url": "http://127.0.0.1:8000/llm/",
"json_args": {
"max_length": 4096,
"temperature": 0.5
}
}
In this model serving, the messages from post requests should be in STRING
format. You can use templates for chat model in
transformers with a little modification in ./flask_transformers/setup_hf_service.py
.
Install Flask and modelscope by following command.
pip install flask torch modelscope
Taking model modelscope/Llama-2-7b-ms
and port 8000
as an example,
to set up the model API serving, run the following command.
python flask_modelscope/setup_ms_service.py \
--model_name_or_path modelscope/Llama-2-7b-ms \
--device "cuda:0" \
--port 8000
You can replace modelscope/Llama-2-7b-ms
with any model card in
modelscope model hub.
In AgentScope, you can load the model with the following model configs:
flask_modelscope/model_config.json
.
{
"model_type": "post_api",
"config_name": "flask_llama2-7b-ms",
"api_url": "http://127.0.0.1:8000/llm/",
"json_args": {
"max_length": 4096,
"temperature": 0.5
}
}
Similar with the example of transformers, the messages from post requests should be in STRING format.
FastChat is an open platform that provides quick setup for model serving with OpenAI-compatible RESTful APIs.
To install FastChat, run
pip install "fschat[model_worker,webui]"
Taking model meta-llama/Llama-2-7b-chat-hf
and port 8000
as an example,
to set up model API serving, run the following command to set up model serving.
bash fastchat_script/fastchat_setup.sh -m meta-llama/Llama-2-7b-chat-hf -p 8000
Refer to supported model list of FastChat.
Now you can load the model in AgentScope by the following model config: fastchat_script/model_config.json
.
{
"model_type": "openai",
"config_name": "meta-llama/Llama-2-7b-chat-hf",
"api_key": "EMPTY",
"client_args": {
"base_url": "http://127.0.0.1:8000/v1/"
},
"generate_args": {
"temperature": 0.5
}
}
vllm is a high-throughput inference and serving engine for LLMs.
To install vllm, run
pip install vllm
Taking model meta-llama/Llama-2-7b-chat-hf
and port 8000
as an example,
to set up model API serving, run
./vllm_script/vllm_setup.sh -m meta-llama/Llama-2-7b-chat-hf -p 8000
Please refer to the supported models list of vllm.
Now you can load the model in AgentScope by the following model config: vllm_script/model_config.json
.
{
"model_type": "openai",
"config_name": "meta-llama/Llama-2-7b-chat-hf",
"api_key": "EMPTY",
"client_args": {
"base_url": "http://127.0.0.1:8000/v1/"
},
"generate_args": {
"temperature": 0.5
}
}
Both Huggingface and
ModelScope provide model inference API,
which can be used with AgentScope post api model wrapper.
Taking gpt2
in HuggingFace inference API as an example, you can use the
following model config in AgentScope.
{
"model_type": "post_api",
"config_name": "gpt2",
"headers": {
"Authorization": "Bearer {YOUR_API_TOKEN}"
},
"api_url": "https://api-inference.huggingface.co/models/gpt2"
}