Set up Model API Serving

In AgentScope, in addition to OpenAI API, we also support open-source models with post request API. In this document, we will introduce how to fast set up local model API serving with different inference engines.

Table of Contents

Set up Model API Serving
Table of Contents
- Local Model API Serving
- Model Inference API

Local Model API Serving

Flask-based Model API Serving

Flask is a lightweight web application framework. It is easy to build a local model API serving with Flask.

Here we provide two Flask examples with Transformers and ModelScope library, respectively. You can build your own model API serving with few modifications.

With Transformers Library

Install Libraries and Set up Serving

Install Flask and Transformers by following command.

pip install flask torch transformers accelerate

Taking model meta-llama/Llama-2-7b-chat-hf and port 8000 as an example, set up the model API serving by running the following command.

python flask_transformers/setup_hf_service.py \
    --model_name_or_path meta-llama/Llama-2-7b-chat-hf \
    --device "cuda:0" \
    --port 8000

You can replace meta-llama/Llama-2-7b-chat-hf with any model card in huggingface model hub.

How to use in AgentScope

In AgentScope, you can load the model with the following model configs: ./flask_transformers/model_config.json.

{
    "model_type": "post_api",
    "config_name": "flask_llama2-7b-chat",
    "api_url": "http://127.0.0.1:8000/llm/",
    "json_args": {
        "max_length": 4096,
        "temperature": 0.5
    }
}

Note

In this model serving, the messages from post requests should be in STRING format. You can use templates for chat model in transformers with a little modification in ./flask_transformers/setup_hf_service.py.

With ModelScope Library

Install Libraries and Set up Serving

Install Flask and modelscope by following command.

pip install flask torch modelscope

Taking model modelscope/Llama-2-7b-ms and port 8000 as an example, to set up the model API serving, run the following command.

python flask_modelscope/setup_ms_service.py \
    --model_name_or_path modelscope/Llama-2-7b-ms \
    --device "cuda:0" \
    --port 8000

You can replace modelscope/Llama-2-7b-ms with any model card in modelscope model hub.

How to use in AgentScope

In AgentScope, you can load the model with the following model configs: flask_modelscope/model_config.json.

{
    "model_type": "post_api",
    "config_name": "flask_llama2-7b-ms",
    "api_url": "http://127.0.0.1:8000/llm/",
    "json_args": {
        "max_length": 4096,
        "temperature": 0.5
    }
}

Note

Similar with the example of transformers, the messages from post requests should be in STRING format.

FastChat

FastChat is an open platform that provides quick setup for model serving with OpenAI-compatible RESTful APIs.

Install Libraries and Set up Serving

To install FastChat, run

pip install "fschat[model_worker,webui]"

Taking model meta-llama/Llama-2-7b-chat-hf and port 8000 as an example, to set up model API serving, run the following command to set up model serving.

bash fastchat_script/fastchat_setup.sh -m meta-llama/Llama-2-7b-chat-hf -p 8000

Supported Models

Refer to supported model list of FastChat.

How to use in AgentScope

Now you can load the model in AgentScope by the following model config: fastchat_script/model_config.json.

{
    "model_type": "openai",
    "config_name": "meta-llama/Llama-2-7b-chat-hf",
    "api_key": "EMPTY",
    "client_args": {
        "base_url": "http://127.0.0.1:8000/v1/"
    },
    "generate_args": {
        "temperature": 0.5
    }
}

vllm

vllm is a high-throughput inference and serving engine for LLMs.

Install Libraries and Set up Serving

To install vllm, run

pip install vllm

Taking model meta-llama/Llama-2-7b-chat-hf and port 8000 as an example, to set up model API serving, run

./vllm_script/vllm_setup.sh -m meta-llama/Llama-2-7b-chat-hf -p 8000

Supported models

Please refer to the supported models list of vllm.

How to use in AgentScope

Now you can load the model in AgentScope by the following model config: vllm_script/model_config.json.

{
    "model_type": "openai",
    "config_name": "meta-llama/Llama-2-7b-chat-hf",
    "api_key": "EMPTY",
    "client_args": {
        "base_url": "http://127.0.0.1:8000/v1/"
    },
    "generate_args": {
        "temperature": 0.5
    }
}

Model Inference API

Both Huggingface and ModelScope provide model inference API, which can be used with AgentScope post api model wrapper. Taking gpt2 in HuggingFace inference API as an example, you can use the following model config in AgentScope.

{
    "model_type": "post_api",
    "config_name": "gpt2",
    "headers": {
        "Authorization": "Bearer {YOUR_API_TOKEN}"
    },
    "api_url": "https://api-inference.huggingface.co/models/gpt2"
}