-
Notifications
You must be signed in to change notification settings - Fork 68
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Adds documentation for adapters (#1197)
- Loading branch information
Showing
3 changed files
with
276 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,186 @@ | ||
# Adapters | ||
|
||
**Note that this API is experimental and is subject to change. Using it requires the environment variable feature flag `ENABLE_ADAPTERS_PREVIEW`.** | ||
|
||
DJL Serving has first class support for adapters. | ||
Adapters are patches or changes that can be made to a model to fine tune it for a particular usage. | ||
The benefit of adapters rather than whole model fine-tuning is that they are often smaller and easier to distribute alongside a base model. | ||
This can allow for multiple adapters used at the same system, and sometimes even in the same batch. | ||
|
||
With DJL Serving, it is possible to easily work with adapters. | ||
You can create models that accept adapters, use DJL Serving to manage your available adapters, and call models with adapters. | ||
|
||
For a concrete usage, see the [large model inference adapters example notebook](http://docs.djl.ai/docs/demos/aws/sagemaker/large-model-inference/sample-llm/multi_lora_adapter_inference.html). | ||
|
||
## Managing Adapters | ||
|
||
There are several options to choose between for managing your set of adapters. | ||
|
||
### Adapters local directory (Recommended) | ||
|
||
The easiest option is to use an adapters local directory. | ||
This is as easy as adding a directory of adapters alongside your model files. | ||
It should contain an overarching adapters directory with an artifact directory for each adapter to add. | ||
This works best for having a manageable set of adapters as they are all loaded on startup. | ||
It can be used in conjunction with services like [Amazon SageMaker Single Model Endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-single-model.html). | ||
|
||
``` | ||
ls: | ||
serving.properties | ||
model.py (optional) | ||
model artifacts... (optional) | ||
adapters/ | ||
adapter1/ | ||
... | ||
adapter2/ | ||
... | ||
``` | ||
|
||
### Management API | ||
|
||
The next option is to manage adapters using the management API. | ||
This can be used in conjunction with our existing management API and supports all standard restful options. | ||
See the [Adapter Management API Documentation](adapters_api.md) for details. | ||
However, this option may be difficult to use inside wrapping systems such as Amazon SageMaker. | ||
|
||
``` | ||
GET models/{modelName}/adapters - List adapters | ||
GET models/{modelName}/adapters/{adapterName} - Get adapter description | ||
POST models/{modelName}/adapters - Create adapter | ||
DEL models/{modelName}/adapters/{adapterName} - Delete adapter | ||
``` | ||
|
||
### Workflow Adapters | ||
|
||
The final option for working with adapters is through the [DJL Serving workflows system](workflows.md). | ||
You can use the adapter `WorkflowFunction` to create and call an adapted version of a model within the workflow. | ||
With our workflows, multiple workflows sharing models will be de-duplicated. | ||
So, the effect of having multiple adapters can be easily made with having one workflow for each adapter. | ||
This system can be used on [Amazon SageMaker Multi-Model Endpoints](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html). | ||
|
||
``` | ||
workflow.json: | ||
{ | ||
"name": "adapter1", | ||
"version": "0.1", | ||
"models": { | ||
"m": "src/test/resources/adaptecho" | ||
}, | ||
"configs": { | ||
"adapters": { | ||
"a1": { | ||
"model": "m", | ||
"src": "url1" | ||
} | ||
} | ||
}, | ||
"workflow": { | ||
"out": ["adapter", "a1", "in"] | ||
} | ||
} | ||
``` | ||
|
||
## Calling Adapters | ||
|
||
When calling a model with an adapter, you are also able to specify which adapter to use. | ||
We currently support several techniques for passing in the adapter data. | ||
|
||
There are a few things to keep in mind which choosing a calling technique. | ||
|
||
1. If you are using workflow adapters, there is no need to specify an adapter as it is included in the workflow. | ||
Instead, just call the workflow as normal. | ||
2. Each technique must be implemented by the model handler by parsing the adapter from the Input parameter, content, or body respectively. | ||
Our built-in implementations support it from all options | ||
3. Some of these techniques may not work in all situations. For example, only the custom attributes strategy will work in Amazon SageMaker as it blocks the other options. | ||
|
||
|
||
### (Recommended) Adapters parameter Calling | ||
|
||
This passes the adapter as part of the requeset body. | ||
It will also work with client-side batching and will allow multiple adapters to be passed with one for each input. | ||
If the adapters is not passed, the base model will be used for inference. | ||
You can also specify the base model for an element in the batch by using the empty string `""`. | ||
|
||
``` | ||
curl -X POST http://127.0.0.1:8080/invocations \ | ||
-H "Content-Type: application/json" \ | ||
-H "X-Amzn-SageMaker-Target-Model: base-1.tar.gz" \ | ||
-d '{"inputs": ["How is the weather"], "adapters": ["adapter_1"], "parameters": {"max_new_tokens": 25}}' | ||
``` | ||
|
||
### Input Content (and query parameter) | ||
|
||
This passes the adapter through a query parameter. | ||
It is reflected in the Input content. | ||
This will not work in Amazon SageMaker. | ||
|
||
``` | ||
curl -X POST http://127.0.0.1:8080/invocations?adapter=adapter_1 \ | ||
-H "Content-Type: application/json" \ | ||
-H "X-Amzn-SageMaker-Target-Model: base-1.tar.gz" \ | ||
-d '{"inputs": ["How is the weather"], "parameters": {"max_new_tokens": 25}}' | ||
``` | ||
|
||
### SageMaker Custom Attributes | ||
|
||
This passes the adapter through the use of the SageMaker Custom Attributes header. | ||
It is reflected in the Input properties. | ||
This option will work in Amazon SageMaker. | ||
|
||
``` | ||
curl -X POST http://127.0.0.1:8080/invocations \ | ||
-H "Content-Type: application/json" \ | ||
-H "X-Amzn-SageMaker-Target-Model: base-1.tar.gz" \ | ||
-H "X-Amzn-SageMaker-Custom-Attributes: adapter=adapter_1" | ||
-d '{"inputs": ["How is the weather"], "parameters": {"max_new_tokens": 25}}' | ||
``` | ||
|
||
## Models with Adapters | ||
|
||
There are two kinds of LoRA that can be put onto various engines. | ||
|
||
* **Merged LoRA** - This will apply the adapter by modifying the base model in place. | ||
It has zero added latency during execution, but has a cost to apply or unapply the merge. | ||
It works best for cases with only a few adapters. | ||
It is best for single adapter batches, but doesn’t support multi-adapter batches | ||
* **Unmerged LoRA** - This will alter the model operators to factor in the adapters without changing the base model. | ||
It has a higher inference latency for the additional adapter operations. | ||
However, it does support multi-adapter batches. | ||
It works best for use cases with large numbers of adapters. | ||
|
||
With our default handlers, we currently support unmerged LoRA for CausalLM through the huggingface handler. | ||
Support for other model types and handlers is coming soon. | ||
|
||
### Writing Custom Adapter Models | ||
|
||
Right now, adapters are only supported through our Python engine and not any of the other DJL engines. | ||
See instructions to get started with writing a [python handler](modes.md#python-mode). | ||
|
||
To add support for adapters, you must first add the register and unregister like below. | ||
These can then take the adapters and save the src, pre-download it, cache it in memory, or cache it on an accelerator device. | ||
|
||
```python | ||
def register_adapter(inputs: Input): | ||
name = inputs.get_properties()["name"] | ||
src = inputs.get_properties()["src"] | ||
# Do adapter registration tasks | ||
return Output().add("Successfully registered adapter") | ||
|
||
def unregister_adapter(inputs: Input): | ||
name = inputs.get_properties()["name"] | ||
# Do adapter unregistration tasks | ||
return Output().add("Successfully unregistered adapter") | ||
|
||
|
||
def handle(inputs: Input): | ||
... | ||
``` | ||
|
||
Within the handler, you must parse the adapter from the inputs. | ||
The adapter information can be put in the inputs property, content, or body depending on the technique(s) used. | ||
If desired, you can also parse multiple adapter passing options for greater ease of calling. | ||
|
||
From there, you can use those as part of your inference call. | ||
This will depend on the specific python deep learning framework you are using for inference. | ||
For example, you would update the [HF Accelerate](https://huggingface.co/docs/transformers/v4.33.0/en/main_classes/text_generation#transformers.GenerationMixin.generate) by passing in the adapters parameter. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
# DJL Serving Adapters Management API | ||
|
||
**Note that this API is experimental and is subject to change. Using it requires the environment variable feature flag `ENABLE_ADAPTERS_PREVIEW`.** | ||
|
||
DJL Serving provides a set of API allow user to manage adapters at runtime: | ||
|
||
1. [Register an adapter](#register-an-adapter) | ||
3. [Describe an adapter's status](#describe-adapter) | ||
4. [Unregister an adapter](#unregister-an-adapter) | ||
5. [List registered adapters](#list-adapters) | ||
|
||
This is an extension of the [Management API](management_api.md) and can be accessed the same. | ||
|
||
## Adapter Management APIs | ||
|
||
### Register an adapter | ||
|
||
`POST /models/{modelName}/adapters` | ||
|
||
* name - The adapter name. | ||
* src - The adapter src. It currently requires a file, but eventually an id or URL can be supported depending on the model handler. | ||
|
||
```bash | ||
curl -X POST "http://localhost:8080/models/adaptecho/adapters?name=a1&src=..." | ||
|
||
{ | ||
"status": "Adapter \"a1\" registered." | ||
} | ||
``` | ||
|
||
### Describe adapter | ||
|
||
`GET /models/{model_name}/adapters/{adapter_name}` | ||
|
||
Use the Describe Adapter API to get the status of an adapter: | ||
|
||
```bash | ||
curl http://localhost:8080/models/adaptecho/adapters/a1 | ||
|
||
[ | ||
{ | ||
"name": "a1", | ||
"src": "..." | ||
} | ||
] | ||
``` | ||
|
||
### Unregister an adapter | ||
|
||
`DELETE /models/{model_name}/adapters/{adapter_name}` | ||
|
||
Use the Unregister Adapter API to free up system resources: | ||
|
||
```bash | ||
curl -X DELETE http://localhost:8080/models/adaptecho/adapters/a1 | ||
|
||
{ | ||
"status": "Adapter \"a1\" unregistered" | ||
} | ||
``` | ||
|
||
### List adapters | ||
|
||
`GET /models/{model_name}/adapters` | ||
|
||
* limit - (optional) the maximum number of items to return. It is passed as a query parameter. The default value is `100`. | ||
* next_page_token - (optional) queries for next page. It is passed as a query parameter. This value is return by a previous API call. | ||
|
||
Use the Adapters API to query current registered adapters: | ||
|
||
```bash | ||
curl "http://localhost:8080/models/adaptecho/adapters" | ||
``` | ||
|
||
This API supports pagination: | ||
|
||
```bash | ||
curl "http://localhost:8080/models/adaptecho/adapters?limit=2&next_page_token=0" | ||
|
||
{ | ||
"adapters": [ | ||
{ | ||
"name": "a1", | ||
"src": "..." | ||
} | ||
] | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters