Skip to content

Commit

Permalink
Adds documentation for adapters (#1197)
Browse files Browse the repository at this point in the history
  • Loading branch information
zachgk authored Oct 18, 2023
1 parent b47f115 commit f86c355
Show file tree
Hide file tree
Showing 3 changed files with 276 additions and 0 deletions.
186 changes: 186 additions & 0 deletions serving/docs/adapters.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
# Adapters

**Note that this API is experimental and is subject to change. Using it requires the environment variable feature flag `ENABLE_ADAPTERS_PREVIEW`.**

DJL Serving has first class support for adapters.
Adapters are patches or changes that can be made to a model to fine tune it for a particular usage.
The benefit of adapters rather than whole model fine-tuning is that they are often smaller and easier to distribute alongside a base model.
This can allow for multiple adapters used at the same system, and sometimes even in the same batch.

With DJL Serving, it is possible to easily work with adapters.
You can create models that accept adapters, use DJL Serving to manage your available adapters, and call models with adapters.

For a concrete usage, see the [large model inference adapters example notebook](http://docs.djl.ai/docs/demos/aws/sagemaker/large-model-inference/sample-llm/multi_lora_adapter_inference.html).

## Managing Adapters

There are several options to choose between for managing your set of adapters.

### Adapters local directory (Recommended)

The easiest option is to use an adapters local directory.
This is as easy as adding a directory of adapters alongside your model files.
It should contain an overarching adapters directory with an artifact directory for each adapter to add.
This works best for having a manageable set of adapters as they are all loaded on startup.
It can be used in conjunction with services like [Amazon SageMaker Single Model Endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-single-model.html).

```
ls:
serving.properties
model.py (optional)
model artifacts... (optional)
adapters/
adapter1/
...
adapter2/
...
```

### Management API

The next option is to manage adapters using the management API.
This can be used in conjunction with our existing management API and supports all standard restful options.
See the [Adapter Management API Documentation](adapters_api.md) for details.
However, this option may be difficult to use inside wrapping systems such as Amazon SageMaker.

```
GET models/{modelName}/adapters - List adapters
GET models/{modelName}/adapters/{adapterName} - Get adapter description
POST models/{modelName}/adapters - Create adapter
DEL models/{modelName}/adapters/{adapterName} - Delete adapter
```

### Workflow Adapters

The final option for working with adapters is through the [DJL Serving workflows system](workflows.md).
You can use the adapter `WorkflowFunction` to create and call an adapted version of a model within the workflow.
With our workflows, multiple workflows sharing models will be de-duplicated.
So, the effect of having multiple adapters can be easily made with having one workflow for each adapter.
This system can be used on [Amazon SageMaker Multi-Model Endpoints](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html).

```
workflow.json:
{
"name": "adapter1",
"version": "0.1",
"models": {
"m": "src/test/resources/adaptecho"
},
"configs": {
"adapters": {
"a1": {
"model": "m",
"src": "url1"
}
}
},
"workflow": {
"out": ["adapter", "a1", "in"]
}
}
```

## Calling Adapters

When calling a model with an adapter, you are also able to specify which adapter to use.
We currently support several techniques for passing in the adapter data.

There are a few things to keep in mind which choosing a calling technique.

1. If you are using workflow adapters, there is no need to specify an adapter as it is included in the workflow.
Instead, just call the workflow as normal.
2. Each technique must be implemented by the model handler by parsing the adapter from the Input parameter, content, or body respectively.
Our built-in implementations support it from all options
3. Some of these techniques may not work in all situations. For example, only the custom attributes strategy will work in Amazon SageMaker as it blocks the other options.


### (Recommended) Adapters parameter Calling

This passes the adapter as part of the requeset body.
It will also work with client-side batching and will allow multiple adapters to be passed with one for each input.
If the adapters is not passed, the base model will be used for inference.
You can also specify the base model for an element in the batch by using the empty string `""`.

```
curl -X POST http://127.0.0.1:8080/invocations \
-H "Content-Type: application/json" \
-H "X-Amzn-SageMaker-Target-Model: base-1.tar.gz" \
-d '{"inputs": ["How is the weather"], "adapters": ["adapter_1"], "parameters": {"max_new_tokens": 25}}'
```

### Input Content (and query parameter)

This passes the adapter through a query parameter.
It is reflected in the Input content.
This will not work in Amazon SageMaker.

```
curl -X POST http://127.0.0.1:8080/invocations?adapter=adapter_1 \
-H "Content-Type: application/json" \
-H "X-Amzn-SageMaker-Target-Model: base-1.tar.gz" \
-d '{"inputs": ["How is the weather"], "parameters": {"max_new_tokens": 25}}'
```

### SageMaker Custom Attributes

This passes the adapter through the use of the SageMaker Custom Attributes header.
It is reflected in the Input properties.
This option will work in Amazon SageMaker.

```
curl -X POST http://127.0.0.1:8080/invocations \
-H "Content-Type: application/json" \
-H "X-Amzn-SageMaker-Target-Model: base-1.tar.gz" \
-H "X-Amzn-SageMaker-Custom-Attributes: adapter=adapter_1"
-d '{"inputs": ["How is the weather"], "parameters": {"max_new_tokens": 25}}'
```

## Models with Adapters

There are two kinds of LoRA that can be put onto various engines.

* **Merged LoRA** - This will apply the adapter by modifying the base model in place.
It has zero added latency during execution, but has a cost to apply or unapply the merge.
It works best for cases with only a few adapters.
It is best for single adapter batches, but doesn’t support multi-adapter batches
* **Unmerged LoRA** - This will alter the model operators to factor in the adapters without changing the base model.
It has a higher inference latency for the additional adapter operations.
However, it does support multi-adapter batches.
It works best for use cases with large numbers of adapters.

With our default handlers, we currently support unmerged LoRA for CausalLM through the huggingface handler.
Support for other model types and handlers is coming soon.

### Writing Custom Adapter Models

Right now, adapters are only supported through our Python engine and not any of the other DJL engines.
See instructions to get started with writing a [python handler](modes.md#python-mode).

To add support for adapters, you must first add the register and unregister like below.
These can then take the adapters and save the src, pre-download it, cache it in memory, or cache it on an accelerator device.

```python
def register_adapter(inputs: Input):
name = inputs.get_properties()["name"]
src = inputs.get_properties()["src"]
# Do adapter registration tasks
return Output().add("Successfully registered adapter")

def unregister_adapter(inputs: Input):
name = inputs.get_properties()["name"]
# Do adapter unregistration tasks
return Output().add("Successfully unregistered adapter")


def handle(inputs: Input):
...
```

Within the handler, you must parse the adapter from the inputs.
The adapter information can be put in the inputs property, content, or body depending on the technique(s) used.
If desired, you can also parse multiple adapter passing options for greater ease of calling.

From there, you can use those as part of your inference call.
This will depend on the specific python deep learning framework you are using for inference.
For example, you would update the [HF Accelerate](https://huggingface.co/docs/transformers/v4.33.0/en/main_classes/text_generation#transformers.GenerationMixin.generate) by passing in the adapters parameter.
88 changes: 88 additions & 0 deletions serving/docs/adapters_api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# DJL Serving Adapters Management API

**Note that this API is experimental and is subject to change. Using it requires the environment variable feature flag `ENABLE_ADAPTERS_PREVIEW`.**

DJL Serving provides a set of API allow user to manage adapters at runtime:

1. [Register an adapter](#register-an-adapter)
3. [Describe an adapter's status](#describe-adapter)
4. [Unregister an adapter](#unregister-an-adapter)
5. [List registered adapters](#list-adapters)

This is an extension of the [Management API](management_api.md) and can be accessed the same.

## Adapter Management APIs

### Register an adapter

`POST /models/{modelName}/adapters`

* name - The adapter name.
* src - The adapter src. It currently requires a file, but eventually an id or URL can be supported depending on the model handler.

```bash
curl -X POST "http://localhost:8080/models/adaptecho/adapters?name=a1&src=..."

{
"status": "Adapter \"a1\" registered."
}
```

### Describe adapter

`GET /models/{model_name}/adapters/{adapter_name}`

Use the Describe Adapter API to get the status of an adapter:

```bash
curl http://localhost:8080/models/adaptecho/adapters/a1

[
{
"name": "a1",
"src": "..."
}
]
```

### Unregister an adapter

`DELETE /models/{model_name}/adapters/{adapter_name}`

Use the Unregister Adapter API to free up system resources:

```bash
curl -X DELETE http://localhost:8080/models/adaptecho/adapters/a1

{
"status": "Adapter \"a1\" unregistered"
}
```

### List adapters

`GET /models/{model_name}/adapters`

* limit - (optional) the maximum number of items to return. It is passed as a query parameter. The default value is `100`.
* next_page_token - (optional) queries for next page. It is passed as a query parameter. This value is return by a previous API call.

Use the Adapters API to query current registered adapters:

```bash
curl "http://localhost:8080/models/adaptecho/adapters"
```

This API supports pagination:

```bash
curl "http://localhost:8080/models/adaptecho/adapters?limit=2&next_page_token=0"

{
"adapters": [
{
"name": "a1",
"src": "..."
}
]
}
```
2 changes: 2 additions & 0 deletions serving/docs/management_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ DJL Serving provides a set of API allow user to manage models at runtime:
4. [Unregister a model](#unregister-a-model-or-workflow)
5. [List registered models](#list-workflows)

In addition, there is also the [adapter management API](adapters_api.md) for managing adapters.

Management API is listening on port 8080 and only accessible from localhost by default. To change the default setting, see [DJL Serving Configuration](configuration.md).

Similar as [Inference API](inference_api.md).
Expand Down

0 comments on commit f86c355

Please sign in to comment.