You set per model settings by adding a serving.properties file in the root of your model directory (or .zip). These apply for all engines and modes.
An example serving.properties
can be found here.
In serving.properties
, you can set the following properties. Model properties are accessible to Translator
and python handler functions.
engine
: Which Engine to use, values include MXNet, PyTorch, TensorFlow, ONNX, etc.load_on_devices
: A ; delimited devices list, which the model to be loaded on, default to load on all devices.translatorFactory
: Specify the TranslatorFactory.job_queue_size
: Specify the job queue size at model level, this will override globaljob_queue_size
, default is1000
.batch_size
: the dynamic batch size, default is1
.max_batch_delay
- the maximum delay for batch aggregation in millis, default value is100
milliseconds.max_idle_time
- the maximum idle time in seconds before the worker thread is scaled down, default is60
seconds.log_request_metric
: Enable model metrics (inference, pre-process and post-process latency) logging.metrics_aggregation
: Number of model metrics to aggregate, default is1000
.minWorkers
: Minimum number of workers, default is1
.maxWorkers
: Maximum number of workers, default is#CPU/OMP_NUM_THREAD
for CPU, GPU default is2
, inferentia default is2
(PyTorch engine),1
(Python engine) .gpu.minWorkers
: Minimum number of workers for GPU.gpu.maxWorkers
: Maximum number of workers for GPU.cpu.minWorkers
: Minimum number of workers for CPU.cpu.maxWorkers
: Maximum number of workers for CPU.required_memory_mb
: Specify the required memory (CPU and GPU) in MB to load the model.gpu.required_memory_mb
: Specify the required GPU memory in MB to load the model.reserved_memory_mb
: Reserve memory in MB to avoid system out of memory.gpu.reserved_memory_mb
: Reserve GPU memory in MB to avoid system out of memory.
In serving.properties
, you can also set options (prefixed with option
) and properties.
The options will be passed to Model.load(Path modelPath, String prefix, Map<String, ?> options)
API.
It allows you to set engine specific configurations.
Here are some of the available option properties:
# set model file name prefix if different from folder name
option.modeName=resnet18_v1
# PyTorch options
option.mapLocation=true
option.extraFiles=foo.txt,bar.txt
# ONNXRuntime options
option.interOpNumThreads=2
option.intraOpNumThreads=2
option.executionMode=SEQUENTIAL
option.optLevel=BASIC_OPT
option.memoryPatternOptimization=true
option.cpuArenaAllocator=true
option.disablePerSessionThreads=true
option.customOpLibrary=myops.so
option.disablePerSessionThreads=true
option.ortDevice=TensorRT/ROCM/CoreML
# Python model options
# Mark model as failure after python process crashing 10 times
retry_threshold=10
option.pythonExecutable=python3
option.entryPoint=huggingface.py
option.handler=hanlde
option.predict_timeout=120
option.model_loading_timeout=10
option.parallel_loading=true
option.tensor_parallel_degree=2
option.enable_venv=true
option.rolling_batch=auto
#option.rolling_batch=lmi-dist
option.max_rolling_batch_size=64
# max output size in bytes, default to 60M
option.max_output_size=67108864
Most of the options can also be overridden by an environment variable with the OPTION_
prefix.
For example:
# to enable rolling batch with only environment variable:
export OPTION_ROLLING_BATCH=auto
You can set number of workers for each model: https://github.com/deepjavalibrary/djl-serving/blob/master/serving/src/test/resources/identity/serving.properties#L4-L8
For example, set minimum workers and maximum workers for your model:
minWorkers=32
maxWorkers=64
Or you can configure minimum workers and maximum workers differently for GPU and CPU:
gpu.minWorkers=2
gpu.maxWorkers=3
cpu.minWorkers=2
cpu.maxWorkers=4
job queue size, batch size, max batch delay, max worker idle time can be configured at per model level, this will override global settings:
job_queue_size=10
batch_size=2
max_batch_delay=1
max_idle_time=120
You can configure which device to load the model on, default is *:
load_on_devices=gpu4;gpu5
# or simply:
load_on_devices=4;5
For Python engine, we recommend set minWorkers
and maxWorkers
to be the same since python
worker scale up and down is expensive.
You may also need to consider OMP_NUM_THREAD
when setting number workers. OMP_NUM_THREAD
is default
to 1
, you can unset OMP_NUM_THREAD
by setting NO_OMP_NUM_THREADS=true
. If OMP_NUM_THREAD
is unset,
the maxWorkers
will be default to 2 (larger maxWorkers
with non 1 OMP_NUM_THREAD
can cause thread
contention, and reduce throughput).
Set minimum workers and maximum workers for your model:
minWorkers=32
maxWorkers=64
# idle time in seconds before the worker thread is scaled down
max_idle_time=120
Or set minimum workers and maximum workers differently for GPU and CPU:
gpu.minWorkers=2
gpu.maxWorkers=3
cpu.minWorkers=2
cpu.maxWorkers=4
Note: Loading model in Python mode is pretty heavy. We recommend to set minWorker
and maxWorker
to be the same value to avoid unnecessary load and unload.
Or override global job_queue_size
:
job_queue_size=10
To enable dynamic batching:
batch_size=2
max_batch_delay=1
To enable rolling batch for Python engine:
# lmi-dist requires running mpi mode
engine=Python
option.mpi_mode=true
option.rolling_batch=auto
# use FlashAttention
#option.rolling_batch=lmi-dist
#option.rolling_batch=scheduler
option.max_rolling_batch_size=64
To enable fast model downloading, you can store your model artifacts (weights) in a S3 bucket, and
only keep the model code and metadata in the model.tar.gz
(.zip) file. DJL can leverage
s5cmd to download uncompressed files from S3 with extremely fast
speed.
To enable s5cmd
downloading, you can configure serving.properties
as the following:
option.model_id=s3://YOUR_BUCKET/...
If you want to deploy multiple python models, but their dependencies has conflict, you can enable python virtual environments for your model:
option.enable_venv=true