diff --git a/docs/faq.rst b/docs/faq.rst new file mode 100644 index 0000000..07e5c1a --- /dev/null +++ b/docs/faq.rst @@ -0,0 +1,59 @@ +Frequently Asked Questions (FAQ) +================================ + + +Is my "model_fn" called at each invocation? +------------------------------------------- + +No. + +The :func:`model_fn` function is called during the very fist invocation only. +Once the model has been loaded, it is retained in memory for as long as the service runs. + +To speed up the very first invocation, it is possible to trigger the `model_fn` hook in advance. +To do this, simply call :func:`inference_server.warmup`. + +For example, when using Gunicorn, this could be done from a post-fork Gunicorn hook:: + + def post_fork(server, worker): + worker.log.info("Warming up worker...") + inference_server.warmup() + + +Does **inference-server** support async/ASGI webservers? +-------------------------------------------------------- + +No. + +**inference-server** is a WSGI application to be used by synchronous webservers. + +For most ML models that will be the correct choice as model inference is typically CPU-bound. +Therefore, a multi-process based WSGI server is a good choice whereby the number of workers is equal to the number of CPU cores available. + +For more details see :ref:`deployment:Configuring Gunicorn workers`. + + +My model is leaking memory, how do I address that? +-------------------------------------------------- + +If the memory leak is outside your control, one approach would be to periodically restart the webserver workers. + +For example, when using Gunicorn, it is possible to specify a maximum number of HTTP requests (`max_requests`) after which a given worker should be restarted. +Gunicorn additionally allows a random offset (`max_requests_jitter`) to be added such that worker restarts are staggered. + +For more details see `Gunicorn settings documentation `_. + + +How do I invoke my model using a data stream from my favourite message queue system? +------------------------------------------------------------------------------------ + +By design, **inference-server** is an HTTP web server and uses a simple request-response model. + +This is so it can be deployed in most environments, not only including AWS Sagemaker but also as a local Dockerized service. +Access to the web server is also possible from a range of environments including AWS itself, but also from other providers in a multi-cloud environment. + +Depending on the messaging/queueing system and cloud environment, you have various options to integrate a model deployed with **inference-server** with a message stream. + +For example, in AWS, you could deploy a Lambda function which consumes messages from AWS SQS, then send this as an HTTP request to AWS SageMaker. +Equally, the Lambda function could write the SageMaker response to another SQS queue. +Of course, instead of a Lambda function you could use any other compute platform to deploy similar logic, including an EKS pods or ECS task. diff --git a/docs/index.rst b/docs/index.rst index c76d669..744b98d 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -4,7 +4,7 @@ inference-server Deploy your AI/ML model to Amazon SageMaker for Real-Time Inference and Batch Transform using your own Docker container image. .. toctree:: - :maxdepth: 2 + :maxdepth: 1 :caption: Contents introduction @@ -12,6 +12,7 @@ Deploy your AI/ML model to Amazon SageMaker for Real-Time Inference and Batch Tr batch_transform deployment testing + faq modules diff --git a/docs/modules.rst b/docs/modules.rst index e85375c..0555b36 100644 --- a/docs/modules.rst +++ b/docs/modules.rst @@ -1,8 +1,8 @@ -API Documentation -================= +API reference documentation +=========================== .. toctree:: - :maxdepth: 2 + :maxdepth: 1 inference_server inference_server_testing diff --git a/src/inference_server/__init__.py b/src/inference_server/__init__.py index 5ab9c29..9f7ff0b 100644 --- a/src/inference_server/__init__.py +++ b/src/inference_server/__init__.py @@ -42,6 +42,7 @@ "MIMEAccept", # Exporting for plugin developers' convenience "create_app", "plugin_hook", + "warmup", ) #: Library version, e.g. 1.0.0, taken from Git tags @@ -70,12 +71,20 @@ class BatchStrategy(enum.Enum): def create_app() -> "WSGIApplication": - """Initialize and return the WSGI application""" + """ + Initialize and return the WSGI application + + This is the WSGI application factory function that needs to be passed to a WSGI-compatible web server. + """ return _app def warmup() -> None: - """Initialize any additional resources upfront""" + """ + Initialize any additional resources upfront + + This will call the ``model_fn`` plugin hook. + """ _model()