Auto-performance tuning, testing, and inference configuration for ML workloads. Bring your ML models, push to start, wait for AIOps automation to find out the optimal configurations you need for production inference. Built for Kubernetes ML production workloads.
Explore the docs »
View Demo
·
Report Bug
·
Request Feature
⚡️🍻 for those who just want to drink their inference, not figure out how to build the entire bar to serve it.
You've already spent ample time and energy brewing your, delicious, ML model. Now you need to serve it... but you have no bar or bartender... you have the retail space (cloud) and customer requirements (product/users). Figuring out the right architecture and model-serving capabilities you need is a whole other conundrum from what you did in your notebook.
ReallyFast is the bartender-as-a-service... including Vegas capabilities with hole-in-the wall cost optimizations out of the box. I.e. it takes your model, and figures out the optimal production architecture and configuration for your desired performance.
ML Applications that require high inference performance (e.g., high throughput and low latency) need optimal configuration settings across various tuning parameters in the application and infrastructure layers. Traditionally, ML applications rely on software engineers to tune these configurations––e.g., thread & memory usage, worker counts, container image types & environment settings, framework compilation settings, CPU/GPU optimization frameworks & their settings, batch models, etc...
However, this manual method has several limitations:
- Tuning all possible configurations is an NP-hard problem, and engineers can only tune a small percentage of all possible configurations.
- Engineers spend much time and exploratory development to tune their focused configurations.
- Engineers are usually good at tuning their focused, or knowledgable, application configurations, e.g., Python/Flash/Gunicorn, but lack the knowledge needed to tune for optimal inference performance and other stack details such as container runtimes, hardware, and testing of accelerator frameworks (MKL, ONNX, CUDA, etc.).
- Kubernetes makes this problem space even further complicated.
These limitations are severe for production inference workloads that need to scale efficiently. Production ML apps fail due to the inability to serve users promptly, scale to critical mass, and high-costs associated with naive scaling methods.
Our enterprise customers get the smarts of our global & continuously trained performance models, as well as economies of scale discounts + multi-model and model-pipeline capabilities. For OSS users, you still get our base model, which is still wicked-smat. And this library which automates some annoying performance testing, and automated tuning, tasks for you.
- Clone the repo
- Drop in your model
- Put in some human-level requirements e.g. I need to scale to 10,000 users, or need not spend more than $5,000 a month
- push to start
- wait for the resuts
... ok its not that simple, you need a cloud env and a budget of around $100. There is some advanced config if you want to really make ths brew spicy.