Optimize your model to speedup inference with OpenVINO and Neural Compressor
+Optimize your model to speedup inference with OpenVINO , Neural Compressor and IPEX
Accelerate your training and inference workflows with AWS Trainium and AWS Inferentia
Accelerate your training and inference workflows with Google TPUs
-> [!TIP] -> Some packages provide hardware-agnostic features (e.g. INC interface in Optimum Intel). - - ## Open-source integrations 🤗 Optimum also supports a variety of open-source frameworks to make model optimization very easy. diff --git a/docs/source/installation.mdx b/docs/source/installation.mdx index c08b3f92e5c..27733574c80 100644 --- a/docs/source/installation.mdx +++ b/docs/source/installation.mdx @@ -25,6 +25,7 @@ If you'd like to use the accelerator-specific features of 🤗 Optimum, you can | [ONNX Runtime](https://huggingface.co/docs/optimum/onnxruntime/overview) | `pip install --upgrade --upgrade-strategy eager optimum[onnxruntime]` | | [Intel Neural Compressor](https://huggingface.co/docs/optimum/intel/index) | `pip install --upgrade --upgrade-strategy eager optimum[neural-compressor]` | | [OpenVINO](https://huggingface.co/docs/optimum/intel/index) | `pip install --upgrade --upgrade-strategy eager optimum[openvino]` | +| [IPEX](https://huggingface.co/docs/optimum/intel/index) | `pip install --upgrade --upgrade-strategy eager optimum[ipex]` | | [NVIDIA TensorRT-LLM](https://huggingface.co/docs/optimum/main/en/nvidia_overview) | `docker run -it --gpus all --ipc host huggingface/optimum-nvidia` | | [AMD Instinct GPUs and Ryzen AI NPU](https://huggingface.co/docs/optimum/amd/index) | `pip install --upgrade --upgrade-strategy eager optimum[amd]` | | [AWS Trainum & Inferentia](https://huggingface.co/docs/optimum-neuron/index) | `pip install --upgrade --upgrade-strategy eager optimum[neuronx]` | diff --git a/docs/source/onnxruntime/package_reference/modeling_ort.mdx b/docs/source/onnxruntime/package_reference/modeling_ort.mdx index 65b2b60195a..2c93ab3ac0d 100644 --- a/docs/source/onnxruntime/package_reference/modeling_ort.mdx +++ b/docs/source/onnxruntime/package_reference/modeling_ort.mdx @@ -119,6 +119,11 @@ The following ORT classes are available for the following custom tasks. ## Stable Diffusion +#### ORTDiffusionPipeline + +[[autodoc]] onnxruntime.ORTDiffusionPipeline + - __call__ + #### ORTStableDiffusionPipeline [[autodoc]] onnxruntime.ORTStableDiffusionPipeline diff --git a/docs/source/onnxruntime/usage_guides/models.mdx b/docs/source/onnxruntime/usage_guides/models.mdx index 131822e9568..27ac446096b 100644 --- a/docs/source/onnxruntime/usage_guides/models.mdx +++ b/docs/source/onnxruntime/usage_guides/models.mdx @@ -4,263 +4,128 @@ Optimum is a utility package for building and running inference with accelerated Optimum can be used to load optimized models from the [Hugging Face Hub](hf.co/models) and create pipelines to run accelerated inference without rewriting your APIs. -## Switching from Transformers to Optimum -The `optimum.onnxruntime.ORTModelForXXX` model classes are API compatible with Hugging Face Transformers models. This -means you can just replace your `AutoModelForXXX` class with the corresponding `ORTModelForXXX` class in `optimum.onnxruntime`. +## Loading -You do not need to adapt your code to get it to work with `ORTModelForXXX` classes: +### Transformers models -```diff -from transformers import AutoTokenizer, pipeline --from transformers import AutoModelForQuestionAnswering -+from optimum.onnxruntime import ORTModelForQuestionAnswering - --model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2") # PyTorch checkpoint -+model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2") # ONNX checkpoint -tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2") - -onnx_qa = pipeline("question-answering",model=model,tokenizer=tokenizer) - -question = "What's my name?" -context = "My name is Philipp and I live in Nuremberg." -pred = onnx_qa(question, context) -``` - -### Loading a vanilla Transformers model - -Because the model you want to work with might not be already converted to ONNX, [`~optimum.onnxruntime.ORTModel`] -includes a method to convert vanilla Transformers models to ONNX ones. Simply pass `export=True` to the -[`~optimum.onnxruntime.ORTModel.from_pretrained`] method, and your model will be loaded and converted to ONNX on-the-fly: - -```python ->>> from optimum.onnxruntime import ORTModelForSequenceClassification - ->>> # Load the model from the hub and export it to the ONNX format ->>> model = ORTModelForSequenceClassification.from_pretrained( -... "distilbert-base-uncased-finetuned-sst-2-english", export=True -... ) -``` - -### Pushing ONNX models to the Hugging Face Hub - -It is also possible, just as with regular [`~transformers.PreTrainedModel`]s, to push your `ORTModelForXXX` to the -[Hugging Face Model Hub](https://hf.co/models): - -```python ->>> from optimum.onnxruntime import ORTModelForSequenceClassification +Once your model was [exported to the ONNX format](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/export_a_model), you can load it by replacing `AutoModelForXxx` with the corresponding `ORTModelForXxx` class. ->>> # Load the model from the hub and export it to the ONNX format ->>> model = ORTModelForSequenceClassification.from_pretrained( -... "distilbert-base-uncased-finetuned-sst-2-english", export=True -... ) +```diff + from transformers import AutoTokenizer, pipeline +- from transformers import AutoModelForCausalLM ++ from optimum.onnxruntime import ORTModelForCausalLM ->>> # Save the converted model ->>> model.save_pretrained("a_local_path_for_convert_onnx_model") +- model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B) # PyTorch checkpoint ++ model = ORTModelForCausalLM.from_pretrained("onnx-community/Llama-3.2-1B", subfolder="onnx") # ONNX checkpoint + tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B") -# Push the onnx model to HF Hub ->>> model.push_to_hub( # doctest: +SKIP -... "a_local_path_for_convert_onnx_model", repository_id="my-onnx-repo", use_auth_token=True -... ) + pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) + result = pipe("He never went out without a book under his arm") ``` -## Sequence-to-sequence models +More information for all the supported `ORTModelForXxx` in our [documentation](https://huggingface.co/docs/optimum/onnxruntime/package_reference/modeling_ort) -Sequence-to-sequence (Seq2Seq) models can also be used when running inference with ONNX Runtime. When Seq2Seq models -are exported to the ONNX format, they are decomposed into three parts that are later combined during inference: -- The encoder part of the model -- The decoder part of the model + the language modeling head -- The same decoder part of the model + language modeling head but taking and using pre-computed key / values as inputs and -outputs. This makes inference faster. -Here is an example of how you can load a T5 model to the ONNX format and run inference for a translation task: +### Diffusers models +Once your model was [exported to the ONNX format](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/export_a_model), you can load it by replacing `DiffusionPipeline` with the corresponding `ORTDiffusionPipeline` class. -```python ->>> from transformers import AutoTokenizer, pipeline ->>> from optimum.onnxruntime import ORTModelForSeq2SeqLM - -# Load the model from the hub and export it to the ONNX format ->>> model_name = "t5-small" ->>> model = ORTModelForSeq2SeqLM.from_pretrained(model_name, export=True) ->>> tokenizer = AutoTokenizer.from_pretrained(model_name) - -# Create a pipeline ->>> onnx_translation = pipeline("translation_en_to_fr", model=model, tokenizer=tokenizer) ->>> text = "He never went out without a book under his arm, and he often came back with two." ->>> result = onnx_translation(text) ->>> # [{'translation_text': "Il n'est jamais sorti sans un livre sous son bras, et il est souvent revenu avec deux."}] -``` -## Stable Diffusion - -Stable Diffusion models can also be used when running inference with ONNX Runtime. When Stable Diffusion models -are exported to the ONNX format, they are split into four components that are later combined during inference: -- The text encoder -- The U-NET -- The VAE encoder -- The VAE decoder - -Make sure you have 🤗 Diffusers installed. - -To install `diffusers`: -```bash -pip install diffusers +```diff +- from diffusers import DiffusionPipeline ++ from optimum.onnxruntime import ORTDiffusionPipeline + + model_id = "runwayml/stable-diffusion-v1-5" +- pipeline = DiffusionPipeline.from_pretrained(model_id) ++ pipeline = ORTDiffusionPipeline.from_pretrained(model_id, revision="onnx") + prompt = "sailing ship in storm by Leonardo da Vinci" + image = pipeline(prompt).images[0] ``` -### Text-to-Image -Here is an example of how you can load an ONNX Stable Diffusion model and run inference using ONNX Runtime: +### Sentence Transformers models -```python -from optimum.onnxruntime import ORTStableDiffusionPipeline - -model_id = "runwayml/stable-diffusion-v1-5" -pipeline = ORTStableDiffusionPipeline.from_pretrained(model_id, revision="onnx") -prompt = "sailing ship in storm by Leonardo da Vinci" -image = pipeline(prompt).images[0] -``` - -To load your PyTorch model and convert it to ONNX on-the-fly, you can set `export=True`. +Once your model was [exported to the ONNX format](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/export_a_model), you can load it by replacing `AutoModel` with the corresponding `ORTModelForFeatureExtraction` class. -```python -pipeline = ORTStableDiffusionPipeline.from_pretrained(model_id, export=True) - -# Don't forget to save the ONNX model -save_directory = "a_local_path" -pipeline.save_pretrained(save_directory) +```diff + from transformers import AutoTokenizer +- from transformers import AutoModel ++ from optimum.onnxruntime import ORTModelForFeatureExtraction + + tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") +- model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") ++ model = ORTModelForFeatureExtraction.from_pretrained("optimum/all-MiniLM-L6-v2") + inputs = tokenizer("This is an example sentence", return_tensors="pt") + outputs = model(**inputs) ``` -