Skip to content

Commit

Permalink
Add Aria (#34157)
Browse files Browse the repository at this point in the history
* Add Aria
---------

Co-authored-by: Cyril Vallez <[email protected]>
Co-authored-by: Arthur <[email protected]>
  • Loading branch information
3 people authored Dec 6, 2024
1 parent 15ab310 commit 9ad4c93
Show file tree
Hide file tree
Showing 32 changed files with 6,244 additions and 7 deletions.
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -810,6 +810,8 @@
title: ALIGN
- local: model_doc/altclip
title: AltCLIP
- local: model_doc/aria
title: Aria
- local: model_doc/blip
title: BLIP
- local: model_doc/blip-2
Expand Down
3 changes: 3 additions & 0 deletions docs/source/en/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,8 @@ Flax), PyTorch, and/or TensorFlow.
| [ALBERT](model_doc/albert) ||||
| [ALIGN](model_doc/align) ||||
| [AltCLIP](model_doc/altclip) ||||
| [Aria](model_doc/aria) ||||
| [AriaText](model_doc/aria_text) ||||
| [Audio Spectrogram Transformer](model_doc/audio-spectrogram-transformer) ||||
| [Autoformer](model_doc/autoformer) ||||
| [Bark](model_doc/bark) ||||
Expand Down Expand Up @@ -172,6 +174,7 @@ Flax), PyTorch, and/or TensorFlow.
| [IDEFICS](model_doc/idefics) ||||
| [Idefics2](model_doc/idefics2) ||||
| [Idefics3](model_doc/idefics3) ||||
| [Idefics3VisionTransformer](model_doc/idefics3_vision) ||||
| [ImageGPT](model_doc/imagegpt) ||||
| [Informer](model_doc/informer) ||||
| [InstructBLIP](model_doc/instructblip) ||||
Expand Down
106 changes: 106 additions & 0 deletions docs/source/en/model_doc/aria.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->

# Aria

## Overview

The Aria model was proposed in [Aria: An Open Multimodal Native Mixture-of-Experts Model](https://huggingface.co/papers/2410.05993) by Li et al. from the Rhymes.AI team.

Aria is an open multimodal-native model with best-in-class performance across a wide range of multimodal, language, and coding tasks. It has a Mixture-of-Experts architecture, with respectively 3.9B and 3.5B activated parameters per visual token and text token.

The abstract from the paper is the following:

*Information comes in diverse modalities. Multimodal native AI models are essential to integrate real-world information and deliver comprehensive understanding. While proprietary multimodal native models exist, their lack of openness imposes obstacles for adoptions, let alone adaptations. To fill this gap, we introduce Aria, an open multimodal native model with best-in-class performance across a wide range of multimodal, language, and coding tasks. Aria is a mixture-of-expert model with 3.9B and 3.5B activated parameters per visual token and text token, respectively. It outperforms Pixtral-12B and Llama3.2-11B, and is competitive against the best proprietary models on various multimodal tasks. We pre-train Aria from scratch following a 4-stage pipeline, which progressively equips the model with strong capabilities in language understanding, multimodal understanding, long context window, and instruction following. We open-source the model weights along with a codebase that facilitates easy adoptions and adaptations of Aria in real-world applications.*

This model was contributed by [m-ric](https://huggingface.co/m-ric).
The original code can be found [here](https://github.com/rhymes-ai/Aria).

## Usage tips

Here's how to use the model for vision tasks:
```python
import requests
import torch
from PIL import Image

from transformers import AriaProcessor, AriaForConditionalGeneration

model_id_or_path = "rhymes-ai/Aria"

model = AriaForConditionalGeneration.from_pretrained(
model_id_or_path, device_map="auto"
)

processor = AriaProcessor.from_pretrained(model_id_or_path)

image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)

messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"text": "what is the image?", "type": "text"},
],
}
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt")
inputs.to(model.device)

output = model.generate(
**inputs,
max_new_tokens=15,
stop_strings=["<|im_end|>"],
tokenizer=processor.tokenizer,
do_sample=True,
temperature=0.9,
)
output_ids = output[0][inputs["input_ids"].shape[1]:]
response = processor.decode(output_ids, skip_special_tokens=True)
```


## AriaImageProcessor

[[autodoc]] AriaImageProcessor

## AriaProcessor

[[autodoc]] AriaProcessor

## AriaTextConfig

[[autodoc]] AriaTextConfig

## AriaConfig

[[autodoc]] AriaConfig

## AriaTextModel

[[autodoc]] AriaTextModel

## AriaTextForCausalLM

[[autodoc]] AriaTextForCausalLM

## AriaForConditionalGeneration

[[autodoc]] AriaForConditionalGeneration
- forward
7 changes: 7 additions & 0 deletions docs/source/en/model_doc/idefics3.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,13 @@ This model was contributed by [amyeroberts](https://huggingface.co/amyeroberts)

[[autodoc]] Idefics3Config

## Idefics3VisionConfig

[[autodoc]] Idefics3VisionConfig

## Idefics3VisionTransformer

[[autodoc]] Idefics3VisionTransformer

## Idefics3Model

Expand Down
2 changes: 2 additions & 0 deletions docs/source/en/perf_infer_gpu_one.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ FlashAttention-2 is experimental and may change considerably in future versions.
2. partitioning the work between GPU threads to reduce communication and shared memory reads/writes between them

FlashAttention-2 is currently supported for the following architectures:
* [Aria](https://huggingface.co/docs/transformers/model_doc/aria#transformers.AriaForConditionalGeneration)
* [Bark](https://huggingface.co/docs/transformers/model_doc/bark#transformers.BarkModel)
* [Bart](https://huggingface.co/docs/transformers/model_doc/bart#transformers.BartModel)
* [Chameleon](https://huggingface.co/docs/transformers/model_doc/chameleon#transformers.Chameleon)
Expand Down Expand Up @@ -216,6 +217,7 @@ PyTorch's [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.o

For now, Transformers supports SDPA inference and training for the following architectures:
* [Albert](https://huggingface.co/docs/transformers/model_doc/albert#transformers.AlbertModel)
* [Aria](https://huggingface.co/docs/transformers/model_doc/aria#transformers.AriaForConditionalGeneration)
* [Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer#transformers.ASTModel)
* [Bart](https://huggingface.co/docs/transformers/model_doc/bart#transformers.BartModel)
* [Bert](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertModel)
Expand Down
32 changes: 32 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,11 @@
"AltCLIPTextConfig",
"AltCLIPVisionConfig",
],
"models.aria": [
"AriaConfig",
"AriaProcessor",
"AriaTextConfig",
],
"models.audio_spectrogram_transformer": [
"ASTConfig",
"ASTFeatureExtractor",
Expand Down Expand Up @@ -1176,6 +1181,7 @@
_import_structure["image_processing_base"] = ["ImageProcessingMixin"]
_import_structure["image_processing_utils"] = ["BaseImageProcessor"]
_import_structure["image_utils"] = ["ImageFeatureExtractionMixin"]
_import_structure["models.aria"].extend(["AriaImageProcessor"])
_import_structure["models.beit"].extend(["BeitFeatureExtractor", "BeitImageProcessor"])
_import_structure["models.bit"].extend(["BitImageProcessor"])
_import_structure["models.blip"].extend(["BlipImageProcessor"])
Expand Down Expand Up @@ -1406,6 +1412,15 @@
"AltCLIPVisionModel",
]
)
_import_structure["models.aria"].extend(
[
"AriaForConditionalGeneration",
"AriaPreTrainedModel",
"AriaTextForCausalLM",
"AriaTextModel",
"AriaTextPreTrainedModel",
]
)
_import_structure["models.audio_spectrogram_transformer"].extend(
[
"ASTForAudioClassification",
Expand Down Expand Up @@ -2461,6 +2476,8 @@
"Idefics3Model",
"Idefics3PreTrainedModel",
"Idefics3Processor",
"Idefics3VisionConfig",
"Idefics3VisionTransformer",
]
)
_import_structure["models.ijepa"].extend(
Expand Down Expand Up @@ -5033,6 +5050,11 @@
AltCLIPTextConfig,
AltCLIPVisionConfig,
)
from .models.aria import (
AriaConfig,
AriaProcessor,
AriaTextConfig,
)
from .models.audio_spectrogram_transformer import (
ASTConfig,
ASTFeatureExtractor,
Expand Down Expand Up @@ -6096,6 +6118,7 @@
from .image_processing_base import ImageProcessingMixin
from .image_processing_utils import BaseImageProcessor
from .image_utils import ImageFeatureExtractionMixin
from .models.aria import AriaImageProcessor
from .models.beit import BeitFeatureExtractor, BeitImageProcessor
from .models.bit import BitImageProcessor
from .models.blip import BlipImageProcessor
Expand Down Expand Up @@ -6325,6 +6348,13 @@
AltCLIPTextModel,
AltCLIPVisionModel,
)
from .models.aria import (
AriaForConditionalGeneration,
AriaPreTrainedModel,
AriaTextForCausalLM,
AriaTextModel,
AriaTextPreTrainedModel,
)
from .models.audio_spectrogram_transformer import (
ASTForAudioClassification,
ASTModel,
Expand Down Expand Up @@ -7189,6 +7219,8 @@
Idefics3Model,
Idefics3PreTrainedModel,
Idefics3Processor,
Idefics3VisionConfig,
Idefics3VisionTransformer,
)
from .models.ijepa import (
IJepaForImageClassification,
Expand Down
1 change: 1 addition & 0 deletions src/transformers/generation/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1465,6 +1465,7 @@ def _prepare_generated_length(
elif (
model_input_name == "inputs_embeds"
and input_ids_length != inputs_tensor.shape[1]
and input_ids_length != 0
and not self.config.is_encoder_decoder
):
generation_config.max_length -= inputs_tensor.shape[1]
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
albert,
align,
altclip,
aria,
audio_spectrogram_transformer,
auto,
autoformer,
Expand Down
30 changes: 30 additions & 0 deletions src/transformers/models/aria/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure


if TYPE_CHECKING:
from .configuration_aria import *
from .image_processing_aria import *
from .modeling_aria import *
from .processing_aria import *

else:
import sys

_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
Loading

0 comments on commit 9ad4c93

Please sign in to comment.