Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Molmo (7B-D, 7B-O, 70B) #33962

Open
wants to merge 145 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 75 commits
Commits
Show all changes
145 commits
Select commit Hold shift + click to select a range
dc6fcac
add base convert keys + chat template
molbap Oct 1, 2024
574e01f
Merge branch 'main' into add_molmo
molbap Oct 2, 2024
0bd413b
draft: add up modular files for molmo
molbap Oct 4, 2024
9e454e4
Squashed commit of the following:
molbap Oct 8, 2024
d82c471
sync changes
molbap Oct 8, 2024
339a8d3
push a simple fix
ArthurZucker Oct 8, 2024
c0c25d6
finish fixing
ArthurZucker Oct 8, 2024
5ee6a44
Merge branch 'main' into add_molmo
molbap Oct 8, 2024
33e43ec
suppress diff
molbap Oct 8, 2024
d23e1c1
Merge branch 'main' into add_molmo
molbap Oct 10, 2024
c8c12fe
fix
ArthurZucker Oct 10, 2024
0909c02
style
ArthurZucker Oct 10, 2024
1799d20
add config + 2d pooling
molbap Oct 10, 2024
fb133d4
suppress changes
molbap Oct 10, 2024
5ba4105
Merge branch 'add_molmo' of github.com:molbap/transformers into add_m…
molbap Oct 10, 2024
a2a6a9b
fix
ArthurZucker Oct 10, 2024
8fe7a9f
Merge branch 'add_molmo' of github.com:molbap/transformers into add_m…
ArthurZucker Oct 10, 2024
20681f5
conversion works :raised_hands:
molbap Oct 11, 2024
c85af98
fixup
molbap Oct 11, 2024
35ea3cc
handle missing MOLMO_VISION_ATTENTION_CLASSES
molbap Oct 11, 2024
ab79d0e
fix
molbap Oct 11, 2024
b9bdf99
fix fused keys mismatch
molbap Oct 15, 2024
98d5ccd
fix
molbap Oct 15, 2024
3bca742
[Modular-breaking] add manually vision attention classes list
molbap Oct 15, 2024
a13fe05
finish weight conversion script
molbap Oct 15, 2024
fac8dfd
add more keys
molbap Oct 16, 2024
c1e5f19
flipped the linear layers
molbap Oct 16, 2024
a68e5f5
add pooling forward + draft general forward
molbap Oct 16, 2024
8298b80
modeling file with swiglu, forward(input_ids) passing
molbap Oct 16, 2024
9f69c6b
BIG push of image processor
molbap Oct 23, 2024
0711e08
add missing objects to init
molbap Oct 23, 2024
7efe22e
Merge branch 'main' into add_molmo
molbap Nov 5, 2024
f5bd3b0
fix up wrong channel dimension
molbap Nov 7, 2024
3ae884f
fix typo
molbap Nov 7, 2024
3ef60c0
add missing image token indices used in forward
molbap Nov 19, 2024
cf9d4ab
pad patch orderings
molbap Nov 19, 2024
91a2d3c
clean up conversion script
molbap Nov 19, 2024
0f7904f
remind that tests are TODO
molbap Nov 19, 2024
577e347
merge main
zucchini-nlp Nov 21, 2024
b514041
at least it runs like this
zucchini-nlp Nov 24, 2024
cf6cb5d
add bos token
molbap Nov 27, 2024
26c517d
add bos token in prompt
molbap Nov 27, 2024
35c168d
fix processor, missing batching img_mask
molbap Nov 27, 2024
e7275c7
fix image masks + batching
molbap Nov 27, 2024
3e7530d
working version
zucchini-nlp Nov 27, 2024
4bbc89b
+1 only on non masked indices
zucchini-nlp Nov 27, 2024
54e072b
attemp 1 to make modular work
zucchini-nlp Nov 27, 2024
1e99752
update conversion to fit all ckpt + chat template + clean up a bit
zucchini-nlp Nov 27, 2024
92a1f31
fix processing tests
zucchini-nlp Nov 27, 2024
42330e0
add more tests (failing for now)
zucchini-nlp Nov 27, 2024
932f6d1
fix the conversion
zucchini-nlp Nov 27, 2024
aafb827
done!
zucchini-nlp Nov 27, 2024
36cc6dd
nit
zucchini-nlp Nov 27, 2024
f399c3a
some tests are failing, coming back tomorrow
zucchini-nlp Nov 27, 2024
7322227
adapt to any image format
molbap Nov 27, 2024
e4db50a
Merge branch 'add_molmo' of github.com:molbap/transformers into add_m…
molbap Nov 27, 2024
205a755
try to get batched generation working
molbap Nov 28, 2024
eb61617
fix other tests, should work now
zucchini-nlp Nov 28, 2024
b77d947
adjust test for batching
zucchini-nlp Nov 28, 2024
ba4dd50
little bit of style
zucchini-nlp Nov 28, 2024
0e2d184
docs + imports + automapping
zucchini-nlp Nov 28, 2024
9a83706
remove images kwargs
zucchini-nlp Nov 28, 2024
171eb8e
some unused config attributes
zucchini-nlp Nov 28, 2024
35b517a
remove additional vocab size and pad lm head
zucchini-nlp Nov 28, 2024
6a0cbc5
remove einops dependency
molbap Nov 28, 2024
5c7b141
Merge branch 'add_molmo' of github.com:molbap/transformers into add_m…
molbap Nov 28, 2024
434d4b1
dont skip these tests
zucchini-nlp Nov 28, 2024
4645f97
format + add integration testing
molbap Nov 28, 2024
48f2e21
Merge branch 'add_molmo' of github.com:molbap/transformers into add_m…
molbap Nov 28, 2024
4bb4e48
fix tests + fix 72B conversion
molbap Nov 29, 2024
e676782
fix format
molbap Nov 29, 2024
a74bda2
modualr kinda works but adds extra classes like `VisionVisionModel` :(
zucchini-nlp Nov 29, 2024
2c428ae
accomodate 7B-O version as well (broken)
molbap Nov 29, 2024
d338153
merge, fix conflicts and clean up modular extra code
molbap Nov 29, 2024
00376c4
fix 7B-O
zucchini-nlp Dec 2, 2024
48354fe
remove unused code path
zucchini-nlp Dec 2, 2024
d738493
nit
zucchini-nlp Dec 3, 2024
d0e90d4
make modular work mostly
zucchini-nlp Dec 3, 2024
f06b6d9
fix imports
zucchini-nlp Dec 3, 2024
9fc25c0
update modulat last time
zucchini-nlp Dec 3, 2024
38dc9e8
fix copies
zucchini-nlp Dec 3, 2024
eb77f3c
fix copies
zucchini-nlp Dec 4, 2024
190cc35
fix tests
zucchini-nlp Dec 4, 2024
84ed244
initial push of fast processor
molbap Dec 6, 2024
b4d48d5
Merge branch 'add_molmo' of github.com:molbap/transformers into add_m…
molbap Dec 6, 2024
1298d08
Merge branch 'main' into add_molmo
molbap Dec 10, 2024
6687d43
fix various issues + tests
molbap Dec 10, 2024
5f79577
add Molmo submodules as private
molbap Dec 10, 2024
9e72758
do not test submodules
molbap Dec 10, 2024
439aed6
[run-slow] molmo
molbap Dec 10, 2024
5a6a965
underscore prefixed method is not public
molbap Dec 10, 2024
b9746a8
fix tests
molbap Dec 10, 2024
2090ed6
fix docs
molbap Dec 10, 2024
8ad3a25
[run-slow] molmo
molbap Dec 10, 2024
0d10ee4
Merge branch 'main' into add_molmo
molbap Dec 10, 2024
9bd96f5
fix cache shape
molbap Dec 10, 2024
af5468b
[run-slow] molmo
molbap Dec 10, 2024
c02c6de
trigger CI
molbap Dec 10, 2024
5f35055
mark flaky test
molbap Dec 10, 2024
2b7af87
add missing objects
molbap Dec 10, 2024
9f0f09d
add config to init
molbap Dec 10, 2024
74ebb24
more init fixes
molbap Dec 10, 2024
8b00c44
fix style
molbap Dec 10, 2024
d6403ad
fix?
molbap Dec 10, 2024
eb43cb9
fix
molbap Dec 10, 2024
33f0624
what is this again
molbap Dec 10, 2024
cc59007
Merge branch 'main' into add_molmo
molbap Dec 10, 2024
23ae692
is this real life
molbap Dec 10, 2024
4c456e7
it was real life, fix broken eager
molbap Dec 10, 2024
91f2820
fix attribtues
molbap Dec 10, 2024
e2df6bc
this attention should be fixed
molbap Dec 10, 2024
ae77cc6
set 7b test to bf16
molbap Dec 11, 2024
166b28a
[run-slow] molmo
molbap Dec 11, 2024
50bcb7c
Merge branch 'main' into add_molmo
molbap Dec 11, 2024
bf012d8
[run-slow] molmo
molbap Dec 11, 2024
6e0634b
fix text (variability T4/A100)
molbap Dec 11, 2024
8569fd0
push clean Fast (x3!) image processor
molbap Dec 12, 2024
fd401bc
Merge branch 'main' into add_molmo
molbap Dec 12, 2024
86acf22
fix modular changes from main
molbap Dec 12, 2024
1ebea3c
Merge branch 'main' into add_molmo
molbap Dec 16, 2024
5ebc6f0
push fast image proc with device check
molbap Dec 23, 2024
19d2689
push fast image proc with device check
molbap Dec 23, 2024
c652bb9
format
molbap Dec 23, 2024
50c21e5
images kwargs were missing
molbap Dec 23, 2024
092da76
merge and fix conflicts
molbap Dec 23, 2024
1254eac
style
molbap Dec 23, 2024
bd39143
update with modular conversion
molbap Dec 23, 2024
3efcb13
add torch import
molbap Dec 23, 2024
56ae76f
style
molbap Dec 23, 2024
9417ff7
protect import
molbap Dec 23, 2024
51f9336
fix modular
molbap Dec 23, 2024
3719481
Merge branch 'main' into add_molmo
molbap Jan 7, 2025
f394b02
cherry-pick: cohere (from 67c3fcd4f32c64e07f302f00243be7d54914d78b)
molbap Jan 8, 2025
e418aa3
fix modular with cohere interface
molbap Jan 8, 2025
5af0b57
fixup cohere all imports
molbap Jan 8, 2025
a574b93
fix bf16 test output
molbap Jan 8, 2025
9f3018d
fix
molbap Jan 8, 2025
e2d1ba8
style
molbap Jan 8, 2025
c872095
Merge branch 'main' into add_molmo
molbap Jan 9, 2025
41ab3a7
uniformize fast image processor
molbap Jan 9, 2025
dd74b78
Merge branch 'main' into add_molmo
molbap Jan 9, 2025
d052666
fix merge
molbap Jan 9, 2025
0a822f4
unbloat modular a tad
molbap Jan 9, 2025
8ebf44f
fix import
molbap Jan 9, 2025
4e6070f
fix modular
molbap Jan 9, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/en/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,7 @@ Flax), PyTorch, and/or TensorFlow.
| [MobileNetV2](model_doc/mobilenet_v2) | ✅ | ❌ | ❌ |
| [MobileViT](model_doc/mobilevit) | ✅ | ✅ | ❌ |
| [MobileViTV2](model_doc/mobilevitv2) | ✅ | ❌ | ❌ |
| [Molmo](model_doc/molmo) | ✅ | ❌ | ❌ |
| [Moshi](model_doc/moshi) | ✅ | ❌ | ❌ |
| [MPNet](model_doc/mpnet) | ✅ | ✅ | ❌ |
| [MPT](model_doc/mpt) | ✅ | ❌ | ❌ |
Expand Down
118 changes: 118 additions & 0 deletions docs/source/en/model_doc/molmo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# Molmo

## Overview

The Molmo model was proposed in [Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
]([https://arxiv.org/abs/2409.17146]) by Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper Wolters, Tanmay Gupta, Kuo-Hao Zeng, Jon Borchardt, Dirk Groeneveld, Jen Dumas, Crystal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Schoenick, Oscar Michel, Ranjay Krishna, Luca Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, Aniruddha Kembhavi.
Comment on lines +21 to +22
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The Molmo model was proposed in [Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
]([https://arxiv.org/abs/2409.17146]) by Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper Wolters, Tanmay Gupta, Kuo-Hao Zeng, Jon Borchardt, Dirk Groeneveld, Jen Dumas, Crystal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Schoenick, Oscar Michel, Ranjay Krishna, Luca Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, Aniruddha Kembhavi.
The Molmo model was proposed in [Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models]([https://arxiv.org/abs/2409.17146]) by Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper Wolters, Tanmay Gupta, Kuo-Hao Zeng, Jon Borchardt, Dirk Groeneveld, Jen Dumas, Crystal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Schoenick, Oscar Michel, Ranjay Krishna, Luca Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, Aniruddha Kembhavi.


Molmo, developed by AllenAI team, is an open-source multimodal AI model capable of processing text and images within a unified framework. It outperforms larger models in efficiency and accuracy, leveraging high-quality datasets like PixMo for tasks such as captioning, question answering, and visual pointing.

The abstract from the paper is the following:

*Today's most advanced multimodal models remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed models into open ones. As a result, the community is still missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key innovation is a novel, highly detailed image caption dataset collected entirely from human annotators using speech-based descriptions. To enable a wide array of user interactions, we also introduce a diverse dataset mixture for fine-tuning that includes in-the-wild Q&A and innovative 2D pointing data. The success of our approach relies on careful choices for the model architecture details, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets, all of which will be released. The best-in-class 72B model within the Molmo family not only outperforms others in the class of open weight and data models but also compares favorably against proprietary systems like GPT-4o, Claude 3.5, and Gemini 1.5 on both academic benchmarks and human evaluation.
*
Comment on lines +28 to +29
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*Today's most advanced multimodal models remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed models into open ones. As a result, the community is still missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key innovation is a novel, highly detailed image caption dataset collected entirely from human annotators using speech-based descriptions. To enable a wide array of user interactions, we also introduce a diverse dataset mixture for fine-tuning that includes in-the-wild Q&A and innovative 2D pointing data. The success of our approach relies on careful choices for the model architecture details, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets, all of which will be released. The best-in-class 72B model within the Molmo family not only outperforms others in the class of open weight and data models but also compares favorably against proprietary systems like GPT-4o, Claude 3.5, and Gemini 1.5 on both academic benchmarks and human evaluation.
*
*Today's most advanced multimodal models remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed models into open ones. As a result, the community is still missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key innovation is a novel, highly detailed image caption dataset collected entirely from human annotators using speech-based descriptions. To enable a wide array of user interactions, we also introduce a diverse dataset mixture for fine-tuning that includes in-the-wild Q&A and innovative 2D pointing data. The success of our approach relies on careful choices for the model architecture details, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets, all of which will be released. The best-in-class 72B model within the Molmo family not only outperforms others in the class of open weight and data models but also compares favorably against proprietary systems like GPT-4o, Claude 3.5, and Gemini 1.5 on both academic benchmarks and human evaluation.*


<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/molmo_arch.png"
alt="drawing" width="600"/>

<small> Molmo incorporates images by encoding various patches of the input image. Taken from the <a href="https://arxiv.org/abs/2409.17146">original paper.</a> </small>


Tips:

- We advise users to use `padding_side="left"` when computing batched generation as it leads to more accurate results. Simply make sure to call `processor.tokenizer.padding_side = "left"` before generating.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- We advise users to use `padding_side="left"` when computing batched generation as it leads to more accurate results. Simply make sure to call `processor.tokenizer.padding_side = "left"` before generating.
- We recommend calling `processor.tokenizer.padding_side = "left"` for batched generation because it leads to more accurate results.



This model was contributed by [Molbap](https://huggingface.co/Molbap).


## Usage example

### Single image inference

Here's how to load the model and perform inference in half-precision (`torch.float16`):

```python
from transformers import MolmoForConditionalGeneration, AutoProcessor
import torch
from PIL import Image
import requests

model = MolmoForConditionalGeneration.from_pretrained("allenai/Molmo-7B-D-hf", torch_dtype="float16", device_map="auto")
processor = AutoProcessor.from_pretrained("allenai/Molmo-7B-D-hf")

image = Image.open(requests.get("https://picsum.photos/id/237/536/354", stream=True).raw)

conversation = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What is shown in this image?"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(image, prompt, return_tensors="pt").to(model.device)

# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=100)

print(processor.decode(output[0], skip_special_tokens=True))
```


## MolmoConfig

[[autodoc]] MolmoConfig

## MolmoTextConfig

[[autodoc]] MolmoTextConfig

## MolmoVisionConfig

[[autodoc]] MolmoVisionConfig

## MolmoPoolingConfig

[[autodoc]] MolmoPoolingConfig

## MolmoImageProcessor

[[autodoc]] MolmoImageProcessor

## MolmoProcessor

[[autodoc]] MolmoProcessor

## MolmoTextModel

[[autodoc]] MolmoTextModel
- forward

## MolmoForCausalLM

[[autodoc]] MolmoForCausalLM
- forward

## MolmoForConditionalGeneration

[[autodoc]] MolmoForConditionalGeneration
- forward
2 changes: 1 addition & 1 deletion docs/source/en/modular_transformers.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,4 +194,4 @@ We now also support special cases like
class GemmaVisionModel(CLIPModel):
pass
```
where the name of your class `GemmaVision` is not the same as the modular `Gemma`. This is super useful for composite models.
where the name of your class `GemmaVision` is not the same as the modular `Gemma`. This is super useful for composite models.
2 changes: 2 additions & 0 deletions docs/source/en/perf_infer_gpu_one.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ FlashAttention-2 is currently supported for the following architectures:
* [Llava-NeXT-Video](https://huggingface.co/docs/transformers/model_doc/llava_next_video)
* [LLaVA-Onevision](https://huggingface.co/docs/transformers/model_doc/llava_onevision)
* [Mimi](https://huggingface.co/docs/transformers/model_doc/mimi)
* [Molmo](https://huggingface.co/docs/transformers/model_doc/molmo)
* [VipLlava](https://huggingface.co/docs/transformers/model_doc/vipllava)
* [VideoLlava](https://huggingface.co/docs/transformers/model_doc/video_llava)
* [M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)
Expand Down Expand Up @@ -256,6 +257,7 @@ For now, Transformers supports SDPA inference and training for the following arc
* [Mistral](https://huggingface.co/docs/transformers/model_doc/mistral#transformers.MistralModel)
* [Mllama](https://huggingface.co/docs/transformers/model_doc/mllama#transformers.MllamaForConditionalGeneration)
* [Mixtral](https://huggingface.co/docs/transformers/model_doc/mixtral#transformers.MixtralModel)
* [Molmo](https://huggingface.co/docs/transformers/model_doc/molmo)
* [Moshi](https://huggingface.co/docs/transformers/model_doc/moshi#transformers.MoshiModel)
* [Musicgen](https://huggingface.co/docs/transformers/model_doc/musicgen#transformers.MusicgenModel)
* [MusicGen Melody](https://huggingface.co/docs/transformers/model_doc/musicgen_melody#transformers.MusicgenMelodyModel)
Expand Down
23 changes: 23 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -593,6 +593,7 @@
"models.mobilenet_v2": ["MobileNetV2Config"],
"models.mobilevit": ["MobileViTConfig"],
"models.mobilevitv2": ["MobileViTV2Config"],
"models.molmo": ["MolmoConfig", "MolmoImageProcessor", "MolmoProcessor"],
"models.moshi": [
"MoshiConfig",
"MoshiDepthConfig",
Expand Down Expand Up @@ -1220,6 +1221,7 @@
_import_structure["models.mobilenet_v1"].extend(["MobileNetV1FeatureExtractor", "MobileNetV1ImageProcessor"])
_import_structure["models.mobilenet_v2"].extend(["MobileNetV2FeatureExtractor", "MobileNetV2ImageProcessor"])
_import_structure["models.mobilevit"].extend(["MobileViTFeatureExtractor", "MobileViTImageProcessor"])
_import_structure["models.molmo"].append("MolmoImageProcessor")
_import_structure["models.nougat"].append("NougatImageProcessor")
_import_structure["models.oneformer"].extend(["OneFormerImageProcessor"])
_import_structure["models.owlv2"].append("Owlv2ImageProcessor")
Expand Down Expand Up @@ -2805,6 +2807,15 @@
"MobileViTV2PreTrainedModel",
]
)
_import_structure["models.molmo"].extend(
[
"MolmoForCausalLM",
"MolmoForConditionalGeneration",
"MolmoPreTrainedModel",
"MolmoTextModel",
]
)

_import_structure["models.moshi"].extend(
[
"MoshiForCausalLM",
Expand Down Expand Up @@ -5486,6 +5497,11 @@
from .models.mobilevitv2 import (
MobileViTV2Config,
)
from .models.molmo import (
MolmoConfig,
MolmoImageProcessor,
MolmoProcessor,
)
from .models.moshi import (
MoshiConfig,
MoshiDepthConfig,
Expand Down Expand Up @@ -6151,6 +6167,7 @@
MobileNetV2ImageProcessor,
)
from .models.mobilevit import MobileViTFeatureExtractor, MobileViTImageProcessor
from .models.molmo import MolmoImageProcessor
from .models.nougat import NougatImageProcessor
from .models.oneformer import OneFormerImageProcessor
from .models.owlv2 import Owlv2ImageProcessor
Expand Down Expand Up @@ -7442,6 +7459,12 @@
MobileViTV2Model,
MobileViTV2PreTrainedModel,
)
from .models.molmo import (
MolmoForCausalLM,
MolmoForConditionalGeneration,
MolmoPreTrainedModel,
MolmoTextModel,
)
from .models.moshi import (
MoshiForCausalLM,
MoshiForConditionalGeneration,
Expand Down
2 changes: 1 addition & 1 deletion src/transformers/integrations/awq.py
Original file line number Diff line number Diff line change
Expand Up @@ -383,7 +383,7 @@ def _fuse_awq_attention_layers(model, module, modules_to_fuse, current_module_na
The `QuantAttentionFused` class as it only supports that class
for now.
"""
from awq.modules.linear import WQLinear_GEMM, WQLinear_GEMV
from awq.modules.linear import WQLinear_GEMM, WQLinear_GEMV, WQLinear_IPEX

module_has_been_fused = False

Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,7 @@
mobilenet_v2,
mobilevit,
mobilevitv2,
molmo,
moshi,
mpnet,
mpt,
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,7 @@
("mobilenet_v2", "MobileNetV2Config"),
("mobilevit", "MobileViTConfig"),
("mobilevitv2", "MobileViTV2Config"),
("molmo", "MolmoConfig"),
("moshi", "MoshiConfig"),
("mpnet", "MPNetConfig"),
("mpt", "MptConfig"),
Expand Down Expand Up @@ -494,6 +495,7 @@
("mobilenet_v2", "MobileNetV2"),
("mobilevit", "MobileViT"),
("mobilevitv2", "MobileViTV2"),
("molmo", "Molmo"),
("moshi", "Moshi"),
("mpnet", "MPNet"),
("mpt", "MPT"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/image_processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@
("mobilenet_v2", ("MobileNetV2ImageProcessor",)),
("mobilevit", ("MobileViTImageProcessor",)),
("mobilevitv2", ("MobileViTImageProcessor",)),
("molmo", ("MolmoImageProcessor",)),
("nat", ("ViTImageProcessor", "ViTImageProcessorFast")),
("nougat", ("NougatImageProcessor",)),
("oneformer", ("OneFormerImageProcessor",)),
Expand Down
3 changes: 3 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -510,6 +510,7 @@
("mistral", "MistralForCausalLM"),
("mixtral", "MixtralForCausalLM"),
("mllama", "MllamaForCausalLM"),
("molmo", "MolmoForCausalLM"),
("moshi", "MoshiForCausalLM"),
("mpt", "MptForCausalLM"),
("musicgen", "MusicgenForCausalLM"),
Expand Down Expand Up @@ -754,6 +755,7 @@
("llava_next_video", "LlavaNextVideoForConditionalGeneration"),
("llava_onevision", "LlavaOnevisionForConditionalGeneration"),
("mllama", "MllamaForConditionalGeneration"),
("molmo", "MolmoForConditionalGeneration"),
("paligemma", "PaliGemmaForConditionalGeneration"),
("pix2struct", "Pix2StructForConditionalGeneration"),
("qwen2_vl", "Qwen2VLForConditionalGeneration"),
Expand All @@ -779,6 +781,7 @@
("llava_next", "LlavaNextForConditionalGeneration"),
("llava_onevision", "LlavaOnevisionForConditionalGeneration"),
("mllama", "MllamaForConditionalGeneration"),
("molmo", "MolmoForConditionalGeneration"),
("paligemma", "PaliGemmaForConditionalGeneration"),
("pix2struct", "Pix2StructForConditionalGeneration"),
("pixtral", "LlavaForConditionalGeneration"),
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@
("mctct", "MCTCTProcessor"),
("mgp-str", "MgpstrProcessor"),
("mllama", "MllamaProcessor"),
("molmo", "MolmoProcessor"),
("oneformer", "OneFormerProcessor"),
("owlv2", "Owlv2Processor"),
("owlvit", "OwlViTProcessor"),
Expand Down Expand Up @@ -332,6 +333,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
elif type(config) in PROCESSOR_MAPPING:
return PROCESSOR_MAPPING[type(config)].from_pretrained(pretrained_model_name_or_path, **kwargs)

print("BUT WHY", processor_class)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😉 to remove!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, some debugging struggles scars left

# At this stage, there doesn't seem to be a `Processor` class available for this model, so let's try a
# tokenizer.
try:
Expand Down
29 changes: 29 additions & 0 deletions src/transformers/models/molmo/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure


if TYPE_CHECKING:
from .configuration_molmo import *
from .image_processing_molmo import *
from .modeling_molmo import *
from .processing_molmo import *
else:
import sys

_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
Loading