-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Molmo (7B-D, 7B-O, 70B) #33962
base: main
Are you sure you want to change the base?
Add Molmo (7B-D, 7B-O, 70B) #33962
Changes from 75 commits
dc6fcac
574e01f
0bd413b
9e454e4
d82c471
339a8d3
c0c25d6
5ee6a44
33e43ec
d23e1c1
c8c12fe
0909c02
1799d20
fb133d4
5ba4105
a2a6a9b
8fe7a9f
20681f5
c85af98
35ea3cc
ab79d0e
b9bdf99
98d5ccd
3bca742
a13fe05
fac8dfd
c1e5f19
a68e5f5
8298b80
9f69c6b
0711e08
7efe22e
f5bd3b0
3ae884f
3ef60c0
cf9d4ab
91a2d3c
0f7904f
577e347
b514041
cf6cb5d
26c517d
35c168d
e7275c7
3e7530d
4bbc89b
54e072b
1e99752
92a1f31
42330e0
932f6d1
aafb827
36cc6dd
f399c3a
7322227
e4db50a
205a755
eb61617
b77d947
ba4dd50
0e2d184
9a83706
171eb8e
35b517a
6a0cbc5
5c7b141
434d4b1
4645f97
48f2e21
4bb4e48
e676782
a74bda2
2c428ae
d338153
00376c4
48354fe
d738493
d0e90d4
f06b6d9
9fc25c0
38dc9e8
eb77f3c
190cc35
84ed244
b4d48d5
1298d08
6687d43
5f79577
9e72758
439aed6
5a6a965
b9746a8
2090ed6
8ad3a25
0d10ee4
9bd96f5
af5468b
c02c6de
5f35055
2b7af87
9f0f09d
74ebb24
8b00c44
d6403ad
eb43cb9
33f0624
cc59007
23ae692
4c456e7
91f2820
e2df6bc
ae77cc6
166b28a
50bcb7c
bf012d8
6e0634b
8569fd0
fd401bc
86acf22
1ebea3c
5ebc6f0
19d2689
c652bb9
50c21e5
092da76
1254eac
bd39143
3efcb13
56ae76f
9417ff7
51f9336
3719481
f394b02
e418aa3
5af0b57
a574b93
9f3018d
e2d1ba8
c872095
41ab3a7
dd74b78
d052666
0a822f4
8ebf44f
4e6070f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,118 @@ | ||||||||
<!--Copyright 2024 The HuggingFace Team. All rights reserved. | ||||||||
|
||||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||||||||
the License. You may obtain a copy of the License at | ||||||||
|
||||||||
http://www.apache.org/licenses/LICENSE-2.0 | ||||||||
|
||||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||||||||
specific language governing permissions and limitations under the License. | ||||||||
|
||||||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | ||||||||
rendered properly in your Markdown viewer. | ||||||||
|
||||||||
--> | ||||||||
|
||||||||
# Molmo | ||||||||
|
||||||||
## Overview | ||||||||
|
||||||||
The Molmo model was proposed in [Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models | ||||||||
]([https://arxiv.org/abs/2409.17146]) by Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper Wolters, Tanmay Gupta, Kuo-Hao Zeng, Jon Borchardt, Dirk Groeneveld, Jen Dumas, Crystal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Schoenick, Oscar Michel, Ranjay Krishna, Luca Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, Aniruddha Kembhavi. | ||||||||
|
||||||||
Molmo, developed by AllenAI team, is an open-source multimodal AI model capable of processing text and images within a unified framework. It outperforms larger models in efficiency and accuracy, leveraging high-quality datasets like PixMo for tasks such as captioning, question answering, and visual pointing. | ||||||||
|
||||||||
The abstract from the paper is the following: | ||||||||
|
||||||||
*Today's most advanced multimodal models remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed models into open ones. As a result, the community is still missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key innovation is a novel, highly detailed image caption dataset collected entirely from human annotators using speech-based descriptions. To enable a wide array of user interactions, we also introduce a diverse dataset mixture for fine-tuning that includes in-the-wild Q&A and innovative 2D pointing data. The success of our approach relies on careful choices for the model architecture details, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets, all of which will be released. The best-in-class 72B model within the Molmo family not only outperforms others in the class of open weight and data models but also compares favorably against proprietary systems like GPT-4o, Claude 3.5, and Gemini 1.5 on both academic benchmarks and human evaluation. | ||||||||
* | ||||||||
Comment on lines
+28
to
+29
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
|
||||||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/molmo_arch.png" | ||||||||
alt="drawing" width="600"/> | ||||||||
|
||||||||
<small> Molmo incorporates images by encoding various patches of the input image. Taken from the <a href="https://arxiv.org/abs/2409.17146">original paper.</a> </small> | ||||||||
|
||||||||
|
||||||||
Tips: | ||||||||
|
||||||||
- We advise users to use `padding_side="left"` when computing batched generation as it leads to more accurate results. Simply make sure to call `processor.tokenizer.padding_side = "left"` before generating. | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
|
||||||||
|
||||||||
This model was contributed by [Molbap](https://huggingface.co/Molbap). | ||||||||
|
||||||||
|
||||||||
## Usage example | ||||||||
|
||||||||
### Single image inference | ||||||||
|
||||||||
Here's how to load the model and perform inference in half-precision (`torch.float16`): | ||||||||
|
||||||||
```python | ||||||||
from transformers import MolmoForConditionalGeneration, AutoProcessor | ||||||||
import torch | ||||||||
from PIL import Image | ||||||||
import requests | ||||||||
|
||||||||
model = MolmoForConditionalGeneration.from_pretrained("allenai/Molmo-7B-D-hf", torch_dtype="float16", device_map="auto") | ||||||||
processor = AutoProcessor.from_pretrained("allenai/Molmo-7B-D-hf") | ||||||||
|
||||||||
image = Image.open(requests.get("https://picsum.photos/id/237/536/354", stream=True).raw) | ||||||||
|
||||||||
conversation = [ | ||||||||
{ | ||||||||
"role": "user", | ||||||||
"content": [ | ||||||||
{"type": "image"}, | ||||||||
{"type": "text", "text": "What is shown in this image?"}, | ||||||||
], | ||||||||
}, | ||||||||
] | ||||||||
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) | ||||||||
inputs = processor(image, prompt, return_tensors="pt").to(model.device) | ||||||||
|
||||||||
# autoregressively complete prompt | ||||||||
output = model.generate(**inputs, max_new_tokens=100) | ||||||||
|
||||||||
print(processor.decode(output[0], skip_special_tokens=True)) | ||||||||
``` | ||||||||
|
||||||||
|
||||||||
## MolmoConfig | ||||||||
|
||||||||
[[autodoc]] MolmoConfig | ||||||||
|
||||||||
## MolmoTextConfig | ||||||||
|
||||||||
[[autodoc]] MolmoTextConfig | ||||||||
|
||||||||
## MolmoVisionConfig | ||||||||
|
||||||||
[[autodoc]] MolmoVisionConfig | ||||||||
|
||||||||
## MolmoPoolingConfig | ||||||||
|
||||||||
[[autodoc]] MolmoPoolingConfig | ||||||||
|
||||||||
## MolmoImageProcessor | ||||||||
|
||||||||
[[autodoc]] MolmoImageProcessor | ||||||||
|
||||||||
## MolmoProcessor | ||||||||
|
||||||||
[[autodoc]] MolmoProcessor | ||||||||
|
||||||||
## MolmoTextModel | ||||||||
|
||||||||
[[autodoc]] MolmoTextModel | ||||||||
- forward | ||||||||
|
||||||||
## MolmoForCausalLM | ||||||||
|
||||||||
[[autodoc]] MolmoForCausalLM | ||||||||
- forward | ||||||||
|
||||||||
## MolmoForConditionalGeneration | ||||||||
|
||||||||
[[autodoc]] MolmoForConditionalGeneration | ||||||||
- forward |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -162,6 +162,7 @@ | |
mobilenet_v2, | ||
mobilevit, | ||
mobilevitv2, | ||
molmo, | ||
moshi, | ||
mpnet, | ||
mpt, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -79,6 +79,7 @@ | |
("mctct", "MCTCTProcessor"), | ||
("mgp-str", "MgpstrProcessor"), | ||
("mllama", "MllamaProcessor"), | ||
("molmo", "MolmoProcessor"), | ||
("oneformer", "OneFormerProcessor"), | ||
("owlv2", "Owlv2Processor"), | ||
("owlvit", "OwlViTProcessor"), | ||
|
@@ -332,6 +333,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): | |
elif type(config) in PROCESSOR_MAPPING: | ||
return PROCESSOR_MAPPING[type(config)].from_pretrained(pretrained_model_name_or_path, **kwargs) | ||
|
||
print("BUT WHY", processor_class) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 😉 to remove! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. lol, some debugging struggles scars left |
||
# At this stage, there doesn't seem to be a `Processor` class available for this model, so let's try a | ||
# tokenizer. | ||
try: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# Copyright 2024 The HuggingFace Team. All rights reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
from typing import TYPE_CHECKING | ||
|
||
from ...utils import _LazyModule | ||
from ...utils.import_utils import define_import_structure | ||
|
||
|
||
if TYPE_CHECKING: | ||
from .configuration_molmo import * | ||
from .image_processing_molmo import * | ||
from .modeling_molmo import * | ||
from .processing_molmo import * | ||
else: | ||
import sys | ||
|
||
_file = globals()["__file__"] | ||
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.