Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SmolVLM-Instruct not following instructions #188

Closed
tmoroney opened this issue Jan 28, 2025 · 3 comments · Fixed by #191
Closed

SmolVLM-Instruct not following instructions #188

tmoroney opened this issue Jan 28, 2025 · 3 comments · Fixed by #191

Comments

@tmoroney
Copy link

tmoroney commented Jan 28, 2025

I am trying to get the new SmolVLM 256M and 500M models to output JSON using mlx-vlm but the models are unable to follow simple instructions. In comparison the Qwen2-VL-2B-Instruct-4bit model performs perfectly at this task as you can see below. I am wondering if this is an issue with the current mlx-vlm implementation for the SmolVLM models, or if these new SmolVLM models are just terrible at instruction following? The reason I want to use the Smol models is because they claim to be more efficient then Qwen2-VL models, but in my testing using mlx-vlm the mlx-community/SmolVLM-256M-Instruct-4bit model uses 2x the amount of memory then mlx-community/Qwen2-VL-2B-Instruct-4bit.

It also gives this warning during generation: Some kwargs in processor config are unused and will not have any effect: image_seq_len.

My code:

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model_path = "mlx-community/SmolVLM-256M-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)

images = ["frames/frame-01.jpg", "frames/frame-02.jpg", "frames/frame-03.jpg", "frames/frame-04.jpg", "frames/frame-05.jpg", "frames/frame-06.jpg"]
prompt = """
        Describe the scene - output json in the following format:
        {
                "scene": "concise scene description",
                "num-people": "number of people in the scene",
                "shot-angle": "possible labels: hight-angle, eye-level, low-angle, overhead, aerial, other",
                "shot-location": "possible labels: int, ext, other",
                "shot-motion": "possible labels: tilt, handheld, locked, pan, zoom, aerial",
                "shot-size": "possible labels: close-up, medium, wide, extreme-wide, extreme-close-up, other",
                "shot-subject": "possible labels: object, text, limb, human, face, location, animal, cartoon, other",
                "shot-type": "possible labels: insert, single, two-shot, three-shot, ots, group-shot, other"
        }
        Make sure to correctly identify wide shots.
        """

formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=len(images)
)

output = generate(model, processor, formatted_prompt, images, verbose=False, max_tokens=1000)
print(output)

mlx-community/Qwen2-VL-2B-Instruct-4bit output:

{
    "scene": "A group of large, industrial-looking structures are suspended in the air, likely part of a transportation system.",
    "num-people": "0",
    "shot-angle": "low-angle",
    "shot-location": "ext",
    "shot-motion": "locked",
    "shot-size": "extreme-wide",
    "shot-subject": "other",
    "shot-type": "group-shot"
}

mlx-community/SmolVLM-256M-Instruct-4bit output:

 The image shows a blue train car with a white roof, white walls, and a blue roof. The train car has a white interior and white windows. The train car has a white roof with a blue stripe. The train car has a white interior with a blue roof. The train car has a white roof with a blue stripe. The train car has a white interior with a blue roof. The train car has a white roof with a blue stripe. The train car has a white interior with a blue roof. The train car has a white roof with a blue stripe. The train car has a white interior with a blue roof. The train car has a white roof with a blue stripe. The train car has a white interior with a blue roof. The train car has a white roof with a blue stripe. The train car has a white interior with a blue roof. The train car has a white roof with a blue stripe. The train car has a white interior with a blue roof.
@Blaizzy
Copy link
Owner

Blaizzy commented Jan 29, 2025

hey @tmoroney

Yes, I confirm that the mlx-vlm implementation fails to provide structured outputs whilst the transformers works.

I traced to do_image_splitting. When set to false mlx-vlm matches transformers response (check image below).

@pcuenca how can we fix this? Idefics 2 has the same issue, I had to set do_image_splitting to false too in the past to fix OCR issues.

Image

@Blaizzy
Copy link
Owner

Blaizzy commented Jan 29, 2025

@pcuenca I found the solution ✅

A PR will be up soon.

@tmoroney
Copy link
Author

Great, thanks for your help :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants