SmolVLM-Instruct not following instructions #188

tmoroney · 2025-01-28T14:43:46Z

I am trying to get the new SmolVLM 256M and 500M models to output JSON using mlx-vlm but the models are unable to follow simple instructions. In comparison the Qwen2-VL-2B-Instruct-4bit model performs perfectly at this task as you can see below. I am wondering if this is an issue with the current mlx-vlm implementation for the SmolVLM models, or if these new SmolVLM models are just terrible at instruction following? The reason I want to use the Smol models is because they claim to be more efficient then Qwen2-VL models, but in my testing using mlx-vlm the mlx-community/SmolVLM-256M-Instruct-4bit model uses 2x the amount of memory then mlx-community/Qwen2-VL-2B-Instruct-4bit.

It also gives this warning during generation: Some kwargs in processor config are unused and will not have any effect: image_seq_len.

My code:

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model_path = "mlx-community/SmolVLM-256M-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)

images = ["frames/frame-01.jpg", "frames/frame-02.jpg", "frames/frame-03.jpg", "frames/frame-04.jpg", "frames/frame-05.jpg", "frames/frame-06.jpg"]
prompt = """
        Describe the scene - output json in the following format:
        {
                "scene": "concise scene description",
                "num-people": "number of people in the scene",
                "shot-angle": "possible labels: hight-angle, eye-level, low-angle, overhead, aerial, other",
                "shot-location": "possible labels: int, ext, other",
                "shot-motion": "possible labels: tilt, handheld, locked, pan, zoom, aerial",
                "shot-size": "possible labels: close-up, medium, wide, extreme-wide, extreme-close-up, other",
                "shot-subject": "possible labels: object, text, limb, human, face, location, animal, cartoon, other",
                "shot-type": "possible labels: insert, single, two-shot, three-shot, ots, group-shot, other"
        }
        Make sure to correctly identify wide shots.
        """

formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=len(images)
)

output = generate(model, processor, formatted_prompt, images, verbose=False, max_tokens=1000)
print(output)

mlx-community/Qwen2-VL-2B-Instruct-4bit output:

{
    "scene": "A group of large, industrial-looking structures are suspended in the air, likely part of a transportation system.",
    "num-people": "0",
    "shot-angle": "low-angle",
    "shot-location": "ext",
    "shot-motion": "locked",
    "shot-size": "extreme-wide",
    "shot-subject": "other",
    "shot-type": "group-shot"
}

mlx-community/SmolVLM-256M-Instruct-4bit output:

 The image shows a blue train car with a white roof, white walls, and a blue roof. The train car has a white interior and white windows. The train car has a white roof with a blue stripe. The train car has a white interior with a blue roof. The train car has a white roof with a blue stripe. The train car has a white interior with a blue roof. The train car has a white roof with a blue stripe. The train car has a white interior with a blue roof. The train car has a white roof with a blue stripe. The train car has a white interior with a blue roof. The train car has a white roof with a blue stripe. The train car has a white interior with a blue roof. The train car has a white roof with a blue stripe. The train car has a white interior with a blue roof. The train car has a white roof with a blue stripe. The train car has a white interior with a blue roof.

The text was updated successfully, but these errors were encountered:

Blaizzy · 2025-01-29T11:18:46Z

hey @tmoroney

Yes, I confirm that the mlx-vlm implementation fails to provide structured outputs whilst the transformers works.

I traced to do_image_splitting. When set to false mlx-vlm matches transformers response (check image below).

@pcuenca how can we fix this? Idefics 2 has the same issue, I had to set do_image_splitting to false too in the past to fix OCR issues.

Blaizzy · 2025-01-29T11:35:24Z

@pcuenca I found the solution ✅

A PR will be up soon.

tmoroney · 2025-01-29T12:03:00Z

Great, thanks for your help :)

Blaizzy mentioned this issue Jan 29, 2025

Fix idefics (2 and 3) do-image-split #191

Merged

Blaizzy closed this as completed in #191 Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SmolVLM-Instruct not following instructions #188

SmolVLM-Instruct not following instructions #188

tmoroney commented Jan 28, 2025 •

edited

Loading

Blaizzy commented Jan 29, 2025 •

edited

Loading

Blaizzy commented Jan 29, 2025

tmoroney commented Jan 29, 2025

SmolVLM-Instruct not following instructions #188

SmolVLM-Instruct not following instructions #188

Comments

tmoroney commented Jan 28, 2025 • edited Loading

Blaizzy commented Jan 29, 2025 • edited Loading

Blaizzy commented Jan 29, 2025

tmoroney commented Jan 29, 2025

tmoroney commented Jan 28, 2025 •

edited

Loading

Blaizzy commented Jan 29, 2025 •

edited

Loading