You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to get the new SmolVLM 256M and 500M models to output JSON using mlx-vlm but the models are unable to follow simple instructions. In comparison the Qwen2-VL-2B-Instruct-4bit model performs perfectly at this task as you can see below. I am wondering if this is an issue with the current mlx-vlm implementation for the SmolVLM models, or if these new SmolVLM models are just terrible at instruction following? The reason I want to use the Smol models is because they claim to be more efficient then Qwen2-VL models, but in my testing using mlx-vlm the mlx-community/SmolVLM-256M-Instruct-4bit model uses 2x the amount of memory then mlx-community/Qwen2-VL-2B-Instruct-4bit.
It also gives this warning during generation: Some kwargs in processor config are unused and will not have any effect: image_seq_len.
My code:
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
model_path = "mlx-community/SmolVLM-256M-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)
images = ["frames/frame-01.jpg", "frames/frame-02.jpg", "frames/frame-03.jpg", "frames/frame-04.jpg", "frames/frame-05.jpg", "frames/frame-06.jpg"]
prompt = """
Describe the scene - output json in the following format:
{
"scene": "concise scene description",
"num-people": "number of people in the scene",
"shot-angle": "possible labels: hight-angle, eye-level, low-angle, overhead, aerial, other",
"shot-location": "possible labels: int, ext, other",
"shot-motion": "possible labels: tilt, handheld, locked, pan, zoom, aerial",
"shot-size": "possible labels: close-up, medium, wide, extreme-wide, extreme-close-up, other",
"shot-subject": "possible labels: object, text, limb, human, face, location, animal, cartoon, other",
"shot-type": "possible labels: insert, single, two-shot, three-shot, ots, group-shot, other"
}
Make sure to correctly identify wide shots.
"""
formatted_prompt = apply_chat_template(
processor, config, prompt, num_images=len(images)
)
output = generate(model, processor, formatted_prompt, images, verbose=False, max_tokens=1000)
print(output)
mlx-community/Qwen2-VL-2B-Instruct-4bit output:
{
"scene": "A group of large, industrial-looking structures are suspended in the air, likely part of a transportation system.",
"num-people": "0",
"shot-angle": "low-angle",
"shot-location": "ext",
"shot-motion": "locked",
"shot-size": "extreme-wide",
"shot-subject": "other",
"shot-type": "group-shot"
}
mlx-community/SmolVLM-256M-Instruct-4bit output:
The image shows a blue train car with a white roof, white walls, and a blue roof. The train car has a white interior and white windows. The train car has a white roof with a blue stripe. The train car has a white interior with a blue roof. The train car has a white roof with a blue stripe. The train car has a white interior with a blue roof. The train car has a white roof with a blue stripe. The train car has a white interior with a blue roof. The train car has a white roof with a blue stripe. The train car has a white interior with a blue roof. The train car has a white roof with a blue stripe. The train car has a white interior with a blue roof. The train car has a white roof with a blue stripe. The train car has a white interior with a blue roof. The train car has a white roof with a blue stripe. The train car has a white interior with a blue roof.
The text was updated successfully, but these errors were encountered:
I am trying to get the new SmolVLM 256M and 500M models to output JSON using mlx-vlm but the models are unable to follow simple instructions. In comparison the Qwen2-VL-2B-Instruct-4bit model performs perfectly at this task as you can see below. I am wondering if this is an issue with the current mlx-vlm implementation for the SmolVLM models, or if these new SmolVLM models are just terrible at instruction following? The reason I want to use the Smol models is because they claim to be more efficient then Qwen2-VL models, but in my testing using mlx-vlm the
mlx-community/SmolVLM-256M-Instruct-4bit
model uses 2x the amount of memory thenmlx-community/Qwen2-VL-2B-Instruct-4bit
.It also gives this warning during generation:
Some kwargs in processor config are unused and will not have any effect: image_seq_len.
My code:
mlx-community/Qwen2-VL-2B-Instruct-4bit
output:mlx-community/SmolVLM-256M-Instruct-4bit
output:The text was updated successfully, but these errors were encountered: