Support Vision (image) queries on OpenAI , Azure, Claude3, etc #231

doctorguile · 2024-03-05T21:44:42Z

doctorguile
Mar 5, 2024

I have openai vision request implemented in my fork of org-ai

https://github.com/doctorguile/org-ai/blob/org/org-ai-block.el#L189

but currently org-ai only supports OpenAI

and I see that gptel now supports multiple backends including claude3.

For org-ai because it's org focused, it's easy for me to just use org link for image reference (and org can also display the image inline, which is nice)

[[file:/link.to/image.jpg][description]]

[[https://link.to/image.jpg][description]]

For gptel, any idea what input format is most appropriate for referencing an image?

I'm assuming karthik or users of gptel are interested in vision/text multi-model API

Thanks

karthink · 2024-03-06T18:23:07Z

karthink
Mar 6, 2024
Maintainer

This depends on whether I can find a seamless way to integrate multi-modal interaction into the user experience. The main limitation is time: I can only afford about one weekend a month to work on gptel's non-janitorial tasks right now, i.e. adding features/design work that requires more thinking and experimentation. (The first feature in the queue right now is the function calling API, for which there's an open PR.)

I'm also not familiar with the multi-modal APIs, perhaps you can help me with a couple of basic questions:

Can you have multi-turn conversations that involve a mix of images and text using these APIs?
If yes, how do you sequence (interleave) the text and images in the API call, and
do you have to resend all images at each turn of the conversation like you do all text?

The link format is a UI issue. I don't know yet but I'm guessing the standard Org/Markdown link formats should work fine.

0 replies

doctorguile · 2024-03-06T18:45:29Z

doctorguile
Mar 6, 2024
Author

yes
the json is broken down exactly like
[{sys:...},{user:...},{assistant:...},{user:...},{assistant:...},{user:...}]
but with another layer
e.g.
[{sys:...},{user:...},{assistant:...},{user:[{type:text, content:...}, {type:image, url:base64data}, {type:text, content:...}}],{assistant:...},{user:...}]
yes

I might contribute a PR after I familiarize with gptel code.

But there are bunch of utility functions here
https://github.com/doctorguile/org-ai/blob/org/org-ai-block.el#L189
that can be reused for open ai (since it requires the client to pre-process the image , resize, convert to jpg, etc)

for gemini and claude, they allow other image formats and they also probably are downscaling image in server side
gemini also supports video (again, server side taking 1 frame of picture per second out of the video)

0 replies

karthink · 2024-03-06T19:07:46Z

karthink
Mar 6, 2024
Maintainer

the json is broken down exactly like
[{sys:...},{user:...},{assistant:...},{user:...},{assistant:...},{user:...}]
but with another layer

That's convenient, thanks for the explanation.

I might contribute a PR after I familiarize with gptel code.

Thanks! But do note that image/video support is a low priority for gptel's code, behind better error reporting, UI improvements, completion (#206), function calling support (#209) and embeddings, although that last one will probably be a separate package as well.

0 replies

karthink · 2024-03-11T03:11:00Z

karthink
Mar 11, 2024
Maintainer

feature-vision-preview.mp4

feature-vision-preview-2.mp4

I'll create a feature-vision branch soon if you're interested in playing around with it -- but I want to reiterate that I don't intend to add it to gptel any time soon, since I'd like to work on other things for now.

0 replies

karthink · 2024-09-26T07:01:31Z

karthink
Sep 26, 2024
Maintainer

I have added vision support to gptel in the feature-capabilities branch. Currently all backends except Gemini are supported. This is not a proof of concept hack like the feature-vision branch above, I did it the "right" (and painful) way this time -- please test if you're interested. It's a pretty big change so there are sure to be bugs.

It's actually a little more general than vision support -- a lot of the changes are about specifying per-model capabilities, to pave the way to add function calling, JSON output and image output (DALL-E etc) uniformly to gptel-request.

To set it up correctly,

ensure that your gptel-model is a symbol, not a string (i.e. gpt-4o-mini instead of "gpt-4o-mini"), and set to a model that supports vision (except Gemini models).
See the documentation of gptel-make-*, where * is openai, anthropic or ollama,
and the documentation of the new user option gptel-track-media.

There are two ways to use it.

Set gptel-track-media to t and type in a link to an image file, on a line by itself, in an Org or Markdown chat buffer. Then gptel-send will find and send the image along with your text.
Add an image file to the context using gptel-add-file, or gptel-add from an image-mode buffer.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Vision (image) queries on OpenAI , Azure, Claude3, etc #231

{{title}}

Replies: 5 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Support Vision (image) queries on OpenAI , Azure, Claude3, etc #231

doctorguile Mar 5, 2024

Replies: 5 comments

karthink Mar 6, 2024 Maintainer

doctorguile Mar 6, 2024 Author

karthink Mar 6, 2024 Maintainer

karthink Mar 11, 2024 Maintainer

karthink Sep 26, 2024 Maintainer

doctorguile
Mar 5, 2024

karthink
Mar 6, 2024
Maintainer

doctorguile
Mar 6, 2024
Author

karthink
Mar 6, 2024
Maintainer

karthink
Mar 11, 2024
Maintainer

karthink
Sep 26, 2024
Maintainer