Support Vision (image) queries on OpenAI , Azure, Claude3, etc #231
Replies: 5 comments
-
This depends on whether I can find a seamless way to integrate multi-modal interaction into the user experience. The main limitation is time: I can only afford about one weekend a month to work on gptel's non-janitorial tasks right now, i.e. adding features/design work that requires more thinking and experimentation. (The first feature in the queue right now is the function calling API, for which there's an open PR.) I'm also not familiar with the multi-modal APIs, perhaps you can help me with a couple of basic questions:
The link format is a UI issue. I don't know yet but I'm guessing the standard Org/Markdown link formats should work fine. |
Beta Was this translation helpful? Give feedback.
-
I might contribute a PR after I familiarize with gptel code. But there are bunch of utility functions here for gemini and claude, they allow other image formats and they also probably are downscaling image in server side |
Beta Was this translation helpful? Give feedback.
-
That's convenient, thanks for the explanation.
Thanks! But do note that image/video support is a low priority for gptel's code, behind better error reporting, UI improvements, completion (#206), function calling support (#209) and embeddings, although that last one will probably be a separate package as well. |
Beta Was this translation helpful? Give feedback.
-
feature-vision-preview.mp4feature-vision-preview-2.mp4I'll create a |
Beta Was this translation helpful? Give feedback.
-
I have added vision support to gptel in the It's actually a little more general than vision support -- a lot of the changes are about specifying per-model capabilities, to pave the way to add function calling, JSON output and image output (DALL-E etc) uniformly to To set it up correctly,
There are two ways to use it.
|
Beta Was this translation helpful? Give feedback.
-
I have openai vision request implemented in my fork of org-ai
https://github.com/doctorguile/org-ai/blob/org/org-ai-block.el#L189
but currently org-ai only supports OpenAI
and I see that gptel now supports multiple backends including claude3.
For org-ai because it's org focused, it's easy for me to just use org link for image reference (and org can also display the image inline, which is nice)
For gptel, any idea what input format is most appropriate for referencing an image?
I'm assuming karthik or users of gptel are interested in vision/text multi-model API
Thanks
Beta Was this translation helpful? Give feedback.
All reactions