-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Roadmap] Multimodal Orchestration #1975
Comments
Just to verify, the text analyzer won't exist, and this cost won't be paid, unless the image generation capability is actually added to the agent. |
IIRC, if you send a message that has content of images to an LLM that doesn't support image ingestion, the API request will fail (for OpenAI at least). So in the implementation, Each class Agent(Protocol)
@property
def modalities(self) -> List[str]:
"""The modalities the LLM can accept/understand etc.."""
...
def add_modality(self, modal: str) -> None:
... class ConversableAgent(LLMAgent):
def __init__(self, ...):
...
self._modalities = ["text"]
@property
def modalities(self) -> List[str]:
return self._modalities
def add_modality(self, modal: str) -> None:
self._modalities.append(modal) However, this is a breaking change because the class ModelClient(Protocol):
...
def create(self, params: Dict[str, Any], modalities: List[str]) -> ModelClientResponseProtocol:
... Probably there are better ways out there, but this is what I thought about doing recently |
@WaelKarkoub Thanks for your suggestions! Very helpful~ I am thinking about adding it to the Client, with some parameters in config_list. Let me try to create a PR and will tag you!!! |
Feel free to give any suggestions! @tomorrmato @deerleo @Josephrp @antoan @ViperVille007 @scortLi @ibishara |
Come on Autogen we can't let Crew ai or Lang chain take over. Lets use chain of thought. We started this multi agent ish and this is the thanks we get. Lets go Full desktop full futuristic GUI Application call it "Autogen Studio X" created for advanced Genx users Lets stop playing around at the backdoor and just drop through the roof with this multi conversational, multi modal agent to agent invasion... Give the people what they want. "Slay the Sequence" |
see 0.4 architecture |
Tip
Want to get involved?
We'd love it if you did! Please get in contact with the people assigned to this issue, or leave a comment. See general contributing advice here too.
Integrating multimodal and language-only agents presents significant challenges, as few tools currently support seamless inter-agent communication. For example, when one agent generates an image (through DALLE or code-based figure creation), a separate GPT-3 agent may encounter difficulties interpreting the visual content, leading to errors or simplifications like converting images into a
<image>
tag.To ensure smooth operation, users must carefully design the agent workflow to avoid unexpected issues, such as the Plot-Critic scenarios in both the GPT-4V and DALLE notebook. Hence, group chat, graph chat, nested chat, sequential chat, or many other pre-designed workflows could not work out-of-the-box for multimodal features.
Our goal is to enable seamless interaction between multimodal and text-only agents, making it possible to include them in conversational agents regardless of their connection to multimodal models (llm_config).
Currently, the MultimodalConversableAgent specifically processes input message content to interpret images and format messages prior to client communication. However, this approach can complicate orchestration with other agents. For instance, the GroupChatManager lacks visual processing capabilities and thus cannot properly distribute tasks, or a Coder agent fails to read image and could not write matplotlib code.
The problem becomes more severe if we enable image generation, audio, OCR, and video (in the future).
Common Issues [Example]
Issue with group chat: #1838
Fix for wrong handling of message: #2118
Things to consider:
Current issues and solutions
Issue #2142
Quick Fix to resolve the issue:
generate_init_message
for Multimodal Messages #2124Important Multimodal Updates
We suggest three major changes, categorized by their focus on accuracy and efficiency. Then, we have a few future changes to image generation, audio processing, and OCR capabilities.
Major Update 0: Message Format
Contributor: @WaelKarkoub
message
parameter, which includes an HTML tag to include images, audio, video, etc. While the input is complex and contains regular HTML content, users can easily disable this feature and use the OpenAI format for the message (a list of text/multimodal content).parse_tags_from_content
function will parse the input string and request to retrieve the multimodal content.coding
#1881, need more discussion.Major Update 1: Full Vision Support inside Client
config_list
. No other code change needed.class OpenAIClient
. This involves adapting messages to a multimodal format based on the configuration.Major Update 2: Efficient Vision Capability
process_last_received_message
is added to check if an image exists.Update 3: Image Generation Capability
Update 4: Audio Capabilities
process_last_received_message
. For outputs, integrate text-to-speech functionalities withinprocess_message_before_send
.Update 5: OCR Capability (within VisionCapability)
perform_OCR=True
orperform_OCR=’auto’
Update 6: Coordinate Detection Capability (within VisionCapability)
coordinate_detection=my_detection_callable
.Additional context
@schauppi created an useful MM WebAgent for AutoGen.
Many other contributors also have great insights, and please feel free to comment below.
The text was updated successfully, but these errors were encountered: