[Roadmap] Multimodal Orchestration #1975

BeibinLi · 2024-03-12T18:28:23Z

Tip

Want to get involved?

We'd love it if you did! Please get in contact with the people assigned to this issue, or leave a comment. See general contributing advice here too.

Integrating multimodal and language-only agents presents significant challenges, as few tools currently support seamless inter-agent communication. For example, when one agent generates an image (through DALLE or code-based figure creation), a separate GPT-3 agent may encounter difficulties interpreting the visual content, leading to errors or simplifications like converting images into a <image> tag.

To ensure smooth operation, users must carefully design the agent workflow to avoid unexpected issues, such as the Plot-Critic scenarios in both the GPT-4V and DALLE notebook. Hence, group chat, graph chat, nested chat, sequential chat, or many other pre-designed workflows could not work out-of-the-box for multimodal features.

Our goal is to enable seamless interaction between multimodal and text-only agents, making it possible to include them in conversational agents regardless of their connection to multimodal models (llm_config).

Currently, the MultimodalConversableAgent specifically processes input message content to interpret images and format messages prior to client communication. However, this approach can complicate orchestration with other agents. For instance, the GroupChatManager lacks visual processing capabilities and thus cannot properly distribute tasks, or a Coder agent fails to read image and could not write matplotlib code.

The problem becomes more severe if we enable image generation, audio, OCR, and video (in the future).

Common Issues [Example]

Issue with group chat: #1838
Fix for wrong handling of message: #2118

Things to consider:

What if user send an image to a non-vision agent
What if user send an image to a vision agent with LLM-only model
How to manage memory (compressibility) with message transformation
How to save API cost

Current issues and solutions

Issue #2142
Quick Fix to resolve the issue:

Important Multimodal Updates

We suggest three major changes, categorized by their focus on accuracy and efficiency. Then, we have a few future changes to image generation, audio processing, and OCR capabilities.

Major Update 0: Message Format

PR Parse Any HTML-esh Style Tags #2046
PR Transform Messages Capability #1923
PR Multimodal message as a List: introducing AGImage #2196
Contributor: @WaelKarkoub
User Experience: Users can easily write a string message parameter, which includes an HTML tag to include images, audio, video, etc. While the input is complex and contains regular HTML content, users can easily disable this feature and use the OpenAI format for the message (a list of text/multimodal content).
Implementation: The parse_tags_from_content function will parse the input string and request to retrieve the multimodal content.
Limitation: If users do not want to use this feature, they have to use the OpenAI format.
Related Issues:
- We should create an AutoGen class to represent messages: legacy PR Resolve types issues in coding #1881, need more discussion.
- We should also handle this feature coherently in Dotnet: Bring Dotnet AutoGen #924
- Related Tasks and Problems: message format @jackgerrits , message logging @cheng-tan, web exploring problem with HTML tag @afourney, custom models for multimodal @ekzhu @olgavrou, dotnet support @LittleLittleCloud, and more.

Major Update 1: Full Vision Support inside Client

PR Support multimodal (GPT-4V) directly in OpenAIClient #2026
User Experience: Users simply include their multimodal API information in the config_list. No other code change needed.
Description: We aim to provide multimodal capabilities across all agents, allowing them to access visual content in conversation histories without data loss.
Implementation: Transferring the multimodal message processing feature from the "MultimodalConversableAgent" to the class OpenAIClient. This involves adapting messages to a multimodal format based on the configuration.
Runtime Cost: $O(n) * Cost_{mm}$, where $n$ is the length of a conversation.
Limitation: As of now, GPT-4V's processing speed is significantly lower compared to GPT-4's, impacting efficiency if multimodal processing is overutilized.
Legacy PR: [Outdated] Support multimodal (GPT-4V) directly in OpenAIClient #2013

Major Update 2: Efficient Vision Capability

PR Add vision capability #2025
User Experience: By integrating the "VisionCapability", agents can now transcribe images through text captions, requiring only a single line of code per agent.
Description: When the vision capability sees an image, it will transcribe the image into caption and answers related to the image.
Implementation: as in the PR, a hook to process_last_received_message is added to check if an image exists.
Runtime Cost: $O(m) * Cost_{mm} + O(n) * Cost_{lm}$, where $n$ is the length of the conversation and $m$ is the number of images (aka, multimodal inputs).
Limitation: This approach may result in the loss of detailed image information.
Legacy PR: [Outdate] Add vision capability #1926

Why two different ways to enable the "vision" feature? Answer: We propose two distinct approaches for enabling "vision" to satisfy different requirements. The [Update 1], offering comprehensive multimodal support, allows all agents within a workflow to use multimodal client, ensuring no information loss but at a higher computational cost. The [Update 2], focusing on efficient vision capability, transcribes images into text captions, enabling broader application with text-only models at reduced costs but with potential information loss. This dual strategy provides flexibility, allowing users to choose the optimal balance between accuracy and efficiency based on their specific needs and resources.

Update 3: Image Generation Capability

PR [Feature] Adds Image Generation Capability 2.0 #1907
User Experience: Agents can gain the ability to generate images with the "Image Generation Capability".
Description: this capability will process each received message, then allows agent to call DALLE or other image generation API to generate images.
Implementation: this capability will invoke text analyzer to decide should image be generated or not; if yes, it will rewrite the prompt and then call image generation API to generate images.
Runtime Cost: $O(m) * (Cost_{mm} + Cost_{mm}) + 2 * O(n) * Cost_{lm}$, where $n$ is the length of the conversation and $m$ is the number of images generated. Note that we have $2 *$ in the front because all messages will go through text analyzer.
Limitation: if an agent, with image generation capability enabled, does not generate images very often in its conversation, the API calls made in the text analyzer would be costly.

Update 4: Audio Capabilities

PR [Feature] Adds Audio Capability to Conversable Agents #2098
User Experience: Introducing "Audio Capability" allows agents to process and generate audio content, enhancing interaction with users.
Description: This feature adds capabilities for speech-to-text and text-to-speech conversions (e.g., with Whisper and TTS ).
Runtime Cost: $O(n) * Cost_{mm}$
Implementation: For audio inputs, convert to text using specific tags "" and activate Whisper for transcription from process_last_received_message. For outputs, integrate text-to-speech functionalities within process_message_before_send.
Limitation: connecting to user's input and output device (mic and speaker) might be hard and error prone for different operation systems.
Related PRs: U/kinnym/multimodalwork #2090

Update 5: OCR Capability (within VisionCapability)

User perspective: when creating the “Vision Capability” mentioned above, user can setup the parameter perform_OCR=True or perform_OCR=’auto’
Description: GPT-4V, even as the state-of-the-art multimodal model, still struggles to recognize text in images. So, adding this capability to invoke a Python package to extract text could help LLM’s or LMM’s to improve Q&A accuracy.
Implementation Details: using tools in Adam’s AgentBench and GAIA branch to perform OCR directly in Vision Capability.
Runtime Cost: 0 for API calls, $O(m)$ for OCR calls.
Limitation: invoking OCR functions may cause latency or runtime errors in different operation systems.

Update 6: Coordinate Detection Capability (within VisionCapability)

User perspective: when creating the “Vision Capability” mentioned above, user can setup the parameter coordinate_detection=my_detection_callable.
Description: As pointed out by @afourney , @yadong-lu, and many contributors, GPT-4V can not identify coordinates correctly. So, we provide an interface for the users to include their coordinate detection algorithm.
Implementation Details: using a separate model or function to perform the coordinate detection.
Runtime Cost: 0 for API calls, $O(m)$ for OCR calls.
Limitation: we do not provide actually implementation, but provide a parameter for users to configurate.

Additional context

@schauppi created an useful MM WebAgent for AutoGen.

Many other contributors also have great insights, and please feel free to comment below.

The text was updated successfully, but these errors were encountered:

rickyloynd-microsoft · 2024-03-12T21:40:28Z

Limitation: if the agent does not generate images very often in its conversation, the API calls made in the text analyzer would be costly.

Just to verify, the text analyzer won't exist, and this cost won't be paid, unless the image generation capability is actually added to the agent.

WaelKarkoub · 2024-03-13T00:10:58Z

Implementation: Transferring the multimodal message processing feature from the "MultimodalConversableAgent" to the class OpenAIWrapper. This involves adapting messages to a multimodal format based on the configuration.

IIRC, if you send a message that has content of images to an LLM that doesn't support image ingestion, the API request will fail (for OpenAI at least). So in the implementation, OpenAIWrapper must be aware of what modalities it can accept to generate the right message to send.

Each Agent could have a property where it lists out all the modalities it can accept, and must be mutable (in case we add custom capabilities).

class Agent(Protocol)
	@property
	def modalities(self) -> List[str]:
		"""The modalities the LLM can accept/understand etc.."""
   		...
	
	def add_modality(self, modal: str) -> None:
 		...

class ConversableAgent(LLMAgent):
	def __init__(self, ...):
		...
		self._modalities = ["text"]
	
	@property
	def modalities(self) -> List[str]:
		return self._modalities

	def add_modality(self, modal: str) -> None:
		self._modalities.append(modal)

However, this is a breaking change because the ModelClient interface now has to accept the list of modalities as an input to the create method

class ModelClient(Protocol):
	...
	def create(self, params: Dict[str, Any], modalities: List[str]) ->  ModelClientResponseProtocol:
	    ...

Probably there are better ways out there, but this is what I thought about doing recently

BeibinLi · 2024-03-13T00:17:54Z

@WaelKarkoub Thanks for your suggestions! Very helpful~

I am thinking about adding it to the Client, with some parameters in config_list. Let me try to create a PR and will tag you!!!

BeibinLi · 2024-03-13T21:23:12Z

Feel free to give any suggestions!

@tomorrmato @deerleo @Josephrp @antoan @ViperVille007 @scortLi @ibishara

BeibinLi · 2024-03-13T23:47:31Z

@awssenera

krowflow · 2024-05-31T18:02:13Z

Come on Autogen we can't let Crew ai or Lang chain take over. Lets use chain of thought. We started this multi agent ish and this is the thanks we get. Lets go Full desktop full futuristic GUI Application call it "Autogen Studio X" created for advanced Genx users Lets stop playing around at the backdoor and just drop through the roof with this multi conversational, multi modal agent to agent invasion... Give the people what they want. "Slay the Sequence"

rysweet · 2024-10-18T20:54:51Z

see 0.4 architecture

BeibinLi added the enhancement label Mar 12, 2024

BeibinLi assigned ekzhu, afourney, sonichi, qingyun-wu, rickyloynd-microsoft, LittleLittleCloud, skzhang1, WaelKarkoub and BeibinLi Mar 12, 2024

BeibinLi mentioned this issue Mar 13, 2024

[Outdate] Add vision capability #1926

Closed

3 tasks

BeibinLi assigned victordibia and jackgerrits Mar 13, 2024

This was referenced Mar 14, 2024

[Outdated] Support multimodal (GPT-4V) directly in OpenAIClient #2013

Closed

Add vision capability #2025

Merged

BeibinLi changed the title ~~[Feature Request]: Multimodal Orchestration~~ [Roadmap]: Multimodal Orchestration Mar 14, 2024

BeibinLi mentioned this issue Mar 15, 2024

Support multimodal (GPT-4V) directly in OpenAIClient #2026

Closed

3 tasks

BeibinLi added the roadmap Issues related to roadmap of AutoGen label Mar 15, 2024

jackgerrits unassigned ekzhu, victordibia, afourney, sonichi, BeibinLi, jackgerrits and qingyun-wu Mar 18, 2024

jackgerrits assigned BeibinLi and unassigned rickyloynd-microsoft, LittleLittleCloud and skzhang1 Mar 18, 2024

jackgerrits changed the title ~~[Roadmap]: Multimodal Orchestration~~ [Roadmap] Multimodal Orchestration Mar 18, 2024

BeibinLi mentioned this issue Mar 28, 2024

Multimodal message as a List: introducing AGImage #2196

Closed

3 tasks

This was referenced Apr 29, 2024

[Issue]: Agentic Loop over a large document #2542

Closed

[Feature Request]: Connect to the HuggingFace Hub to achieve a multimodal capability #2577

Open

rysweet added 0.2 Issues which are related to the pre 0.4 codebase needs-triage labels Oct 2, 2024

rysweet closed this as completed Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Roadmap] Multimodal Orchestration #1975

[Roadmap] Multimodal Orchestration #1975

BeibinLi commented Mar 12, 2024 •

edited

Loading

Want to get involved?

rickyloynd-microsoft commented Mar 12, 2024

WaelKarkoub commented Mar 13, 2024

BeibinLi commented Mar 13, 2024

BeibinLi commented Mar 13, 2024

BeibinLi commented Mar 13, 2024

krowflow commented May 31, 2024

rysweet commented Oct 18, 2024

[Roadmap] Multimodal Orchestration #1975

[Roadmap] Multimodal Orchestration #1975

Comments

BeibinLi commented Mar 12, 2024 • edited Loading

Want to get involved?

The problem becomes more severe if we enable image generation, audio, OCR, and video (in the future).

Common Issues [Example]

Current issues and solutions

Important Multimodal Updates

Major Update 0: Message Format

Major Update 1: Full Vision Support inside Client

Major Update 2: Efficient Vision Capability

Update 3: Image Generation Capability

Update 4: Audio Capabilities

Update 5: OCR Capability (within VisionCapability)

Update 6: Coordinate Detection Capability (within VisionCapability)

Additional context

rickyloynd-microsoft commented Mar 12, 2024

WaelKarkoub commented Mar 13, 2024

BeibinLi commented Mar 13, 2024

BeibinLi commented Mar 13, 2024

BeibinLi commented Mar 13, 2024

krowflow commented May 31, 2024

rysweet commented Oct 18, 2024

BeibinLi commented Mar 12, 2024 •

edited

Loading