Planning as code - where does it end? #174
Replies: 2 comments 1 reply
-
I think the answer is simply that we have to gauge things on a per-use basis. You already mentioned a web browser agent that needs multiple pauses before deciding what to do next, and there's a ready example from the vlm browser branch of this repo where the lm is instructed to "Proceed in several steps rather than trying to do it all in one shot". It makes sense there, so the tool is molded to fit the circumstance. It seems to me that you're asking to put the constraint on the tool instead of its application. |
Beta Was this translation helpful? Give feedback.
-
Hi Jeremy, thank your for your input! I was simply wondering if people had the same questions as I did about the boundaries of a step and if some interesting ways of seeing things would emerge. As I understand it, your point of view would be to evaluate the agent with different settings for the "step prompting policy" and find the best. Still, the fact that creating specialized tools (such as tools to parse a result string into a determined format) can extend the length of a step since the agent no longer needs the outer loop to extract the result of a previous step is more a who does what problem I think. |
Beta Was this translation helpful? Give feedback.
-
Hello everyone,
Something is bothering me since I started hacking with smolagents and I'm curious to hear the community's thoughts about it.
The smolagents' approach of using code instead of JSON for planning enables better composability and multiple tool usage per step. This effectively blurs the line between planning and execution steps, as a single "step" can contain multiple tool calls within a code block.
Should we push this philosophy to its logical conclusion and have agents plan entire workflows as a single code block when possible (considering we have the tools to do so)?
For example:
Query: If the temperature in Paris is lower than the the temperature in New York, give me 3 museums to visit in Paris, otherwise give me 3 parks in New York.
Using smolagents's default tooling:
But it if we had enough tools:
This raises several questions (at least for me):
How do we balance the benefits of code-based composition against the need for dynamic adaptation and error recovery? I.e how do we define a step?
If we had enough tools to do anything, should there be a way to specify the max number of tool_calls made in a step?
What is the purpose of maintaining an outer agent loop if we move towards larger code blocks? Mainly for error handling, recovery and memory i suppose? (I get that some use cases like a web browsing agent would be difficult to plan as one shot workflow without recoding an agent itself).
I hope this makes sense, as I'm not sure what really triggers me, but I feel that there's some asbtraction to find about all this.
Beta Was this translation helpful? Give feedback.
All reactions