Planning as code - where does it end? #174

printfhere · 2025-01-13T14:01:43Z

printfhere
Jan 13, 2025

Hello everyone,

Something is bothering me since I started hacking with smolagents and I'm curious to hear the community's thoughts about it.

The smolagents' approach of using code instead of JSON for planning enables better composability and multiple tool usage per step. This effectively blurs the line between planning and execution steps, as a single "step" can contain multiple tool calls within a code block.

Should we push this philosophy to its logical conclusion and have agents plan entire workflows as a single code block when possible (considering we have the tools to do so)?

For example:

Query: If the temperature in Paris is lower than the the temperature in New York, give me 3 museums to visit in Paris, otherwise give me 3 parks in New York.

Using smolagents's default tooling:

# Current sequential approach
web_seach("weather in paris")
web_search("weather in new york")
# Wait for next planning step...

But it if we had enough tools:

# Full workflow in one plan
nyc_weather = web_seach("weather in new york")
paris_weather = web_seach("weather in paris")
nyc_temp = parse_temperature(nyc_weather, output_format="float")
paris_temp = parse_temperature(paris_weather, output_format="float")
if paris_temp < nyc_temp:
    # Continue logic...

This raises several questions (at least for me):

How do we balance the benefits of code-based composition against the need for dynamic adaptation and error recovery? I.e how do we define a step?
If we had enough tools to do anything, should there be a way to specify the max number of tool_calls made in a step?
What is the purpose of maintaining an outer agent loop if we move towards larger code blocks? Mainly for error handling, recovery and memory i suppose? (I get that some use cases like a web browsing agent would be difficult to plan as one shot workflow without recoding an agent itself).

I hope this makes sense, as I'm not sure what really triggers me, but I feel that there's some asbtraction to find about all this.

JeremyBickel · 2025-01-13T23:11:40Z

JeremyBickel
Jan 13, 2025

I think the answer is simply that we have to gauge things on a per-use basis. You already mentioned a web browser agent that needs multiple pauses before deciding what to do next, and there's a ready example from the vlm browser branch of this repo where the lm is instructed to "Proceed in several steps rather than trying to do it all in one shot". It makes sense there, so the tool is molded to fit the circumstance. It seems to me that you're asking to put the constraint on the tool instead of its application.

0 replies

printfhere · 2025-01-14T11:48:48Z

printfhere
Jan 14, 2025
Author

Hi Jeremy, thank your for your input!

I was simply wondering if people had the same questions as I did about the boundaries of a step and if some interesting ways of seeing things would emerge.

As I understand it, your point of view would be to evaluate the agent with different settings for the "step prompting policy" and find the best.

Still, the fact that creating specialized tools (such as tools to parse a result string into a determined format) can extend the length of a step since the agent no longer needs the outer loop to extract the result of a previous step is more a who does what problem I think.

1 reply

sunpazed Jan 16, 2025

Interesting question – I was recently asking myself the same thing. To test this, I wrote a small agent to answer the following question;

Which city is currently the coldest? New York, Glasgow, or Shanghai? Will I need an umbrella in this city in the next few days? Respond in natural language.

I wrote a custom Tool to fetch the current weather, the forecast, and the precipitation via a REST API. Here's a video on how the agent tackled the problem;

smolagents-weather.mp4

The agent fetched the weather results first, and then based on the shape of the data, extracted the additional data it needed in the next step. I believe this is an optimal planning strategy for the agent, unless the schema of the response is well known upfront.

I've also seen this approach with a text-to-sql agent I wrote, even when the table schema and description is well defined. For very complex questions, the agent will generate multiple smaller SQL queries, and then coalesce previous steps in it's final answer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Planning as code - where does it end? #174

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Planning as code - where does it end? #174

printfhere Jan 13, 2025

Replies: 2 comments · 1 reply

JeremyBickel Jan 13, 2025

printfhere Jan 14, 2025 Author

sunpazed Jan 16, 2025

printfhere
Jan 13, 2025

Replies: 2 comments 1 reply

JeremyBickel
Jan 13, 2025

printfhere
Jan 14, 2025
Author