We evaluated freeact
using five state-of-the-art models:
- Claude 3.5 Sonnet (
claude-3-5-sonnet-20241022
) - Claude 3.5 Haiku (
claude-3-5-haiku-20241022
) - Gemini 2.0 Flash (
gemini-2.0-flash-exp
) - Qwen 2.5 Coder 32B Instruct (
qwen2p5-coder-32b-instruct
) - DeepSeek V3 (
deepseek-v3
)
The evaluation was performed using two benchmark datasets: m-ric/agents_medium_benchmark_2 and the MATH subset from m-ric/smol_agents_benchmark. Both datasets were created by the smolagents team at 🤗 Hugging Face and contain selected tasks from GAIA, GSM8K, SimpleQA and MATH:
model | GAIA (exact_match) | GSM8K (exact_match) | MATH (exact_match) | SimpleQA (exact_match) | SimpleQA (llm_as_judge) |
---|---|---|---|---|---|
claude-3-5-sonnet-20241022 | 53.1 | 95.7 | 90.0 | 57.5 | 72.5 |
claude-3-5-haiku-20241022 | 31.2 | 90.0 | 76.0 | 52.5 | 70.0 |
gemini-2.0-flash-exp | 34.4 | 95.7 | 88.0 | 50.0 | 65.0 |
qwen2p5-coder-32b-instruct | 25.0 | 95.7 | 88.0 | 52.5 | 65.0 |
deepseek-v3 | 37.5 | 91.4 | 88.0 | 60.0 | 67.5 |
When comparing our results with smolagents using claude-3-5-sonnet-20241022
on m-ric/agents_medium_benchmark_2 (only dataset with available smolagents reference data), we observed the following outcomes (evaluation conducted on 2025-01-07):
agent | model | prompt | GAIA | GSM8K | SimpleQA |
---|---|---|---|---|---|
freeact | claude-3-5-sonnet-20241022 | zero-shot | 53.1 | 95.7 | 57.5 |
smolagents | claude-3-5-sonnet-20241022 | few-shot | 43.8 | 91.4 | 47.5 |
Interestingly, these results were achieved using zero-shot prompting in freeact
, while the smolagents implementation utilizes few-shot prompting. To ensure a fair comparison, we employed identical evaluation protocols and tools (converted to skills).
Clone the freeact
repository:
git clone https://github.com/freeact/freeact.git
Set up the development environment following DEVELOPMENT.md, but use this installation command:
poetry install --with eval
Create a .env
file with Anthropic, Gemini, Fireworks AI SerpAPI and OpenAI API keys:
# Claude 3.5 Sonnet and Haiku
ANTHROPIC_API_KEY=...
# Gemini 2 Flash Experimental
GOOGLE_API_KEY=...
# Qwen 2.5 Coder 32B Instruct and DeepSeek V3
FIREWORKS_API_KEY=...
# Google Web Search
SERPAPI_API_KEY=...
# GPT-4 Judge (SimpleQA evaluation)
OPENAI_API_KEY=...
Then run the evaluation script for each model:
python evaluation/evaluate.py \
--model-name claude-3-5-sonnet-20241022 \
--run-id claude-3-5-sonnet-20241022
python evaluation/evaluate.py \
--model-name claude-3-5-haiku-20241022 \
--run-id claude-3-5-haiku-20241022
python evaluation/evaluate.py \
--model-name gemini-2.0-flash-exp \
--run-id gemini-2.0-flash-exp
python evaluation/evaluate.py \
--model-name qwen2p5-coder-32b-instruct \
--run-id qwen2p5-coder-32b-instruct
python evaluation/evaluate.py \
--model-name deepseek-v3 \
--run-id deepseek-v3
Results are saved in output/evaluation/<run-id>
. Pre-generated outputs from our runs are available here.
Score the results:
python evaluation/score.py \
--evaluation-dir output/evaluation/claude-3-5-sonnet-20241022 \
--evaluation-dir output/evaluation/claude-3-5-haiku-20241022 \
--evaluation-dir output/evaluation/gemini-2.0-flash-exp \
--evaluation-dir output/evaluation/qwen2p5-coder-32b-instruct \
--evaluation-dir output/evaluation/deepseek-v3
Generate visualization and reports:
python evaluation/report.py performance
python evaluation/report.py performance-comparison \
--model-name claude-3-5-sonnet-20241022 \
--reference-results-file evaluation/reference/agents_medium_benchmark_2/smolagents-20250107.csv
python evaluation/report.py performance-comparison \
--model-name qwen2p5-coder-32b-instruct \
--reference-results-file evaluation/reference/agents_medium_benchmark_2/smolagents-20250107.csv
Plots are saved to output/evaluation-report
.