Skip to content

Latest commit

 

History

History

evaluation

Evaluation

We evaluated freeact using five state-of-the-art models:

  • Claude 3.5 Sonnet (claude-3-5-sonnet-20241022)
  • Claude 3.5 Haiku (claude-3-5-haiku-20241022)
  • Gemini 2.0 Flash (gemini-2.0-flash-exp)
  • Qwen 2.5 Coder 32B Instruct (qwen2p5-coder-32b-instruct)
  • DeepSeek V3 (deepseek-v3)

The evaluation was performed using two benchmark datasets: m-ric/agents_medium_benchmark_2 and the MATH subset from m-ric/smol_agents_benchmark. Both datasets were created by the smolagents team at 🤗 Hugging Face and contain selected tasks from GAIA, GSM8K, SimpleQA and MATH:

Performance

model GAIA (exact_match) GSM8K (exact_match) MATH (exact_match) SimpleQA (exact_match) SimpleQA (llm_as_judge)
claude-3-5-sonnet-20241022 53.1 95.7 90.0 57.5 72.5
claude-3-5-haiku-20241022 31.2 90.0 76.0 52.5 70.0
gemini-2.0-flash-exp 34.4 95.7 88.0 50.0 65.0
qwen2p5-coder-32b-instruct 25.0 95.7 88.0 52.5 65.0
deepseek-v3 37.5 91.4 88.0 60.0 67.5

When comparing our results with smolagents using claude-3-5-sonnet-20241022 on m-ric/agents_medium_benchmark_2 (only dataset with available smolagents reference data), we observed the following outcomes (evaluation conducted on 2025-01-07):

Performance comparison

agent model prompt GAIA GSM8K SimpleQA
freeact claude-3-5-sonnet-20241022 zero-shot 53.1 95.7 57.5
smolagents claude-3-5-sonnet-20241022 few-shot 43.8 91.4 47.5

Interestingly, these results were achieved using zero-shot prompting in freeact, while the smolagents implementation utilizes few-shot prompting. To ensure a fair comparison, we employed identical evaluation protocols and tools (converted to skills).

Running

Clone the freeact repository:

git clone https://github.com/freeact/freeact.git

Set up the development environment following DEVELOPMENT.md, but use this installation command:

poetry install --with eval

Create a .env file with Anthropic, Gemini, Fireworks AI SerpAPI and OpenAI API keys:

# Claude 3.5 Sonnet and Haiku
ANTHROPIC_API_KEY=...

# Gemini 2 Flash Experimental
GOOGLE_API_KEY=...

# Qwen 2.5 Coder 32B Instruct and DeepSeek V3
FIREWORKS_API_KEY=...

# Google Web Search
SERPAPI_API_KEY=...

# GPT-4 Judge (SimpleQA evaluation)
OPENAI_API_KEY=...

Then run the evaluation script for each model:

python evaluation/evaluate.py \
    --model-name claude-3-5-sonnet-20241022 \
    --run-id claude-3-5-sonnet-20241022

python evaluation/evaluate.py \
    --model-name claude-3-5-haiku-20241022 \
    --run-id claude-3-5-haiku-20241022

python evaluation/evaluate.py \
    --model-name gemini-2.0-flash-exp \
    --run-id gemini-2.0-flash-exp

python evaluation/evaluate.py \
    --model-name qwen2p5-coder-32b-instruct \
    --run-id qwen2p5-coder-32b-instruct

python evaluation/evaluate.py \
    --model-name deepseek-v3 \
    --run-id deepseek-v3

Results are saved in output/evaluation/<run-id>. Pre-generated outputs from our runs are available here.

Analysis

Score the results:

python evaluation/score.py \
  --evaluation-dir output/evaluation/claude-3-5-sonnet-20241022 \
  --evaluation-dir output/evaluation/claude-3-5-haiku-20241022 \
  --evaluation-dir output/evaluation/gemini-2.0-flash-exp \
  --evaluation-dir output/evaluation/qwen2p5-coder-32b-instruct \
  --evaluation-dir output/evaluation/deepseek-v3

Generate visualization and reports:

python evaluation/report.py performance

python evaluation/report.py performance-comparison \
  --model-name claude-3-5-sonnet-20241022 \
  --reference-results-file evaluation/reference/agents_medium_benchmark_2/smolagents-20250107.csv

python evaluation/report.py performance-comparison \
  --model-name qwen2p5-coder-32b-instruct \
  --reference-results-file evaluation/reference/agents_medium_benchmark_2/smolagents-20250107.csv

Plots are saved to output/evaluation-report.