-
Notifications
You must be signed in to change notification settings - Fork 192
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #168 from AlibabaResearch/flowbench
Flowbench
- Loading branch information
Showing
14 changed files
with
1,765 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,112 @@ | ||
|
||
|
||
<div align="center"> | ||
<h1 align="center"> 🌊 FlowBench 🌊</h1> | ||
<b>FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents</b> | ||
|
||
<p align="center"><font size=6>📃</font> <a target="_self" href="https://arxiv.org/abs/2406.14884"> <img style="height:14pt" src="https://img.shields.io/badge/-Paper-red?style=flat&logo=arxiv"></a> <font size=6>•</font> <font size=6>🔔</font> <a target="_self" href="https://github.com/Justherozen/FlowBench"> <img style="height:14pt" src="https://img.shields.io/badge/-Code-pink?style=flat&logo=github"></a></p> | ||
|
||
</div> | ||
|
||
|
||
## Overview | ||
|
||
This repository contains the source data and code for our EMNLP 2024 paper [FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents](https://arxiv.org/abs/2406.14884). We propose a comprehensive benchmark, FlowBench, for workflow-guided agent planning. We first revisit and formalize different workflow knowledge formats for agent planning. FlowBench covers an extensive taxonomy (6 domains, 22 roles, 51 scenarios) and different knowledge formats (text, code, flowchart) to synchronize with real-world applications comprehensively. The benchmark data is constructed through a three-phase pipeline of task collection, workflow organization, and session generation. FlowBench features numerous distinct characteristics, such as coverage, difficulty, expert-level annotation, and support for multi-round user-agent interaction. Through extensive experiments on FlowBench, we find that even the best-performing model, GPT4o, fails to deliver satisfying results on challenging FlowBench. We hope that our work can provide meaningful insights to future research in the field of workflow-guided agent planning. An overview of our proposed FlowBench can be seen as follows: | ||
|
||
![overview of flowbench](./resources/flowbench.png) | ||
|
||
> *Please find more details of this work in our paper.* | ||
|
||
|
||
|
||
|
||
|
||
|
||
### Dataset Introduction | ||
|
||
Download `turn_data.zip` and `session_data.zip` from [Google Drive](https://drive.google.com/drive/folders/1PFzA5e-fuKpVZvAHP-otBhWPdU60O3d4?usp=sharing). After extracting, you will get two folders: `turn_data` and `session_data`. Move these two folders into the `data` directory. These two folders contain the benchmark data on the session-level and turn-level. All workflow knowledge with different formats has been organized into the `knowledge.json`. | ||
|
||
|
||
|
||
|
||
|
||
### Evaluating workflow-guided agent planning | ||
|
||
##### Dependencies | ||
|
||
To install requirements: | ||
|
||
pip install requirements.txt | ||
|
||
##### API preparation | ||
|
||
Set up your OPENAI key in ./utils/keys.json | ||
|
||
``` | ||
api_key: "Your OPENAI key" | ||
``` | ||
|
||
After that, you can conduct the turn-level and session-level evaluations. | ||
|
||
##### Turn-level evaluation | ||
|
||
- To generate the single-turn predictions for different test samples, please run | ||
|
||
``` | ||
python ./turn_level/turn_inference.py --input_path INPUT_FOLDER --output_path OUTPUT_FOLDER | ||
``` | ||
|
||
* Then you can calculate and display the evaluation metrics with the following commands, where `OUTPUT_FOLDER` is the output path of the last generation step. | ||
|
||
``` | ||
python ./turn_level/turn_metric_display.py --output_path OUTPUT_FOLDER | ||
``` | ||
|
||
|
||
|
||
##### Session-level evaluation | ||
|
||
- To simulate the predicted sessions, use the following commands with simulate mode, where `INPUT_PATH`, `OUTPUT_PATH`, and `EVAL_PATH` indicate the paths for test input, simulation generation, and simulation evaluation results, respectively. | ||
|
||
``` | ||
python ./session_level/session_simulate.py --mode simulate --input_path INPUT_PATH --output_path OUTPUT_PATH --eval_path EVAL_PATH | ||
``` | ||
|
||
* After session simulation, you can calculate and save the evaluation metrics using the eval mode as follows. | ||
|
||
``` | ||
python ./session_level/session_simulate.py --mode eval --input_path INPUT_PATH --output_path OUTPUT_PATH --eval_path EVAL_PATH | ||
``` | ||
|
||
* Finally, you can display the evaluation metrics for each scenario and optionally save them to the Excel file. | ||
``` | ||
python ./session_level/session_metric_display.py --eval_path EVAL_PATH | ||
``` | ||
|
||
You can specify the LLM used for generation, the LLM used as a judge, and the LLM used for environment simulation from the command line. | ||
|
||
|
||
|
||
|
||
##### Future plans | ||
|
||
Apart from the scenarios presented in the paper, we will incorporate more additional scenarios. We will also keep refining our benchmark quality and our evaluation framework as part of our future initiatives! | ||
|
||
|
||
|
||
### Citation | ||
|
||
If you use or extend our work, please cite the paper as follows: | ||
|
||
``` | ||
@misc{xiao2024flowbenchrevisitingbenchmarkingworkflowguided, | ||
title={FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents}, | ||
author={Ruixuan Xiao and Wentao Ma and Ke Wang and Yuchuan Wu and Junbo Zhao and Haobo Wang and Fei Huang and Yongbin Li}, | ||
year={2024}, | ||
eprint={2406.14884}, | ||
archivePrefix={arXiv}, | ||
primaryClass={cs.CL}, | ||
url={https://arxiv.org/abs/2406.14884}, | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
regex | ||
pandas | ||
numpy | ||
openai | ||
jsonlines | ||
xlsxwriter |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
#To simulate the predicted sessions, use the following commands | ||
python ./session_level/session_simulate.py --mode simulate --input_path INPUT_PATH --output_path OUTPUT_PATH | ||
#After session simulation, you can calculate and save the evaluation metrics as follows. | ||
python ./session_level/session_simulate.py --mode eval --output_path OUTPUT_PATH --eval_path EVAL_PATH | ||
#Finally, you can display the evaluation metrics for each scenario and optionally save them to excel file. | ||
python ./session_level/session_metric_display.py --input_directory EVAL_PATH |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
#To generate the singele-turn predictions for different test samples, please run | ||
python ./turn_level/turn_inference.py --input_path INPUT_FOLDER --output_path OUTPUT_FOLDER | ||
#Then you can calculate and display the evaluation metrics with the following commands. | ||
python ./turn_level/turn_metric_display.py --input_path OUTPUT_FOLDER |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,84 @@ | ||
import os | ||
import jsonlines | ||
import pandas as pd | ||
import argparse | ||
|
||
def compute_session_metrics(input_directory, output_excel=''): | ||
final_progress = [] | ||
final_all_session = 0 | ||
final_right_session = 0 | ||
final_right_api_num = 0 | ||
final_all_api_num_gt = 0 | ||
final_all_api_num_pre = 0 | ||
if output_excel: | ||
excel_path = pd.ExcelWriter(output_excel, engine='xlsxwriter') | ||
jsonl_files = [f for f in os.listdir(input_directory) if f.endswith('.jsonl')] | ||
|
||
for file_name in jsonl_files: | ||
file_path = os.path.join(input_directory, file_name) | ||
gpt_success = [] | ||
gpt_progress = [] | ||
api_num_right = [] | ||
api_num_all_gt = [] | ||
api_num_all_pre = [] | ||
|
||
data = [] | ||
with jsonlines.open(file_path) as reader: | ||
for obj in reader: | ||
data.append(obj) | ||
gpt_success.append(int(obj.get('success_gpt'))) | ||
gpt_progress.append(float(obj.get('progress_gpt'))) | ||
api_num_right.append(obj.get('right_api_num')) | ||
api_num_all_gt.append(obj.get('all_api_num_gt')) | ||
api_num_all_pre.append(obj.get('all_api_num_pre')) | ||
tmp_output = { | ||
"scenarios": file_name, | ||
"success_rate": sum(gpt_success) / len(gpt_success) if gpt_success else 0, | ||
"avg_progress": sum(gpt_progress) / len(gpt_progress) if gpt_progress else 0, | ||
"tool_precision": sum(api_num_right) / sum(api_num_all_pre) if api_num_all_pre else 0, | ||
"tool_recall": sum(api_num_right) / sum(api_num_all_gt) if api_num_all_gt else 0, | ||
} | ||
|
||
print(tmp_output) | ||
final_progress.extend(gpt_progress) | ||
final_all_session += len(gpt_success) | ||
final_right_session += sum(gpt_success) | ||
|
||
final_right_api_num += sum(api_num_right) | ||
final_all_api_num_gt += sum(api_num_all_gt) | ||
final_all_api_num_pre += sum(api_num_all_pre) | ||
if output_excel: | ||
df = pd.DataFrame(data) | ||
df.to_excel(excel_path, sheet_name=file_name.split('.')[0][:], index=False) | ||
|
||
final_gpt_success = final_right_session / final_all_session if final_all_session > 0 else 0 | ||
final_gpt_progress = sum(final_progress) / len(final_progress) if final_progress else 0 | ||
final_api_prec = final_right_api_num / final_all_api_num_pre if final_all_api_num_pre > 0 else 0 | ||
final_api_recall = final_right_api_num / final_all_api_num_gt if final_all_api_num_gt > 0 else 0 | ||
|
||
final_tmp_output = { | ||
"scenarios": "All", | ||
"success_rate": final_gpt_success, | ||
"avg_progress": final_gpt_progress, | ||
"tool_precision": final_api_prec, | ||
"tool_recall": final_api_recall | ||
} | ||
print("--------------") | ||
print(final_tmp_output) | ||
print(final_all_session) | ||
if output_excel: | ||
df_final = pd.DataFrame([final_tmp_output]) | ||
df_final.to_excel(excel_path, sheet_name='Overall Metrics', index=False) | ||
excel_path._save() | ||
|
||
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser(description="Process paths and modes.") | ||
|
||
# Add arguments | ||
parser.add_argument("--output_excel", help="Path to the output Excel file") | ||
parser.add_argument("--eval_path", required=True, help="Path to the input directory for metric display") | ||
|
||
# Parse arguments | ||
args = parser.parse_args() | ||
ret = compute_session_metrics(args.eval_path,args.output_excel) |
Oops, something went wrong.