Merge pull request #168 from AlibabaResearch/flowbench

Flowbench
AlibabaResearch · Nov 19, 2024 · dac5bd0 · dac5bd0
2 parents 610a882 + 7660771
commit dac5bd0
Show file tree

Hide file tree

Showing 14 changed files with 1,765 additions and 0 deletions.
diff --git a/FlowBench/README.md b/FlowBench/README.md
@@ -0,0 +1,112 @@
+
+
+<div align="center">
+<h1 align="center"> 🌊 FlowBench 🌊</h1>
+<b>FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents</b>
+
+<p align="center"><font size=6>📃</font> <a target="_self" href="https://arxiv.org/abs/2406.14884"> <img style="height:14pt" src="https://img.shields.io/badge/-Paper-red?style=flat&logo=arxiv"></a> <font size=6>•</font> <font size=6>🔔</font> <a target="_self" href="https://github.com/Justherozen/FlowBench"> <img style="height:14pt" src="https://img.shields.io/badge/-Code-pink?style=flat&logo=github"></a></p>
+
+</div>
+
+
+## Overview
+
+This repository contains the source data and code for our EMNLP 2024 paper [FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents](https://arxiv.org/abs/2406.14884).  We propose a comprehensive benchmark, FlowBench, for workflow-guided agent planning. We first revisit and formalize different workflow knowledge formats for agent planning. FlowBench covers an extensive taxonomy (6 domains, 22 roles, 51 scenarios) and different knowledge formats (text, code, flowchart) to synchronize with real-world applications comprehensively. The benchmark data is constructed through a three-phase pipeline of task collection, workflow organization, and session generation. FlowBench features numerous distinct characteristics, such as coverage, difficulty, expert-level annotation, and support for multi-round user-agent interaction. Through extensive experiments on FlowBench, we find that even the best-performing model, GPT4o, fails to deliver satisfying results on challenging FlowBench. We hope that our work can provide meaningful insights to future research in the field of workflow-guided agent planning. An overview of our proposed FlowBench can be seen as follows:
+
+![overview of flowbench](./resources/flowbench.png)
+
+> *Please find more details of this work in our paper.*
+
+
+
+
+
+
+
+### Dataset Introduction
+
+Download `turn_data.zip` and `session_data.zip` from [Google Drive](https://drive.google.com/drive/folders/1PFzA5e-fuKpVZvAHP-otBhWPdU60O3d4?usp=sharing). After extracting, you will get two folders: `turn_data` and `session_data`. Move these two folders into the `data` directory. These two folders contain the benchmark data on the session-level and turn-level. All workflow knowledge with different formats has been organized into the `knowledge.json`.
+
+
+
+
+
+### Evaluating workflow-guided agent planning
+
+##### Dependencies
+
+To install requirements:
+
+	pip install requirements.txt
+
+##### API preparation
+
+Set up your OPENAI key in ./utils/keys.json
+
+```
+api_key: "Your OPENAI key"
+```
+
+After that, you can conduct the turn-level and session-level evaluations. 
+
+##### Turn-level evaluation
+
+- To generate the single-turn predictions for different test samples, please run
+
+```
+python ./turn_level/turn_inference.py --input_path INPUT_FOLDER --output_path OUTPUT_FOLDER
+```
+
+* Then you can calculate and display the evaluation metrics with the following commands, where `OUTPUT_FOLDER`  is the output  path of the last generation step.
+
+```
+ python ./turn_level/turn_metric_display.py --output_path OUTPUT_FOLDER
+```
+
+
+
+##### Session-level evaluation
+
+- To simulate the predicted sessions, use the following commands with simulate mode, where `INPUT_PATH`, `OUTPUT_PATH`, and `EVAL_PATH` indicate the paths for test input, simulation generation, and simulation evaluation results, respectively.
+
+```
+python ./session_level/session_simulate.py --mode simulate --input_path INPUT_PATH --output_path OUTPUT_PATH --eval_path EVAL_PATH 
+```
+
+* After session simulation, you can calculate and save the evaluation metrics using the eval mode as follows.
+
+```
+python ./session_level/session_simulate.py --mode eval --input_path INPUT_PATH --output_path OUTPUT_PATH --eval_path EVAL_PATH 
+```
+
+* Finally, you can display the evaluation metrics for each scenario and optionally save them to the Excel file.
+```
+python ./session_level/session_metric_display.py --eval_path EVAL_PATH
+```
+
+You can specify the LLM used for generation, the LLM used as a judge, and the LLM used for environment simulation from the command line.
+
+
+
+
+##### Future plans
+
+Apart from the scenarios presented in the paper, we will incorporate more additional scenarios. We will also keep refining our benchmark quality and our evaluation framework as part of our future initiatives!
+
+
+
+### Citation
+
+If you use or extend our work, please cite the paper as follows:
+
+```
+@misc{xiao2024flowbenchrevisitingbenchmarkingworkflowguided,
+      title={FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents}, 
+      author={Ruixuan Xiao and Wentao Ma and Ke Wang and Yuchuan Wu and Junbo Zhao and Haobo Wang and Fei Huang and Yongbin Li},
+      year={2024},
+      eprint={2406.14884},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2406.14884}, 
+}
+```
diff --git a/FlowBench/requirements.txt b/FlowBench/requirements.txt
@@ -0,0 +1,6 @@
+regex
+pandas
+numpy
+openai
+jsonlines
+xlsxwriter
diff --git a/FlowBench/resources/flowbench.png b/FlowBench/resources/flowbench.png
diff --git a/FlowBench/script/session_level.sh b/FlowBench/script/session_level.sh
@@ -0,0 +1,6 @@
+#To simulate the predicted sessions, use the following commands
+python ./session_level/session_simulate.py --mode simulate --input_path INPUT_PATH --output_path OUTPUT_PATH
+#After session simulation, you can calculate and save the evaluation metrics as follows.
+python ./session_level/session_simulate.py --mode eval --output_path OUTPUT_PATH --eval_path EVAL_PATH 
+#Finally, you can display the evaluation metrics for each scenario and optionally save them to excel file.
+python ./session_level/session_metric_display.py --input_directory EVAL_PATH
diff --git a/FlowBench/script/turn_level.sh b/FlowBench/script/turn_level.sh
@@ -0,0 +1,4 @@
+#To generate the singele-turn predictions for different test samples, please run
+python ./turn_level/turn_inference.py --input_path INPUT_FOLDER --output_path OUTPUT_FOLDER
+#Then you can calculate and display the evaluation metrics with the following commands.
+python ./turn_level/turn_metric_display.py --input_path OUTPUT_FOLDER
diff --git a/FlowBench/session_level/session_metric_display.py b/FlowBench/session_level/session_metric_display.py
@@ -0,0 +1,84 @@
+import os
+import jsonlines
+import pandas as pd
+import argparse
+
+def compute_session_metrics(input_directory, output_excel=''):
+    final_progress = []
+    final_all_session = 0
+    final_right_session = 0
+    final_right_api_num = 0
+    final_all_api_num_gt = 0
+    final_all_api_num_pre = 0 
+    if output_excel:
+        excel_path = pd.ExcelWriter(output_excel, engine='xlsxwriter')
+    jsonl_files = [f for f in os.listdir(input_directory) if f.endswith('.jsonl')]
+
+    for file_name in jsonl_files:
+        file_path = os.path.join(input_directory, file_name)
+        gpt_success = []
+        gpt_progress = []
+        api_num_right = []
+        api_num_all_gt = []
+        api_num_all_pre = []
+
+        data = []
+        with jsonlines.open(file_path) as reader:
+            for obj in reader:
+                data.append(obj)
+                gpt_success.append(int(obj.get('success_gpt')))
+                gpt_progress.append(float(obj.get('progress_gpt')))
+                api_num_right.append(obj.get('right_api_num'))
+                api_num_all_gt.append(obj.get('all_api_num_gt'))
+                api_num_all_pre.append(obj.get('all_api_num_pre'))
+        tmp_output = {
+            "scenarios": file_name,
+            "success_rate": sum(gpt_success) / len(gpt_success) if gpt_success else 0,
+            "avg_progress": sum(gpt_progress) / len(gpt_progress) if gpt_progress else 0,
+            "tool_precision": sum(api_num_right) / sum(api_num_all_pre) if api_num_all_pre else 0,
+            "tool_recall": sum(api_num_right) / sum(api_num_all_gt) if api_num_all_gt else 0,
+        }
+
+        print(tmp_output)
+        final_progress.extend(gpt_progress)
+        final_all_session += len(gpt_success)
+        final_right_session += sum(gpt_success)
+
+        final_right_api_num += sum(api_num_right)
+        final_all_api_num_gt += sum(api_num_all_gt)
+        final_all_api_num_pre += sum(api_num_all_pre)
+        if output_excel:
+            df = pd.DataFrame(data)
+            df.to_excel(excel_path, sheet_name=file_name.split('.')[0][:], index=False)
+
+    final_gpt_success = final_right_session / final_all_session if final_all_session > 0 else 0
+    final_gpt_progress = sum(final_progress) / len(final_progress) if final_progress else 0
+    final_api_prec = final_right_api_num / final_all_api_num_pre if final_all_api_num_pre > 0 else 0
+    final_api_recall = final_right_api_num / final_all_api_num_gt if final_all_api_num_gt > 0 else 0
+
+    final_tmp_output = {
+        "scenarios": "All",
+        "success_rate": final_gpt_success,
+        "avg_progress": final_gpt_progress,
+        "tool_precision": final_api_prec,
+        "tool_recall": final_api_recall
+    }
+    print("--------------")
+    print(final_tmp_output)
+    print(final_all_session)
+    if output_excel:
+        df_final = pd.DataFrame([final_tmp_output])
+        df_final.to_excel(excel_path, sheet_name='Overall Metrics', index=False)
+        excel_path._save()
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Process paths and modes.")
+
+    # Add arguments
+    parser.add_argument("--output_excel", help="Path to the output Excel file")
+    parser.add_argument("--eval_path", required=True, help="Path to the input directory for metric display")
+
+    # Parse arguments
+    args = parser.parse_args()
+    ret = compute_session_metrics(args.eval_path,args.output_excel)