You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue tracks the integration of the DiscoveryBench benchmark into OpenHands. DiscoveryBench includes real-world and synthetic scientific discovery tasks that will help assess the agents' capabilities in multi-step, complex problem-solving. The benchmark aims to provide comprehensive insights into how well OpenHands agents handle data-driven scientific discovery workflows.
📋 Tasks
1. Clone and set up DiscoveryBench repository
Clone the DiscoveryBench Git repository and install dependencies.
2. Create dataset for evaluation
Create a custom function that create a dataset from the cloned repository.
Prepare the dataset for evaluation.
3. Generate evaluation metadata and process each instance
Create metadata using the make_metadata function, including dataset and task info.
Use the process_instance method to prepare evaluation queries for each dataset instance.
4. Set up runtime
Create the runtime environment for experimentation.
Initialize the runtime by copying the necessary data files into the container.
Start OpenHands with the instance query and the data inside the container
5. Run the evaluation workflow
Extract the results generated by the OpenHands agents.
Analyze the results, comparing generated hypotheses to gold-standard outputs.
6. Compile final results into test result dictionary
Save all metrics and results into the test_result dictionary for final analysis.
7. Log and save evaluation outputs
Ensure all outputs are logged and stored for reporting.
8. Validate the integration
Perform end-to-end validation of DiscoveryBench within OpenHands to ensure correct functionality.
The text was updated successfully, but these errors were encountered:
🛰️ DiscoveryBench Integration
This issue tracks the integration of the DiscoveryBench benchmark into OpenHands. DiscoveryBench includes real-world and synthetic scientific discovery tasks that will help assess the agents' capabilities in multi-step, complex problem-solving. The benchmark aims to provide comprehensive insights into how well OpenHands agents handle data-driven scientific discovery workflows.
📋 Tasks
1. Clone and set up DiscoveryBench repository
2. Create dataset for evaluation
3. Generate evaluation metadata and process each instance
make_metadata
function, including dataset and task info.process_instance
method to prepare evaluation queries for each dataset instance.4. Set up runtime
5. Run the evaluation workflow
6. Compile final results into test result dictionary
test_result
dictionary for final analysis.7. Log and save evaluation outputs
8. Validate the integration
The text was updated successfully, but these errors were encountered: