Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ROADMAP] DiscoveryBench Integration #2

Open
13 tasks
Ethan0456 opened this issue Oct 15, 2024 · 0 comments
Open
13 tasks

[ROADMAP] DiscoveryBench Integration #2

Ethan0456 opened this issue Oct 15, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@Ethan0456
Copy link
Member

Ethan0456 commented Oct 15, 2024

🛰️ DiscoveryBench Integration

This issue tracks the integration of the DiscoveryBench benchmark into OpenHands. DiscoveryBench includes real-world and synthetic scientific discovery tasks that will help assess the agents' capabilities in multi-step, complex problem-solving. The benchmark aims to provide comprehensive insights into how well OpenHands agents handle data-driven scientific discovery workflows.

📋 Tasks

1. Clone and set up DiscoveryBench repository

  • Clone the DiscoveryBench Git repository and install dependencies.

2. Create dataset for evaluation

  • Create a custom function that create a dataset from the cloned repository.
  • Prepare the dataset for evaluation.

3. Generate evaluation metadata and process each instance

  • Create metadata using the make_metadata function, including dataset and task info.
  • Use the process_instance method to prepare evaluation queries for each dataset instance.

4. Set up runtime

  • Create the runtime environment for experimentation.
  • Initialize the runtime by copying the necessary data files into the container.
  • Start OpenHands with the instance query and the data inside the container

5. Run the evaluation workflow

  • Extract the results generated by the OpenHands agents.
  • Analyze the results, comparing generated hypotheses to gold-standard outputs.

6. Compile final results into test result dictionary

  • Save all metrics and results into the test_result dictionary for final analysis.

7. Log and save evaluation outputs

  • Ensure all outputs are logged and stored for reporting.

8. Validate the integration

  • Perform end-to-end validation of DiscoveryBench within OpenHands to ensure correct functionality.
@Ethan0456 Ethan0456 added the enhancement New feature or request label Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant