Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Automated Benchmarking Suite #366

Open
TimothyStiles opened this issue Sep 23, 2023 · 5 comments
Open

Create Automated Benchmarking Suite #366

TimothyStiles opened this issue Sep 23, 2023 · 5 comments
Labels
devops Improvements to DevOps (e.g. GitHub actions, linting, etc.) hard A major or complex undertaking low priority Would be nice to fix, but doesn't have to happen right now/there are more important things stale

Comments

@TimothyStiles
Copy link
Collaborator

It'd be really cool to have a benchmarking suite that we can run to see if we've unintentionally introduced any performance changes before merging into main.

Idea would be that on PR creation we'd run the benchmarks on both the main branch and the PR branch, and use it to highlight any significant changes (positive and negative).

We can start slow with what we'd consider "problem areas" and continue out from there.

@TimothyStiles TimothyStiles converted this from a draft issue Sep 23, 2023
@carreter
Copy link
Collaborator

I think this might combine well with #362 . If we have a tutorial series that takes a real-world example from end to end throughout our package, we can also use it to benchmark performance.

@carreter carreter added devops Improvements to DevOps (e.g. GitHub actions, linting, etc.) low priority Would be nice to fix, but doesn't have to happen right now/there are more important things hard A major or complex undertaking labels Sep 23, 2023
@Koeng101
Copy link
Contributor

I think it'd be great to benchmark against all of Genbank or uniprot or pdb. Would take a server with decent hard drives, or just a lot of data per month to stream, and would actually validate that our parsers work well.

@carreter
Copy link
Collaborator

This is a great idea! We could have our new CI/CD pipeline (#365) incorproate this.

I don't think it'd be advisable to have it run against ALL of these massive datasets every time we merge, but we could have it pick a consistent, representative subset.

It'd be nice to also have all new entries in these DBs run against the latest version of our parsers.

Also, these DBs aren't that big size-wise since it's just text and not image data, right? I have no clue, this is a genuine question.

@Koeng101
Copy link
Contributor

Genbank I think is a little over a terabyte, so not that bad. Uniprot is like 250gb. SRA, on the other hand, is 33 petabytes (and the wayback machine is 57 petabytes), so kinda puts it into perspective. SRA there is NO WAY we could handle, but Genbank+uniprot would probably be doable.

Copy link

This issue has had no activity in the past 2 months. Marking as stale.

@github-actions github-actions bot added the stale label Nov 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops Improvements to DevOps (e.g. GitHub actions, linting, etc.) hard A major or complex undertaking low priority Would be nice to fix, but doesn't have to happen right now/there are more important things stale
Projects
None yet
Development

No branches or pull requests

3 participants