Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: integrate sketching utils from directsketch #516

Merged
merged 16 commits into from
Nov 20, 2024
Merged

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented Nov 15, 2024

In directsketch, I tried to come up with a set of sketching utils that would make reading parameters + building sketches easier and more standardized. I introduced a series of buildutils -- BuildRecord,BuildManifest, BuildCollection, etc that are similar to their sourmash counterparts, but enable building and updating.

Here, I copy the utils over from directsketch and replace the two async functions with sync counterparts that enable writing the sigs and manifest from the buildutils. Then, I use BuildCollection in manysketch and singlesketch to manage parameter string handling, sig template generation, and sig writing.

Main differences vs prior strategy:

  • We replace build_siginfo with BuildCollection sig templates, and read the param string in via BuildCollection. There are now safety checks in place: we only allow a single moltype, abund value, scaled value, etc. We do allow multiple kmer sizes.
  • The build collection uses the exact same strategy for adding data to each template signature, and for writing signatures to file
  • Since we have a mutable BuildRecord, we don't need to iterate through all the sigs and the end and build new Records for them prior to writing a Manifest. However, we do need to make sure the BuildRecords are updated appropriately after adding data + before writing the file. This is handled within the BuildCollection methods.
  • Here I added counter for the # sequences processed per file while adding data, in order to keep the reporting that was in singlesketch. Didn't think it would impact performance much, but we are incrementing the counter after every record in the fasta, so maybe?

To do:

  • benchmark (thanks @ctb!)

Future:

  • improve parallelization, particularly when running few files
  • could we use InnerStorage for thread safety/parallelization?

@bluegenes
Copy link
Contributor Author

bluegenes commented Nov 15, 2024

status:
zip files are getting written, but can't (yet) be read with sourmash. Running into this error again:

sourmash.exceptions.Panic: sourmash panicked: thread 'unnamed' panicked with 'called `Result::unwrap()` on an `Err` value: InvalidArchive("Extra data field contains disk number")' at src/core/src/storage/mod.rs:390

Apparently the zip that is installed in my environment is v2.2, which has the bug mentioned here, and not v2.0, which is in Cargo.toml.

...why/how?

Update: I'm not sure why/how that was happening, but I did some cleaning + updating, and now the right versions seem to be installing. Hopefully this will stay fixed.

@bluegenes
Copy link
Contributor Author

bluegenes commented Nov 16, 2024

The only remaining failing tests are the ones with "incomplete" param strings -- i.e. user doesn't provide a moltype or doesn't provide a ksize. Temporarily commented out.

The current parameter strategy = DNA default (k=31,scaled=1000,noabund), then modify any parameters that are provided. Previously, we required a moltype and ksize to be present in the parameter string.

What do we want to require? I think maybe the right solution may be to require moltype, but not necessarily scaled or ksize? If a moltype is provided, we can use the default scaled/ksize. However, we don't necessarily want to assume DNA if the moltype is not provided?

thoughts @ctb?

Update: I've changed the strategy to require moltype. I think this matches our sketch fromfile strategy. Note that requiring ksize is not explicitly mentioned anywhere, and I think I only did it to prevent issues, since I didn't have comprehensive parameter string checks or signature defaults.

bluegenes and others added 4 commits November 15, 2024 16:57
* buildutils for singlesketch

* consolidate sig building

* consolidate sig writing

* gz + zip writing!

* check gz is actually gz

* black formatting
@ctb
Copy link
Collaborator

ctb commented Nov 16, 2024

benchmarks:

::::::::::::::
outputs.paper/benchmarks-2024-11-16.v0_9_11/sketch_SRR5650070.txt
::::::::::::::
s       h:m:s   max_rss max_vms max_uss max_pss io_in   io_out  mean_load
cpu_time
454.5841        0:07:34 134.50  4571.02 131.88  132.54  0.00    0.00    96.24  
437.51
--More--(Next file: outputs.paper/benchmarks-2024-11-16.v0_9_11_pr516/sketch_SRR::::::::::::::
outputs.paper/benchmarks-2024-11-16.v0_9_11_pr516/sketch_SRR5650070.txt
::::::::::::::
s       h:m:s   max_rss max_vms max_uss max_pss io_in   io_out  mean_load
cpu_time
469.6802        0:07:49 146.27  4570.58 145.22  145.34  13.77   10.91   94.03  
441.90
--More--(Next file: outputs.paper/benchmarks-2024-11-16.v0_9_11_pr526/sketch_SRR::::::::::::::
outputs.paper/benchmarks-2024-11-16.v0_9_11_pr526/sketch_SRR5650070.txt
::::::::::::::
s       h:m:s   max_rss max_vms max_uss max_pss io_in   io_out  mean_load
cpu_time
5855.4110       1:37:35 123.00  4535.66 133.02  133.68  6.99    6.42    4023.00
235563.59

I think simple parallelization per #526 isn't going to work :) :(

@bluegenes bluegenes changed the title WIP: integrate sketching utils from directsketch MRG: integrate sketching utils from directsketch Nov 18, 2024
@bluegenes
Copy link
Contributor Author

ok @ctb - ready for review

@bluegenes
Copy link
Contributor Author

@mr-eyes Here is where I modify singlesketch (the other PR was just into here), but I would be happy for you to parallelize singlesketch however you like, even if it doesn't use the sig writing/data addition utils I added here. I mostly just didn't want to support parse_param_str or build_siginfo anymore, since they're duplicated in BuildCollection, but with better checks. That meant I switched to BuildCollection for sig templates. Do you think you can use that?

@mr-eyes
Copy link
Member

mr-eyes commented Nov 18, 2024

@mr-eyes Here is where I modify singlesketch (the other PR was just into here), but I would be happy for you to parallelize singlesketch however you like, even if it doesn't use the sig writing/data addition utils I added here. I mostly just didn't want to support parse_param_str or build_siginfo anymore, since they're duplicated in BuildCollection, but with better checks. That meant I switched to BuildCollection for sig templates. Do you think you can use that?

Hi Tessa! This is cool! I will see while writing the code, but I think it's gonna work smoothly. Thank you!

@bluegenes
Copy link
Contributor Author

@ctb any chance you could run the benchmarks again? I modified MultiSelect to work in-place, and I'm not sure if that changes much, if anything. I'm really only using it to select template sigs we can use for a given file.

@ctb
Copy link
Collaborator

ctb commented Nov 19, 2024

on it!

@ctb
Copy link
Collaborator

ctb commented Nov 19, 2024

% more outputs.paper/benchmarks/sketch_SRR5650070.txt
s       h:m:s   max_rss max_vms max_uss max_pss io_in   io_out  mean_load
cpu_time
473.1543        0:07:53 133.00  4570.59 130.99  131.66  2807.45 6.42    92.57  
438.12

Copy link
Collaborator

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup!

Are you OK with this code eventually be moved over to sourmash, maybe? And if so are you ok with the license changing to BSD? (Just want to get it on record ;)) If so, please also create a sourmash issue suggesting that we consolidate some of this over there!

@bluegenes
Copy link
Contributor Author

Nice cleanup!

Are you OK with this code eventually be moved over to sourmash, maybe? And if so are you ok with the license changing to BSD? (Just want to get it on record ;)) If so, please also create a sourmash issue suggesting that we consolidate some of this over there!

Yep, that was the intention. Will make a sourmash issue for it!

@bluegenes bluegenes merged commit c8229d5 into main Nov 20, 2024
3 checks passed
@bluegenes bluegenes deleted the integrate-buildutils branch November 20, 2024 22:28
@ctb ctb mentioned this pull request Dec 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants