Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporate the high-memory BWT implementation from fmlrc2 #9

Open
Ebedthan opened this issue May 5, 2022 · 3 comments
Open

Incorporate the high-memory BWT implementation from fmlrc2 #9

Ebedthan opened this issue May 5, 2022 · 3 comments
Labels
enhancement New feature or request

Comments

@Ebedthan
Copy link
Contributor

Ebedthan commented May 5, 2022

Hi @holtjma,

Sorry to bother you, and sorry if it is more issues and PR from me than from anyone in the recent days.
I believe this tool will greatly benefit from this addition.
I would like to try to see if we can incorporate the high-memory BWT implementation from fmlrc2 here. Before starting I prefer to have your recommendations and direction so that it will not be a waste of time for me and you. And if it is not something you would like for mswbt2, we can just leave it as it is.

Thanks!

@holtjma
Copy link
Member

holtjma commented May 5, 2022

It's no bother, though I am curious as to why the sudden interest in the crate.

Anyways, incorporating the exact implementation is likely tricky in the crate's current state; the main issue being that k-mer caching isn't really built into msbwt2 currently. However, if you ignore caching, it should basically be a copy-paste-modify using a mixture of sources from msbwt2 and fmlrc2. My suggestions for going about this:

  1. Really familiarize yourself with what exactly the fmlrc2 BWT implementation is doing, both in code and concept. You might also find useful info in the paper.
  2. I suggest starting from the RLE BWT implementation for the initial copy-paste-modify. That will get you the unit tests you need to pass. It may also be worth reviewing the fmlrc2 implementation to see if any of those tests are worth adding.
  3. Replace the corresponding RLE implementations with those from fmlrc2. I note again that you'll likely need to remove/ignore anything related to caching for now.

That's the high level process I would use. I just haven't had a need for it yet.

@Ebedthan
Copy link
Contributor Author

Ebedthan commented May 5, 2022

It's not so sudden interest for me :) This crate come to my interest in the last 4 months. I have tens of long and short reads transcriptomes and genomes to analyze and correct. It is really useful to have to type only one command to build the mswbt compared to the old version which is OS-dependent and more tedious in my opinion.

I had some free hours since yesterday so I tried to invest in the crate for my future works and hope to have at least help a bit. I don't know if I will have much free time in the coming days but we'll see how things are going.

For the incorporation of the high-memory BWT, I'll try to do my best when I can.

Thanks.

P.S. I'm also the person behind the Rust for Bioinformatics twitter profile and the community love your works!

@holtjma
Copy link
Member

holtjma commented May 5, 2022

Yea, that was initially the intent was to make building a bit more user friendly; granted it's also not as efficient as the ropebwt2 approach in its current state. I just haven't had time to really theorize/implement faster versions given that this is currently a side-project (i.e. not directly work-related).

As for the high-mem impl., I'm happy to help review/iterate on that if you do decide to work on it!

Also, good to know the person behind the twitter handle haha

@holtjma holtjma added the enhancement New feature or request label Jul 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants