-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
suffix array/ bwt construction algorithms other than libdivsufsort #27
Comments
There are plans for this yes. We are meeting in Japan in 2 weeks so I suggest if you can, that you put all the feature requests that you want as issues in here so we can discuss them. |
Currently, we I mocked up some prototype that uses the builder pattern. So it would look like this:
|
Ah, that's good to know! Thanks.
Ideally we could just pass a lambda to @cpockrandt Please chime in on these topics! |
Currently all inmemory construction (using constuct_im) is done by writing all temporary files to a "ram filesystem". So, if you pass in a "buffer" to consturct_im it is first copied to another buffer (a virtual file) in the ram_fs. Do you think this is not sufficient? If the sa_algorithm you pass in doesn't want to write stuff to disk (or the virtual ram fs) it doesn't have to do that. However, the wavelet tree construction algorithms all use streaming to be more memory efficient. In the ram_fs case, the stream would just be from a virtual file instead of from disk. In the end, we can copy 3-4GB/s these days so that should never be a bottleneck. In general we were thinking of moving towards just memory mapping everything letting the operating system handle things which is good practice anyway as far as I know. |
So the entire SA is copied from memory to mem-fs (or the other way around) at least once, or not? This would double the peak memory requirement... PS: On an unrelated note: I also added progress reporting capabilities to some of our SACAs which was highly appreciated by our users (if something is expected to run somewhere between a day and a week it really helps to know whether it is closer to a day or a week!). Have you thought about this? |
Currently (and also in the future) we use an object called cache_config which contains a "file_map". The map stores locations of stuff we have constructed for reuse. In the current version of sdsl you could construct the SA and BWT externally and then add it do the file_map object. The CSA/CST construction process will then reuse those files instead of constructing the objects again. So if you want to do your own thing you can. There are currently stages in the construct of a CSA where temporary files are written to the fs (virtual or not). Creating a separate inmemory version which doesn't just use a "virtual file" in the background would create lots of code deduplication. I think in the ram filesystem we move memory around instead of copying whenever possible. We have some memory consumption visualizations and we are almost as efficient in terms of space consumption as possible and we are working on something to make this more efficient. The main issue is when to discard objects (e.g. the full SA or PSI) that are not needed anymore. Regarding the "progress display", we could add this functionality to the new construction process. Maybe a Have you looked into existing work on SACAs which do (1) parallel and (2) on disk. I remember Dominik Kempa has been doing at lot of work in this area. I think all the code is available and I was planning to integrate this into sdsl if I ever have time and we have a reasonable interface for different SACAs. There is also a really good parallel libdivsufsort implementation here out there which gives almost linear speedup. |
Hm, I understand that these changes are non-trivial....
... but I still don't see how you can avoid doubling the peak memory usage in your model. Unless of course you detect when you are writing to ram-fs and then delete the currently held (real) memory block-wise after writing blocks – something that sounds rather inefficient and creates exactly the "special-case-code-overhead" that you are trying to avoid.
Yes, I don't think this should be a requirement, it's just bonus.
We have fully parallel SACAs based on sorting in SeqAn. They work for non-repetitive data like protein sequences, but have problems with repetitive data like genomes. As on-disk SACAs we have Skew-variants, but they require a lot of disk space. If you have better alternatives here that would be very welcome! |
Do you have the mockup somewhere in a separate branch @mpetri? We'd then start implementing SA construction algorithms (at first our own inplace radix implementation from SeqAn). If you know a parallelizable algorithm that performs well in practice and you'd like to have in the SDSL, we're glad for any recommendation! |
I don't think the mockup that I created is very useful. I would just use the
SA construction algorithms that should be included somehow:
|
@mpetri @h-2 @simongog @cpockrandt I am a little puzzeled about the semantics of I can understand this design decision, as it makes the code easier to use. How whould we know that if the last character of the text is So instead it would be indeed cool, if I can just add my own SACA algorithm from outside. It must get a text and an sdsl::int_vector for the SA. In general it is no problem to adapt divsufsort to work on a iterator instead of a One open question that I have is the current design of the construction call. There are basically two public interfaces: void construct(t_index& idx, const std::string& file, cache_config& config, uint8_t num_bytes = 0); and void construct_im(t_index& idx, t_data&& data, uint8_t num_bytes = 0) The first one always expects the cache to be given (also it is not documented). How is this interface intended to be used? I mean I can now cache a SA that might be different to the text that I pass. Basically, I think that I always need to construct something from a text. That is either inmemory or a file which needs to be loaded. So instead of passing the cache_config as a parameter, I would return it from the function with the filled data. If the cache_config is not captured then all temporary resources will be freed. Otherwise, the resources live as long as someone holds a handle on the cache_config. That would allow to still call more or less internal interfaces to construct certain data structures in a lazy fashion but simplifies the standard use case for the clients and makes the both interfaces more alike. So here would be my current design proposal for the construcion functions: cache_config construct(t_index& idx, const std::string& file, uint8_t num_bytes = 0);
text_cache_config<t_data> construct_im(t_index& idx, t_data&& data, uint8_t num_bytes = 0) Later we can discuss about the config system. If we are allowing to set the construction algorithm from the client then the config chain So that is my current take. Maybe I am missing a lot of issues and other design problems so feedback and other solutions/alternatives would be highly appreciated, especially since this is my first rodeo with the sdsl. Many thanks. |
We had already spoken about this with @simongog in march:
For both technical and licensing reasons we need replacements for libdivsufsort. We have both in-memory and external algorithms in SeqAn currently, that we could modernise and contribute if you would like to have them. Alternatively we would require some form interface so that we could use to plug our own algorithms into current functionality.
@simongog suggested you had planned on doing the second thing anyway. Do you have any concrete plans for this?
@cpockrandt what are your thought on this? Please contribute to this discussion as you are the one who will have to do the work either way 😉
The text was updated successfully, but these errors were encountered: