Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimisations #32

Open
dasunpubudumal opened this issue Jun 20, 2024 · 4 comments · May be fixed by #33
Open

Optimisations #32

dasunpubudumal opened this issue Jun 20, 2024 · 4 comments · May be fixed by #33

Comments

@dasunpubudumal
Copy link
Contributor

dasunpubudumal commented Jun 20, 2024

Now that we have started writing tests in #22, we could start adding some optimisations.

We can start optimisations with src/tpf_fasta.rs, as the file is being covered by tests, and is likely to be fully covered fairly soon.

I will update this issue with comments when I discover any other optimisations possible.

@dasunpubudumal dasunpubudumal linked a pull request Jun 20, 2024 that will close this issue
@dasunpubudumal
Copy link
Contributor Author

dasunpubudumal commented Jun 20, 2024

Here are some optimisations that I have uncovered:

  1. get_uniques function can be optimised by introducing a std::collections::HashSet. It should not invoke to_owned() on new scaffold strings, as it will use String's Clone trait to clone the value, which is undesirable.

It could be better if we can update the function as follows:

pub fn get_uniques(tpf_list: &[Tpf]) -> Vec<&String> {
    let mut uniques = HashSet::new();
    for tpf in tpf_list {
        uniques.insert(&tpf.new_scaffold);
    }
    uniques.into_iter().collect()
 }

This will not copy values in-memory, as it will only use Deref trait when trying to insert values to the set. Note that this changes the function signature, and it demands a refactoring where the function is used.

  1. subset_vec_tpf function could be optimised by using Rust's filter function provided to iterators.
pub fn subset_vec_tpf<'a>(
    tpf: &'a [Tpf],
    fasta: (&std::string::String, &usize),
) -> Vec<&'a Tpf> {
    //
    // Subset the Vec<TPF> based on a search through the fasta
    //
    tpf.iter().filter(|&i| i.ori_scaffold == *fasta.0).collect()
}

@dasunpubudumal
Copy link
Contributor Author

dasunpubudumal commented Jul 3, 2024

  1. File logic can be refactored. In src/tpf_fasta.rs:166, both File::Create and OpenOptions are used to create an open a file. Instead, the following could be used (without using File::Create):
let file = OpenOptions::new()
            .read(true)
            .write(true)
            .create(true)
            .open(output);

OpenOptions docs suggests:

This builder exposes the ability to configure how a File is opened and what operations are permitted on the open file. The File::open and File::create methods are aliases for commonly used options using this builder.

@dasunpubudumal
Copy link
Contributor Author

  1. The usage of stacker::maybe_grow in tpf_fasta needs to be carefully considered.

From maybe_grow function documentation:

Once a program has reached the end of its stack, a temporary stack on the heap is allocated and is switched to for the duration of a closure.

This function is intended to be called at manually instrumented points in a program where recursion is known to happen quite a bit. This function will check to see if we're within red_zone bytes of the end of the stack, and if so it will allocate a new stack of at least stack_size bytes. The closure f is guaranteed to run on a stack with at least red_zone bytes, and it will be run on the current stack if there's space available.

There's always box-types (Rust's equivalent to C++ smart pointers) which are there to allocate memory in the heap. Allocating extra stack space in the heap is a little unconventional. Stack is for allocating types that the compiler can identify the size of at compile time. Allocating something in the heap (e.g. recursive data structures like linked lists) incurs an additional overhead of a system signal to call a memory allocator but that is sort of a requirement to facilitate dynamically sized types such as vectors and strings.

@dasunpubudumal
Copy link
Contributor Author

  1. Updates to the logic in curate_fasta function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant