Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating graph or mesh should be marked unsafe #11

Closed
gfaster opened this issue Jul 22, 2023 · 9 comments · Fixed by #13
Closed

Creating graph or mesh should be marked unsafe #11

gfaster opened this issue Jul 22, 2023 · 9 comments · Fixed by #13

Comments

@gfaster
Copy link
Contributor

gfaster commented Jul 22, 2023

The problem

All partition functions in METIS will cause undefined behavior if given not well formed CSR graphs.

For example:

#[test]
fn segfault() {
    let adjncy = &mut [1, 9999];
    let xadj = &mut [0, 2];
    let part = &mut [0];
    Graph::new(1, 1, xadj, adjncy).part_recursive(part).unwrap();
}

This is expected as METIS does unchecked array indexing using values in adjncy in many places.

Because of this, the API is unsound and should be changed.

How can this be fixed

According to the Rust API Guidelines, the way to fix this and keep the signatures the same will be to add a validation function that checks every value of adjncy that it is less than xadj.len() - 1. The current implementations could be moved to an _unchecked version of the function.

impl Graph {
    fn new(...) -> Result<Graph> {
        Graph::validiate(...)?;
        unsafe { Graph::new_unchecked(...) }
    }
    unsafe fn new_unchecked(...) -> Graph {
        // current implemtation
        // maybe it includes existing asserts?
    }
}

// same process with Mesh
@hhirtz
Copy link
Member

hhirtz commented Jul 24, 2023

Good catch! Thanks for the comprehensive bug report. Your fix looks good but is a breaking change. We could raise the error in part_recursive and part_kway that already return a Result, but I guess it's better to return the error as early as possible? I will push a fix and this will go into v0.2.

@hhirtz
Copy link
Member

hhirtz commented Jul 24, 2023

Actually, there is an issue. We cannot do any meaningful check in Graph::new, because doing so requires knowing the numbering used: https://docs.rs/metis/latest/metis/option/enum.Numbering.html.

This is a pretty weird feature IMO, and I don't know anyone who uses this (let alone with these rust bindings). On the other hand, I don't want to be too opinionated and would rather keep the API close to metis.h.

So we can either

  • raise the error late in part_recursive and part_kway. The check happens at every partition call, but it does not break the API.
  • raise the error in Graph::new. It breaks the API and we'd have to drop the numbering feature, but the error is raised early and the check is done only once.

@cedricchevalier19
Copy link
Member

This is a pretty weird feature IMO, and I don't know anyone who uses this (let alone with these rust bindings). On the other hand, I don't want to be too opinionated and would rather keep the API close to metis.h.

I do not think a lot of Rust users use numbering starting from 1.

So we can either

* raise the error late in `part_recursive` and `part_kway`. The check happens at every partition call, but it does not break the API.

* raise the error in `Graph::new`. It breaks the API and we'd have to drop the numbering feature, but the error is raised early and the check is done only once.

Please do the second option. Not having Fortran arrays is not a problem. It will also avoid a lot of bugs playing with offsets (my personal experience dealing with offset in Scotch).

@hhirtz
Copy link
Member

hhirtz commented Jul 24, 2023

I do not think a lot of Rust users use numbering starting from 1.

This is true. At least all CSR crates that I know of start from 0 (sprs, nalgebra-sparse and russell_sparse).

It will also avoid a lot of bugs playing with offsets (my personal experience dealing with offset in Scotch).

METIS plays pretty safe and decrements all values from xadj and adjncy by 1 before partitioning 1. This is a pain for us however because it's not thread safe and requires mutable refs everywhere.

If we prevent Fortran numbering maybe we could use shared references instead and reopen #8. However METIS could still be mutating input values in other places.

Footnotes

  1. https://github.com/KarypisLab/METIS/blob/e0f1b88b8efcb24ffa0ec55eabb78fbe61e58ae7/libmetis/pmetis.c#L116-L120
    https://github.com/KarypisLab/METIS/blob/e0f1b88b8efcb24ffa0ec55eabb78fbe61e58ae7/libmetis/fortran.c#L16-L28

hhirtz added a commit that referenced this issue Jul 27, 2023
This fixes the unsound Graph::new and Mesh::new.
See the following report for more details:

<#11>

Co-authored-by: Gavin Rohrer <[email protected]>
@oisyn
Copy link
Contributor

oisyn commented Oct 4, 2023

Just to add my 2cts, I was experiencing spurious segfaults and heap corruption. I spent a considerable amount of time debugging the issue, only to find out that it was my input that was causing the problem.

The cause turned out to be that I was passing a weight of 0 for some edges. I meticulously verified my data, but this was something I overlooked and I figured it out when re-reading the METIS API documentation. It would've been a life saver if this was checked by the crate 😄.

I would therefore opt for the first option, raising the error on the part_* calls, because then you'll be able check additional data (in my case, the edge weights) as well.

@cedricchevalier19
Copy link
Member

I think we have to do a lot better at checking data before calling Metis, that is kind of fragile depending on user input. Raising errors should be generalized to all functions, with unchecked versions as well.
We should not panic.

As we speak about API changes, we should think about not having the partitioning function tied to the Graph structure and perhaps use a builder pattern to construct the data.

@gfaster
Copy link
Contributor Author

gfaster commented Oct 5, 2023

I've been doing a lot of investigation into the METIS source in my efforts to port it to Rust and there are a whole bunch of other subtle ways you can cause undefined behavior. For example, graphs with degree that is "too large" will overflow the index in a way that is very difficult to predict (for example, in kwayrefine.c). There doesn't seem to be a safe way of doing it short of forcing METIS itself to do a ton more checks.

Edit: the example in kwayrefine.c is worse than it seems, since it's most likely to trigger in a coarsened graph, rather than the original. You could potentially calculate the maximum possible interior degree in the coarsest graph, but that will be expensive and still likely incomplete.

@oisyn
Copy link
Contributor

oisyn commented Oct 6, 2023

I'm usually a fan of "don't let perfect be the enemy of good enough", so I would opt for adding the checks even if we can't cover them all. However, then calling the variant with the checks 'safe' is a different question altogether, as technically it still isn't. But still better than the current situation, though...

@hhirtz
Copy link
Member

hhirtz commented Oct 9, 2023

Yeah. If only there was a clear specification of what METIS accepts as input (whether CSR indices are sorted, value ranges, etc. the manual doesn't specify these if i remember correctly), then we could do those checks in a "safe" wrapper. But the manual is pretty light on details, and with every change upstream can add another way to trigger undefined behavior. Technically {Graph,Mesh}::part_* & mesh_to_dual must be marked unsafe then, even with the checks.

As Cédric said we can also do the checks/raise errors early in all set_* functions. That would double the API size (with unchecked variants) but allow the Graph and Mesh structures to be reused without doing all the checks.

A "good" thing about this is that we don't have to add all the checks at once. Once the API is fixed they can be added incrementally i think, at least if all functions return a Result, something like:

impl<'a> Graph<'a> {
    pub fn new(...) -> Result<Self, ...> { ... }
    pub unsafe fn new_unchecked(...) -> Self { ... }

    // could also take `self`, but then the user must
    // be able to retrieve the graph from the error type
    pub fn set_*(&mut self, ...) -> Result<&mut Self, ...> { ... }
    pub unsafe fn set_*_unchecked(&mut self, ...) -> &mut Self { ... }

    // Take a &mut self instead of owned so that
    // Graph can be reused without doing all the checks
    pub unsafe fn part_*(&mut self, ...) -> Result<...> { ... }

    // no need for part_*_unchecked, if all checks
    // happen in set_* and new? we can always
    // add it later without breaking compatiblity
}

// same with Mesh

pub unsafe fn mesh_to_dual(...) -> Result<Dual, ...> { ... }

I currently don't have time to continue work on #13 so feel free to take it from there if you want to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants