-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring rulinalg in terms of a lower level BLAS-like interface #155
Comments
I think this is a solid idea and a good way to stretch into supporting opt-in blas (and others) in the future. A few comments:
I know that this is necessary but I think we should take some care in doing this in the safest way. For example consider: let mut data = Vec::with_capacity(n*n);
unsafe { data.set_len(n*n); }
// C hasn't been initialized but can be efficiently filled
let c = Matrix::new(n, n, data);
let d = dgemm(1.0, 0.0, a, b, c); This is just to illustrate that to be optimal here we need to do some unsafe things. Note that I think naming is a tricky issue as well. The BLAS names are (perhaps a little too) compact and easy to remember/understand once you are familiar with them. But I think that for new users it can be pretty difficult to figure out what function you need. I'd like to try and work around this where possible. There is also the One other concern is that this could be a really huge sub-project. Because of this we need to have a good battle-plan going in. Maybe we can break down the work into small but useful chunks so that if things get too much we can take a break without stalling development elsewhere. On one slightly unrelated note. I have been thinking a little more about the arithmetic traits lately. In particular, the fact that we have |
You bring up some very good points! The problem of uninitialized input is difficult to resolve within the confines of a safe API. As you say, we can write an internal unsafe one and expose a safe abstraction on top, but this still makes us have to deal with unsafe pointer-based APIs spread across the code base, which is also not entirely ideal. I'm having some vague ideas about a
I think this can be solved by good documentation. Basically a table of the mathematical meaning, and link to corresponding function. I'd like to read a bit more about BLIS - it seems very well structured.
I think that until we get specialization, we will probably have to resort to runtime checks. However, I think it will take us a while before we get to the point of specializing individual types for most of the operations in question. I think we can get pretty far with some generic manual loop unrolling and auto-vectorization.
Agree completely. I think as long as we keep the
Can I ask you why you think it should be the other way around? This is actually an area on which I have a very strong opinion. Basically, it boils down to the following: arrays are the wrong abstraction for linear algebra. Conversely, matrices are the right abstraction for linear algebra. For arrays, it may make sense for multiplication to mean elementwise multiplication. However, for matrices, matrix multiplication is a very well-defined operation, and you'd be hard pressed to find any mathematician or engineer to expect The fact that one often works with matrices as if they were arrays in e.g. As you can probably tell by now, I definitely don't think you should change it ;-) But I'm curious to hear why you are considering it, however! |
I agree that this wouldn't be the best. I think however that we might be able to avoid falling to pointers extern crate blas_sys;
#[cfg(not(blas))]
unsafe fn gemm<T: Float>(alpha: T, beta: T, a: &MatrixSlice<T>, b: &MatrixSlice<T>, c: &mut MatrixSliceMut<T> ) {
// Implement C = β C + α A B
}
#[cfg(blas)]
unsafe fn gemm<T: Float>(alpha: T, beta: T, a: &MatrixSlice<T>, b: &MatrixSlice<T>, c: &mut MatrixSliceMut<T> ) {
// Check type of T and call either `dgemm` or `sgemm`
blas_sys::cblas_dgemm(.. pull all these arguments from the input objects ...)
} The above function is unsafe because of the initialization issue in the
I agree with you here, and actually it is because of this that I designed things as they are now. I also dislike how this is not the case in every other library I use. But I am not too sure whether fighting this convention is such a good idea and this is basically why I had been considering changing things. For now though, let's keep putting up the good fight and keep things how they are! :) Edit: My above functions should probably also include some arguments to determine whether the matrices should be transposed. (So far as I'm aware computing |
If I understand you correctly, you want to construct a matrix from uninitialized data and then pass it into the function. I assume that you intend it to be called like: let mut data = Vec::with_capacity(n * n);
unsafe { data.set_len(n * n); }
let mut mat = Matrix::new(n, n, data);
gemm(1.0, 0.0, &a, &b, &mut c); (I also want to note that this is only well defined in the first place if T: Copy) I'm a little worried that this is a dangerous pattern to start using. It requires
I'm confused.
MATLAB and most libraries I've used use |
I understand and agree with the concern. However - I'm not sure that there is a good way around this. At some layer in our abstraction we will want to have the choice to use an uninitialized
I meant specifically the pub struct Float128(f64, f64);
impl Float for Float128 {}; But this type has the wrong size and we invoke undefined behaviour when we pass its raw pointer. We would have to either panic at runtime when we see a bad
Even more reason to keep things as they are :) . I had got MATLAB mixed up in my head, I thought it also followed numpy's pattern. I've been using arrayfire recently which also has a dedicated |
(This is an answer only to the question of uninitialized data) I think it should be possible to encapsulate the possibility of uninitialized data in a slightly more safe manner. Ultimately unsafe will be needed to actually do something with the data, but I'd like to encapsulate this as well as possible. I've been toying with an idea which I will sketch out below. I'm not sure if it works out, but it might be a start. trait MatrixBuffer<T> {
fn is_initialized(&self) -> bool;
fn rows(&self) -> usize;
fn cols(&self) -> usize;
/// Returns a matrix slice whose data is guaranteed to be initialized.
fn as_mut_slice(&'a mut self) -> MatrixSliceMut<'a, Scalar>;
/// Return a pointer to the mutable data. The data may be uninitialized.
unsafe fn as_mut_ptr(&mut self) -> *mut Scalar;
}
impl<T> MatrixBuffer<T> for MatrixSliceMut<T> {
fn is_initialized(&self) -> { true }
unsafe fn as_mut_ptr(&mut self) -> { self.as_mut_ptr() }
// Rest of the impl...
}
impl<T> MatrixBuffer for Matrix<T> {
// Same as MatrixSliceMut
}
struct UninitializedBuffer<T: Copy> {
data: Vec<T>
}
impl<T: Copy> UninitializedBuffer<T: Copy> {
fn new(rows: usize, cols: usize) -> Self { ... }
fn into_matrix(self) -> Matrix<T> { ... }
}
impl<T: Copy> MatrixBuffer for UninitializedBuffer<T> {
// ...
}
fn gemm<T: Float, M: Into<MatrixBuffer>>(
alpha: T,
beta: T,
a: &MatrixSlice<T>,
b: &MatrixSlice<T>,
c: &mut M) {
// Implement C = β C + α A B
}
fn main() {
let (a, b) = ...
{
// Call gemm with uninitialized buffer
let mut c = UninitializedBuffer::new(100, 100);
gemm(1.0, 0.0, &a, &b, &mut c);
let c = c.into_matrix();
}
{
// Call gemm with existing/initialized matrix/buffer
let mut c = Matrix::zeros(100, 100);
gemm(1.0, 1.0, &a, &b, &mut c);
}
} I stress that the above is a very rough sketch, and I don't even know if the above would work out in terms of lifetimes etc. The idea here is to prevent any consumers of these APIs to have to deal with unsafe. Moreover, dealing with uninitialized data in each BLAS-like function is entirely optional: one can just call Do you think something like this could maybe work? |
I think the runtime type check works quite well for this issue. We check the type and specialize if possible, otherwise we fall back to some fairly efficient implementation of our own (that presumably performs loop unrolling in the kernel at a minimum). This way we can also cover integers for the cases in which this makes sense, which BLAS does not handle at all. Hopefully we can move to compile-time specialization in the future. That said, this approach risks leading us into a situation where we have to provide unnecessary "fallback" implementations. I think actually it's completely OK to restrict the types supported by rulinalg to a specific set. For example, as you say we could provide a trait |
It looks quite appealing in the way you've sketched it out. However, I'm not convinced we can allow the user to safely construct an All that said - I do see the benefit of an
I think what you laid out here is a good approach. And fortunately feature flags work in our favour here. A user will be able to use the BLAS-like functions on integers etc. if they haven't opt'd in to BLAS. But when they activate the flag the compiler will only accept However, we would probably want to stick to poorly optimized functions for anything but floats initially. And tackling |
Actually, the way I had it in mind, it should be completely safe. At least, to my understanding, The conclusion is that accessing any safe methods on
Actually, I think we should do BLAS integration in a different way. Specifically, opting in to BLAS should not change the API in any way. Just imagine if you're using both integers and floats in your application, and you enable BLAS and suddenly the code involving integers simply breaks because BLAS does not support it. Instead I think we should make the BLAS integration transparent, in that the BLAS implementations are called for floating point types "under the hood". For now we'd have to rely on runtime checks, but eventually we can hopefully replace it with generic specialization. Also, a significant barrier to BLAS support is that, as far as I know, BLAS only supports column-major matrices...? EDIT: It seems that CBLAS and similar also support row-major matrices. |
I think that this could work actually. At the very least we should try this idea out and see if it gives us what we want with the additional safety.
I can see your argument here and I think the implementation isn't much different. Instead of replacing the whole function with a typed blas abstraction we would keep the same
I know that blas-sys allows row major inputs but I'm not sure if there is any performance hit for using this. I think that row and column strides let us encode both column major and row major indexing and as a result it is possible to optimize a blas library to be efficient for both. |
Yeah, I agree it's worth trying out. We could attempt to rewrite one of the BLAS-1 routines as an experiment. I'm still partial to a BLIS-like interface instead of BLAS, however, because it seems to incur a lot less work to implement it from scratch as composed to BLAS. I will try to read more about BLIS when I have the time. As far as I understand, I think a BLAS API can be written on top of BLIS, so it might make sense in either case.
Yes, exactly! I just had an idea. Not sure if it works out. How about we implement each BLAS-like function as a trait, and implement it for each datatype? You'd have roughly: impl Gemm<...> for f64 {
fn gemm(...) {}
}
// Call gemm for f64
f64::gemm(...) However, could this make writing generic code harder? Or could we define the pub trait Real: Gemm + <other BLAS functions> This way we could still write generic functions in terms of the In the end I think this gets unnecessarily complicated compared to runtime type checking, but it's perhaps still an interesting idea. |
BLIS is exciting. When I evaluated it (probably a whole year ago) I concluded it's not a drop in solution with multiple great implementations like BLAS. At that time, BLIS had very high overhead for small problems. I can't really quantify that, but think of it that OpenBLAS also makes sure that even small matrix operations are quick. Python's numpy also would benefit from BLIS but doesn't use it what I understand. |
@bluss @Andlon -I have to admit that I don't know too much about BLIS. I also looked into it around a year ago when you first published your
I worry that we lose some functionality this way. For example what if a user has their own custom type which implements impl<T: Float> Gemm<T> for T { } |
Thanks for your input, @bluss! Did you have the impression that overhead for small matrices is a fundamental restriction of the architecture of BLIS, or that it is/was an implementation detail? The reason I'm curious to see if we can use something inspired by BLIS for rulinalg is that it seems to provide a shorter path to some reasonably performant system. Whereas implementing something BLAS-like seems to cause you to have to optimize code in many, many places, the promise of BLIS is that you can implement all operations in terms of a comparatively small number of kernels, and that optimizing these kernels automatically leads to optimized code for the rest of the routines. My point is that BLIS seems to offer a greater performance/effort ratio, which in the case of
It might seem unnecessary to implement a BLIS-like API just to build a BLAS-like API on top, but if implementing a relatively high-performance BLIS-like API from scratch is significantly simpler than implementing a relatively high-performance BLAS-like API scratch, it would probably still be worthwhile. Also note that I still haven't had the time to actually read any of the BLIS papers yet. Hope to get around to it soon, but I'm pretty busy these days! |
Yeah, this is a trade off. Though, keep in mind that in almost all cases, users will use either It is also possible to provide default implementations of the traits (assuming |
I think the BLIS/BLAS composition you described sounds like a reasonable approach. I'm not sure exactly how it will work (due to a lack of familiarity with BLIS) but it seems like a good way to kick things off. I think one of the more important aspects is that rulinalg works with the higher level BLAS api so that we can easily swap out for blas bindings if we need (having the option to do this at the BLIS level would be great too!).
All of the arithmetic operations would work for complex types with the appropriate trait bounds. But most of our decompositions require floats explicitly as complex types change some of these algorithms (in minor ways mostly I think).
Yes me too. Maybe we can explore doing this through a trait interface later, once things are working. |
Here's a talk about something BLIS-like. Maybe too multidimensional for yall, but there's always ideas worth taking. https://www.youtube.com/watch?v=FbsJ6urR4x0 |
That was a very interesting talk, @bluss! I've never really had the need to work with tensors so it's not my field of expertise, but I quite enjoyed to see how the matrix packing that BLIS employs can be used for efficiently computing tensor contractions. Very cool! |
As I've been working on rewriting some of our decompositions, it has become increasingly clear that we could massively benefit from a rulinalg-specific BLAS-like API. More specifically, I propose the creation of a new
blas
module which has BLAS-like functions covering different functionality from the three levels of BLAS. For example:I propose rewriting all matrix/vector arithmetic (i.e.
Add
,Sub
,Mul
implementations) in terms of such functions. This has some significant benefits:Mul
/Add
/etc. to do the right thing (which they won't when computing something likeC = 2 * C + 3 * A * B
. In this case you'd have unnecessary memory allocations).I'm not decided on the exact naming of the functions. I think we should be somewhat BLAS-compatible in order to easily facilitate the aforementioned opt-in BLAS support, but I'm more a fan of the BLIS set of routines. See also this paper. I became aware of this framework through @bluss's blog post on
matrixmultiply
.We need these APIs to be as generic as possible. I think they need to follow these rules:
MatrixSlice(Mut)
for matrices and&[T]
or&mut [T]
for vector-like types. The reason for the regular slice instead of e.g.Vector
is that some algorithms require you to treat e.g. partial rows of matrices as if they were vectors. This was the case in my implementation of the "GAXPY-rich Cholesky" in Cholesky decomposition rewrite: initial work #150.An example:
We could use this to e.g. implement matrix-vector products, and this particular pattern comes up a lot in linear algebra algorithms.
If people agree that this could be a good idea, I suggest we start by making this
blas
(or whatever name it should have) module entirely a private module for now. Then we can refactor in multiple stages and make decisions about the API. Hopefully some time in the future we'll be confident enough in our choices to make it a public module.Thoughts/questions/ideas/criticism to this idea are very welcome!
The text was updated successfully, but these errors were encountered: