Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LIBM - sin, cos, ln, exp and friends. #126

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

andy-thomason
Copy link

This is a placeholder for implementing libm functions in stdsimd.

I would appreciate help and advice to integrate with existing work and create a wide range of tests.

@programmerjake
Copy link
Member

We'd want to translate sin/cos/etc. to LLVM's built-in functions since there are some architectures with native instructions for those (Libre-SOC's SimpleV will have them, SPIR-V has them).

Also, mul_add should (arguably) translate to llvm's fma intrinsic: #102

@andy-thomason
Copy link
Author

It is worth noting that LLVM's default behaviour is to call libm on each scalar element,
which is hundreds (possibly thousands) of times slower.

Whilst many cpus have had microcoded trig functions, they were not widely used by game
developers as they tend to have very low throughput. It would be interesting to compare the two.
It is very hard to beat densely packed FMAs.

Time will tell, I guess.

@andy-thomason
Copy link
Author

Worth looking at the intel trig intrinsics.

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SVML&expand=5236,5236&text=_mm256_sin_ps

If we are to have a race.

@programmerjake
Copy link
Member

It is worth noting that LLVM's default behaviour is to call libm on each scalar element,
which is hundreds (possibly thousands) of times slower.

Yup, part of #109 is adding support for vector-math to llvm so llvm will generate calls to vector-math when there aren't good native instructions available.

Whilst many cpus have had microcoded trig functions, they were not widely used by game
developers as they tend to have very low throughput. It would be interesting to compare the two.

In the case of rustc targeting SPIR-V or Libre-SOC's SimpleV, using sin intrinsics is by far the best approach since many GPUs (including Libre-SOC's cores) do have high-performance hardware for calculating sin/cos/etc. faster than using a bunch of FMAs.

It is very hard to beat densely packed FMAs.

True.

Time will tell, I guess.

Copy link
Member

@workingjubilee workingjubilee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello! Thank you for your patience with review!

While I agree that we would want a hardware-accelerable sin(), I feel doubtful that such is a priority: for the vast majority of architectures, we can better-specify what we need using software. As such, I would consider "and hardware-accelerated where possible" to be the optimization here, and so we can start with strictly software.

I am more interested in adding tests to this that check that we are within some measure of sanity of the current scalar sin implementation, using the existing proptest framework. Unlike most of those tests, a fairly fuzzy "within margins" check is fine here. You can find most of those in macro form in https://github.com/rust-lang/stdsimd/blob/7b66032ed511229410b0879d384b06cfcacf0538/crates/core_simd/tests/ops_macros.rs

I am also curious if this is more or less precise than such, on average.

Comment on lines 7 to 11
// This does not seem to be implemented, yet.
#[inline]
fn mul_add(self, mul: Self, add: Self) -> Self {
self * mul + add
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This does not seem to be implemented, yet.
#[inline]
fn mul_add(self, mul: Self, add: Self) -> Self {
self * mul + add
}

A hardware-accelerable version of this was implemented in #138.

Copy link
Member

@thomcc thomcc Jun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that version compiled to libm calls when hardware support isn't guaranteed to be available? Feels like a dubious tradeoff for this if so, especially given that this is already computing an approximated fsin.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because it is IEEE-754 compliant.

In general since we have a path forward for fixing the libm problem in LLVM we shouldn't get hung up on it elsewhere.

/// assert!((x.sin() - SimdF32::<8>::splat((1.0_f32).sin())).abs().horizontal_max() < 0.0000002);
/// ```
#[inline]
pub fn sin(&self) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Would this implementation be shared for f64, same values but different types? If so, we could macro this for applying to both, though my math brain is feeling deeply skeptical that such is how it works at the moment. It's been A Minute since I did trigonometry.

@calebzulawski
Copy link
Member

@workingjubilee I think I disagree with that sentiment--we already have sin and cos platform intrinsics, right? And we could choose to link against libmvec (which is limited but does provide trig) right now if we wanted to, which would support the vast majority of use cases. I'm not sure it makes sense when that seems to be a strictly better choice to me.

@workingjubilee
Copy link
Member

@workingjubilee I think I disagree with that sentiment--we already have sin and cos platform intrinsics, right? And we could choose to link against libmvec (which is limited but does provide trig) right now if we wanted to, which would support the vast majority of use cases. I'm not sure it makes sense when that seems to be a strictly better choice to me.

Ah, well, an implementation of linkage against libmvec would also be fine then, I suppose?

@andy-thomason
Copy link
Author

As an update.

Sorry about the lack of progress, but the day job is intervening.

I now have a full set of f32 and most of the f64 implementations.

I still need to wrap them in a trait and add optional limit checks and
low-value approximations (to make ULP fans happy).

Linking with a shared library (if it is) will work but is not desirable as the call overhead and
lack of inlinability will substantially reduce performance.

This is what happens currently with the scalar libm, and the performance is a couple of orders
of magnitude off optimal.

@andy-thomason
Copy link
Author

Could I get some help creating a test strategy for this?

I now have a more or less complete set (draft) of all the libm functions use by std.

With a few exceptions (such as tan) they have about the same error profile as the
std (glibc) versions.

I have a number of tests which include tables of accurate results generated using BigDecimal
which we can compare the std functions and our functions for errors.

We would ideally iterate over the generated table comparing the vector results with the
accurate results and ensuring that the error falls within bounds.

Another test would be to compare most of the range of values with the std version to
check that this is in bounds, also.

I am hoping to publish two versions of this, a fast version and a conforming version.

The fast versions use single polynomial evaluations (like my stub example) and the
conforming versions use segmented evaluations and have extra checks for small
values and overflows. Other permutations are possible.

The current tests use the same function and test for f32 and f64, which is not the case
for libm functions. The f64 versions have more terms and may even have different modalities.

Any help would be welcome.

The current scalar generated functions are here:

https://github.com/extendr/doctor-syn/tree/accurate-evaluation/tests

@workingjubilee workingjubilee force-pushed the master branch 4 times, most recently from cc78a1c to ab8eec7 Compare October 22, 2021 01:21
@workingjubilee
Copy link
Member

workingjubilee commented Jan 19, 2022

Hello! I am really sorry for not getting back to you earlier: we got sidetracked by "wait, how do we get things moving into std?" and "wait, how do we deal with any of the libm problem at all?" on our end. As a result, we have moved float functions that may be scalarized by LLVM to call libm (and thus depend on runtime support) into a new crate that implements a new trait, added in this commit:
ecc00ef

That trait will become available as std::simd::StdFloat with an upcoming (soon?) sync PR into libstd. Ideally we will find a more lasting solution, but if you want to call fn mul_add conditionally, that is where that function is currently exposed.

I will try to get back to you very shortly on the test strategy question.

@andy-thomason
Copy link
Author

Thanks, @workingjubilee, I've been following your progress and you have been doing great things these last few months.

I now have the generator as a tool - https://github.com/doctor-syn/doctor-syn/tree/main/libmgen. This generates Rust and C
code, but could easilly also generate SIMD code.

The tool generates accurately generated tables of (x, y) values (to 40+ digits) for testing. Accuracy of f64 functions
are typically good to n * 10-16 where n, a small integer, depends on the input value - floating point numbers
are more accurate when small. We could make more accurate versions at some performance cost - for example using
gather and look-up tables - and could make versions which give good accuracy for small input values (sin(x) = x for small x).

We don't handle range limits and NANs at the moment, but it is easy to add these. LLVM's round, min and max
are suboptimal, no doubt due to compatibility with libm, which has strange NAN behaviour - max(NAN, x) = x.

Use cases are quite variable from games which are usually latency sensitive to analysis which is throughput sensitive.

@workingjubilee
Copy link
Member

Ah, apparently "try" did not happen very fast.

Interesting, the tests currently fail on my local in debug, but only for the asinh:

---- test_libm32::asinh_f32 stdout ----
thread 'test_libm32::asinh_f32' panicked at '
x     =[ -0.9978027343750000,  -0.9977722167968750,  -0.9977416992187500,  -0.9977111816406250]
e     =[  0.0000000000000000,  -0.0000001192092896,  -0.0000000596046448,   0.0000002980232239]
limit =[  0.0000002384185791,   0.0000002384185791,   0.0000002384185791,   0.0000002384185791]
vector=[ -0.8798189759254456,  -0.8797975182533264,  -0.8797759413719177,  -0.8797540068626404]
scalar=[ -0.8798189759254456,  -0.8797973990440369,  -0.8797758817672729,  -0.8797543048858643]
vector=[000000000000bf613bd1, 000000000000bf613a69, 000000000000bf6138ff, 000000000000bf61378f]
scalar=[000000000000bf613bd1, 000000000000bf613a67, 000000000000bf6138fe, 000000000000bf613794]
vector_fn=|x: f32x4| x.asinh()', crates/std_float/src/test_libm32.rs:651:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/532d3cda90b8a729cd982548649d32803d265052/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/532d3cda90b8a729cd982548649d32803d265052/library/core/src/panicking.rs:142:14
   2: std_float::test_libm32::asinh_f32
             at ./src/test_libm32.rs:651:5
   3: std_float::test_libm32::asinh_f32::{{closure}}
             at ./src/test_libm32.rs:647:1
   4: core::ops::function::FnOnce::call_once
             at /rustc/532d3cda90b8a729cd982548649d32803d265052/library/core/src/ops/function.rs:227:5
   5: core::ops::function::FnOnce::call_once
             at /rustc/532d3cda90b8a729cd982548649d32803d265052/library/core/src/ops/function.rs:227:5

Doesn't look much different on my Ryzen with optimizations on:

---- test_libm32::asinh_f32 stdout ----
thread 'test_libm32::asinh_f32' panicked at '
x     =[ -0.9978027343750000,  -0.9977722167968750,  -0.9977416992187500,  -0.9977111816406250]
e     =[  0.0000000000000000,  -0.0000001192092896,  -0.0000000596046448,   0.0000002980232239]
limit =[  0.0000002384185791,   0.0000002384185791,   0.0000002384185791,   0.0000002384185791]
vector=[ -0.8798189759254456,  -0.8797975182533264,  -0.8797759413719177,  -0.8797540068626404]
scalar=[ -0.8798189759254456,  -0.8797973990440369,  -0.8797758817672729,  -0.8797543048858643]
vector=[000000000000bf613bd1, 000000000000bf613a69, 000000000000bf6138ff, 000000000000bf61378f]
scalar=[000000000000bf613bd1, 000000000000bf613a67, 000000000000bf6138fe, 000000000000bf613794]
vector_fn=|x: f32x4| x.asinh()', crates/std_float/src/test_libm32.rs:651:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/532d3cda90b8a729cd982548649d32803d265052/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/532d3cda90b8a729cd982548649d32803d265052/library/core/src/panicking.rs:142:14
   2: std_float::test_libm32::asinh_f32
   3: core::ops::function::FnOnce::call_once
             at /rustc/532d3cda90b8a729cd982548649d32803d265052/library/core/src/ops/function.rs:227:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

@andy-thomason
Copy link
Author

Thanks for looking at this @workingjubilee .

For the moment, I'll just push the limit up a bit. We can do some more work on accuracy once the complete set is in place.

@workingjubilee
Copy link
Member

Thanks for looking at this @workingjubilee .

For the moment, I'll just push the limit up a bit. We can do some more work on accuracy once the complete set is in place.

Yeah, I agree that accuracy-tuning is not as important as having the option for an implementation that is "std-invariant".

Thank you for adjusting for formatting!

@workingjubilee
Copy link
Member

---- test_libm::ln stdout ----
thread 'main' panicked at '
x     =[ -1.0000000000000000,  -1.0000000000000000,  -1.0000000000000000,  -1.0000000000000000]
e     =[                 NaN,                  NaN,                  NaN,                  NaN]
limit =[  0.0000002384185791,   0.0000002384185791,   0.0000002384185791,   0.0000002384185791]
vector=[                 NaN,                  NaN,                  NaN,                  NaN]
scalar=[                 NaN,                  NaN,                  NaN,                  NaN]
vector=[000000000000ffc00000, 000000000000ffc00000, 000000000000ffc00000, 000000000000ffc00000]
scalar=[0000000000007fbfffff, 0000000000007fbfffff, 0000000000007fbfffff, 0000000000007fbfffff]
vector_fn=|x: vector_type| x.ln()', crates/std_float/src/test_libm.rs:416:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
---- test_libm::ln stdout ----
thread 'main' panicked at '
x     =[ -1.0000000000000000,  -1.0000000000000000,  -1.0000000000000000,  -1.0000000000000000]
e     =[                 NaN,                  NaN,                  NaN,                  NaN]
limit =[  0.0000002384185791,   0.0000002384185791,   0.0000002384185791,   0.0000002384185791]
vector=[                 NaN,                  NaN,                  NaN,                  NaN]
scalar=[                 NaN,                  NaN,                  NaN,                  NaN]
vector=[000000000000ffc00000, 000000000000ffc00000, 000000000000ffc00000, 000000000000ffc00000]
scalar=[0000000000007fbfffff, 0000000000007fbfffff, 0000000000007fbfffff, 0000000000007fbfffff]
vector_fn=|x: vector_type| x.ln()', crates/std_float/src/test_libm.rs:416:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Hmm, @programmerjake is there anything weird about MIPS we should be alert to here? That's the only one we're failing this test on.

@programmerjake
Copy link
Member

Hmm, @programmerjake is there anything weird about MIPS we should be alert to here? That's the only one we're failing this test on.

I'd guess it might have something to do with old MIPS having the signalling/quiet NaN bit inverted from what's in later IEEE 754 standards?

@programmerjake
Copy link
Member

if we're checking for bit equality, floats should only check like so:

fn check_eq(a: f32, b: f32) -> bool {
    if a.is_nan() || b.is_nan() {
        a.is_nan() && b.is_nan()
    } else {
        a.to_bits() == b.to_bits()
    }
}

@andy-thomason
Copy link
Author

I was hoping to replicate NaNs exactly, but there seem to be no standards in libm variants.

MIPs uses signalling NaNs and some libs use NaN in place of Inf.

For example:

log2(-1.0) Should be (non-signalling) NaN -> gnu uses -NaN
log2(0.0) Should be -Inf

So I have to agree with @programmerjake that we should just test for NaNness

We also have some issues with denormals on some CPUs which affect std::fxx::MAX and MIN_POSITIVE.
I generally prefer if libs give the same answer on all platforms, rather than specialising them.

@workingjubilee
Copy link
Member

Hm. Which CPUs are having the denormal problems?

@andy-thomason
Copy link
Author

Mostly i586 variants failing. Most likely a problem with the os libm, but could be using 80387 instructions.

@workingjubilee
Copy link
Member

Oh, yes. Those are going to attempt to use the x87 FPU for f64 at the very least, maybe for f32s as well.

@andy-thomason
Copy link
Author

andy-thomason commented Mar 14, 2022

I've added some initial benchmarks:

running 3 tests
test sin_f32 ... bench: 4,289 ns/iter (+/- 21)
test sin_f32x16 ... bench: 292 ns/iter (+/- 78)
test sin_f32x4 ... bench: 608 ns/iter (+/- 18)

These show about a 20x performance increase over libm.
Larger vector sizes allow interleaving of instructions even though we
only have f32x8 registers (ymms) on my laptop.

You need to build with:

RUSTFLAGS="-C target-cpu=native" cargo bench

On modern hardware or the code gen is apalling.

@andy-thomason
Copy link
Author

I would request:

  • Simd::from(scalar) instead of splat -> this would enable functions to take Into<Simd> everywhere.
  • Much larger vector sizes f32x64 and f32x128 to gain more interleave.

@workingjubilee
Copy link
Member

We discussed the idea of using Into<Simd> and we decided that, even if we expose a trait that offers such a behavior (splat scalars and the identity function on vectors), we don't necessarily want to conflate the usage of From::from (which is very highly general) with the "autosplat" implication like that.

@andy-thomason
Copy link
Author

Hi @workingjubilee

Thanks for the feedback. I'll try to attend one of your Monday meetings soon, as it has been a while.

I'll try to dedicate some days for this, but my dance card is full of training and crypto!

@burrbull
Copy link

burrbull commented Aug 6, 2022

Hello.
burrbull/sleef-rs#30
I try to port sleef crate from packed_simd on portable_simd but I get error in release:

  Compiling sleef v0.1.0 (/home/runner/work/sleef-rs/sleef-rs)
error: failed to parse bitcode for LTO module: Explicit gep type does not match pointee type of pointer operand (Producer: 'LLVM14.0.6-rust-1.64.0-nightly' Reader: 'LLVM 14.0.6-rust-1.64.0-nightly')
Error: failed to parse bitcode for LTO module: Explicit gep type does not match pointee type of pointer operand (Producer: 'LLVM14.0.6-rust-1.64.0-nightly' Reader: 'LLVM 14.0.6-rust-1.64.0-nightly')
error: could not compile `sleef` due to previous error
Error: The process '/home/runner/.cargo/bin/cargo' failed with exit code 101

In dev mode all compiles and works.
No any unsafe in code.

UPD. Fixed. Looks like issues in gather_or_default

@andy-thomason
Copy link
Author

It would be interesting to benchmark against sleef. I had some good conversations with the author some time ago,
but of course sleef is a very different thing!

It would also be useful to merge this branch and any help to do that would be welcome.

@burrbull
Copy link

burrbull commented Aug 9, 2022

@burrbull
Copy link

burrbull commented Aug 9, 2022

Your implementation is very fast when "fma" is enabled, but what about accuracy?

@andy-thomason
Copy link
Author

This implementation has the same accuracy as libm and several times the performance. The scalar version is
also quite significantly faster.

The tests test against libm implementations of various kinds which vary in accuracy. It is very hard to get
the last bit of accuracy because of the "Tablemakers paradox". At the expense of a table lookup - which
becomes exponentially more costly in multithreaded processes - you can get an extra bit of accuracy
but we choose not to do this.

As with all SIMD code, the only way to get it to run efficiently is to use modern hardware, ie. -C target-cpu=native.
Older CPUs do not fully support SIMD operations or have such short vectors that there is no benefit.

It is a shame that this library only supports up to 512 bit vectors as longer ones exist and cause better performance
even on shorter register machines due to instruction interleaving.

@programmerjake
Copy link
Member

It is a shame that this library only supports up to 512 bit vectors

That's incorrect...it supports Simd<u64, 64> which is 4096 bits. it supports all element types up to length 64 (we're planning on increasing that later).
https://rust.godbolt.org/z/fa7PMvrza

@andy-thomason
Copy link
Author

Excellent.

@miguelraz
Copy link
Contributor

Bump - how can we push out this work over the finish line?

Additionally, I see we're missing a sincos implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants