Skip to content

v2.9.0

Latest
Compare
Choose a tag to compare
@MrUnbelievable92 MrUnbelievable92 released this 26 Oct 19:25
· 17 commits to master since this release
7f4752f

Known Issues

  • half8 == and != operators don't conform to the IEEE 754 standard (compliant with Unity.Mathematics... for now... hint)
  • bool vectors generated from operations on non-(s)byte vectors do not generate the most optimal machine code possible, partly due to an LLVM performance regression, partly due to other compiler related difficulties
  • float8 min() and max() functions don't handle NaNs the same way Unity.Mathematics does

Fixes

  • Fixed XML documentation not showing descriptions for valid Promise flags
  • Fixed cminmax documentation
  • bitmask64 with numBits equal to 64 now correctly returns a bitmask with all 64 bits set if not compiling for Bmi1 i.e. AVX2
  • Fixed uint8 to float8 type conversion if compiling for AVX2
  • Fixed incorrect mod implementations
  • (ISSUE #16) Fixed float and double (r)cbrt edge cases (+/-0, Infinity and NaN). Additionally, the scalar- and vector float implementation now returns accurate results for subnormal numbers. Performance is affected negatively yet minimally (~2 clock cycles, + ~10 instructions); new valid Promise flags allow for call-site selection of faster code paths

Additions

Divider<T>

Divider<T> is an opaque OOP-like struct which performs fast integer division and modulo operations as well as divisibility checks.
For any divisor of any scalar- or vector integer type T, a Divider<T> instance replaces division operations by multiplication-, shift- and rounding operations, utilizing the most suitable of 2 algorithms, typically used by compilers for compile time constant divisors.
Divider<T> was carefully crafted in a way that allows for complete compile-time evaluation of constant divisors of all types in Burst compiled code.
Divider<T> is NOT meant to replace divison operations; a (notable) performance gain is only to be expected in case the same divisor is used multiple times, or when multiple divisors are computed at once, utilizing SIMD (for instance, when a very predictable i is the divisor in a for-loop).
Numerous Promise flags allow for faster operations, provided that the Divider<T> instance is both initialized and used in the same block of Burst compiled code and not loaded from RAM.
The implementation is pseudo-generic and only works for integer types known to MaxMath. Furthermore, Bursts inabilty to compile-time evaluate typeof(T) often requires explicit initialization (example: new Divider<byte>((byte)42)). DEBUG only validity checks ensure correct initialization and usage.
The current Divider API consists of...:

  • / and % operators:
    • LHS: scalar <> RHS: Divider(scalar): requires both scalars to be of the same type; returns a scalar of the that type
    • LHS: vector <> RHS: Divider(vector): requires both vectors to be of the same type; returns a vector of the that type
    • LHS: scalar <> RHS: Divider(vector): requires the vector type to contain integers of the scalar type; returns an instance of the vector type
    • LHS: vector <> RHS: Divider(scalar): requires the vector type to contain integers of the scalar type; returns an instance of the vector type
  • DivRem member methods
  • EvenlyDivides member methods
  • T Divisor as a readonly property
  • public const Promises within Divider<T>, documenting valid promise flags with appropriate naming, starting with "PROMISE_"
  • Get/SetInnerDivider<U> methods: get or set a scalar- or vector Divider<U> within a Divider<T>
  • Component shuffles: Divider<T>.wzxy swizzle "operators" as properties.

NOTE: Get/SetInnerDivider<U> methods and Divider<T>.[a][b][c][d] properties will change in the future. Due to current limitations regarding C# generics, swizzle operators only take in or return the same type the respective property is a member of, i.e. you cannot use these to get a Divider<int2> from a Divider<int4>. Get/SetInnerDivider<U> are placeholderholders both for these operations as well as for the v[a]_[b] properties for vectors with 8 or more components. C# will at some point get more complex type extension language support, at which point this API will change.

quadruple (PREVIEW)

Analogous to (U)Int128, this library now supports 128 bit floating point operations with its respective software-implemented type. It is fully IEEE754 compliant and in the typical 1 sign bit, 15 exponent bits, 112 mantissa bits format.
NOTE: quadruple is in preview for an unforseeable amount of time. This means that it is neither completely optimized, nor are all maxmath functions available for it at this time.
The following functions have been implemented: ToString and Parse (no perfect roundtrip guaranteed), All constants (example: PI_QUAD), Random128 NextQuadruple (optionally with min and max values), all type conversions except for decimal, -(unary), +(binary), -(binary), *, /, %, ==, !=, <, <=, >, >=, fmod, mad, msub, rcp, isnan, isinf, isfinite, isnormal, issubnormal, round, floor, ceil, trunc, roundtoint (and all other integer variations), fastsqrt, (r)sqrt, (r)cbrt, isinrange, approx, select, compareto, min, max, copysign, nextgreater, nextsmaller, nexttoward, radians, degrees, chgsign

Functions

  • Added isnormal and issubnormal functions for floating point types
  • Added hypot and inthypot functions for calculating [int]sqrt(a * a + b * b) without overflow, unless an optional Promise parameter with its NoOverflow flag set is passed as a compile time constant argument
  • Added roundto(s)byte/(u)short/(u)int/(u)long/(U)Int128. These take in floating point values of any type and convert them to the respective integer scalar- or vector type while rounding towards the nearest integer
  • Added cor and cxor. These reduce vectors of a given integer type to a scalar integer of that type by applying bitwise OR or XOR operations between each element
  • Split approx into two overloads: one with a custom tolerance parameter (the old version) and one without, which calculates an appropriate tolerance instead
  • Added roundmultiple(x, m), floormultiple(x, m), ceilmultiple(x, m) and truncmultiple(x, m) for all types, rounding x to the nearest multiple of any positive m with the selected rounding mode (for example: ceilmultiple rounds x to the nearest greater multiple of m)
  • Added a whole stack of bit manipulation functions for all scalar- and vector integer types: parityodd, parityeven, countzerobits, l1cnt, t1cnt, lzmask, tzmask, l1mask, t1mask, bits_extractlowest0, bits_masktolowest, bits_masktolowest0, bits_maskfromlowest, bits_maskfromlowest0, bits_setlowest, bits_surroundlowest and bits_surroundlowest0

Global Compilation Options

  • Added Global Compilation Options for OptimizeFor, FloatMode and FloatPrecision. A proposal for compile-time access to job-specific options has been forwarded to the Burst team and is on their backlog. For now, these global options are dependency-injection-style placeholders and thus hard-coded to OptimizeFor.Performance, FloatMode.Default and FloatPrecision.Standard, respectively, and can be customized within the source code itself at .../MaxMath/Runtime/Compiler Extensions/Compilation Options.cs

Improvements

Meta

  • This library now fully supports ARM CPUs' SIMD instructions (huge!). It utilizes SSE2NEON and SIMDe to convert x86 SIMD instructions to ARM SIMD instructions or instruction sequences. Because of this, generated ARM code will sometimes remain slightly unoptimized, because the author is unable to verify correctness of ARM specific optimizations with unit tests in most cases.

Performance

  • Implemented optimized (u)long vector to float vector type convesion operators
  • Implemented the execution of two loop bodies in one for functions that use loop-based algorithms, when a vector type wider than 128 bits is used without compiling for AVX(2)
  • Implemented an AssumeRangeAttribute equivalent for all vectorized functions with known return value ranges
  • Implemented more optimal (U)Int128 comparison operators
  • Implemented optimal (U)Int128 multiplication operations with- and division and modulo operations by compile time constants
  • Implemented optimal (U)Int128 division and modulo operations by replacing a loop algorithm with straight line code. Because Burst does not expose the hardware-supported 128x64 narrowing division instruction as an intrinsic, this instruction, which is fundamentally important to the algorithm, is implemented with fallback code. A highly optimized (speed & size) native DLL written in Windows x86-64 assembly containing the most optimal implementation of any varation of 128 bit integer division was added to utilize this hardware instruction. This does mean that 128 bit integer division now results in a function call that cannot be inlined, yet the performance gain is worth it. Additionally, the C#/assembly interface was carefully crafted to avoid calling external functions partially or even entirely by utilizing Unity.Burst.CompilerServices.Constant.IsConstantExpression<T>()
  • Increased valid Promise.Unsafe0 range for (u)long intcbrt from [0, 2^46} to [0, 2^48]
  • Added an optional Promise parameter to gamma
  • Added an optional Promise parameter to erf(c)
  • Added an optional Promise parameter to gcd and lcm
  • Added quarter and half scalar- and vector function overloads for min, max, minmax, clamp, saturate, isinrange, trunc, round, ceil, floor and sign
  • Removed the only non-optimizing branch in vector code in the entire library within the long2/3/4 >> operator if the shift amount is not a compile time constant for a ~30% performance gain
  • Reduced branch predictor penalty of gcd by moving the loop condition to an earlier part of the loops code
  • Reduced latency of internal byte to uint/float and ushort to ulong/double vector zero-extension conversions (and back) if an entire SIMD register is converted, by ~6 cycles. This operation is especially relevant in byte vector shl and shr operations, which are very common throughout the library, including within loops. Consequently, this improvement should yield significant performance gains
  • Reduced latency of gcd within the loop by 1 to 8 clock cycles at the cost of 0 to 9 clock cycles outside the loop if compiling for SSE4 or higher, by adding 1 to 6 instructions.
  • Reduced latency of comb for all scalar and vector types.
    1. (s)byte, (u)short and scalar (u)int overloads now make use of an algorithm that only requires one division per loop iteration instead of two, if the type fits into a single hardware register when cast to the next wider integer type. This algorithm is more than twice as fast.
    2. Implemented Divider<T> into both algorithms. This yields a massive performance gain of 2.3x to 20x (!), at the cost of an immense amount of code size
    3. The first eight loop iterations of both algorithms have been extracted from the loop and optimized by hand. This yields a performance gain between ~15 (16 bit) and ~600 (64 bit) clock cycles.

Both ii. and iii. are disabled if the global compilation tag OptimizeFor is set to OptimizeFor.Size.

  • Reduced latency of half{X} to (s)byte[X}, (u)short{X}, (u)long{X}, and half8 to (u)int8 type conversion from 10 down to 4 or 5 cycles, also affecting managed C# fallback performance. Unity.Mathematics' implementation remains suboptimal (for now... hint)
  • Reduced latency of SSE2 fallback code for ulong vector < and > operators by 1 cycle and reduced code size by removing 1 instruction
  • Reduced latency of (s)byte vector floorlog2/ceillog2 by 1 cycle and reduced code size by removing 1 to 2 instructions, as well as 1 constant read from RAM
  • Reduced latency of vector long (n)abs by 0 to 2 cycles (highly CPU specific) and reduced code size by removing 1 to 2 instructions
  • Reduced latency of all_dif overloads for all (s)byte vectors and short16 by 10% to 20%; the (s)byte overloads now use ~35% less constant data read from RAM
  • Reduced latency of (s)byte16/32 division and modulo operations by 6 cycles by adding 6 instructions
  • Reduced latency of vector float to (u)long type conversion by 5 to 10 cycles
  • Reduced latency of quarter/half to integer type conversion operations by 6 or more cycles
  • Reduced latency, code size and constant data read from RAM of SSE2 fallback code for (s)byte vector shrl, shra and shl drastically when the shift amount is constant
  • Reduced latency of vector rol/ror drastically when the rotation value is constant and a multiple of 8
  • Reduced latency of (u)int8 and (u)long vector l/tzcnt by up to 2 cycles and removed 1 instruction
  • Reduced latency of float8 to uint8 conversion operator by ~6 cycles if compiling for AVX2
  • Reduced latency of scalar- and vector double fastrsqrt by 4 cycles and increased accuracy of the result
  • Reduced latency of (u)long intsqrt by reducing the latency of each loop interation by 1 (signed) or 2 (unsigned) cycles, at the cost of an extra 5 (unsigned) or 6 (signed) instructions with a latency of 3 cycles outside the loop
  • Reduced latency of scalar (u)int intlog10 by 1 cycle and removed 1 instruction
  • Reduced latency of vector double to (u)long type conversion operators by 1 cycle and removed 5 instructions
  • Reduced latency of previously internal roundto(u)long(double x) by ~10 cycles, also reducing (u)long vector division latency by ~20 cycles down to ~94 cycles
  • Reduced latency of squaring (u)long vectors by 2 to 3 cycles and removed 2 instructions, also reducing (u)long vector inpow latency, as squaring is part of the loop
  • Reduced latency of all floating point nexttoward functions by 1 cycle
  • Reduced latency of (u)long vector multiplication by scalar- or vector integer types with bit-width less than or equal to 32 by 2 cycles and removed 1 instruction
  • Reduced latency of signed 8 bit integer division by 1 cycle if compiling for SSSE3 or higher
  • Reduced latency of managed- and SSE2 fallback code paths for integer vector isinrange(x, min, max) function overloads by ~2 cycles
  • Reduced latency of floating point isinrange(x, min, max) function overloads by ~7 cycles
  • Reduced latency of float/double to (U)Int128 type conversion drastically (hundreds of cycles...!)
  • Reduced code size of vector bits_depositparallel and bits_extractparallel by removing 1 instruction if compiling for SSE4 or higher
  • Reduced code size of vectorized perm by removing 3 instructions
  • Reduced code size of vectorized intpow by removing 5 or more instructions
  • Reduced code size of ushort8 to float8 type conversion by removing 1 instruction
  • Added SSE2 fallback code for (u)long2/3/4 to (s)byte2/3/4 type conversion
  • Added SSE2 fallback code for countbits
  • Added SSE2 fallback code for mulsaturated for int2/3/4
  • Added AVX fallback code for float8 and double3/4 gamma
  • Added AVX fallback code for float8 and double3/4 erf(c)
  • Added AVX fallback code for double3/4 (r)cbrt

Changes

  • Bumped C# Dev Tools dependency to version 1.0.9
  • Bumped Unity.Mathematics dependency to version 1.3.1 and implemented square for all types, chgsign for floating point types and the constants TAU (present since 2020, updated documentation), PI2, PIHALF, TODEGREES and TORADIANS
  • Bumped Unity.Burst dependency to version 1.8.18
  • Moved Promise type declaration from MaxMath.maxmath (class) to MaxMath (namespace)
  • Two-parameter all_dif overloads are no longer a wrapper for math.all(a != b); these now return true if and only if both vectors do not share any components with each other
  • double to (u)long type conversion outside of the (u)long domain is now undefined behavior (just like in the C standard) - the Mono runtime changed its behavior with an upgrade to some Unity 2022 version and is more performance intensive to follow "correctly" now. Reduced latency by 2 to 4 cycles, removed up to 5 instructions and 3 pieces of constant data read from RAM
  • rol/ror return values are now undefined if the rotation amount is outside [0, BITS) (compliant with Unity.Mathematics). Reduced latency by up to 2 cycles if the rotation amount is not a compile time constant

Fixed Oversights

  • Added asbyte(sbyte{X}), assbyte(byte{X}), asshort(ushort{X}), asushort(short{X}), asint(uint8), asuint(int8), aslong(ulong{X}), asulong(long{X})
  • Added minmax(a, b, out min, out max) for scalar types; Improved managed fallback performance also
  • Added minmaxmag(a, b, out minmag, out maxmag) for scalar types; Improved managed fallback performance also
  • Added bitmask overloads for bool2 and bool3, an oversight by the Unity.Mathematics team
  • Added missing (u)long matrix from/to (u)int matrix conversion operators
  • Added C# unboxing casts to (U)Int128 Equals(object other)
  • countbits(uint8 x) now returns an int8

Upcoming

The next release, 3.0, is going to be a major deviation from the previous premise of this library being an extension to Unity.Mathematics. It is going to become a wrapper library instead, replacing Unity.Mathematics entirely, while retaining the things that work well and cannot be done with MaxMath and Burst alone.
The most important guideline is an easy transition from either Unity.Mathematics alone or in combination with MaxMath. The maxmath class will be renamed to math; a simple case sensitive find&replace ("Unity.Mathematics" -> "MaxMath" and "maxmath" -> "math") will be enough to migrate to this new version.
Reasons for this are:

  • Performance: Integer division/mod is not vectorized in Unity.Mathematics - just as an example. More dramatically Unity.Mathematics' bool vectors perform poorly and a solution is at hand.
  • Unity.Mathematics is not really updated anymore. Thus we cannot expect certain fixes, like half types not being IEEE754 compliant at all.

Release 3.0 is going to be released as both an extension- and as a wrapper library. It is only going to focus on extra test coverage and possible bugfixes, i.e. there are not going to be any improvements or additions.
Questions are welcome. Please use the issue tracker or seek contact on Discord