Known Issues
half8
==
and!=
operators don't conform to the IEEE 754 standard (compliant with Unity.Mathematics... for now... hint)bool
vectors generated from operations on non-(s)byte
vectors do not generate the most optimal machine code possible, partly due to an LLVM performance regression, partly due to other compiler related difficultiesfloat8
min()
andmax()
functions don't handle NaNs the same way Unity.Mathematics does
Fixes
- Fixed XML documentation not showing descriptions for valid
Promise
flags - Fixed
cminmax
documentation bitmask64
withnumBits
equal to 64 now correctly returns a bitmask with all 64 bits set if not compiling for Bmi1 i.e. AVX2- Fixed
uint8
tofloat8
type conversion if compiling for AVX2 - Fixed incorrect
mod
implementations - (ISSUE #16) Fixed
float
anddouble
(r)cbrt
edge cases (+/-0, Infinity and NaN). Additionally, the scalar- and vectorfloat
implementation now returns accurate results for subnormal numbers. Performance is affected negatively yet minimally (~2 clock cycles, + ~10 instructions); new validPromise
flags allow for call-site selection of faster code paths
Additions
Divider<T>
Divider<T>
is an opaque OOP-like struct which performs fast integer division and modulo operations as well as divisibility checks.
For any divisor of any scalar- or vector integer type T
, a Divider<T>
instance replaces division operations by multiplication-, shift- and rounding operations, utilizing the most suitable of 2 algorithms, typically used by compilers for compile time constant divisors.
Divider<T>
was carefully crafted in a way that allows for complete compile-time evaluation of constant divisors of all types in Burst compiled code.
Divider<T>
is NOT meant to replace divison operations; a (notable) performance gain is only to be expected in case the same divisor is used multiple times, or when multiple divisors are computed at once, utilizing SIMD (for instance, when a very predictable i
is the divisor in a for-loop).
Numerous Promise
flags allow for faster operations, provided that the Divider<T>
instance is both initialized and used in the same block of Burst compiled code and not loaded from RAM.
The implementation is pseudo-generic and only works for integer types known to MaxMath. Furthermore, Bursts inabilty to compile-time evaluate typeof(T)
often requires explicit initialization (example: new Divider<byte>((byte)42))
. DEBUG
only validity checks ensure correct initialization and usage.
The current Divider API consists of...:
/
and%
operators:- LHS: scalar <> RHS: Divider(scalar): requires both scalars to be of the same type; returns a scalar of the that type
- LHS: vector <> RHS: Divider(vector): requires both vectors to be of the same type; returns a vector of the that type
- LHS: scalar <> RHS: Divider(vector): requires the vector type to contain integers of the scalar type; returns an instance of the vector type
- LHS: vector <> RHS: Divider(scalar): requires the vector type to contain integers of the scalar type; returns an instance of the vector type
DivRem
member methodsEvenlyDivides
member methodsT Divisor
as a readonly propertypublic const
Promise
s withinDivider<T>
, documenting valid promise flags with appropriate naming, starting with "PROMISE_"Get/SetInnerDivider<U>
methods: get or set a scalar- or vectorDivider<U>
within aDivider<T>
- Component shuffles:
Divider<T>.wzxy
swizzle "operators" as properties.
NOTE: Get/SetInnerDivider<U>
methods and Divider<T>.[a][b][c][d]
properties will change in the future. Due to current limitations regarding C# generics, swizzle operators only take in or return the same type the respective property is a member of, i.e. you cannot use these to get a Divider<int2>
from a Divider<int4>
. Get/SetInnerDivider<U>
are placeholderholders both for these operations as well as for the v[a]_[b]
properties for vectors with 8 or more components. C# will at some point get more complex type extension language support, at which point this API will change.
quadruple
(PREVIEW)
Analogous to (U)Int128
, this library now supports 128 bit floating point operations with its respective software-implemented type. It is fully IEEE754 compliant and in the typical 1 sign bit, 15 exponent bits, 112 mantissa bits format.
NOTE: quadruple
is in preview for an unforseeable amount of time. This means that it is neither completely optimized, nor are all maxmath
functions available for it at this time.
The following functions have been implemented: ToString
and Parse
(no perfect roundtrip guaranteed), All constants (example: PI_QUAD
), Random128
NextQuadruple
(optionally with min and max values), all type conversions except for decimal
, -
(unary), +
(binary), -
(binary), *
, /
, %
, ==
, !=
, <
, <=
, >
, >=
, fmod
, mad
, msub
, rcp
, isnan
, isinf
, isfinite
, isnormal
, issubnormal
, round
, floor
, ceil
, trunc
, roundtoint
(and all other integer variations), fastsqrt
, (r)sqrt
, (r)cbrt
, isinrange
, approx
, select
, compareto
, min
, max
, copysign
, nextgreater
, nextsmaller
, nexttoward
, radians
, degrees
, chgsign
Functions
- Added
isnormal
andissubnormal
functions for floating point types - Added
hypot
andinthypot
functions for calculating[int]sqrt(a * a + b * b)
without overflow, unless an optionalPromise
parameter with itsNoOverflow
flag set is passed as a compile time constant argument - Added
roundto(s)byte/(u)short/(u)int/(u)long/(U)Int128
. These take in floating point values of any type and convert them to the respective integer scalar- or vector type while rounding towards the nearest integer - Added
cor
andcxor
. These reduce vectors of a given integer type to a scalar integer of that type by applying bitwise OR or XOR operations between each element - Split
approx
into two overloads: one with a custom tolerance parameter (the old version) and one without, which calculates an appropriate tolerance instead - Added
roundmultiple(x, m)
,floormultiple(x, m)
,ceilmultiple(x, m)
andtruncmultiple(x, m)
for all types, rounding x to the nearest multiple of any positive m with the selected rounding mode (for example: ceilmultiple rounds x to the nearest greater multiple of m) - Added a whole stack of bit manipulation functions for all scalar- and vector integer types:
parityodd
,parityeven
,countzerobits
,l1cnt
,t1cnt
,lzmask
,tzmask
,l1mask
,t1mask
,bits_extractlowest0
,bits_masktolowest
,bits_masktolowest0
,bits_maskfromlowest
,bits_maskfromlowest0
,bits_setlowest
,bits_surroundlowest
andbits_surroundlowest0
Global Compilation Options
- Added Global Compilation Options for
OptimizeFor
,FloatMode
andFloatPrecision
. A proposal for compile-time access to job-specific options has been forwarded to the Burst team and is on their backlog. For now, these global options are dependency-injection-style placeholders and thus hard-coded toOptimizeFor.Performance
,FloatMode.Default
andFloatPrecision.Standard
, respectively, and can be customized within the source code itself at .../MaxMath/Runtime/Compiler Extensions/Compilation Options.cs
Improvements
Meta
- This library now fully supports ARM CPUs' SIMD instructions (huge!). It utilizes SSE2NEON and SIMDe to convert x86 SIMD instructions to ARM SIMD instructions or instruction sequences. Because of this, generated ARM code will sometimes remain slightly unoptimized, because the author is unable to verify correctness of ARM specific optimizations with unit tests in most cases.
Performance
- Implemented optimized
(u)long
vector tofloat
vector type convesion operators - Implemented the execution of two loop bodies in one for functions that use loop-based algorithms, when a vector type wider than 128 bits is used without compiling for AVX(2)
- Implemented an
AssumeRangeAttribute
equivalent for all vectorized functions with known return value ranges - Implemented more optimal
(U)Int128
comparison operators - Implemented optimal
(U)Int128
multiplication operations with- and division and modulo operations by compile time constants - Implemented optimal
(U)Int128
division and modulo operations by replacing a loop algorithm with straight line code. Because Burst does not expose the hardware-supported 128x64 narrowing division instruction as an intrinsic, this instruction, which is fundamentally important to the algorithm, is implemented with fallback code. A highly optimized (speed & size) native DLL written in Windows x86-64 assembly containing the most optimal implementation of any varation of 128 bit integer division was added to utilize this hardware instruction. This does mean that 128 bit integer division now results in a function call that cannot be inlined, yet the performance gain is worth it. Additionally, the C#/assembly interface was carefully crafted to avoid calling external functions partially or even entirely by utilizingUnity.Burst.CompilerServices.Constant.IsConstantExpression<T>()
- Increased valid
Promise.Unsafe0
range for(u)long
intcbrt
from [0, 2^46} to [0, 2^48] - Added an optional
Promise
parameter togamma
- Added an optional
Promise
parameter toerf(c)
- Added an optional
Promise
parameter togcd
andlcm
- Added
quarter
andhalf
scalar- and vector function overloads formin
,max
,minmax
,clamp
,saturate
,isinrange
,trunc
,round
,ceil
,floor
andsign
- Removed the only non-optimizing branch in vector code in the entire library within the
long2/3/4
>>
operator if the shift amount is not a compile time constant for a ~30% performance gain - Reduced branch predictor penalty of
gcd
by moving the loop condition to an earlier part of the loops code - Reduced latency of internal
byte
touint
/float
andushort
toulong
/double
vector zero-extension conversions (and back) if an entire SIMD register is converted, by ~6 cycles. This operation is especially relevant inbyte
vectorshl
andshr
operations, which are very common throughout the library, including within loops. Consequently, this improvement should yield significant performance gains - Reduced latency of
gcd
within the loop by 1 to 8 clock cycles at the cost of 0 to 9 clock cycles outside the loop if compiling for SSE4 or higher, by adding 1 to 6 instructions. - Reduced latency of
comb
for all scalar and vector types.(s)byte
,(u)short
and scalar(u)int
overloads now make use of an algorithm that only requires one division per loop iteration instead of two, if the type fits into a single hardware register when cast to the next wider integer type. This algorithm is more than twice as fast.- Implemented
Divider<T>
into both algorithms. This yields a massive performance gain of 2.3x to 20x (!), at the cost of an immense amount of code size - The first eight loop iterations of both algorithms have been extracted from the loop and optimized by hand. This yields a performance gain between ~15 (16 bit) and ~600 (64 bit) clock cycles.
Both ii. and iii. are disabled if the global compilation tag OptimizeFor
is set to OptimizeFor.Size
.
- Reduced latency of
half{X}
to(s)byte[X}
,(u)short{X}
,(u)long{X}
, andhalf8
to(u)int8
type conversion from 10 down to 4 or 5 cycles, also affecting managed C# fallback performance. Unity.Mathematics' implementation remains suboptimal (for now... hint) - Reduced latency of SSE2 fallback code for
ulong
vector<
and>
operators by 1 cycle and reduced code size by removing 1 instruction - Reduced latency of
(s)byte
vectorfloorlog2
/ceillog2
by 1 cycle and reduced code size by removing 1 to 2 instructions, as well as 1 constant read from RAM - Reduced latency of vector
long
(n)abs
by 0 to 2 cycles (highly CPU specific) and reduced code size by removing 1 to 2 instructions - Reduced latency of
all_dif
overloads for all(s)byte
vectors andshort16
by 10% to 20%; the(s)byte
overloads now use ~35% less constant data read from RAM - Reduced latency of
(s)byte16/32
division and modulo operations by 6 cycles by adding 6 instructions - Reduced latency of vector
float
to(u)long
type conversion by 5 to 10 cycles - Reduced latency of
quarter
/half
to integer type conversion operations by 6 or more cycles - Reduced latency, code size and constant data read from RAM of SSE2 fallback code for
(s)byte
vectorshrl
,shra
andshl
drastically when the shift amount is constant - Reduced latency of vector
rol
/ror
drastically when the rotation value is constant and a multiple of 8 - Reduced latency of
(u)int8
and(u)long
vectorl/tzcnt
by up to 2 cycles and removed 1 instruction - Reduced latency of
float8
touint8
conversion operator by ~6 cycles if compiling for AVX2 - Reduced latency of scalar- and vector
double
fastrsqrt
by 4 cycles and increased accuracy of the result - Reduced latency of
(u)long
intsqrt
by reducing the latency of each loop interation by 1 (signed) or 2 (unsigned) cycles, at the cost of an extra 5 (unsigned) or 6 (signed) instructions with a latency of 3 cycles outside the loop - Reduced latency of scalar
(u)int
intlog10
by 1 cycle and removed 1 instruction - Reduced latency of vector
double
to(u)long
type conversion operators by 1 cycle and removed 5 instructions - Reduced latency of previously internal
roundto(u)long(double x)
by ~10 cycles, also reducing(u)long
vector division latency by ~20 cycles down to ~94 cycles - Reduced latency of squaring
(u)long
vectors by 2 to 3 cycles and removed 2 instructions, also reducing(u)long
vectorinpow
latency, as squaring is part of the loop - Reduced latency of all floating point
nexttoward
functions by 1 cycle - Reduced latency of
(u)long
vector multiplication by scalar- or vector integer types with bit-width less than or equal to 32 by 2 cycles and removed 1 instruction - Reduced latency of signed 8 bit integer division by 1 cycle if compiling for SSSE3 or higher
- Reduced latency of managed- and SSE2 fallback code paths for integer vector
isinrange(x, min, max)
function overloads by ~2 cycles - Reduced latency of floating point
isinrange(x, min, max)
function overloads by ~7 cycles - Reduced latency of
float
/double
to(U)Int128
type conversion drastically (hundreds of cycles...!) - Reduced code size of vector
bits_depositparallel
andbits_extractparallel
by removing 1 instruction if compiling for SSE4 or higher - Reduced code size of vectorized
perm
by removing 3 instructions - Reduced code size of vectorized
intpow
by removing 5 or more instructions - Reduced code size of ushort8 to float8 type conversion by removing 1 instruction
- Added SSE2 fallback code for
(u)long2/3/4
to(s)byte2/3/4
type conversion - Added SSE2 fallback code for
countbits
- Added SSE2 fallback code for
mulsaturated
forint2/3/4
- Added AVX fallback code for
float8
anddouble3/4
gamma
- Added AVX fallback code for
float8
anddouble3/4
erf(c)
- Added AVX fallback code for
double3/4
(r)cbrt
Changes
- Bumped C# Dev Tools dependency to version 1.0.9
- Bumped Unity.Mathematics dependency to version 1.3.1 and implemented
square
for all types,chgsign
for floating point types and the constantsTAU
(present since 2020, updated documentation),PI2
,PIHALF
,TODEGREES
andTORADIANS
- Bumped Unity.Burst dependency to version 1.8.18
- Moved
Promise
type declaration fromMaxMath.maxmath
(class) toMaxMath
(namespace) - Two-parameter
all_dif
overloads are no longer a wrapper formath.all(a != b)
; these now return true if and only if both vectors do not share any components with each other double
to(u)long
type conversion outside of the(u)long
domain is now undefined behavior (just like in the C standard) - the Mono runtime changed its behavior with an upgrade to some Unity 2022 version and is more performance intensive to follow "correctly" now. Reduced latency by 2 to 4 cycles, removed up to 5 instructions and 3 pieces of constant data read from RAMrol
/ror
return values are now undefined if the rotation amount is outside [0, BITS) (compliant with Unity.Mathematics). Reduced latency by up to 2 cycles if the rotation amount is not a compile time constant
Fixed Oversights
- Added
asbyte(sbyte{X})
,assbyte(byte{X})
,asshort(ushort{X})
,asushort(short{X})
,asint(uint8
),asuint(int8)
,aslong(ulong{X})
,asulong(long{X})
- Added
minmax(a, b, out min, out max)
for scalar types; Improved managed fallback performance also - Added
minmaxmag(a, b, out minmag, out maxmag)
for scalar types; Improved managed fallback performance also - Added
bitmask
overloads forbool2
andbool3
, an oversight by the Unity.Mathematics team - Added missing
(u)long
matrix from/to(u)int
matrix conversion operators - Added C# unboxing casts to
(U)Int128
Equals(object other)
countbits(uint8 x)
now returns anint8
Upcoming
The next release, 3.0, is going to be a major deviation from the previous premise of this library being an extension to Unity.Mathematics. It is going to become a wrapper library instead, replacing Unity.Mathematics entirely, while retaining the things that work well and cannot be done with MaxMath and Burst alone.
The most important guideline is an easy transition from either Unity.Mathematics alone or in combination with MaxMath. The maxmath
class will be renamed to math
; a simple case sensitive find&replace ("Unity.Mathematics" -> "MaxMath" and "maxmath" -> "math") will be enough to migrate to this new version.
Reasons for this are:
- Performance: Integer division/mod is not vectorized in Unity.Mathematics - just as an example. More dramatically Unity.Mathematics'
bool
vectors perform poorly and a solution is at hand. - Unity.Mathematics is not really updated anymore. Thus we cannot expect certain fixes, like
half
types not being IEEE754 compliant at all.
Release 3.0 is going to be released as both an extension- and as a wrapper library. It is only going to focus on extra test coverage and possible bugfixes, i.e. there are not going to be any improvements or additions.
Questions are welcome. Please use the issue tracker or seek contact on Discord