Taking the new System.Runtime.Intrinsics
namespace for a spin and comparing it to scalar float
and Vector<float>
operations.
- Introduction to Intrinsics
- First steps
- Loading and storing data
- Aligned vs. Unaligned Memory
- Dataset Sizes vs Caches
- Basic Operations
- Comparisons
- What´s Missing?
- Some Benchmark Results
The new functionality (available in Net Core 3.0 and beyond) under the System.Runtime.Intrinsics
namespace will open up some the Intel and AMD processor intrinsics (see Intel´s full guide here)) and a Microsoft blog entry by Tanner Gooding on the subject. The coverage is not 100% but I imagine it will grow further as time passes. ARM processor support is in the future.
In a nutshell, the new functionality expands SIMD processing beyond what´s possible using System.Numerics.Vector<T>
by adding dozens of new instructions.
You prepare your code by adding some using
statements:
using System.Runtime.Intrinsics
using System.Runtime.Intrinsics.X86
Intrinsics
contains the different new vector classes and structures (Microsoft documentation): Vector64<T>
, Vector128<T>
and Vector256<T>
. The number refers to the bit-length of the vector, as expected.
The classes offer functions for creating and transforming vectors: Vector256.Create(1.0f)
creates a new Vector256<float>
, with every component float
initialized to 1.0f
, Vector128.AsByte<float>(someVector128<float>)
creates a new vector128, casting the float
values to byte
. Also, you can create vectors using Create
and explicitly passing all elements.
using System.Runtime.Intrinsics;
namespace Core3Intrinsics
{
public class Intro
{
public Intro()
{
Vector128<float> middleVector = Vector128.Create(1.0f); // middleVector = <1,1,1,1>
middleVector = Vector128.CreateScalar(-1.0f); // middleVector = <-1,0,0,0>
Vector64<byte> floatBytes = Vector64.AsByte(Vector64.Create(1.0f, -1.0f)); // floatBytes = <0, 0, 128, 63, 0, 0, 128, 63>
Vector256<float> left = Vector256.Create(-1.0f, -2.0f, -3.0f, -4.0f, -5.0f, -6.0f, - 7.0f, -8.0f);
}
}
}
Intrinsics.X86
contains the SIMD namespaces, like SSE and AVX. It can be quite daunting (see Microsoft´s documentation here) since it does not contain any explanation of the functionality. For functions like Add
it might not be necessary but the Blend
name itself is not necessarily enlightening (unless you are already familiar with Intel´s intrinsincs.)
All namespaces within Intrinsics.X86
contain a static IsSupported
bool
: if true
all is well and the platform supports the specific functionality (i. e. AVX2). If false
, you are on your own, no software fallback is provided. If your code does not check for availability and happens to run on a hardware platform which does not support the functionality you are using, a PlatformNotSupportedException
will be thrown at runtime.
These namespaces contain all the currently supported SIMD functions, like Add
, LoadVector256
and many more.
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
namespace Core3Intrinsics
{
public class Intro
{
public Intro()
{
if(Avx.IsSupported)
{
var left = Vector256.Create(-2.5f);
var right = Vector256.Create(5.0f);
Vector256<float> result = Avx.Add(left, right); // result = <2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5>
result = Avx.Multiply(left, right); // result = <-12.5, -12.5, -12.5, -12.5, -12.5, -12.5, -12.5, -12.5>
double[] someDoubles = new double[] { 1.0, 3.0, -2.5, 7.5, 10.8, 0.33333 };
unsafe
{
fixed (double* ptr = &someDoubles[1])
{
Vector256<double> res2 = Avx.LoadVector256(ptr); // res2 = <3, -2.5, 7.5, 10.8>
}
}
}
}
}
}
The documentation contains the intrinsic function used by the processor (for Add(Vector256<Single>, Vector256<Single>)
for example, the instruction is __m256 _mm256_add_ps (__m256 a, __m256 b)
). This comes in handy in order to find the equivalent instruction in the Intel guide:
__m256 _mm256_add_ps (__m256 a, __m256 b)
Synopsis
__m256 _mm256_add_ps (__m256 a, __m256 b)
#include <immintrin.h>
Instruction: vaddps ymm, ymm, ymm
CPUID Flags: AVX
Description
Add packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.
Operation
FOR j := 0 to 7
i := j*32
dst[i+31:i] := a[i+31:i] + b[i+31:i]
ENDFOR
dst[MAX:256] := 0
Performance
| Architecture | Latency | Throughput (CPI)
| ---------------|---------|-----------------
| Skylake | 4 | 0.5
| Broadwell | 3 | 1
| Haswell | 3 | 1
| Ivy Bridge | 3 | 1
This gives you the exact description of the operation(s) being performed and also performance data (the "Latency" value is "is the number of processor clocks it takes for an instruction to have its data available for use by another instruction", the "Throughput" is "the number of processor clocks it takes for an instruction to execute or perform its calculations". See Intels´ definition here)
As seen above, you can create vectors one-by-one using the various Create
functions. Another possibility is to use the (unsafe) Loadxxx()
functions.
Storing data can be achieved with Storexx.
double[] someDoubles = new double[] { 1.0, 3.0, -2.5, 7.5, 10.8, 0.33333 };
double[] someResult = new double[someDoubles.Length];
unsafe
{
fixed (double* ptr = &someDoubles[1])
{
fixed (double* ptr2 = &someResult[0])
{
Vector256<double> res2 = Avx.LoadVector256(ptr); // res2 = <3, -2.5, 7.5, 10.8>
Avx.Store(ptr2, res2);
}
}
}
You can also create a new vector by interleaving two others:
left = Vector256.Create(-1.0f, -2.0f, -3.0f, -4.0f, -50.0f, -60.0f, - 70.0f, -80.0f);
right = Vector256.Create(1.0f, 2.0f, 3.0f, 4.0f, 50.0f, 60.0f, 70.0f, 80.0f);
result = Avx.UnpackLow(left, right); // result = <-1, 1, -2, 2, -50, 50, -60, 60>
result = Avx.UnpackHigh(left, right); // result = <-3, 3, -4, 4, -70, 70, -80, 80>
R = UnpackLow(A, B)
|------|------|------|------|------|------|------|------|
| A0 | A1 | A2 | A3 | A4 | A5 | A6 | A7 |
|------|------|------|------|------|------|------|------|
|------|------|------|------|------|------|------|------|
| B0 | B1 | B2 | B3 | B4 | B5 | B6 | B7 |
|------|------|------|------|------|------|------|------|
R0 R1 R2 R3 R4 R5 R6 R7
|------|------|------|------|------|------|------|------|
| A0 | B0 | A1 | B1 | A4 | B4 | A5 | B5 |
|------|------|------|------|------|------|------|------|
Many times you´ll use the intrinsics for huge amounts of data, so a more practical approach to create vectors could be:
public float[] ProcessData(ref Span<float> input)
{
float[] results = new float[input.Length];
Span<Vector256<float>> resultVectors = MemoryMarshal.Cast<float, Vector256<float>>(results);
ReadOnlySpan<Vector256<float>> inputVectors = MemoryMarshal.Cast<float, Vector256<float>>(input);
for(int i = 0; i < inputVectors.Length; i++)
{
resultVectors[i] = Avx.Sqrt(inputVectors[i]);
}
return results;
}
System.Runtime.Interopservices.MemoryMarshal.Cast<fromType, toType>()
will cast values in place (i. e. no copying involved). At the end of the loop, the results
array will automagically contain the individual floats from the vector operation (btw, the above example does not check if the input
array fits neatly into Vector256
, normally you´d need to process any remaining elements in a scalar way).
You can also go unsafe
and loop through pointers, of course:
public unsafe float[] ProcessDataUnsafe(ref Span<float> input)
{
float[] results = new float[input.Length];
fixed (float* inputPtr = &input[0])
{
float* inCurrent = inputPtr;
fixed (float* resultPtr = &results[0])
{
float* resEnd = resultPtr + results.Length;
float* resCurrent = resultPtr;
while (resCurrent < resEnd)
{
Avx.Store(resCurrent, Avx.Sqrt(Avx.LoadVector256(inCurrent)));
resCurrent += 8;
inCurrent += 8;
}
}
}
return results;
}
No performance difference on my machine, though.
Although moving data around using vectors seems pretty efficient, I was surprised to measure System.Runtime.CompilerServices.Unsafe.CopyBlock(ref byte destination, ref byte source, uint byteCount)
as faster, independently of data size (i.e. even data far bigger than cache will be copied efficiently). Of course it´s unsafe in the sense that you need to know what you are doing (not unsafe
though).
| Method | numberOfBytes | Mean | Error | StdDev | Median | Ratio | RatioSD |
|------------------------------ |-------------- |---------------:|--------------:|--------------:|---------------:|------:|--------:|
| ScalarStoreBlock | 16384 | 306.1 ns | 8.539 ns | 12.246 ns | 302.8 ns | 1.00 | 0.00 |
| VectorStoreArrayMemPtr | 16384 | 401.3 ns | 8.049 ns | 12.998 ns | 397.5 ns | 1.32 | 0.07 |
| ScalarStoreBlock | 8388608 | 1,106,074.5 ns | 17,544.390 ns | 14,650.360 ns | 1,107,074.2 ns | 1.00 | 0.00 |
| VectorStoreArrayMemPtr | 8388608 | 1,573,258.0 ns | 34,312.238 ns | 44,615.601 ns | 1,561,962.8 ns | 1.43 | 0.05 |
An impressive 32 - 43% advantage... It shows that a properly optimized scalar method (probably using some very smart assembly instructions) beats a naïve vectorization with ease.
If you look through the different Load...
instructions available, you´ll notice that you have, for example, LoadVector256(T*)
and LoadAlignedVector256(T*)
.
⚠️ The "Aligned" part refers to memory alignment of the pointer to the beginning of the data: in order to use theLoadAligned
version of the functions, your data needs to start at a specific boundary: for 256 bit vectors (32 bytes), the data needs to start at a location (pointer address) that is a multiple of 32 (for 128 bit vectors it needs to be aligned at 16 byte boundaries). Failure to do so can result in a runtime general protection fault.
In the past, aligned data used to work much better that unaligned data, but modern processors don´t really care, as long as your data is aligned to the natural OS´s boundary in order to avoid stradling cache line or page boundaries (see this comment by T. Gooding, for example)
Comparing aligned to unaligned on my machine:
Method | NumberOfBytes | Mean | Error | StdDev | Median | Ratio | RatioSD |
---|---|---|---|---|---|---|---|
VectorStoreAlignedUnsafe | 8388608 | 1,508,063.2 ns | 30,407.034 ns | 26,955.044 ns | 1,501,035.9 ns | 1.00 | 0.00 |
VectorStoreUnalignedUnsafe | 8388608 | 1,527,444.0 ns | 29,279.764 ns | 30,068.162 ns | 1,514,013.7 ns | 1.02 | 0.03 |
VectorStoreUnalignedToAlignedUnsafe | 8388608 | 1,485,540.1 ns | 12,131.046 ns | 10,129.973 ns | 1,486,236.1 ns | 0.99 | 0.02 |
There´s really no meaningful difference for bigger data chunks.
Often overlooked, the size of your datasets may have an important impact on your processing times (apart from the obvious increase in elements): if all data fits in a processor core´s cache and only a few operations will be performed per data point, then memory acces times will be crucial and you´ll notice a non-linear increase in processing time vs. data size.
⚠️ In other words, when you measure your loop in order to determine your gains (if any!) from using intrinsics, it´s important to test with data sizes close to the real data. For huge data, test with arrays several times bigger than the available cache size, at least.
As mentioned above, System.Runtime.Intrinsics.X86
contains the SSE, AVX etc. functionality. You can add, substract, multiply and divide all kinds of vectors.
You also have Sqrt
and ReciprocalSqrt
, Min
and Max
, they all do what you expect.
Some more exotic operations are:
__m256d _mm256_addsub_pd (__m256d a, __m256d b)
var left = Vector256.Create(-2.5f); // <-2.5, -2.5, -2.5, -2.5, -2.5, -2.5, -2.5, -2.5>
var right = Vector256.Create(5.0f); // <5, 5, 5, 5, 5, 5, 5, 5>
Vector256<float> result = Avx.AddSubtract(left, right); // result = <-7.5, 2.5, -7.5, 2.5, -7.5, 2.5, -7.5, 2.5>
AddSubtract
will subtract the even components (0, 2, ...) and add the odd ones (1, 3, ...).
|------|------|------|------|------|
| A0 | A1 | A2 | A3 | ... |
|------|------|------|------|------|
- + - + ...
|------|------|------|------|------|
| B0 | B1 | B2 | B3 | ... |
|------|------|------|------|------|
__m256 _mm256_dp_ps (__m256 a, __m256 b, const int imm8)
The Avx.DotProduct
is a bit out of the common:
left = Vector256.Create(-1.0f, -2.0f, -3.0f, -4.0f, -50.0f, -60.0f, - 70.0f, -80.0f);
right = Vector256.Create(1.0f, 2.0f, 3.0f, 4.0f, 50.0f, 60.0f, 70.0f, 80.0f);
result = Avx.DotProduct(left, right, 0b1111_0001); // result = <-30, 0, 0, 0, -17400, 0, 0, 0>
This will actually create 2 dot products of 128 bit vectors: from the first four elements of left
and right
, stored on the first element of result
, and the same for the right 4 elements, stored on the 5th element. In other words, it will perform a dot product on two 128 bit float vectors independently. It can be visualized as doing the dot product of 2 four float element vectors separately and simultaneously.
You can control which product is performed by using the 4 high order bits of the third parameter in reverse order: all ones means do all 4 products (on each 128 bit half). A value of 0b0001
would mean that only the first element´s products is performed, a value of 0b1010
will multiply second and fourth:
result = Avx.DotProduct(left, right, 0b1010_0001); // result = <-20, 0, 0, 0, -10000, 0, 0, 0>
If you think of vectors with x, y, z and w components, the order in which you turn the product on or off is thus (w, z, y, x).
The second half of the third parameter byte indicates where to store the dot product results, again in reverse order: 0001
means store the result in the first elements of each 128 bit vector.
R = DotProduct(A, B, bitMask)
bit mask = b7 b6 b5 b4 0 0 0 1
b4 b5 b6 b7 b4 b5 b6 b7
|------|------|------|------||------|------|------|------|
| A0 | A1 | A2 | A3 || A4 | A5 | A6 | A7 |
|------|------|------|------||------|------|------|------|
* * * * * * * *
|------|------|------|------||------|------|------|------|
| B0 | B1 | B2 | B3 || B4 | B5 | B6 | B7 |
|------|------|------|------||------|------|------|------|
= = ...
0 0 0 0
or + or ... or + or ...
A0*B0 A1*B1 A4*B4 A5*B5
|__________________________||__________________________|
stored in stored in
| |
| |
1 0 0 0 1 0 0 0
|------|------|------|------||------|------|------|------|
| R0 | 0 | 0 | 0 || R4 | 0 | 0 | 0 |
|------|------|------|------||------|------|------|------|
⚠️ You should do some benchmarking before using this instruction, its performance doesn´t seem to be too hot.
These do what you expect:
var left = Vector256.Create(-2.5f); // <-2.5, -2.5, -2.5, -2.5, -2.5, -2.5, -2.5, -2.5>
var right = Vector256.Create(5.0f); // <5, 5, 5, 5, 5, 5, 5, 5>
result = Avx.Floor(left); // result = <-3, -3, -3, -3, -3, -3, -3, -3>
result = Avx.Ceiling(left); // result = <-2, -2, -2, -2, -2, -2, -2, -2>
In order to have finer control you also have RoundToNearestInteger
, RoundToNegativeInfinity
etc.
__m256 _mm256_hadd_ps (__m256 a, __m256 b)
__m256 _mm256_hsub_ps (__m256 a, __m256 b)
var left = Vector256.Create(-2.5f); // <-2.5, -2.5, -2.5, -2.5, -2.5, -2.5, -2.5, -2.5>
var right = Vector256.Create(5.0f); // <5, 5, 5, 5, 5, 5, 5, 5>
result = Avx.HorizontalAdd(left, right); // result = <-5, -5, 10, 10, -5, -5, 10, 10>
result = Avx.HorizontalSubtract(left, right); // result = <0, 0, 0, 0, 0, 0, 0, 0>
HorizontalAdd
will add element 0 and 1 from left
, then elements 2 and 3. They get stored in elements 0 and 1 of result
. Then it goes on with the same for right
and stores the results in elements 2 and 3 of result
; then further...
R = HorizontalAdd(A, B)
|------|------|------|------|------|------|------|------|
| A0 | A1 | A2 | A3 | A4 | A5 | A6 | A7 |
|------|------|------|------|------|------|------|------|
|------|------|------|------|------|------|------|------|
| B0 | B1 | B2 | B3 | B4 | B5 | B6 | B7 |
|------|------|------|------|------|------|------|------|
R0 R1 R2 R3 R4 R5 R6 R7
|----------|----------|----------|----------|----------|----------|----------|----------|
| A0 + A1 | A2 + A3 | B0 + B1 | B2 + B3 | A4 + A5 | A6 + A7 | B4 + B5 | B6 + B7 |
|----------|----------|----------|----------|----------|----------|----------|----------|
__m256 _mm256_fmadd_ps etc.
if (Fma.IsSupported)
{
var resultFma = Fma.MultiplyAdd(left, right, other); // = left * right + other for each element
resultFma = Fma.MultiplyAddNegated(left, right, other); // = -(left * right + other) for each element
resultFma = Fma.MultiplySubtract(left, right, other); // = left * right - other for each element
Fma.MultiplyAddSubtract(left, right, other); // even elements (0, 2, ...) like MultiplyAdd, odd elements like MultiplySubtract
}
These instructions will combine multiplies with add or substract in several variants.
There are several intrinsics to compare vectors.
A set of Sse.Compare...
exist for 128-bit vectors:
var left128 = Vector128.Create(1.0f, 2.0f, 3.0f, 4.0f);
var right128 = Vector128.Create(2.0f, 3.0f, 4.0f, 5.0f);
Vector128<float> compResult128 = Sse.CompareGreaterThan(left128, right128); // compResult128 = <0, 0, 0, 0>
You also have CompareLessThanOrEqual
, CompareNotEqual
and many more.
If the comparison is false
for a given element, the result vector will have a zero in that position. If true
the position will be occupied by a value of all bits set to 1 (which results in NaN
for float
and double
).
For 256-bit vectors, Avx.Compare(vector a, vector b, flag)
will compare both vectors according to the FloatComparisonMode
flag given.
left = Vector256.Create(-1.0f, 3.0f, -3.0f, 4.0f, -50.0f, 60.0f, -70.0f, 80.0f);
right = Vector256.Create(0.0f, 2.0f, 3.0f, 2.0f, 50.0f, -60.0f, 70.0f, -80.0f);
var compareResult = Avx.Compare(left, right, FloatComparisonMode.OrderedGreaterThanNonSignaling); // compareResult = <0, NaN, 0, NaN, 0, NaN, 0, NaN>
FloatComparisonMode.OrderedGreaterThanNonSignaling
will compare if elements in left
are greater than elements in right
. As above, if the comparison is false
, the result vector will have a zero in that position. If true
the position will be occupied by a value of all bits set to 1 (which results in NaN
for float
and double
).
The
Ordered...
part of the flag´s name refers to howNaN
in the vectors are treated, the...NonSignaling
means to not throw exceptions when NaNs occur, although I am not really sure how this works yet [TO BE CONTINUED].
Once you have the comparison result, there are several things you can do with it:
left = Vector256.Create(-1.0f, 3.0f, -3.0f, 4.0f, -50.0f, 60.0f, -70.0f, 80.0f);
right = Vector256.Create(0.0f, 2.0f, 3.0f, 2.0f, 50.0f, -60.0f, 70.0f, -80.0f);
var compareResult = Avx.Compare(left, right, FloatComparisonMode.OrderedGreaterThanNonSignaling); // compareResult = <0, NaN, 0, NaN, 0, NaN, 0, NaN>
int res = Avx.MoveMask(compareResult); // res = 0b10101010 = 0xAA = 170
if(int > 0)
{
// At least one comparison is true, do something
}
MoveMask
will create an int
which bits indicate the elements which are true
(in reality, it will copy each element´s highest order bit, which comes down to the same, since true
has all bits set). The int
will list the elements in reverse order.
If you don´t need to know which element satisfies the comparison but just determine if all did, you can do:
left = Vector256.Create(-1.0f, 3.0f, -3.0f, 4.0f, -50.0f, 60.0f, -70.0f, 80.0f);
right = Vector256.Create(0.0f, 2.0f, 3.0f, 2.0f, 50.0f, -60.0f, 70.0f, -80.0f);
var compareResult = Avx.Compare(left, right, FloatComparisonMode.OrderedGreaterThanNonSignaling); // compareResult = <0, NaN, 0, NaN, 0, NaN, 0, NaN>
bool areAllTrue = !Avx.TestZ(compareResult, compareResult); // areAllTrue = false
if(!areAllTrue)
{
// At least one comparison is false, do something
}
You can also use the resulting vector to selectively load vector elements:
left = Vector256.Create(-1.0f, 3.0f, -3.0f, 4.0f, -50.0f, 60.0f, -70.0f, 80.0f);
right = Vector256.Create(0.0f, 2.0f, 3.0f, 2.0f, 50.0f, -60.0f, 70.0f, -80.0f);
var mask = Avx.Compare(left, right, FloatComparisonMode.OrderedGreaterThanNonSignaling); // mask = <0, NaN, 0, NaN, 0, NaN, 0, NaN>
Vector256<float> mixed = Avx.BlendVariable(left, right, mask); // mixed = <-1, 2, -3, 2, -50, -60, -70, -80>
For each element in the third parameter (mask
), BlendVariable
will pick the correspondent element from the second vector (right
in the above snippet) if the mask´s value is true
; otherwise it will pick the element from the first vector.
In the above snippet, left
[0] = -1.0f
, right
[0] = 0.0f
. The mask is 0
(false) at this position, so the result vector´s first position gets the value from the first vector: -1.0f
.
As mentioned above, there are some intrinsics to compare values that return a scalar (int
or bool
): TestZ
, TestC
etc and MoveMask
.
There are no trigonometric functions as yet: cosine, sine etc. are all missing. Maybe some others, but that´s the category that caught my eye.
Some benchmarks, with small data sizes (i. e. the data should fit into L2 cache) and larger sizes (i. e. 10 x L3 cachesize ) on my machine.
A simple scalar loop:
[BenchmarkCategory("MultiplyAdd"), Benchmark(Baseline = true)]
public unsafe void MultiplyAddScalarFloat()
{
var sp1 = new ReadOnlySpan<float>(data, 0, numberOfFloatItems);
var sp12 = new ReadOnlySpan<float>(data2, 0, numberOfFloatItems);
var sp13 = new ReadOnlySpan<float>(data3, 0, numberOfFloatItems);
var sp2 = new Span<float>(result, 0, numberOfFloatItems);
for (int i = 0; i < sp1.Length; i++)
{
sp2[i] = sp1[i] * sp12[i] + sp13[i];
}
}
The same using Fma
:
[BenchmarkCategory("MultiplyAdd"), Benchmark]
public unsafe void FmaMultiplyAddvector256Float()
{
ReadOnlySpan<Vector256<float>> d1 = MemoryMarshal.Cast<float, Vector256<float>>(new Span<float>(data, 0, numberOfFloatItems));
ReadOnlySpan<Vector256<float>> d2 = MemoryMarshal.Cast<float, Vector256<float>>(new Span<float>(data2, 0, numberOfFloatItems));
ReadOnlySpan<Vector256<float>> d3 = MemoryMarshal.Cast<float, Vector256<float>>(new Span<float>(data3, 0, numberOfFloatItems));
Span<Vector256<float>> r = MemoryMarshal.Cast<float, Vector256<float>>(new Span<float>(result, 0, numberOfFloatItems));
for (int i = 0; i < d1.Length; i++)
{
r[i] = Fma.MultiplyAdd(d1[i], d2[i], d3[i]);
}
}
Comparing both gives:
BenchmarkDotNet=v0.11.5, OS=Windows 10.0.18362
Intel Core i7-4500U CPU 1.80GHz (Haswell), 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=3.0.100-rc1-014190
[Host] : .NET Core 3.0.0-rc1-19456-20 (CoreCLR 4.700.19.45506, CoreFX 4.700.19.45604), 64bit RyuJIT
DefaultJob : .NET Core 3.0.0-rc1-19456-20 (CoreCLR 4.700.19.45506, CoreFX 4.700.19.45604), 64bit RyuJIT
Method | ParamCacheSizeBytes | Mean | Error | StdDev | Ratio | RatioSD |
---|---|---|---|---|---|---|
MultiplyAddScalarFloat | 262144 | 20.128 us | 0.5597 us | 0.8377 us | 1.00 | 0.00 |
FmaMultiplyAddvector256Float | 262144 | 6.750 us | 0.1338 us | 0.1186 us | 0.33 | 0.02 |
MultiplyAddScalarFloat | 41943040 | 5,208.768 us | 103.2312 us | 257.0815 us | 1.00 | 0.00 |
FmaMultiplyAddvector256Float | 41943040 | 4,021.671 us | 75.5671 us | 70.6856 us | 0.78 | 0.04 |
As expected for small number of operations inside the loop, the memory access times take their tolls: only a 22% time reduction for larger data sizes with vector intrinsics, using safe operations. (Although 22% could really be many hours for really huge jobs, of course...)
If we perform 3 FMA operations per step in the loop and use pointers for the vectors on the other hand, we get a more consistent speedup: still 1.67x for bigger data sets (see the source code for implementation of the test):
Method | ParamCacheSizeBytes | Mean | Error | StdDev | Median | Ratio | RatioSD |
---|---|---|---|---|---|---|---|
ScalarFloatMultipleOps | 262144 | 40.88 us | 1.1768 us | 1.2592 us | 40.44 us | 1.00 | 0.00 |
Vector256FloatMultipleOpsUnsafe | 262144 | 17.75 us | 0.0963 us | 0.0752 us | 17.74 us | 0.43 | 0.02 |
VectorTFloatMultipleOps | 262144 | 18.29 us | 0.1063 us | 0.0942 us | 18.29 us | 0.45 | 0.01 |
ScalarFloatMultipleOps | 41943040 | 7,877.13 us | 142.2414 us | 126.0933 us | 7,851.68 us | 1.00 | 0.00 |
Vector256FloatMultipleOpsUnsafe | 41943040 | 4,659.43 us | 91.8353 us | 176.9355 us | 4,731.19 us | 0.60 | 0.02 |
VectorTFloatMultipleOps | 41943040 | 5,239.05 us | 126.9370 us | 118.7370 us | 5,246.28 us | 0.67 | 0.02 |
The processor will likely prefetch data while it performs operations, i´d assume, effectively hiding the access time.
Vector<T>
is slower, as expected, partly (probably) because it doesn´t implement Fma
and partly since we are using safe code and the loop will be slowed down by range-checks.
The Mandelbrot set is an all-time favorite to show off parallel processing. On my machine I get the following results for a 1920 X 1080 image (this is just generating values, not creating a bitmap):
Method | Mean | Error | StdDev | Ratio |
---|---|---|---|---|
FloatMandel | 134.92 ms | 1.1073 ms | 0.9247 ms | 1.00 |
Vector256Mandel | 25.98 ms | 0.1739 ms | 0.1626 ms | 0.19 |
A 5.3x speedup is nice! The vector loop could probably be further optimized though, I just did a naïve translation of the scalar code.
Some basic integer operations show an average 1.5 - 1.6x speed increase with intrinsics for one operation, probably illustrating that modern processors are already very good at handling ints and, again perhaps, memory access times.
Method | NumberOfItems | Mean | Error | StdDev | Ratio | RatioSD |
---|---|---|---|---|---|---|
IntAdd | 4096000 | 4.811 ms | 0.0262 ms | 0.0219 ms | 1.00 | 0.00 |
IntAddVector256 | 4096000 | 3.041 ms | 0.0499 ms | 0.0442 ms | 0.63 | 0.01 |
IntXor | 4096000 | 4.834 ms | 0.0838 ms | 0.0700 ms | 1.00 | 0.01 |
IntXorVector256 | 4096000 | 3.028 ms | 0.0457 ms | 0.0405 ms | 0.63 | 0.01 |
IntMultiply | 4096000 | 4.777 ms | 0.1163 ms | 0.0971 ms | 0.99 | 0.02 |
IntMultiplyLowVector256 | 4096000 | 3.013 ms | 0.0380 ms | 0.0337 ms | 0.63 | 0.01 |
IntShiftLeft | 4096000 | 4.057 ms | 0.1107 ms | 0.1036 ms | 0.84 | 0.02 |
IntShiftLeftVector256 | 4096000 | 3.063 ms | 0.0638 ms | 0.0597 ms | 0.64 | 0.01 |
IntMax | 4096000 | 4.757 ms | 0.1295 ms | 0.1272 ms | 0.99 | 0.03 |
IntMaxVector256 | 4096000 | 3.018 ms | 0.0286 ms | 0.0239 ms | 0.63 | 0.01 |
Chaing three ops inside the loop gives:
Method | NumberOfItems | Mean | Error | StdDev | Ratio |
---|---|---|---|---|---|
IntMultipleOps | 4096000 | 5.430 ms | 0.1067 ms | 0.1048 ms | 1.00 |
IntMultipleOpsvector256 | 4096000 | 3.016 ms | 0.0551 ms | 0.0516 ms | 0.56 |
Only a small improvement, so probably processor optimization plays the bigger role.