vector dot - inconsistent results #88958

ernest-morariu · 2023-07-16T03:07:38Z

ernest-morariu
Jul 16, 2023

void Main()
{
	double a1 = 1.4, b1 = 4.3;
	double a2 = 1.5, b2 = 5.3;
	double a3 = 1.6, b3 = 6.3;
	double a4 = 1.7, b4 = 7.3;

	// Vector.Dot
	// Vector<double>.Count is 4 on my system Win10, x64
	Vector<double> a = new Vector<double>(new double[] { a1, a2, a3, a4});
	Vector<double> b = new Vector<double>(new double[] { b1, b2, b3, b4});
	double vdot = Vector.Dot(a, b);
	
	// common dot
	double cdot = a1 * b1 + a2 * b2 + a3 * b3 + a4 * b4;
	
	// vdot is 36.46
	// cdot is 36.459999999999994
	
	// Is this normal?
}

Answered by tannergooding

Jul 16, 2023

-- Noting that I may have made a mistake in the pairwise algorithm. It's early and I didn't think too deeply 😅

The general intent is that you add neighbors to eachother recursively. So (a[0] + a[1]) and (a[2] + a[3]) and so on. This halves the length. You then do it again on those intermediates until you only have 1 result left.

So for 4 elements:

var tmp1 = (a[0] + a[1]);
var tmp2 = (a[2] + a[3]);
return tmp1 + tmp2;

Or for 8 elements:

var tmp1 = (a[0] + a[1]);
var tmp2 = (a[2] + a[3]);
var tmp3 = tmp1 + tmp2;

var tmp4 = (a[4] + a[5]);
var tmp5 = (a[6] + a[7]);
var tmp6 = tmp4 + tmp5;

return tmp3 + tmp6;

etc

View full answer

hez2010 · 2023-07-16T09:29:07Z

hez2010
Jul 16, 2023

I think it is expected because the accuracy between scalar floating instructions and SIMD instructions is different.
You should always be careful about floating-point accuracy when you work with SIMD.

1 reply

ernest-morariu Jul 16, 2023
Author

... the accuracy between scalar floating instructions and SIMD instructions is different.

Can you please point me to a link that explains this statement?

Chat GPT says "SIMD instructions, when implemented correctly, should not introduce additional numerical errors compared to scalar computation."

tannergooding · 2023-07-16T13:43:37Z

tannergooding
Jul 16, 2023
Collaborator

The purely mathematical definition is indeed:

$$\mathbf {a} \cdot \mathbf {b} =\sum _{i=1}^{n}a_{i}b_{i}=a_{1}b_{1}+a_{2}b_{2}+\cdots +a_{n}b_{n}$$

If you were to define this programmatically, you'd end up with something like the following:

public static T DotProduct<T>(ReadOnlySpan<T> a, ReadOnlySpan<T> b)
    where T : INumber<T>
{
    if (a.Length != b.Length)
	{
	    throw new ArgumentOutOfRangeException();
	}

    T[] product = Multiply(a, b);
    return Sum(product);
}

public static T[] Multiply<T>(ReadOnlySpan<T> a, ReadOnlySpan<T> b)
    where T : INumber<T>
{
    if (a.Length != b.Length)
	{
	    throw new ArgumentOutOfRangeException();
	}

    T[] result = new T[a.Length];

    for (int i = 0; i < result.Length; i++)
    {
        result[i] = a[i] * b[i];
    }

    return result;
}

For the naive case, the implementation of Sum ends up being the following:

public static T Sum<T>(ReadOnlySpan<T> a)
    where T : INumber<T>
{
    T result = T.Zero;
	
	for (int i = 0; i < a.Length; i++)
	{
	    result += a[i];
	}

	return result;
}

Floating-point values, however, are not infinitely precise and end up rounding after each operation. This can introduce error into the equation since floating-point is then not associative, which is to say that (a + b) + c is then different from a + (b + c). This means that what we just defined forces a very long and linear dependency chain and this effectively prevents all forms of optimization to the code, since any such optimization can observably change the result. Because of this, most hardware implementations did not standardize on the naive implementation of Sum being the de-facto standard. Instead, you'll find that hardware and software both have largely standardized on doing a series of pairwise operations instead. This effectively changes the algorithm into a binary tree and significantly reduces the dependency chain while allowing a broad range of optimizations that can effectively utilize either SIMD, SIMT, or even regular multi-threading/parallelization techniques.

That is to say, the "standard" algorithm is actually more akin to the following:

public static T Sum<T>(ReadOnlySpan<T> a)
    where T : INumber<T>
{
    T result;

    if (a.Length > 2)
    {
        int halfLength = a.Length / 2;
        result = Sum(a[..halfLength]) + Sum(a[halfLength..]);
    }
    else if (a.Length == 2)
    {
        result = a[0] + a[1];
    }
    else if (a.Length == 1)
    {
        result = a[0];
    }
    else
    {
        result = T.Zero;
    }

    return result;
}

You can see this in the formal definition of addv (Add Across, Arm64), dpps/dppd (Dot Product, x86/x64), etc; where they all define themselves as a series of pairwise operations so that the hardware can properly accelerate itself.

1 reply

tannergooding Jul 16, 2023
Collaborator

-- Noting that I may have made a mistake in the pairwise algorithm. It's early and I didn't think too deeply 😅

The general intent is that you add neighbors to eachother recursively. So (a[0] + a[1]) and (a[2] + a[3]) and so on. This halves the length. You then do it again on those intermediates until you only have 1 result left.

So for 4 elements:

var tmp1 = (a[0] + a[1]);
var tmp2 = (a[2] + a[3]);
return tmp1 + tmp2;

Or for 8 elements:

var tmp1 = (a[0] + a[1]);
var tmp2 = (a[2] + a[3]);
var tmp3 = tmp1 + tmp2;

var tmp4 = (a[4] + a[5]);
var tmp5 = (a[6] + a[7]);
var tmp6 = tmp4 + tmp5;

return tmp3 + tmp6;

etc

Answer selected by ernest-morariu

ernest-morariu · 2023-07-16T15:35:47Z

ernest-morariu
Jul 16, 2023
Author

Hi tannergoodingtannergooding,

Thank you for your very elaborated answer.

In fact, by simply replacing double cdot = a1 * b1 + a2 * b2 + a3 * b3 + a4 * b4; with double cdot = (a1 * b1 + a2 * b2) + (a3 * b3 + a4 * b4);, I get the same result as the vectorized one.

It's clear that no associativity guarantees producing results better than the others, everything depends on the operands of the summation;
this is why algorithms such as Kahan and Neumaier exist.

Since pairwise associativity is more SIMD friendly, you should probably consider changing the left-wise associativity used by ordinary scalar summation into a pairwise one simply for the sake of producing consistent results.

0 replies

ernest-morariu · 2023-07-16T15:36:05Z

ernest-morariu
Jul 16, 2023
Author

Thank you everybody.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vector dot - inconsistent results #88958

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

vector dot - inconsistent results #88958

ernest-morariu Jul 16, 2023

Replies: 4 comments · 2 replies

hez2010 Jul 16, 2023

ernest-morariu Jul 16, 2023 Author

tannergooding Jul 16, 2023 Collaborator

tannergooding Jul 16, 2023 Collaborator

ernest-morariu Jul 16, 2023 Author

ernest-morariu Jul 16, 2023 Author

ernest-morariu
Jul 16, 2023

Replies: 4 comments 2 replies

hez2010
Jul 16, 2023

ernest-morariu Jul 16, 2023
Author

tannergooding
Jul 16, 2023
Collaborator

tannergooding Jul 16, 2023
Collaborator

ernest-morariu
Jul 16, 2023
Author

ernest-morariu
Jul 16, 2023
Author