Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.Runtime.Intrinsics.X86 #28

Open
macaba opened this issue Sep 9, 2019 · 12 comments
Open

System.Runtime.Intrinsics.X86 #28

macaba opened this issue Sep 9, 2019 · 12 comments
Labels
enhancement New feature or request help wanted Extra attention is needed pinned This issue will never be considered stale

Comments

@macaba
Copy link

macaba commented Sep 9, 2019

Hello,

My library that uses NaCl.Core has 2 variants, one that uses NaCl.Core, and the other that uses the libsodium native library. The libsodium variant runs 4 times faster, but has deployment pitfalls where you have to ensure the correct native file is used for the processor architecture and OS. I've worked around those pitfalls but it got me wondering about optimization in NaCl.Core as a fully managed solution has less friction.

I was wondering if you had looked at the System.Runtime.Intrinsics.X86 namespace in .NET Core 3.0?

I am new to intrinsics so don't consider the following to be authoritative but I thought I'd present my experience as a data point for using intrinsics.

XOR proof-of-concept

I did a proof of concept on an easy bit to update - the XOR in Snuffle.cs (this doesn't improve performance much).

At the top:

#if NETCOREAPP3_0
    using System.Runtime.Intrinsics;
    using System.Runtime.Intrinsics.X86;
#endif

In Process:

            using (var owner = MemoryPool<byte>.Shared.Rent(BLOCK_SIZE_IN_BYTES))
            {
                for (var i = 0; i < numBlocks; i++)
                {
                    ProcessKeyStreamBlock(nonce, i + InitialCounter, owner.Memory.Span);

#if NETCOREAPP3_0
                    if (i == numBlocks - 1)
                        Xor(output, input, owner.Memory.Span, length % BLOCK_SIZE_IN_BYTES, offset, i); // last block
                    else
                        XorIntrinsic(output, input, owner.Memory.Span, BLOCK_SIZE_IN_BYTES, offset, i);
#else
                    if (i == numBlocks - 1)
                        Xor(output, input, owner.Memory.Span, length % BLOCK_SIZE_IN_BYTES, offset, i); // last block
                    else
                        Xor(output, input, owner.Memory.Span, BLOCK_SIZE_IN_BYTES, offset, i);
#endif

                    owner.Memory.Span.Clear();
                }
            }

New method at the bottom:

#if NETCOREAPP3_0
        private static unsafe void XorIntrinsic(Span<byte> output, ReadOnlySpan<byte> input, ReadOnlySpan<byte> block, int len, int offset, int curBlock)
        {
            var blockOffset = curBlock * BLOCK_SIZE_IN_BYTES;

            //To do - input length validation checks
            fixed (byte* pOut = output, pInA = input, pInB = block)
            {
                byte* pOutEnd = pOut + offset + blockOffset + len;
                byte* pOutCurrent = pOut + offset + blockOffset;
                byte* pInACurrent = pInA + blockOffset;
                byte* pInBCurrent = pInB;

                while (pOutCurrent + 8 <= pOutEnd)
                {
                    var inputAVector = Avx.LoadVector256(pInACurrent);
                    var inputBVector = Avx.LoadVector256(pInBCurrent);
                    var outputVector = Avx2.Xor(inputAVector, inputBVector);
                    Avx.Store(pOutCurrent, outputVector);

                    pOutCurrent += 8;
                    pInACurrent += 8;
                    pInBCurrent += 8;
                }
            }
        }
#endif

That compiled and benchmarked successfully, running approximately the same speed, perhaps a tiny amount faster.

Conclusion

It seems that to fully implement intrinsics in the Snuffle.cs class (and hence get large performance gains) ProcessKeyStreamBlock would be the place to start, and hence the implementation of u0.h/u1.h/u4.h/u8.h in this directory.

Whilst I don't have the bandwidth for a pull request that is a mass update to include intrinsics, if you were to establish a style/methodology for including intrinsics, I could contribute in parts when time allows.

@daviddesmet
Copy link
Owner

daviddesmet commented Sep 11, 2019

Hello Mark,

I haven’t looked into any new .NET Core 3.0 features yet but I find very interesting the hardware intrinsics one, it is the first time I heard about it.

Looking at your PoC, you seem to be more into the subject... I could create a branch where we could experiment on those features until is stable and ready for merge (master branch will help as a reference point for the benchmarks), would you be interested on it?

@macaba
Copy link
Author

macaba commented Sep 11, 2019

That sounds like a good plan, I am interested.

@daviddesmet
Copy link
Owner

Great! 😃 I've created a new branch named intrinsics and added you as collaborator.

@macaba
Copy link
Author

macaba commented Sep 15, 2019

A little update.

Before going too far into this, I wanted to find out if intrinsics will give any worthwhile improvement, so I picked ChaCha20 to fully convert to intrinsics.

The first thing was to look at how much speedup there is from simply using .NET Core 3.0 without intrinsics:
image

A tiny amount faster, perhaps some gains could be made with more Spans/ref.

After adding intrinsics:
image

From 16.26us to 6.829us, a ratio of 0.42. There are still improvements to be done but this is an encouraging sign that this is a worthwhile development task!

@macaba
Copy link
Author

macaba commented Sep 15, 2019

image

Now down to 2.3us, a ratio of 0.14. The unit tests have been extremely valuable in ironing out implementation bugs. (Particularly the RFC and Wycheproof test vectors)

Update: 1.5us.

@daviddesmet
Copy link
Owner

That’s a very good improvement, in speed and allocation! Looks indeed very promising!

Very well done!

@macaba
Copy link
Author

macaba commented Sep 15, 2019

Thanks for the feedback!

I was thinking about what the ultimate goal is - it's to get near libsodium performance which is probably the fastest implemention. So I hacked together a very thin DllImport libsodium wrapper and benchmarked it:

image

For small packet sizes (which is the region I'm interested in) - it's very close. For large sizes, it's at least within the same magnitude of performance (where it was significantly slower before).

My next area of intrinsic experimentation is Poly1305 then I can benchmark the ChaCha20Poly1305 suite.

@daviddesmet
Copy link
Owner

I’m curious, do you have some benchmarks without Intrinsics vs libsodium?, would be interesting to see where it stands.

@macaba
Copy link
Author

macaba commented Sep 16, 2019

This micro benchmark run of ChaCha20 encrypt in NaCl.Core and libsodium is in 3 runtimes: .NET 4.7.2, Core 2.2, and Core 3.0.
Core 3.0 has intrinsics enabled, the other 2 don't (obviously).
Core 3.0 performance without intrinsics is very similar to Core 2.2, so this gives us a nice idea of the spread of libsodium vs. vanilla vs. intrinsics.

image

Interpretation of results:

  • libsodium runs at the same speed regardless of runtime.
  • (Using NET 4.7.2) libsodium runs 9.3x faster [size 100], 22.8x faster [size 1000], 45.8x faster [size 10000].
  • (Using Core 2.2) libsodium runs 8.5x faster [size 100], 19.5x faster [size 1000], 38.7x faster [size 10000].
  • (Using Core 3.0 with intrinsics) libsodium runs 1.2x faster [size 100], 1.7x faster [size 1000], 2.9x faster [size 10000].

@daviddesmet
Copy link
Owner

daviddesmet commented Sep 16, 2019

Thanks a lot for sharing the benchmarks, it gives a really good point of reference on where it stands right now and with intrinsics vs libsodium. The difference from vanilla vs intrinsics is huge!

@stale
Copy link

stale bot commented Nov 15, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in one week if no further activity occurs.
Thank you for your contributions.

@stale stale bot added the stale This issue had no activity for two months label Nov 15, 2019
@daviddesmet daviddesmet added the pinned This issue will never be considered stale label Nov 20, 2019
@stale stale bot removed the stale This issue had no activity for two months label Nov 20, 2019
@daviddesmet
Copy link
Owner

Keeping the issue alive!

@daviddesmet daviddesmet linked a pull request Sep 18, 2020 that will close this issue
@daviddesmet daviddesmet added enhancement New feature or request help wanted Extra attention is needed labels Apr 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed pinned This issue will never be considered stale
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants