Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make safe/unsafe foreign calls size dependent #282

Open
WJWH opened this issue Sep 9, 2020 · 9 comments
Open

Make safe/unsafe foreign calls size dependent #282

WJWH opened this issue Sep 9, 2020 · 9 comments

Comments

@WJWH
Copy link

WJWH commented Sep 9, 2020

In https://github.com/haskell/bytestring/blob/master/Data/ByteString/Internal.hs#L747, we import memcpy as an unsafe FFI call, which means that it runs "inline" and will block all other Haskell threads running on the same capability in the threaded runtime. If the Bytestring being copied is very large, this might block the other Haskell threads for longer than is desired.

In the crypto libraries, there are both safe and unsafe imports which are switched between depending on the size of the input. This trick might be useful here as well.

Interested to hear your opinions about this, I can pick it up if this is something desired.

@vdukhovni
Copy link
Contributor

One might say that, for large enough buffers, performance-sensitive applications should not be using strict bytestrings, and use Lazy (chunked) ones instead or streams.

I am not sure that we should introduce automatic switching based on size, but we could perhaps leave the choice to the caller. If we were to provide a switch, the cuttoff buffer size should probably be larger than with SHA and the like, because copying data is much cheaper than computing a cryptographic hash.

@WJWH
Copy link
Author

WJWH commented Sep 9, 2020

I agree that high performance applications probably should use something other than large strict bytestrings and they do already mention that it'll take O(N) time. They don't mention that (for example) cons on a strict bytestring will block all other Haskell threads on the capability for multithreaded programs and this is probably not what most users would expect.

Apparently the difference between safe and unsafe is only a few hundred nanoseconds, so we could set the switchover value suitably high that the safe path would only be called when the overhead is less than a few percent?

@vdukhovni
Copy link
Contributor

Right, I'm just hesitant to assume that the cutoff is comparable on all platforms, so would be fairly conservative with the buffer size (>= 128k? More?) to be sure that we're not too strongly biased to just X86_64 platforms.

Do you have measurements of how long it takes to copy N bytes for a few values of N on X86 and perhaps ARM? How big are the samples that motivated this issue?

@vdukhovni
Copy link
Contributor

Also what is the cost of moving all the other threads off the capability so they get a chance elsewhere while the copy is happening? How does that scale with the number of threads?

@WJWH
Copy link
Author

WJWH commented Sep 9, 2020

I was thinking more >512 MB or something like that 😅. The crypto example is really very low since it's so cpu intensive even for fairly small chunks of data. For memcmp and memcpy I'd expect the cutoff to be much higher. To me, it's not so much about squeezing every last bit of performance out of this, just preventing some worst case scenario's when people are accidentally being stupid.

@vdukhovni
Copy link
Contributor

OK, so you're on board for a fairly conservative cutoff, perhaps not quite that high, but some multiple MB sounds about right. We'll have to see what others think of this idea. I'm not opposed, though it will cost an extra branch on every dispatch, I don't know whether that's ever an issue...

@Bodigrim
Copy link
Contributor

memcpy is used in several functions. I would like to see an analysis which of these functions are mostly used with smaller and which with larger inputs. And benchmarks about performance costs of branching and safe calls.

@hsyl20
Copy link
Contributor

hsyl20 commented Sep 17, 2020

We should really add memcpy/memset/memmove primops into GHC. GHC already has thresholds to inline internal calls (see https://gitlab.haskell.org/ghc/ghc/-/blob/master/compiler/GHC/CmmToAsm/X86/CodeGen.hs#L2194 for memcpy on X86). It could insert unsafe/safe calls with new thresholds if needed.

@vdukhovni
Copy link
Contributor

We should really add memcpy/memset/memmove primops into GHC.

Indeed, see #274

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants