Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance optimizations #530

Merged
merged 8 commits into from
Jul 25, 2024
Merged

Performance optimizations #530

merged 8 commits into from
Jul 25, 2024

Conversation

tomaswolf
Copy link
Member

A series of performance optimizations:

  • Manual optimizations in the ChaCha20-Poly1305 cipher implementation (loop unrolling, use int instead of long where possible, avoid unnecessary method calls).
  • Optimize writing ints/longs into buffers.
  • Avoid copying buffers if possible. If we prepare an SSH packet buffer, we should avoid copying it later on if possible.
  • Streamline SFTP OutputStreams. Give them a transferFrom operation to be able to read directly into an internal buffer, and internally read-ahead into a second buffer while the first one is being sent.
  • Add a convenience operation SftpClient.put() for uploading.
  • Avoid using the AES implementation from Bouncy Castle (typically a Java implementation). Prefer the one from SunJCE, which uses native code.

@gnodet
Copy link
Contributor

gnodet commented Jul 21, 2024

In chacha20-poly1305, would it make sense to use a different mechanism for unpackIntLE and packIntLE, such as using Unsafe or VarHandle ?

I've also seen that there is a multithreaded implementation of chacha20-poly1305 which seems to be much faster, not sure if there's anything that could be useful.

@gnodet
Copy link
Contributor

gnodet commented Jul 21, 2024

In chacha20-poly1305, would it make sense to use a different mechanism for unpackIntLE and packIntLE, such as using Unsafe or VarHandle ?

I've also seen that there is a multithreaded implementation of chacha20-poly1305, not sure if there's anything that could be useful.

Also chacha20-poly1305 is supported since JDK 11. How do both performs ?

@tomaswolf
Copy link
Member Author

Haven't looked at further optimizations. Most of the time is spent in Poly1305.processBlock() anyway. We could continue looking at trying to make this cipher and mac faster, but for now I think a 40% improvement and being nearly as fast as native AES (in my benchmarks) is good enough.

I don't think the Java built-in ChaCha20-Poly1305 can be used for SSH. SSH uses a 64bit nonce and 64bit counter, while the JDK uses a 96bit nonce and 32bit counter.

@tomaswolf
Copy link
Member Author

In chacha20-poly1305, would it make sense to use a different mechanism for unpackIntLE and packIntLE, such as using Unsafe or VarHandle ?

VarHandle doesn't exist in Java 8. But ByteBuffer.wrap(someByteArray).order(ByteOrder.LITTLE_ENDIAN).asIntBuffer() does.

However, using IntBuffer.put() instead of packIntLE(), and processing ints via IntBuffers, makes matters
worse on Java 8 (a slow-down of 20-30% compared to the version in this PR at commit e152cc3). Benchmarked
with an older JDK 8 (1.8.0_201) and with a brand-new 1.8.0_422.

On Java 11, benchmarking shows a speed improvement of 5%, on Java 17 of 9%. (Again, with commit e152cc3 as baseline.)

The problem is that on Java 8 the ByteBufferAsIntBufferL implementation ends up doing exactly the same as
our packLE/unpackLE routines. So trying to process ints for whole blocks is going to do lots of conversions
to assemble bytes into ints, or split an int into its four bytes. So it's definitely not worth it on Java 8.

On Java 11 or newer, ByteBufferAsIntBufferL is much more efficient and can write or read whole ints.
Only then does such a change make sense.

But we could make the choice at run-time, based on the Java version we're running on, using IntBuffers only
for Java >= 11. Or we could try a multi-release JAR for sshd-common (and sshd-osgi).

Either way, I'd prefer not to attempt that in this PR. If we do something like this, let's do it in some
later commit.

...multithreaded implementation of chacha20-poly1305...

The idea is that one can pre-compute the ChaCha20 key stream asynchronously. This may indeed give some
improvements for file transfers. An implementation might be not quite trivial, though, and it may mean
that we'd need at least one extra thread for every connection that uses ChaCha20-Poly1305. Which might
be a problem for a server, or also for a client doing many connections in parallel. Maybe it would need
another fixed-size thread pool, and fall back to compute the key stream synchronously if not precomputed
yet when it's needed. In any case, not trivial to get right. The Poly1305 mac computation cannot be offloaded
in that way.

@tomaswolf
Copy link
Member Author

I don't think the Java built-in ChaCha20-Poly1305 can be used for SSH.

Indeed it cannot, but mainly because the Poly1305 setup is done differently. The IETF version from RFC 8439 is just not the same as the SSH version of ChaCha20-Poly1305. But on Java 11 or newer, one can use the simple "ChaCha20" cipher from the standard SunJCE provider as a replacement for our ChaChaEngine, and do the special Poly1305 setup as we do now. The 96-32bit vs. 64-64bit difference can be dealt with. Unfortunately SunJCE does not provide a plain Poly1305 mac (see JDK-8253394), so that one would still be hand-written code.

Such a ChaCha20-Poly1305 based on javax.crypto.Cipher.getInstance("ChaCha20", "SunJCE"); and our Poly1305Mac is again faster than any games we can play with IntBuffer, so if one wants to have different implementations per Java version, I'd take this approach.

For Java 8, I have not found anything better than the version from commit e152cc3.

@tomaswolf
Copy link
Member Author

@gnodet: could we do a bug fix release first before merging this? See the mailing list.

@gnodet
Copy link
Contributor

gnodet commented Jul 22, 2024

@gnodet: could we do a bug fix release first before merging this? See the mailing list.

Sure, I was assuming you wanted to include that one too.

@gnodet gnodet added this to the 2.14.0 milestone Jul 22, 2024
Unroll the permutation loop of ChaCha20 and simplify the crypt() method
to work on bytes instead of ints. Unroll and optimize the pack/unpack
methods in ChaCha20 and in Poly1305. Don't use Integer.toUnsignedLong():
avoid the extra method call by doing (int & 0xFFFF_FFFFL) inline. In
Poly1305.processBlock(), use ints instead of longs where possible
(variables t0-t3).

All this brings a speed-up of about 40% for the encryption/decryption.
Of the time spent in ChaCha20-Poly1305 about one third is spent in the
ChaChaEngine and two thirds in Poly1305.processBlock().

Bug: apache#524
Simplify the implementation, and make sure it works when called
directly with a non-zero timeout.
For ByteArrayBuffer, optimize the getInt()/putInt() operations to work
directly in the buffer without copying into/from a local work area.
Simplify some of the operations; in particular, avoid extending bytes
to long and then shifting.

Bug: apache#524
In ChannelAsyncOutputStream, do not create a new SSH packet if the
caller already had passed in an SSH packet buffer, and we're writing
it fully. This avoids allocating a new buffer, which gives GC less
to do, and it also avoid needlessly copying the data around. Instead
the prepared SSH packet buffer is encoded and written directly.

Bug: apache#524
Give SftpOutPutStreamAsync a transferFrom operation to be able to read
directly into a prepared SSH packet uffer. Previously, we always had
to read into a general byte buffer, and then copy into the SSH packet.
This extra data copy can be avoided. If the copy buffer size is such
that the data fits into a single SSH packet, then this packet buffer
can be encoded and sent directly without any further data copying.

Simplify the implementation: since the stream always uses a PacketBuffer
the code paths for other general byte buffers can simply be removed.

Also enable reading the next buffer while the previous one is being
sent. Use two buffers alternatingly; one being sent, the other being
filled.

Use this implementation also in SftpRemotePathChannel in its
transferFrom() implementation. _Do_ close the stream there, otherwise
the final ACKs may not be checked.

Bug: apache#524
A utility method to upload a file via SFTP.
Bouncy Castle apparently only has native implementations in its "LTS"
releases. BC 1.78.1 has none. SunJCE uses native code.

The "security registrar" architecture in Apache MINA sshd prefers
registered providers over the platform providers, and it automatically
registers Bouncy Castle if it's present. So with Bouncy Castle on the
classpath and an SSH connection for which an AES cipher was specified,
Apache MINA sshd would use the much slower Bouncy Castle AES.

Add a new "SunJCEWrapper" registrar that brings the SunJCE AES and
HmacSHA* to the front of the list, so that they are used even if Bouncy
Castle is also registered.

The new registrar can be disabled via a system property as usual, and
it is only enabled if the JVM indeed has a "SunJCE" security provider
registered.

See also https://issues.apache.org/jira/browse/SSHD-719 (issue closed
as "not needed").

Bug: apache#524
The update() implementation copied _all_ bytes (successively) first into
an internal 16-byte buffer and then processed that buffer. This is no
needed if the input is long.

Use the internal 16-byte buffer only for inputs shorter than 16 bytes,
or if there is a leftover of less than 16 bytes at the end of a long
input. In between process 16-byte chunks directly from the input byte
array.

For 32kB inputs this saves us some 2048 calls to System.arraycopy()
copying all those 32kB. The speedup is minimal but noticeable in
benchmarking.

Bug: apache#524
@tomaswolf tomaswolf merged commit 2b93f5c into apache:master Jul 25, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants