I function optimisation #9

clausecker · 2024-06-20T11:46:17Z

clausecker
Jun 20, 2024

With BMI1, the I function can leverage the andn instruction as such:

I(b, c, d) = c ^ (b | ~d) = ~(c ^ (~b & d)) = -1 - (c ^ (~b & d))

the -1 can be absorbed into the round key and instead of adding I(b, c, d) we subtract c ^ (~b & d).

This improved performance slightly.

Note that mov instructions between registers are free on modern CPUs as they get turned into register renames. So optimising for least mov instructions is generally not super useful.

animetosho · 2024-06-21T01:56:33Z

animetosho
Jun 21, 2024
Maintainer

That's a neat trick - thanks for sharing!

Saving a mov can still help the CPU's front-end, but the effect is likely limited since most CPUs that support BMI1 also support move elimination. I do recall measuring a slight gain on AMD Jaguar (which is probably the only CPU here without move elimination).

This improved performance slightly.

On a similar note, I imagine Jaguar would be one of the few cores where you'd be able to measure a gain, since the trick doesn't shorten the dependency chain on the B input.
Is this what you observe? What CPU did you test this on?

0 replies

clausecker · 2024-06-21T10:53:49Z

clausecker
Jun 21, 2024
Author

I tested on Tiger Lake. See here for my code. I think optimising I improved performance by a percent or two.

Yeah, it doesn't shorten the dependency chain, but it reduces the number of µops running besides the critical path, which allows other instructions to run earlier.

1 reply

animetosho Jun 23, 2024
Maintainer

Interesting result. I don't have access to Tiger Lake unfortunately, but for a performance core like that, it's surprising to hear that you got a 1-2% gain.
My tests on performance cores have been typically been within <2% of the theoretical limit (4.5 cpb), without even the use of BMI1.

I tested this on Golden Cove (Tiger Lake successor) and didn't see any measurable difference (4.57 cpb in either case).
On AMD Jaguar, I saw a gain of 4.73 cpb to 4.72 cpb, which is a surprisingly small.
I wasn't able to get stable figures on my AMD Piledriver, but there didn't seem to be much of a difference.

In ParPar, I do two concurrent MD5s - in such a case, this trick yields a much better gain on less performance oriented cores.
But thanks for reporting your results regardless!

By the way, I should probably mention that the AVX-512 variant is undesirable on Zen4 due to a 2 cycle latency on vector bit-rotate. I've updated the info file to mention this.

clausecker · 2024-06-23T13:37:43Z

clausecker
Jun 23, 2024
Author

My baseline implementation has for I:

	mov	$-1, %ebp
	xor	\d, %ebp
	or	\b, %ebp
	xor	\c, %ebp
	add	$\k, \a			// a + k[i]
	add	((\m)%16*4)(%rsi), \a	// a + k[i] + m[g]
	add	%ebp, \a		// a + k[i] + m[g] + f
	rol	$\s, \a
	add	\b, \a

I do the mov $-1, \f; xor \d, \f sequence over mov \d, \f; not \f as the the move immediate instruction is not on the critical path and some microarchitectures (like Skylake) have move-elimination disabled in a microcode patch, leading to slightly better performance for the former.

I'm not exactly sure how your code works, but the BMI1-enabled sequence instead has

	andn	\d, \b, %ebp
	add	$\k - 1, \a		// a + k[i] - 1
	add	((\m)%16*4)(%rsi), \a	// a + k[i] + m[g] - 1
	xor	\c, %ebp
	sub	%ebp, \a		// a + k[i] + m[g] + f
	rol	$\s, \a
	add	\b, \a

saving two µops off the critical path and some front end bandwidth.

4 replies

animetosho Jun 23, 2024
Maintainer

leading to slightly better performance for the former.

That's strange. Without move elimination, both are two instructions. With move elimination, the latter is clearly superior. So the latter should be same or better than the former?

saving two µops off the critical path and some front end bandwidth.

I was comparing old BMI1 code against new BMI1 code, in which case, it only saves one instruction.
Comparing new BMI1 with no BMI1 is a saving of two instructions, but they aren't on the critical path.

Are you able to lock your CPU's frequency and obtain cycles per byte figures? This can help you determine how close you are to the limit.

clausecker Jun 23, 2024
Author

Note that moving -1 to ebp is not on the critical path, so it can be executed in parallel with all the other stuff. Effectively, it does not contribute to the total latency. But a non-eliminated move does.

I was comparing old BMI1 code against new BMI1 code, in which case, it only saves one instruction.
Comparing new BMI1 with no BMI1 is a saving of two instructions, but they aren't on the critical path.

I observed the 1% or so change when changing just that bit while I was developing the BMI1-enhanced variant.

Are you able to lock your CPU's frequency and obtain cycles per byte figures? This can help you determine how close you are to the limit.

I'll try to figure that out.

animetosho Jun 23, 2024
Maintainer

Effectively, it does not contribute to the total latency. But a non-eliminated move does.

Hmm, how does mov \d, \f contribute to latency? The D input isn't on the critical path, and can execute in parallel as well?
Perhaps there's a uArch you're testing on that has a scheduler issue where it can't recognise that?

clausecker Jun 23, 2024
Author

Yes, it's not on the critical path and it should not matter, even when not eliminated. Perhaps the limit on ILP is reached with this extra instruction? Or we get scheduler conflicts where the critical path instructions can scheduled poorly?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I function optimisation #9

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

I function optimisation #9

clausecker Jun 20, 2024

Replies: 3 comments · 5 replies

animetosho Jun 21, 2024 Maintainer

clausecker Jun 21, 2024 Author

animetosho Jun 23, 2024 Maintainer

clausecker Jun 23, 2024 Author

animetosho Jun 23, 2024 Maintainer

clausecker Jun 23, 2024 Author

animetosho Jun 23, 2024 Maintainer

clausecker Jun 23, 2024 Author

clausecker
Jun 20, 2024

Replies: 3 comments 5 replies

animetosho
Jun 21, 2024
Maintainer

clausecker
Jun 21, 2024
Author

animetosho Jun 23, 2024
Maintainer

clausecker
Jun 23, 2024
Author

animetosho Jun 23, 2024
Maintainer

clausecker Jun 23, 2024
Author

animetosho Jun 23, 2024
Maintainer

clausecker Jun 23, 2024
Author