I function optimisation #9
Replies: 3 comments 5 replies
-
That's a neat trick - thanks for sharing! Saving a
On a similar note, I imagine Jaguar would be one of the few cores where you'd be able to measure a gain, since the trick doesn't shorten the dependency chain on the B input. |
Beta Was this translation helpful? Give feedback.
-
I tested on Tiger Lake. See here for my code. I think optimising I improved performance by a percent or two. Yeah, it doesn't shorten the dependency chain, but it reduces the number of µops running besides the critical path, which allows other instructions to run earlier. |
Beta Was this translation helpful? Give feedback.
-
My baseline implementation has for
I do the I'm not exactly sure how your code works, but the BMI1-enabled sequence instead has
saving two µops off the critical path and some front end bandwidth. |
Beta Was this translation helpful? Give feedback.
-
With BMI1, the I function can leverage the
andn
instruction as such:the -1 can be absorbed into the round key and instead of adding
I(b, c, d)
we subtractc ^ (~b & d)
.This improved performance slightly.
Note that
mov
instructions between registers are free on modern CPUs as they get turned into register renames. So optimising for leastmov
instructions is generally not super useful.Beta Was this translation helpful? Give feedback.
All reactions