-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specialize SIMD SHA-512 for half-length input, iterated hashing #5425
Conversation
The SIMDSHA512body(): Build 3 specialized versions of the function commit here reduces the need to have manually optimized the |
4db6069
to
3e89333
Compare
Testing
The above is in a build with ASan. Exact same test errors without ASan. And exact same faulty indices remain across runs. 3 other formats fail tests as well, I haven't looked at them yet. |
They are:
Out of these, Blackberry-ES10 and eCryptfs are affected by changes in this PR (and got good speedup when they work), but Drupal7 shouldn't be. I have no idea whether they worked with |
Apparently, the AVX-512 specific portion of SIMDSHA512body(): Optimize SSEi_FLAT_OUT here is buggy for that case. Reverting this commit makes these 4 formats pass the same tests. I'll review and revise it. |
0f49e5a
to
148f28d
Compare
And turn off the manual optimization since it started causing a major sha512crypt performance regression on AVX with RHEL6's old gcc after the iterated hashing commit. There appear to be no regressions from turning this off now, meaning that the compiler is hopefully able to figure the zeroes out now that they're written to w[] by this very function.
This avoids uninitialized warnings with RHEL6's old gcc, and is hopefully optimized out by more reasonable compilers
We still waste memory on the second halves. We could want to clean that up separately.
This avoids uninitialized warnings with RHEL6's old gcc, and is hopefully optimized out by more reasonable compilers. This new code could also be used without SSEi_LOOP, but trying to do so causes a major performance regression for e.g. sha512crypt with same RHEL6's old gcc, so we continue with memcpy() in the else path for now.
This way, we start fetching all including the last hash sooner, and the fetches complete sooner.
This introduces specialized versions of SIMDSHA512body() and makes use of them in a few formats, including in the recently added Armory, where it provides a speedup of 4% on i7-4770K and some smaller speedups on some other machines I tested.
There are 3 specialized functions now, but I mostly care about the case of
SSEi_HALF_IN|SSEi_OUTPUT_AS_INP_FMT
(as that's what's used in some iterated formats now) and the original case from prior to my changes here (not to introduce performance regressions via increased code size). The third case of havingSSEi_HALF_IN
along with other or no flags is specialized merely to avoid code size increase in the original case. We could alternatively unsupportSSEi_HALF_IN
except inSSEi_HALF_IN|SSEi_OUTPUT_AS_INP_FMT
, but for now I opted to have it supported separately as well.Further optimization potential is doing similar for SHA-256 (should be easy now) and maybe for others (it'd be different there), and introducing function specializations also for the previously existing flags (to reduce code size in the functions that many/most formats use).