-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize layout calculations in HashMap #51340
Conversation
r? @kennytm (rust_highfive has picked a reviewer for you, use r? to override) |
(stupid git pull doesn't update submodules by default...) Playground link for playing around with the generated code: https://play.rust-lang.org/?gist=b233a122e3223db47521f5b0a0a3911c&version=nightly&mode=release |
r? @SimonSapin |
7bf6245
to
0b56145
Compare
The job Click to expand the log.
I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact |
hashes.extend(pairs).map(|(layout, _)| { | ||
// LLVM seems to have trouble properly const-propagating pairs.align(), | ||
// possibly due to the use of NonZeroUsize. This little hack allows it | ||
// to generate optimal code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is weird, the NonZeroUsize::get(self)
function is marked #[inline]
and the constructor is a const fn
. Did you by chance happen to look at the generated LLVM-IR for calculate_layout
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try the playground link that I posted above. If you remove the map
then it generates worse code. What seems to happen is that LLVM will emit an overflowing multiply to calculate pairs.size
, and will use the overflow flag to set pairs.align
to either 0 (LayoutErr
) or the actual alignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the comment should include the second sentence of that explanation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every caller of calculate_layout
only use one of the returned Layout
or offset. Now that they’re mostly calculated separately, aren’t we better off having two separate functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me this feels like more of a temporary hack to deal with a LLVM bug, rather than a permanent solution. Logically the offset calculation belongs as part of the layout calculation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nnethercote I added a reference to #51346 in the comment.
Nice! Hopefully this will fix the rustc perf regression. rustc uses hash tables a lot :) |
Avoiding playground::new:
push rax
mov rsi, rdi
mov ecx, 8
mov rax, rsi
mul rcx
mov rcx, rax
jo .LBB3_3
mov edx, 16
mov rax, rsi
mul rdx
jo .LBB3_3
add rax, rcx
jb .LBB3_3
mov rax, rcx
pop rcx
ret
.LBB3_3:
call core::result::unwrap_failed
ud2 which is at least shorter (don't know if better) than using playground::new:
mov ecx, 8
xor esi, esi
mov rax, rdi
mul rcx
mov r8, rax
setno r9b
jo .LBB3_7
mov edx, 16
xor ecx, ecx
mov rax, rdi
mul rdx
setno dl
jo .LBB3_7
mov sil, r9b
shl rsi, 3
mov cl, dl
shl rcx, 3
cmp rcx, rsi
cmovae rsi, rcx
lea rdx, [r8 + rcx]
add rdx, -1
neg rcx
and rcx, rdx
sub rcx, r8
add rcx, r8
jb .LBB3_7
add rax, rcx
jb .LBB3_7
mov rdx, rsi
neg rdx
cmp rax, rdx
ja .LBB3_7
test rsi, rsi
je .LBB3_7
lea rax, [rsi + 15]
and rax, rsi
jne .LBB3_7
mov rax, rcx
ret
.LBB3_7:
push rbp
mov rbp, rsp
call core::result::unwrap_failed
ud2 However, given that both situations should just generate: playground::original:
lea rax, [8*rdi]
ret I don't think that |
|
@Amanieu that one was as good as the original IIRC (that is, this PR fixes the perf regression). My point was that maybe it is worth fixing whatever codegen bug is causing the perf regression in the first place because, while we might add these workarounds in You mentioned the regression might be caused by making |
@gnzlbg If you use my original playground link and remove the
This is the codegen bug that I was talking about, which is caused by |
Can you explain how does that show that it is caused by |
It's obvious if you look at the generated LLVM IR:
|
Looking only at playground::new_optimized:
mov rcx, rdi
mov edx, 8
mov rax, rcx
mul rdx
mov rsi, rax
mov edx, 16
xor edi, edi
mov rax, rcx
mul rdx
setno dil
lea rax, [rsi + 8*rdi]
add rax, -1
shl rdi, 3
neg rdi
and rdi, rax
mov rax, rdi
ret and without playground::new_optimized:
lea rax, [8*rdi]
ret It looks like not using |
@bors p=1 for when this is approved |
This now produces the same assembly code as the previous implementation.
📌 Commit b69724f has been approved by |
Optimize layout calculations in HashMap This now produces the same assembly code as the previous implementation. cc #51163 @nnethercote @gnzlbg @andjo403
☀️ Test successful - status-appveyor, status-travis |
I just did some local measurements with Cachegrind and the numbers are looking much better -- the regression from #51163 looks to be entirely fixed. Thank you! |
This doesn't seem to show any difference though. Am using it incorrectly? |
That should be accurate; I'm not sure why we're seeing a difference from @nnethercote's local benchmarks. |
Note that #51226 also significantly improve the performance since it marked several |
My measurements were with Cachegrind rather than perf-stat... that shouldn't make any difference though. |
Can anyone think of reasons why I would see improvements on my machine, but the rustc-perf machine would not? Are the clang versions guaranteed to be the same? |
Clang is not installed on the perf collector (though not sure what specific clang you are referring to...); gcc is version 5.4.0. |
Sorry, I meant LLVM, not clang. Are the LLVM versions guaranteed to be the same? |
Yes, LLVM should be version equivalent (by default), though the way it's compiled might be different, e.g. using gcc locally vs. clang on Travis. I suppose that could play a role, but it would be decidedly odd. |
This now produces the same assembly code as the previous implementation.
cc #51163 @nnethercote @gnzlbg @andjo403