Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This proposal does not conditionally load a value, which could be less future proof. Indeed, having only one "virtual branch" loading a value is what could cause a compiler to create a branch, as it can avoid a load from memory given a condition - which is usually considered as a profitable optimization.
It's not only a speculation as this is what seemed to have originally happened with clang 18. Let's indeed look at clang's optimization pipeline using compiler explorer's very nice viewer on the original
poly_frommsg
function (https://godbolt.org/z/eYrM3scbe):before the X86 cmov conversion pass, we actually have "virtual" cmov instructions:
after the X86 cmov conversion pass, we have the introduction of branches:
Let's look at what this pass is doing: https://github.com/llvm/llvm-project/blob/768118d1ad38bf13c545828f67bd6b474d61fc55/llvm/lib/Target/X86/X86CmovConversion.cpp#L10-L31. I think we hit all the conditions in that case:
checkForProfitableCmovCandidates
: https://github.com/llvm/llvm-project/blob/768118d1ad38bf13c545828f67bd6b474d61fc55/llvm/lib/Target/X86/X86CmovConversion.cpp#L421-L443 that give more details). It would be interesting to debug that analysis to really understand what is considered more profitable here, as the generated assembly doesn't look very efficient to me:What we can note though is that if it figures out one of the "branch" of the cmov has a slower latency than the other, it will generate a branch. This is why I think the original patch is asking for troubles, and that we should avoid exposing such patterns.
As to when this has been introduced in Clang, for what I've checked clang 15 is already doing that optimization, so it's been "a while". Checking LLVM's git log could help answer that question more precisely if that's of interest.
This patch also defines the
cmov_int16
function asnoinline
. Indeed, Link-Time Optimization (LTO) could inline that function: the compiler could then realize theselect
argument is just one bit, which could then generate aselect
from the LLVM IR point of view, and we could then have the x86 cmov conversion pass be tempted to generate a branch again.As an idea for the future, we could verify codegen in a test by verifying that the code generated by the compiler does not contain any branch. This could be done by using LLVM's FileCheck utility - which is basically an enhanced grep that can reason with context. The test would basically check that in the general textual assembly, we don't have any test or branch instructions (depending on the targeted architecture). We could also check that on the LLVM IR, but:
This test could be run only if that tool is available in the environment, so that we don't make the build system more complex for everyone. I can make a first tentative if that's something of interest to the maintainers!