Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector128.WithElement codegen regression in .NET 9.0 #108197

Open
EgorBo opened this issue Sep 24, 2024 · 7 comments
Open

Vector128.WithElement codegen regression in .NET 9.0 #108197

EgorBo opened this issue Sep 24, 2024 · 7 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@EgorBo
Copy link
Member

EgorBo commented Sep 24, 2024

struct HBOARD
{
    Vector128<byte> data;

    public void Test(int i, byte b)
    {
        data = data.WithElement(i, b);
    }
}

.NET 8.0:

; Assembly listing for method HBOARD:Test(int,ubyte):this (FullOpts)
G_M000_IG01:                ;; offset=0x0000
       sub      rsp, 56
       vzeroupper 
G_M000_IG02:                ;; offset=0x0007
       vmovups  xmm0, xmmword ptr [rcx]
       cmp      edx, 16
       jae      SHORT G_M000_IG04
       vmovaps  xmmword ptr [rsp+0x20], xmm0
       lea      rax, bword ptr [rsp+0x20]
       movsxd   rdx, edx
       mov      byte  ptr [rax+rdx], r8b
       vmovaps  xmm0, xmmword ptr [rsp+0x20]
       vmovups  xmmword ptr [rcx], xmm0
G_M000_IG03:                ;; offset=0x002C
       add      rsp, 56
       ret      
G_M000_IG04:                ;; offset=0x0031
       mov      ecx, 21
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException(int)]
       int3     
; Total bytes of code 61

.NET 9.0:

; Assembly listing for method HBOARD:Test(int,ubyte):this (FullOpts)
G_M56601_IG01:  ;; offset=0x0000
       push     rbx
       sub      rsp, 64
       mov      rbx, rcx
       mov      eax, edx
G_M56601_IG02:  ;; offset=0x000A
       vmovups  xmm0, xmmword ptr [rbx]
       vmovaps  xmmword ptr [rsp+0x20], xmm0
       lea      rdx, [rsp+0x20]
       lea      rcx, [rsp+0x30]
       movzx    r9, r8b
       mov      r8d, eax
       call     [System.Runtime.Intrinsics.Vector128:WithElement[ubyte](System.Runtime.Intrinsics.Vector128`1[ubyte],int,ubyte):System.Runtime.Intrinsics.Vector128`1[ubyte]]
       vmovaps  xmm0, xmmword ptr [rsp+0x30]
       vmovups  xmmword ptr [rbx], xmm0
G_M56601_IG03:  ;; offset=0x0035
       add      rsp, 64
       pop      rbx
       ret      
; Total bytes of code 59
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 24, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Sep 24, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@EgorBo
Copy link
Member Author

EgorBo commented Sep 24, 2024

WithElement is not directly accelerated by jit when its index arg is not a constant. In net8.0, we just inlined the underlying helper for this case. In net9.0 we don't inline it and just hope that the non-constant index becomes a constant after all optimizations. But if it doesn't we end up with a worse codegen as it's too late to inline it.

I guess in general, WithElement should be avoided to be used with a non-constant index as it's not really a SIMD way

@tannergooding
Copy link
Member

The improvements given from not inlining and trying to opportunistically find a constant end up more profitable in typical use-cases, so I don't think we want to go back to the .NET 8 behavior there.

The proper fix here is to ensure the non-constant case is handled, much as it is for GetElement, but I don't know if that meets the bar for backporting since its not a correctness issue and expected to be somewhat less common.

Developers have a workaround where they can do something like the following, which would also be faster if they know the index is definitively in bounds but is also definitively non-constant:

public static Vector128<T> WithElementUnsafe<T>(Vector128<T> vector, int index, T value)
{
    ref T address = ref Unsafe.As<Vector128<T>, T>(ref vector);
    Unsafe.Add(ref address, index) = value;
    return vector;
}

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Sep 25, 2024
@JulieLeeMSFT JulieLeeMSFT added this to the 10.0.0 milestone Sep 25, 2024
@JulieLeeMSFT
Copy link
Member

Because there is workaround, setting this to .NET 10.

Developers have a workaround where they can do something like the following, which would also be faster if they know the index is definitively in bounds but is also definitively non-constant:

@Ruihan-Yin
Copy link
Contributor

Hi @tannergooding, just checking in if this issue is on your list for now, if not, we have a new folk onboarding and I think this will be a good issue for him to get started with the contribution to dotnet.

Regrading to the issue itself, the handling I'm looking at in GetElement with non-constant input is that we insert some range check nodes but not using the fallback, and the backend has already been able to handle the codegen, so we only need to format the tree properly, am I understanding it correctly?

@tannergooding
Copy link
Member

Happy to let them pick it up, feel free to reach out with any questions.

Regrading to the issue itself, the handling I'm looking at in GetElement with non-constant input is that we insert some range check nodes but not using the fallback, and the backend has already been able to handle the codegen, so we only need to format the tree properly, am I understanding it correctly?

We have various handling in place for NI_Vector128_GetElement. We then have some minimal handling for non-constant NI_Vector128_WithElement up through rationalize.cpp. I would expect we need to mirror most of the GetElement handling for WithElement (NI_Vector128_*, NI_Vector256_*, and NI_Vector512_*) to ensure that everything ends up as expected.

I would expect that most of this handling is relegated to rationalize.cpp (no longer rewriting back to a user call), decomposelongs.cpp (ensuring 64-bit WithElement is handled on 32-bit platforms), hwintrinsiccodegenxarch.cpp (where we actually emit the spill, set, and reload like is done for GetElement: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsiccodegenxarch.cpp#L1624-L1779), and possibly lowerxarch.cpp/lsraxarch.cpp (to handle any specialized containment or temp register requirements).

@JulieLeeMSFT
Copy link
Member

we have a new folk onboarding and I think this will be a good issue for him to get started

@Ruihan-Yin, please reassign this issue to the new member when ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

4 participants