Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add an option in insertGpuAllocs to skip the func args copy #916

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

zhczhong
Copy link
Member

@zhczhong zhczhong commented Oct 9, 2024

Add an option in insertGpuAllocs to skip the func args copy so that the user could decide whether a gpu::memcopy is needed for the input args

Please review these guidelines to help with the review process:

  • Have you provided a meaningful PR description?
  • Have you added a test, a reproducer, or a reference to an issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • If this PR is a work in progress, are you filing the PR as a draft?
  • Have you organized your commits logically and ensured each can be built by itself?

@kurapov-peter kurapov-peter requested a review from chencha3 October 9, 2024 10:25
@silee2
Copy link
Contributor

silee2 commented Oct 9, 2024

I don't think this is the right solution.
If caller functions args are directly used,
Gpu mem allocation and copy is only need if the arg's (a memref) memspace is not gpu.
USM vs Device mem shouldn't matter in this case.
Better way would be to use
https://mlir.llvm.org/docs/Dialects/GPU/#gpu-address-spaces
For marking gpu or usm memory args.
And then in this pass skip func arg alloc and copy based on the address space attribute.

@zhczhong
Copy link
Member Author

I don't think this is the right solution. If caller functions args are directly used, Gpu mem allocation and copy is only need if the arg's (a memref) memspace is not gpu. USM vs Device mem shouldn't matter in this case. Better way would be to use https://mlir.llvm.org/docs/Dialects/GPU/#gpu-address-spaces For marking gpu or usm memory args. And then in this pass skip func arg alloc and copy based on the address space attribute.

If the memory is usm-shared and we mark it with gpu-address-spaces, the copy on the device side is indeed skipped but the memory access on the host side will still need an additional explicit copy.

@kurapov-peter
Copy link

I don't think this is the right solution. If caller functions args are directly used, Gpu mem allocation and copy is only need if the arg's (a memref) memspace is not gpu. USM vs Device mem shouldn't matter in this case. Better way would be to use https://mlir.llvm.org/docs/Dialects/GPU/#gpu-address-spaces For marking gpu or usm memory args. And then in this pass skip func arg alloc and copy based on the address space attribute.

Marking host-side usm memory with a gpu address space is semantically incorrect. You are relying on the cpu lowering to ignore a gpu address space that shouldn't have been there in the first place. The solution to mark the callee only (this is what actually happens after the lowering) would be feasible if not for type-checking between the kernel caller and the callee. This is a simple way to avoid the problem and does not change the default behavior.

@silee2
Copy link
Contributor

silee2 commented Oct 11, 2024

I don't think this is the right solution. If caller functions args are directly used, Gpu mem allocation and copy is only need if the arg's (a memref) memspace is not gpu. USM vs Device mem shouldn't matter in this case. Better way would be to use https://mlir.llvm.org/docs/Dialects/GPU/#gpu-address-spaces For marking gpu or usm memory args. And then in this pass skip func arg alloc and copy based on the address space attribute.

Marking host-side usm memory with a gpu address space is semantically incorrect. You are relying on the cpu lowering to ignore a gpu address space that shouldn't have been there in the first place. The solution to mark the callee only (this is what actually happens after the lowering) would be feasible if not for type-checking between the kernel caller and the callee. This is a simple way to avoid the problem and does not change the default behavior.

Makes sense.
Unfortunately, MLIR does not have an address space attribute for USM memory.
Pass option is not fail safe either. In assumes, all caller direct forwarded args are USM memory. There is no way for the pass to check, so user needs to make that guarantee. Especially if there are multiple callees and callers involved. I will approve but would like to see one more approval before merging.

Copy link
Contributor

@silee2 silee2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one thing I overlooked.
Please add a test case for the new option "is-usm-args"

@kurapov-peter
Copy link

I don't think this is the right solution. If caller functions args are directly used, Gpu mem allocation and copy is only need if the arg's (a memref) memspace is not gpu. USM vs Device mem shouldn't matter in this case. Better way would be to use https://mlir.llvm.org/docs/Dialects/GPU/#gpu-address-spaces For marking gpu or usm memory args. And then in this pass skip func arg alloc and copy based on the address space attribute.

Marking host-side usm memory with a gpu address space is semantically incorrect. You are relying on the cpu lowering to ignore a gpu address space that shouldn't have been there in the first place. The solution to mark the callee only (this is what actually happens after the lowering) would be feasible if not for type-checking between the kernel caller and the callee. This is a simple way to avoid the problem and does not change the default behavior.

Makes sense.

Unfortunately, MLIR does not have an address space attribute for USM memory.

Pass option is not fail safe either. In assumes, all caller direct forwarded args are USM memory. There is no way for the pass to check, so user needs to make that guarantee. Especially if there are multiple callees and callers involved. I will approve but would like to see one more approval before merging.

Actually, if we care about guarantees, there's an option to insert runtime checks. For the case at hand we know for sure those are usm allocations, so this is just overhead.

As to usm address space, this could be a viable solution on the long run. However, we'd need to revisit llvm type converter behavior as it makes assumptions about the default address space.

@zhczhong
Copy link
Member Author

Found one thing I overlooked. Please add a test case for the new option "is-usm-args"

Thanks for reminding! A test case is added

@mshahneo
Copy link
Contributor

Hi @zhczhong ,

An important addition, thank you so much :).
Could you please add a bf16 end-to-end test case? It could be a very simple test case, but uses the bf16-to-gpu pass. Wanted to check how it would interact with bf16-to-gpu transformation.
A reference to a simple test case is: https://github.com/intel/mlir-extensions/blob/main/test/Integration/Dialect/Gpu/EltwiseAdd_BF16.mlir.

You could just create a similar USM version of this test or you can create something else :)

@mshahneo
Copy link
Contributor

I don't think this is the right solution. If caller functions args are directly used, Gpu mem allocation and copy is only need if the arg's (a memref) memspace is not gpu. USM vs Device mem shouldn't matter in this case. Better way would be to use https://mlir.llvm.org/docs/Dialects/GPU/#gpu-address-spaces For marking gpu or usm memory args. And then in this pass skip func arg alloc and copy based on the address space attribute.

Marking host-side usm memory with a gpu address space is semantically incorrect. You are relying on the cpu lowering to ignore a gpu address space that shouldn't have been there in the first place. The solution to mark the callee only (this is what actually happens after the lowering) would be feasible if not for type-checking between the kernel caller and the callee. This is a simple way to avoid the problem and does not change the default behavior.

Makes sense. Unfortunately, MLIR does not have an address space attribute for USM memory. Pass option is not fail safe either. In assumes, all caller direct forwarded args are USM memory. There is no way for the pass to check, so user needs to make that guarantee. Especially if there are multiple callees and callers involved. I will approve but would like to see one more approval before merging.

This may be out of scope of this PR.
But the concern raised by SangIk is a valid one, adding this option to the pass does give us the option to not create gpu-allocs and memcpy for the USM memory, but adds a very tricky pitfall. If the user uses this option, then they need to absolutely make sure that, all their host function uses USM arguments that gets passed to the device. Otherwise, it would result in runtime crash/segfault.

What do you think if we could add an attribute to XeGPU (similar to XeGPU_MemoryScope) for USM, and mark memrefs with this attribute whenever we pass a USM memref to a function. insert-gpu-alloc could inturn use that attribute.

@zhczhong
Copy link
Member Author

Hi @zhczhong ,

An important addition, thank you so much :). Could you please add a bf16 end-to-end test case? It could be a very simple test case, but uses the bf16-to-gpu pass. Wanted to check how it would interact with bf16-to-gpu transformation. A reference to a simple test case is: https://github.com/intel/mlir-extensions/blob/main/test/Integration/Dialect/Gpu/EltwiseAdd_BF16.mlir.

You could just create a similar USM version of this test or you can create something else :)

Thanks for the suggestion! The test has been added but the bf16-to-gpu indeed failed to process the USM case.

BF16ToGPU/EltwiseAdd.bf16.mlir:81:5: error: 'func.return' op has 1 operands, but enclosing function (@test) returns 0
    return %alloc : memref<10x20xbf16>
    ^
"builtin.module"() <{sym_name = "eltwise_add_usm"}> ({
    "memref.global"() <{constant, initial_value = dense<5.000000e-01> : tensor<10x20xbf16>, sym_name = "__constant_10x20xbf16", sym_visibility = "private", type = memref<10x20xbf16>}> : () -> ()
    "func.func"() <{function_type = (memref<10x20xi16>, memref<10x20xi16>) -> (), sym_name = "test"}> ({
    ^bb0(%arg3: memref<10x20xi16>, %arg4: memref<10x20xi16>):
      %23 = "arith.constant"() <{value = 20 : index}> : () -> index
      %24 = "arith.constant"() <{value = 10 : index}> : () -> index
      %25 = "arith.constant"() <{value = 1 : index}> : () -> index
      %26 = "arith.constant"() <{value = 0 : index}> : () -> index
      %27 = "gpu.alloc"() <{hostShared, operandSegmentSizes = array<i32: 0, 0, 0>}> : () -> memref<400xi8>
      %28 = "memref.view"(%27, %26) : (memref<400xi8>, index) -> memref<10x20xbf16>
      %29 = "memref.view"(%27, %26) : (memref<400xi8>, index) -> memref<10x20xi16>
      "gpu.launch_func"(%24, %23, %25, %25, %25, %25, %arg3, %arg4, %29) <{kernel = @test_kernel::@test_kernel, operandSegmentSizes = array<i32: 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 3, 0>}> : (index, index, index, index, index, index, memref<10x20xi16>, memref<10x20xi16>, memref<10x20xi16>) -> ()
      %30 = "memref.alloc"() <{operandSegmentSizes = array<i32: 0, 0>}> : () -> memref<10x20xbf16>
      "memref.copy"(%28, %30) : (memref<10x20xbf16>, memref<10x20xbf16>) -> ()
      "gpu.dealloc"(%27) : (memref<400xi8>) -> ()
      "func.return"(%30) : (memref<10x20xbf16>) -> ()
    }) : () -> ()
    "gpu.module"() <{sym_name = "test_kernel"}> ({
      "gpu.func"() <{function_type = (memref<10x20xi16>, memref<10x20xi16>, memref<10x20xi16>) -> ()}> ({
      ^bb0(%arg0: memref<10x20xi16>, %arg1: memref<10x20xi16>, %arg2: memref<10x20xi16>):
        %4 = "gpu.block_id"() <{dimension = #gpu<dim x>}> : () -> index
        %5 = "gpu.block_id"() <{dimension = #gpu<dim y>}> : () -> index
        %6 = "arith.constant"() <{value = 16128 : i16}> : () -> i16
        %7 = "memref.load"(%arg0, %4, %5) : (memref<10x20xi16>, index, index) -> i16
        %8 = "memref.load"(%arg1, %4, %5) : (memref<10x20xi16>, index, index) -> i16
        %9 = "arith.bitcast"(%7) : (i16) -> bf16
        %10 = "arith.extf"(%9) : (bf16) -> f32
        %11 = "arith.bitcast"(%8) : (i16) -> bf16
        %12 = "arith.extf"(%11) : (bf16) -> f32
        %13 = "arith.addf"(%10, %12) <{fastmath = #arith.fastmath<none>}> : (f32, f32) -> f32
        %14 = "arith.truncf"(%13) : (f32) -> bf16
        %15 = "arith.bitcast"(%14) : (bf16) -> i16
        %16 = "arith.bitcast"(%15) : (i16) -> bf16
        %17 = "arith.extf"(%16) : (bf16) -> f32
        %18 = "arith.bitcast"(%6) : (i16) -> bf16
        %19 = "arith.extf"(%18) : (bf16) -> f32
        %20 = "arith.addf"(%17, %19) <{fastmath = #arith.fastmath<none>}> : (f32, f32) -> f32
        %21 = "arith.truncf"(%20) : (f32) -> bf16
        %22 = "arith.bitcast"(%21) : (bf16) -> i16
        "memref.store"(%22, %arg2, %4, %5) : (i16, memref<10x20xi16>, index, index) -> ()
        "gpu.return"() : () -> ()
      }) {VectorComputeFunctionINTEL, gpu.kernel, gpu.known_block_size = array<i32: 1, 1, 1>, gpu.known_grid_size = array<i32: 10, 20, 1>, spirv.entry_point_abi = #spirv.entry_point_abi<>, sym_name = "test_kernel", workgroup_attributions = 0 : i64} : () -> ()
    }) {spirv.target_env = #spirv.target_env<#spirv.vce<v1.0, [Addresses, Float16Buffer, Int64, Int16, Int8, Kernel, Linkage, Vector16, GenericPointer, Groups, Float16, Float64, AtomicFloat32AddEXT, ExpectAssumeKHR, SubgroupDispatch, VectorComputeINTEL, VectorAnyINTEL, Bfloat16ConversionINTEL], [SPV_EXT_shader_atomic_float_add, SPV_KHR_expect_assume, SPV_INTEL_vector_compute, SPV_INTEL_bfloat16_conversion]>, api=OpenCL, #spirv.resource_limits<>>} : () -> ()
    "func.func"() <{function_type = () -> (), sym_name = "main"}> ({
      %0 = "memref.get_global"() <{name = @__constant_10x20xbf16}> : () -> memref<10x20xbf16>
      %1 = "memref.get_global"() <{name = @__constant_10x20xbf16}> : () -> memref<10x20xbf16>
      %2 = "func.call"(%0, %1) <{callee = @test}> : (memref<10x20xbf16>, memref<10x20xbf16>) -> memref<10x20xbf16>
      %3 = "memref.cast"(%2) : (memref<10x20xbf16>) -> memref<*xbf16>
      "func.call"(%3) <{callee = @printMemrefBF16}> : (memref<*xbf16>) -> ()
      "func.return"() : () -> ()
    }) : () -> ()
    "func.func"() <{function_type = (memref<*xbf16>) -> (), sym_name = "printMemrefBF16", sym_visibility = "private"}> ({
    }) {llvm.emit_c_interface} : () -> ()
  }) {gpu.container_module} : () -> ()

@zhczhong zhczhong closed this Oct 15, 2024
@zhczhong zhczhong reopened this Oct 15, 2024
@zhczhong
Copy link
Member Author

I don't think this is the right solution. If caller functions args are directly used, Gpu mem allocation and copy is only need if the arg's (a memref) memspace is not gpu. USM vs Device mem shouldn't matter in this case. Better way would be to use https://mlir.llvm.org/docs/Dialects/GPU/#gpu-address-spaces For marking gpu or usm memory args. And then in this pass skip func arg alloc and copy based on the address space attribute.

Marking host-side usm memory with a gpu address space is semantically incorrect. You are relying on the cpu lowering to ignore a gpu address space that shouldn't have been there in the first place. The solution to mark the callee only (this is what actually happens after the lowering) would be feasible if not for type-checking between the kernel caller and the callee. This is a simple way to avoid the problem and does not change the default behavior.

Makes sense. Unfortunately, MLIR does not have an address space attribute for USM memory. Pass option is not fail safe either. In assumes, all caller direct forwarded args are USM memory. There is no way for the pass to check, so user needs to make that guarantee. Especially if there are multiple callees and callers involved. I will approve but would like to see one more approval before merging.

This may be out of scope of this PR. But the concern raised by SangIk is a valid one, adding this option to the pass does give us the option to not create gpu-allocs and memcpy for the USM memory, but adds a very tricky pitfall. If the user uses this option, then they need to absolutely make sure that, all their host function uses USM arguments that gets passed to the device. Otherwise, it would result in runtime crash/segfault.

What do you think if we could add an attribute to XeGPU (similar to XeGPU_MemoryScope) for USM, and mark memrefs with this attribute whenever we pass a USM memref to a function. insert-gpu-alloc could inturn use that attribute.

Yes, marking the memref with a usm memory space could make the semantic correct and let the pass behavior more precise. But it also requires the front end doing the corresponding change. @kurapov-peter What do you think abut this way?

@kurapov-peter
Copy link

What do you think if we could add an attribute to XeGPU (similar to XeGPU_MemoryScope) for USM, and mark memrefs with this attribute whenever we pass a USM memref to a function. insert-gpu-alloc could inturn use that attribute.

I agree there should be a proper attribute that would ensure type consistency for USM buffers (and not violate host-side semantic). XeGPU does not seem a good place for it since that would be the same as putting a gpu attribute to a host-side buffer. Host-device shared memory is not unique to intel gpus, the abstraction should be vendor-agnostic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants