Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessary NOPs between DPP instructions [HCC, HIP, LLVM] #1235

Open
misos1 opened this issue Aug 12, 2019 · 1 comment
Open

Unnecessary NOPs between DPP instructions [HCC, HIP, LLVM] #1235

misos1 opened this issue Aug 12, 2019 · 1 comment

Comments

@misos1
Copy link

misos1 commented Aug 12, 2019

According to https://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ and "Vega" Instruction Set Architecture Reference Guide:

v_add_f32 v1, v0, v0 row_shr:1 bound_ctrl:0 // Instruction 1
v_add_f32 v1, v0, v1 row_shr:2 bound_ctrl:0 // Instruction 2
v_add_f32 v1, v0, v1 row_shr:3 bound_ctrl:0 // Instruction 3

If a previous VALU instruction modifies a VGPR read by DPP, two wait states are required. Note that this hazard affects only the operand that DPP reads. Consider instructions 2 and 3 in the example above; they consume the output from the previous VALU instruction by reading v1 . But DPP applies to v0 , and because v0 is unmodified, wait states are unnecessary.

But when I tried to implement this using hc::__amdgcn_move_dpp I got two NOPs between each two of these instructions:

#include <hc.hpp>
int main()
{
	hc::array_view<int> data(1);
	parallel_for_each(hc::extent<1>(1), [=](hc::index<1> i) [[hc]]
	{
		asm("s_nop 0");
		int v0 = data[0], v1;
		v1 = hc::__amdgcn_move_dpp(v0, 0x111, 0xF, 0xF, true) + v0;
		v1 = hc::__amdgcn_move_dpp(v0, 0x112, 0xF, 0xF, true) + v1;
		v1 = hc::__amdgcn_move_dpp(v0, 0x113, 0xF, 0xF, true) + v1;
		data[i[0]] = v1;
	});
	return 0;
}
KMDUMPISA=1 hcc -hc main.cpp

dump-gfx900.isa:

	v_add_u32_dpp v3, v2, v2  row_shr:1 row_mask:0xf bank_mask:0xf bound_ctrl:0
	s_nop 1
	v_add_u32_dpp v3, v2, v3  row_shr:2 row_mask:0xf bank_mask:0xf bound_ctrl:0
	s_nop 1
	v_add_u32_dpp v2, v2, v3  row_shr:3 row_mask:0xf bank_mask:0xf bound_ctrl:0

These NOPs does not must be here because next instruction does not have "DPP dependency" from previous. DPP modifier here affects first source operand.

Ideally this should generate 3 instructions as shown above without NOPs.

Also it is interesting how generated code changes with small modifications to source:

Removing asm("s_nop 0");:

	parallel_for_each(hc::extent<1>(1), [=](hc::index<1> i) [[hc]]
	{
		int v0 = data[0], v1;
		v1 = hc::__amdgcn_move_dpp(v0, 0x111, 0xF, 0xF, true) + v0;
		v1 = hc::__amdgcn_move_dpp(v0, 0x112, 0xF, 0xF, true) + v1;
		v1 = hc::__amdgcn_move_dpp(v0, 0x113, 0xF, 0xF, true) + v1;
		data[i[0]] = v1;
	});
	v_mov_b32_dpp v2, v2  row_shr:1 row_mask:0xf bank_mask:0xf bound_ctrl:0
	v_add_u32_e32 v2, s6, v2
	v_mov_b32_e32 v3, s1
	s_nop 0
	v_add_u32_dpp v2, v1, v2  row_shr:2 row_mask:0xf bank_mask:0xf bound_ctrl:0
	s_nop 1
	v_add_u32_dpp v2, v1, v2  row_shr:3 row_mask:0xf bank_mask:0xf bound_ctrl:0

Changing 0 to i[0]:

	parallel_for_each(hc::extent<1>(1), [=](hc::index<1> i) [[hc]]
	{
		int v0 = data[i[0]], v1;
		v1 = hc::__amdgcn_move_dpp(v0, 0x111, 0xF, 0xF, true) + v0;
		v1 = hc::__amdgcn_move_dpp(v0, 0x112, 0xF, 0xF, true) + v1;
		v1 = hc::__amdgcn_move_dpp(v0, 0x113, 0xF, 0xF, true) + v1;
		data[i[0]] = v1;
	});
	v_mov_b32_e32 v4, v2
	s_nop 0
	v_add_u32_dpp v3, v2, v2  row_shr:1 row_mask:0xf bank_mask:0xf bound_ctrl:0
	v_mov_b32_dpp v4, v4  row_shr:2 row_mask:0xf bank_mask:0xf bound_ctrl:0
	v_mov_b32_dpp v2, v2  row_shr:3 row_mask:0xf bank_mask:0xf bound_ctrl:0
	v_add3_u32 v2, v3, v4, v2

This last has correctly only one NOP for v4 dependency so together with v_add_u32_dpp v3, v2, v2... there are 2 independent instructions between modification to v4 v_mov_b32_e32 v4, v2 and DPP read v_mov_b32_dpp v4, v4.... And uses add3 but this still looks worse than 3 v_add_dpp.

@b-sumner
Copy link

Thank you for reporting this problem! We have opened an internal ticket to get it fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants