-
Notifications
You must be signed in to change notification settings - Fork 952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Unable to unroll loop" error compiling generated HLSL code that writes to a matrix row in a loop #4436
Comments
Well, this is unfortunate... We might have to come up with a workaround for this (unroll the inner loop ourselves?) but I don't think it's high priority right now. Possibly related #4460. |
Yeah, not sure about what the right solution would be in this case. Fundamentally WGSL is allowing something that HLSL does not support (dynamic indexing of a matrix row). Notifying the user by throwing an error in this case seems appropriate to me. Working around the issue entirely may not be possible nor worthwhile. Perhaps the HLSL compiler can be tricked into unrolling the inner loop but not the outer one by application of attributes (see HLSL documentation - 'unroll' is one of the attributes). That would fix this specific case but dynamic indexing in matrixes (and perhaps vectors) could also occur in other usage scenarios. |
@pixelspark Could you pull out a reduced test case that you could put in this issue? |
Seeing this still on DXC In Device::create_compute_pipeline
note: label = `1d_conv3_scalar`
Internal error: FXC D3DCompile error (0x80004005): D:\a\1d_conv3_scalar(69,21-32): warning X3550: array reference cannot be used as an l-value; not natively addressable, forcing loop to unroll
D:\a\1d_conv3_scalar(36,5-15): error X3511: unable to unroll loop, loop does not appear to terminate in a timely manner (497 iterations) or unrolled loop is too large, use the [unroll(n)] attribute to force an exact higher number Full shader for context: //Each workgroup is responsible for a single filter.
//Each thread computes a single element of the output.
//Each thread places the 3 column wide filter over the input, and multiplies and accumulates the values
//to a SINGLE output element.
@group(0) @binding(0)
var<storage, read> X: array<f32>;
@group(0) @binding(1)
var<storage, read> W: array<f32>;
@group(0) @binding(2)
var<storage, read> B: array<f32>;
@group(0) @binding(3)
var<storage, read_write> Y: array<f32>;
struct Meta {
padding: u32,
stride: u32,
Cin: u32,
Lin: u32,
KS: u32,
F_numel: u32,
W_numel: u32,
Lout: u32,
Fperthread: u32,
}
@group(1) @binding(0)
var<uniform> metadata: Meta;
var<workgroup> F: array<f32, {{ F_numel }}u>;
fn inner(input_index: u32, filter_index: u32, output_index: u32, bias_index: u32, start: u32, end: u32) {
var inp = vec3<f32>(0f);
var kernel = vec3<f32>(0f);
var acc = vec3<f32>(0f);
for(var i = 0u; i < metadata.Cin; i++) {
let input_start = input_index + (i * metadata.Lin) - 1u; //-1 is for padding
//We only populate the input between the provided indices, used for padding
for(var j = start; j <= end; j++) {
inp[j] = X[input_start + j];
}
let filter_start = i * metadata.KS;
kernel.x = F[filter_start];
kernel.y = F[filter_start + 1u];
kernel.z = F[filter_start + 2u];
acc = fma(inp, kernel, acc);
}
Y[output_index] = acc.x + acc.y + acc.z + B[bias_index];
}
//Each thread may load more than 1 element into shared memory
fn load_filters_into_smem(local_id: vec3<u32>, filter_index: u32) {
let windex = filter_index + (local_id.x * metadata.Fperthread);
let findex = (local_id.x * metadata.Fperthread);
for(var i=0u; i < metadata.Fperthread; i++) {
if findex + i < metadata.F_numel {
F[findex + i] = W[windex + i];
}
}
}
//Doesn't support dynamic padding
@compute @workgroup_size({{ workgroup_size_x }}, {{ workgroup_size_y }}, {{ workgroup_size_z }})
fn main(@builtin(global_invocation_id) global_id: vec3<u32>,
@builtin(local_invocation_id) local_id: vec3<u32>,
@builtin(workgroup_id) workgroup_id: vec3<u32>) {
let input_index = (workgroup_id.x * {{ workgroup_size_x }}u + local_id.x) * metadata.stride;
let filter_index = (workgroup_id.y * {{ F_numel }}u);
load_filters_into_smem(local_id, filter_index);
workgroupBarrier();
if input_index >= metadata.Lin {
//Break after loading because all threads may be needed for loading F
return;
}
let output_index = (workgroup_id.x * {{ workgroup_size_x }}u + local_id.x) + (workgroup_id.y * metadata.Lout);
let bias_index = workgroup_id.y;
//TODO: dynamic padding
if input_index == metadata.Lin - metadata.padding {
inner(input_index, filter_index, output_index, bias_index, 0u, 1u);
} else if input_index == 0u {
inner(input_index, filter_index, output_index, bias_index, 1u, 2u);
} else {
inner(input_index, filter_index, output_index, bias_index, 0u, 2u);
}
} Obviously the inner function is causing issues. |
The error is from FXC |
You're correct - using DXC fixes this. |
I believe the Naga bug here is that we don't generate valid HLSL for dynamically-indexed vectors and matrices. If that's the case, then getting that right would probably stop provoking FXC into trying to fix the bug with "not natively addressable, forcing loop to unroll", and things would be fine. We should definitely look at what code Tint generates for these things. |
See also #4337 |
We are using
wgpu
inwonnx
to run machine learning models using compute shaders - as I understand it the WGSL code is translated to HLSL (and MSL, etc) throughnaga
. All of this has been working great so far, your work is much appreciated!While testing code that works fine on macOS I nevertheless encountered an issue on Windows (webonnx/wonnx#166). The error is the following:
The linked issue contains both the input WGSL code as well as the generated HLSL code. It appears the issue is the following:
Our system replaces
{{k_chunks}}
and{{kernel_size}}
at runtime with constant integers. The kernel_size is typically 4 while k_chunks can get quite large.Apparently HLSL requires the inner loop to be unrolled as it apparently disallows indexing into
matNxN
dynamically (while WGSL allows this). It helpfully applies a workaround (unroll the loop) but inadvertently seems to unroll both the inner and outer loop.I managed to resolve it in this case (see webonnx/wonnx@c772653) by replacing the inner loop with simply
product = mat_right * mat_left;
. This of course is better regardless of this issue. Nevertheless you may be interested to know about this error as I suspect it may occur in different situations as well.(Note that I am now seeing some other error, see the linked issue. If anyone here happens to have an idea about that one, I'd be very interested to hear)
The text was updated successfully, but these errors were encountered: