"Unable to unroll loop" error compiling generated HLSL code that writes to a matrix row in a loop #4436

pixelspark · 2023-05-23T06:30:19Z

We are using wgpu in wonnx to run machine learning models using compute shaders - as I understand it the WGSL code is translated to HLSL (and MSL, etc) through naga. All of this has been working great so far, your work is much appreciated!

While testing code that works fine on macOS I nevertheless encountered an issue on Windows (webonnx/wonnx#166). The error is the following:

In Device::create_compute_pipeline
Internal error: FXC D3DCompile error (0x80004005): (81,21-35): warning X3550: array reference cannot be used as an l-value; not natively addressable, forcing loop to unroll
(35,5-15): error X3511: unable to unroll loop, loop does not appear to terminate in a timely manner (329 iterations) or unrolled

The linked issue contains both the input WGSL code as well as the generated HLSL code. It appears the issue is the following:

// tmpsum and product are both matNxN where N = kernel_size
for(var k: u32 = 0u; k < {{ k_chunks }}u; k = k + 1u) {
  // ...
 for(var index_mat: u32 = 0u; index_mat < {{ kernel_size }}u; index_mat = index_mat + 1u) {
    tmpsum[index_mat] = tmpsum[index_mat] + product[index_mat];
 }
}

Our system replaces {{k_chunks}} and {{kernel_size}} at runtime with constant integers. The kernel_size is typically 4 while k_chunks can get quite large.

Apparently HLSL requires the inner loop to be unrolled as it apparently disallows indexing into matNxN dynamically (while WGSL allows this). It helpfully applies a workaround (unroll the loop) but inadvertently seems to unroll both the inner and outer loop.

I managed to resolve it in this case (see webonnx/wonnx@c772653) by replacing the inner loop with simply product = mat_right * mat_left;. This of course is better regardless of this issue. Nevertheless you may be interested to know about this error as I suspect it may occur in different situations as well.

(Note that I am now seeing some other error, see the linked issue. If anyone here happens to have an idea about that one, I'd be very interested to hear)

The text was updated successfully, but these errors were encountered:

teoxoy · 2023-05-25T11:51:20Z

Apparently HLSL requires the inner loop to be unrolled as it apparently disallows indexing into matNxN dynamically (while WGSL allows this). It helpfully applies a workaround (unroll the loop) but inadvertently seems to unroll both the inner and outer loop.

Well, this is unfortunate... We might have to come up with a workaround for this (unroll the inner loop ourselves?) but I don't think it's high priority right now.

Possibly related #4460.

pixelspark · 2023-05-25T12:34:54Z

Yeah, not sure about what the right solution would be in this case. Fundamentally WGSL is allowing something that HLSL does not support (dynamic indexing of a matrix row). Notifying the user by throwing an error in this case seems appropriate to me. Working around the issue entirely may not be possible nor worthwhile.

Perhaps the HLSL compiler can be tricked into unrolling the inner loop but not the outer one by application of attributes (see HLSL documentation - 'unroll' is one of the attributes). That would fix this specific case but dynamic indexing in matrixes (and perhaps vectors) could also occur in other usage scenarios.

jimblandy · 2023-05-30T16:17:38Z

@pixelspark Could you pull out a reduced test case that you could put in this issue?

FL33TW00D · 2023-11-09T13:27:37Z

Seeing this still on DXC

In Device::create_compute_pipeline
      note: label = `1d_conv3_scalar`
    Internal error: FXC D3DCompile error (0x80004005): D:\a\1d_conv3_scalar(69,21-32): warning X3550: array reference cannot be used as an l-value; not natively addressable, forcing loop to unroll
D:\a\1d_conv3_scalar(36,5-15): error X3511: unable to unroll loop, loop does not appear to terminate in a timely manner (497 iterations) or unrolled loop is too large, use the [unroll(n)] attribute to force an exact higher number

Full shader for context:

//Each workgroup is responsible for a single filter.
//Each thread computes a single element of the output.
//Each thread places the 3 column wide filter over the input, and multiplies and accumulates the values
//to a SINGLE output element.
@group(0) @binding(0)
var<storage, read> X: array<f32>;

@group(0) @binding(1)
var<storage, read> W: array<f32>;

@group(0) @binding(2)
var<storage, read> B: array<f32>;

@group(0) @binding(3)
var<storage, read_write> Y: array<f32>;

struct Meta {
    padding: u32,
    stride: u32,
    Cin: u32,
    Lin: u32,
    KS: u32,
    F_numel: u32,
    W_numel: u32,
    Lout: u32,
    Fperthread: u32,
}

@group(1) @binding(0)
var<uniform> metadata: Meta;

var<workgroup> F: array<f32, {{ F_numel }}u>;

fn inner(input_index: u32, filter_index: u32, output_index: u32, bias_index: u32, start: u32, end: u32) {
    var inp = vec3<f32>(0f);
    var kernel = vec3<f32>(0f);
    var acc = vec3<f32>(0f); 
    for(var i = 0u; i < metadata.Cin; i++) {
        let input_start = input_index + (i * metadata.Lin) - 1u; //-1 is for padding
        //We only populate the input between the provided indices, used for padding
        for(var j = start; j <= end; j++) {
            inp[j] = X[input_start + j];
        }

        let filter_start = i * metadata.KS;
        kernel.x = F[filter_start];
        kernel.y = F[filter_start + 1u];
        kernel.z = F[filter_start + 2u];

        acc = fma(inp, kernel, acc);
    }        
    Y[output_index] = acc.x + acc.y + acc.z + B[bias_index];
}


//Each thread may load more than 1 element into shared memory
fn load_filters_into_smem(local_id: vec3<u32>, filter_index: u32) {
    let windex = filter_index + (local_id.x * metadata.Fperthread);
    let findex = (local_id.x * metadata.Fperthread);
    for(var i=0u; i < metadata.Fperthread; i++) {
        if findex + i < metadata.F_numel {
            F[findex + i] = W[windex + i];
        }
    }
}

//Doesn't support dynamic padding
@compute @workgroup_size({{ workgroup_size_x }}, {{ workgroup_size_y }}, {{ workgroup_size_z }})
fn main(@builtin(global_invocation_id) global_id: vec3<u32>,
        @builtin(local_invocation_id) local_id: vec3<u32>,
        @builtin(workgroup_id) workgroup_id: vec3<u32>) {

    let input_index = (workgroup_id.x * {{ workgroup_size_x }}u + local_id.x) * metadata.stride;
    let filter_index = (workgroup_id.y * {{ F_numel }}u); 
    
    load_filters_into_smem(local_id, filter_index);
    workgroupBarrier();

    if input_index >= metadata.Lin {
        //Break after loading because all threads may be needed for loading F
        return;
    }

    let output_index = (workgroup_id.x * {{ workgroup_size_x }}u + local_id.x) + (workgroup_id.y * metadata.Lout);
    let bias_index = workgroup_id.y;

    //TODO: dynamic padding
    if input_index == metadata.Lin - metadata.padding {
        inner(input_index, filter_index, output_index, bias_index, 0u, 1u);
    } else if input_index == 0u {
        inner(input_index, filter_index, output_index, bias_index, 1u, 2u);
    } else {
        inner(input_index, filter_index, output_index, bias_index, 0u, 2u);
    }
}

Obviously the inner function is causing issues.

cwfitzgerald · 2023-11-09T21:22:01Z

Seeing this still on DXC

The error is from FXC

FL33TW00D · 2023-11-12T04:45:13Z

Seeing this still on DXC

The error is from FXC

You're correct - using DXC fixes this.

jimblandy · 2023-12-14T17:42:55Z

I believe the Naga bug here is that we don't generate valid HLSL for dynamically-indexed vectors and matrices. If that's the case, then getting that right would probably stop provoking FXC into trying to fix the bug with "not natively addressable, forcing loop to unroll", and things would be fine.

We should definitely look at what code Tint generates for these things.

jimblandy · 2023-12-14T17:43:43Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Unable to unroll loop" error compiling generated HLSL code that writes to a matrix row in a loop #4436

"Unable to unroll loop" error compiling generated HLSL code that writes to a matrix row in a loop #4436

pixelspark commented May 23, 2023 •

edited

Loading

teoxoy commented May 25, 2023

pixelspark commented May 25, 2023

jimblandy commented May 30, 2023

FL33TW00D commented Nov 9, 2023

cwfitzgerald commented Nov 9, 2023

FL33TW00D commented Nov 12, 2023

jimblandy commented Dec 14, 2023

jimblandy commented Dec 14, 2023

"Unable to unroll loop" error compiling generated HLSL code that writes to a matrix row in a loop #4436

"Unable to unroll loop" error compiling generated HLSL code that writes to a matrix row in a loop #4436

Comments

pixelspark commented May 23, 2023 • edited Loading

teoxoy commented May 25, 2023

pixelspark commented May 25, 2023

jimblandy commented May 30, 2023

FL33TW00D commented Nov 9, 2023

cwfitzgerald commented Nov 9, 2023

FL33TW00D commented Nov 12, 2023

jimblandy commented Dec 14, 2023

jimblandy commented Dec 14, 2023

pixelspark commented May 23, 2023 •

edited

Loading