Multi-packed DSPs for MVU/VVU layers #1021

mmrahorovic · 2024-03-29T11:15:50Z

mmrahorovic
Mar 29, 2024

Multi-packed DSPs for MVU/VVU

FINN v0.10 introduces multi-packed DSPs for Convolutional layers. By packing multiple elements on the input data paths of DSP slices, we’re able to achieve 2-4x higher computational density depending on the precision and DSP architecture. A summary of the MACs / DSP given the input datatype and DSP architecture is shown below.

MVU	Activations: [4, 4]-bit (U)INT Weights: [4, 4]-bit INT	Activations: (4, 8]-bit (U)INT Weights: (4, 8]-bit INT
HLS DSP	1 MAC / DSP	1 MAC / DSP
RTL DSP48E1	4 MAC / DSP	2 MAC / DSP
RTL DSP48E2	4 MAC / DSP	2 MAC / DSP
RTL DSP58	3 MAC / DSP	3 MAC / DSP

VVU	Activations: [4, 4]-bit (U)INT Weights: [4, 4]-bit INT	Activations: (4, 8]-bit (U)INT Weights: (4, 8]-bit INT
HLS DSP	1 MAC / DSP	1 MAC / DSP
RTL DSP48	*Not yet supported*	*Not yet supported*
RTL DSP58	3 MAC / DSP	3 MAC / DSP

Note that a 2-4x improvement in MACs / DSP directly translates to a 2-4x reduction in DSP usage while achieving the same performance! Or alternatively, investing the same number of DSPs yields a 2-4x increase in performance.

How to enable multi-packed DSPs in your FINN-generated IPs?

To enable multi-packed DSPs, all you need to do is to ensure that your convolutions use datatypes in the ranges specified in the tables above! The FINN compiler will automatically detect a match between the RTL-backend and the specified bit width and DSP architecture. In turn, it will take care of instantiating the correct kernel to leverage multiple MACs/DSP.
If, for whatever reason, you want to forgo the optimized DSP mapping and implement the convolution arithmetic in fabric instead, simply select the preferred_impl_style of the corresponding target layer to hls in the specialize_layers_config_file passed to your DataflowBuildConfig. Note that you don’t need to create this JSON file by hand – simply running the step_create_dataflow_partition will create one for you, which you can then modify according to your requirements.
Please note that the Thresholding layer needs to be implemented as a standalone layer rather than embedded within the MVAU layer -- this can be done by simple setting standalone_thresholds in your DataflowBuildConfig to True.

How do these changes affect the layer-folding?

FINN allows to tweak the parallelism and performance of each layer by exposing SIMD and/or PE folding factors to the user. Please, see the following notebook and following documentation for more information.
Below, you can find a table that shows, given the input datatype precisions and DSP architecture, what the optimal PE or SIMD folding factors are that achieve the highest utilization from the multi-packed DSPs. For example, given 4-bit weights and activations for the MVU layer, ideally PE should be divisible by 4 or 3 for DSP48 and DSP58, respectively. Please note that, on top of that, the folding factor constraints inherent to each layer still apply.

MVU	Activations: [4, 4]-bit (U)INT Weights: [4, 4]-bit INT	Activations: (4, 8]-bit (U)INT Weights: (4, 8]-bit INT
RTL DSP48E1	PE % 4 = 0	PE % 2 = 0
RTL DSP48E2	PE % 4 = 0	PE % 2 = 0
RTL DSP58	SIMD % 3 = 0	SIMD % 3 = 0

VVU	Activations: [4, 4]-bit (U)INT Weights: [4, 4]-bit INT	Activations: (4, 8]-bit (U)INT Weights: (4, 8]-bit INT
RTL DSP48	*Not yet supported*	*Not yet supported*
RTL DSP58	SIMD % 3 = 0	SIMD % 3 = 0

Having presented what the multi-packed DSPs achieve, how you can enable this in your designs and the best practices for setting the folding factors, the following sections below will go into a deeper level of detail for those that are interested.
First, we will refresh the computational pattern of Convolutions, then we’ll dive into answering why we want to have multi-packed DSPs. Last but not least, we’ll dive into more detail on how this is implemented.

Convolution Compute Pattern

As said, let’s first revisit what the computation pattern of MVU/VVU layers are. Below, you can find a visualization of how a Convolution ONNX layer transforms to FINN’s hardware abstraction layers. In the leftmost graph, you can find a snippet of two Convolutional layers from MobileNet-V1 (which you can find in our FINN-examples repository). The first convolution is a standard convolution, while the second convolution is a depthwise separable convolution. In the lowering process, a Convolution operator is converted to an Im2Col followed by a MatMul operator. This leads to the second graph (from the left). By calling convert_to_hw_layers-transformations, we’re able to make the step towards realizing these layers in FINN’s backend. As can be seen in the third graph, our first ONNX Convolution is realized as a FINN’s ConvolutionInputGenerator and MVAU node. Last but not least, a final step is applied to specialize these FINN-ops to their equivalent HLS/RTL-supported variants.

The take-away point above is that the computational part of a Convolution is captured in the MatMul operator, which becomes either the MVAU or VVAU depending on the type of convolution we’re dealing with. In FINN’s terminology, we would refer to a matrix of weights (representing the weights of the Convolution) and a matrix of (input) activations (representing the inputs coming from the preceding Im2Col / ConvolutionInputGenerator(_hls,_rtl) layer). The weight-matrix would have a dimension of MH x MW, and the activation-matrix would have a dimension of MW x OFM (where OFM represents the height and width of the output feature map from the Convolution operator). For simplicity, let’s assume OFM = 1, which leads us to a matrix-vector multiplication as depicted below.

The traditional way we used to do this in HLS, would be to place one element from the weight matrix and one element from the activation vector on the input data paths of a DSP slice.
Below, you can find a simplified time diagram of how this matrix-vector multiplication gets executed in the traditional HLS approach.

Since FINN mainly deals with highly quantized networks (say 4-bit), what this essentially means is that we’re using the DSP48E2’s 27x18-bit multiplier to compute a 4x4-bit product. This naturally gives rise to the question: can we utilize those unused lanes to achieve a higher computational density, i.e. to implement multiple MACs per DSP per cycle? Next to that, moving to RTL-hardened operators gives the following other advantages:

Faster synthesis time
More predictable synthesis results
Improved resource efficiency (in general)

Now, let’s zoom a little bit more into the implementation details for the DSP48 and DSP58 variants.

Multi-packed DSP48

Let’s, for simplicity, assume 4-bit weights and activations and a DSP48E2 architecture. The explanation below extends in a similar way up to 8-bit weights and activations as well as the DSP48E1 architecture.
Given the two input data paths to the multiplier, which will be referred to as A and B, a single DSP essentially computes $(A \cdot B + C)$, where C is the accumulated value of the output from the previous cycle. Note that these DSPs can be chained and the ALU would be used to accumulate an extra term (i.e. to compute $(A \cdot B+C)$ (dsp i) + $(A \cdot B+C)$ (dsp i-1)). For the DSP48E2, the internal multiplier takes two inputs of 27 (referred as the A data path) and 18 (referred to as the B data path) bits wide. For the case of 4-bit products, we can make use of the wide lanes to compute 4 products in parallel by constructing our A data path to hold 4 elements. Ideally, if the data paths are wide enough, we could construct the following multiplication:
$A \cdot B = (a_0 + 2^8 \cdot a_1 + 2^{16} \cdot a_2 + 2^{24} \cdot a_3) \cdot b_0 = a_0 \cdot b_0 + 2^8 \cdot a_1 \cdot b_0 + 2^{16} \cdot a_2 \cdot b_0 + 2^{24} \cdot a_3 \cdot b_0$

Note that our 8-bit output products $a_i \cdot b_0$ are spaced exactly 8-bits apart. In turn, we could simply extract those 4 products by slicing the output product accordingly.
In reality, our A data path is only 27-bits, so this method doesn’t work. However, @preusser came up with an ingenious technique to pack the products $a_i$ tighter together. This is achieved by keeping track of the carry propagations that would corrupt the neighboring lanes and applying a final correction to the dot-product accumulation to restore the correct products. Below you can find a schematic how this block looks like:

In summary, the following technique would lead to a matrix-vector computation as shown below.

For more details, you can find the code here: https://github.com/Xilinx/finn/blob/dev/finn-rtllib/mvu/mvu_4sx4u.sv
& https://github.com/Xilinx/finn/blob/dev/finn-rtllib/mvu/mvu_8sx8u_dsp48.sv.

Multi-packed DSP58

Let’s assume 8-bit weights and activations and a DSP58 architecture. The DSP58 has a unique feature called the vector fixed-point ALU mode. Instead of having the wide multiplier (in this case a 27x24-bit) as the DSP48 architectures have, the vector fixed-point ALU mode allows us to instead use 3 8x9-bit multipliers followed by an adder to reduce these 3 products to a single 19-bit product as shown below.

As depicted, instead of computing $A[0:26] \cdot B[0:23]$, we’re computing $A[0:8] \cdot B[0:7] + A[9:17] \cdot B[9:15] + A[19:26] \cdot B[16:23]$ in this mode.
Since the matrix-vector computation consists of essentially multiple vector-vector products, this is a perfect fit to our problem. By packing multiple products on the A data path of the DSP slice, we have the following computation:
$A[0:8] \cdot B[0:7] + A[9:17] \cdot B[9:15] + A[19:26] \cdot B[16:23] = a_0 \cdot b_0 + a_1 \cdot b_0 + a_2 \cdot b_0$
Without requiring any lane extensions, we can compute 3 multiplications and reduce them to a single product within a single cycle in the DSP58.
The following technique would then lead to a matrix-vector computation as shown below.

For more details, you can find the code here: https://github.com/Xilinx/finn/blob/dev/finn-rtllib/mvu/mvu_vvu_8sx9_dsp58.sv.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-packed DSPs for MVU/VVU layers #1021

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Multi-packed DSPs for MVU/VVU layers #1021

mmrahorovic Mar 29, 2024

Multi-packed DSPs for MVU/VVU

How to enable multi-packed DSPs in your FINN-generated IPs?

How do these changes affect the layer-folding?

Convolution Compute Pattern

Multi-packed DSP48

Multi-packed DSP58

Replies: 0 comments

mmrahorovic
Mar 29, 2024