Multi-packed DSPs for MVU/VVU layers #1021
mmrahorovic
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Multi-packed DSPs for MVU/VVU
FINN v0.10 introduces multi-packed DSPs for Convolutional layers. By packing multiple elements on the input data paths of DSP slices, we’re able to achieve 2-4x higher computational density depending on the precision and DSP architecture. A summary of the MACs / DSP given the input datatype and DSP architecture is shown below.
Weights: [4, 4]-bit INT
Weights: (4, 8]-bit INT
Weights: [4, 4]-bit INT
Weights: (4, 8]-bit INT
Note that a 2-4x improvement in MACs / DSP directly translates to a 2-4x reduction in DSP usage while achieving the same performance! Or alternatively, investing the same number of DSPs yields a 2-4x increase in performance.
How to enable multi-packed DSPs in your FINN-generated IPs?
To enable multi-packed DSPs, all you need to do is to ensure that your convolutions use datatypes in the ranges specified in the tables above! The FINN compiler will automatically detect a match between the RTL-backend and the specified bit width and DSP architecture. In turn, it will take care of instantiating the correct kernel to leverage multiple MACs/DSP.
If, for whatever reason, you want to forgo the optimized DSP mapping and implement the convolution arithmetic in fabric instead, simply select the
preferred_impl_style
of the corresponding target layer tohls
in thespecialize_layers_config_file
passed to yourDataflowBuildConfig
. Note that you don’t need to create this JSON file by hand – simply running thestep_create_dataflow_partition
will create one for you, which you can then modify according to your requirements.Please note that the Thresholding layer needs to be implemented as a standalone layer rather than embedded within the MVAU layer -- this can be done by simple setting standalone_thresholds in your
DataflowBuildConfig
toTrue
.How do these changes affect the layer-folding?
FINN allows to tweak the parallelism and performance of each layer by exposing SIMD and/or PE folding factors to the user. Please, see the following notebook and following documentation for more information.
Below, you can find a table that shows, given the input datatype precisions and DSP architecture, what the optimal PE or SIMD folding factors are that achieve the highest utilization from the multi-packed DSPs. For example, given 4-bit weights and activations for the MVU layer, ideally PE should be divisible by 4 or 3 for DSP48 and DSP58, respectively. Please note that, on top of that, the folding factor constraints inherent to each layer still apply.
Weights: [4, 4]-bit INT
Weights: (4, 8]-bit INT
Weights: [4, 4]-bit INT
Weights: (4, 8]-bit INT
Having presented what the multi-packed DSPs achieve, how you can enable this in your designs and the best practices for setting the folding factors, the following sections below will go into a deeper level of detail for those that are interested.
First, we will refresh the computational pattern of Convolutions, then we’ll dive into answering why we want to have multi-packed DSPs. Last but not least, we’ll dive into more detail on how this is implemented.
Convolution Compute Pattern
As said, let’s first revisit what the computation pattern of MVU/VVU layers are. Below, you can find a visualization of how a Convolution ONNX layer transforms to FINN’s hardware abstraction layers. In the leftmost graph, you can find a snippet of two Convolutional layers from MobileNet-V1 (which you can find in our FINN-examples repository). The first convolution is a standard convolution, while the second convolution is a depthwise separable convolution. In the lowering process, a Convolution operator is converted to an Im2Col followed by a MatMul operator. This leads to the second graph (from the left). By calling
convert_to_hw_layers
-transformations, we’re able to make the step towards realizing these layers in FINN’s backend. As can be seen in the third graph, our first ONNX Convolution is realized as a FINN’s ConvolutionInputGenerator and MVAU node. Last but not least, a final step is applied to specialize these FINN-ops to their equivalent HLS/RTL-supported variants.The take-away point above is that the computational part of a Convolution is captured in the MatMul operator, which becomes either the MVAU or VVAU depending on the type of convolution we’re dealing with. In FINN’s terminology, we would refer to a matrix of weights (representing the weights of the Convolution) and a matrix of (input) activations (representing the inputs coming from the preceding Im2Col / ConvolutionInputGenerator(_hls,_rtl) layer). The weight-matrix would have a dimension of MH x MW, and the activation-matrix would have a dimension of MW x OFM (where OFM represents the height and width of the output feature map from the Convolution operator). For simplicity, let’s assume OFM = 1, which leads us to a matrix-vector multiplication as depicted below.
The traditional way we used to do this in HLS, would be to place one element from the weight matrix and one element from the activation vector on the input data paths of a DSP slice.
Below, you can find a simplified time diagram of how this matrix-vector multiplication gets executed in the traditional HLS approach.
Since FINN mainly deals with highly quantized networks (say 4-bit), what this essentially means is that we’re using the DSP48E2’s 27x18-bit multiplier to compute a 4x4-bit product. This naturally gives rise to the question: can we utilize those unused lanes to achieve a higher computational density, i.e. to implement multiple MACs per DSP per cycle? Next to that, moving to RTL-hardened operators gives the following other advantages:
Now, let’s zoom a little bit more into the implementation details for the DSP48 and DSP58 variants.
Multi-packed DSP48
Let’s, for simplicity, assume 4-bit weights and activations and a DSP48E2 architecture. The explanation below extends in a similar way up to 8-bit weights and activations as well as the DSP48E1 architecture.$(A \cdot B + C)$ , where C is the accumulated value of the output from the previous cycle. Note that these DSPs can be chained and the ALU would be used to accumulate an extra term (i.e. to compute $(A \cdot B+C)$ (dsp i) + $(A \cdot B+C)$ (dsp i-1)). For the DSP48E2, the internal multiplier takes two inputs of 27 (referred as the A data path) and 18 (referred to as the B data path) bits wide. For the case of 4-bit products, we can make use of the wide lanes to compute 4 products in parallel by constructing our A data path to hold 4 elements. Ideally, if the data paths are wide enough, we could construct the following multiplication:
$A \cdot B = (a_0 + 2^8 \cdot a_1 + 2^{16} \cdot a_2 + 2^{24} \cdot a_3) \cdot b_0 = a_0 \cdot b_0 + 2^8 \cdot a_1 \cdot b_0 + 2^{16} \cdot a_2 \cdot b_0 + 2^{24} \cdot a_3 \cdot b_0$
Given the two input data paths to the multiplier, which will be referred to as A and B, a single DSP essentially computes
Note that our 8-bit output products$a_i \cdot b_0$ are spaced exactly 8-bits apart. In turn, we could simply extract those 4 products by slicing the output product accordingly.$a_i$ tighter together. This is achieved by keeping track of the carry propagations that would corrupt the neighboring lanes and applying a final correction to the dot-product accumulation to restore the correct products. Below you can find a schematic how this block looks like:
In reality, our A data path is only 27-bits, so this method doesn’t work. However, @preusser came up with an ingenious technique to pack the products
In summary, the following technique would lead to a matrix-vector computation as shown below.
For more details, you can find the code here: https://github.com/Xilinx/finn/blob/dev/finn-rtllib/mvu/mvu_4sx4u.sv
& https://github.com/Xilinx/finn/blob/dev/finn-rtllib/mvu/mvu_8sx8u_dsp48.sv.
Multi-packed DSP58
Let’s assume 8-bit weights and activations and a DSP58 architecture. The DSP58 has a unique feature called the vector fixed-point ALU mode. Instead of having the wide multiplier (in this case a 27x24-bit) as the DSP48 architectures have, the vector fixed-point ALU mode allows us to instead use 3 8x9-bit multipliers followed by an adder to reduce these 3 products to a single 19-bit product as shown below.
As depicted, instead of computing$A[0:26] \cdot B[0:23]$ , we’re computing $A[0:8] \cdot B[0:7] + A[9:17] \cdot B[9:15] + A[19:26] \cdot B[16:23]$ in this mode.
$A[0:8] \cdot B[0:7] + A[9:17] \cdot B[9:15] + A[19:26] \cdot B[16:23] = a_0 \cdot b_0 + a_1 \cdot b_0 + a_2 \cdot b_0$
Since the matrix-vector computation consists of essentially multiple vector-vector products, this is a perfect fit to our problem. By packing multiple products on the A data path of the DSP slice, we have the following computation:
Without requiring any lane extensions, we can compute 3 multiplications and reduce them to a single product within a single cycle in the DSP58.
The following technique would then lead to a matrix-vector computation as shown below.
For more details, you can find the code here: https://github.com/Xilinx/finn/blob/dev/finn-rtllib/mvu/mvu_vvu_8sx9_dsp58.sv.
Beta Was this translation helpful? Give feedback.
All reactions