Graph optimization is a collection of optimization passes that happens to convert a general network description into a network-description-for-GPU-execution. It happens in the constructor of cldnn::program
. In other words, the input of graph optimization is topology
(link) and output is program
(link).
The transformation from original graph into the final graph is quite complicated. The steps are divided into smaller pieces(pass
). The purpose of this documentation is not to explain every step in detail, but to explain key steps.
For debugging purpose, you can dump the optimized graph after each step. Please see this link for detail.
Note: The optimization passes runs in sequence and the prefixed number indicates the sequence. However, this sequence number might change in the future.
- 00_init: First step of the optimization. If you want to see first cldnn graph, you can check this. It collects network output node information and set node processing order.
- 08_prepare_primitive_fusing: Fuse post-operations into other primitives. For example, relu is fused into convolution. Element-wise add operation can usually be fused into predecessor, too. The layout for the primitive is not chosen at this point yet, and we don't know which kernel will be chosen for the primitive. However, support for post-operation is dependent on the chosen kernel. That is why this pass contains some logic to guess the layout.
- 09_reorder_inputs: Select layout format for each primitives. This is done by calling
layout_optimizer::get_preferred_format
function which returns preferred format for a node(or “any” which means that format must be propagated from adjacent nodes if possible). Then it propagate formats for nodes with “any” preferred format to minimize local reorders. After propagating formats, it inserts actual reorders nodes into the graph. As a result of this pass, we get quite complicated graph with many redundant reorders. It will be removed fromremove_redundant_reorders
. - 17_remove_redundant_reorders: This pass is about removing reorder, but it has two conceptual purpose. First one is removing redundant reorders. For example, when the network contains a pattern like
reorder - reorder - reorder
, it can be shrunk into singlereorder
. Second one is about supporting cross-layout operation of primitive. For example, when aconvolution
needs to receivebfyx
input and to generateb_fs_yx_fsv16
output, the initial graph fromreorder_inputs
looks like this:data(bfyx) --> reorder(b_fs_yx_fsv16) --> convolution(b_fs_yx_fsv16)
. This pass looks for such pattern and removes the reorder to generate cross-layout graph for the target convolution:data(bfyx) --> convolution(b_fs_yx_fsv16)
- 19_prepare_buffer_fusing: This pass is for implicit concat or implicit crop. Implicit concat is about removing
concatenation
primitive when two predecessors can put result into the target buffer of concat directly. For example, if two convolution results are concatenated along f-axis and the layout is bfyx format and b=1, we can just remove concat primitive and manipulate the output address of the convolutions to point proper locations. - 20_add_required_reorders: This pass tries to keep graph consistency and add reorder if current format is not supported by a node. It checks if current input format is present in
implementation_map<op_t>
defined in<op_type>_gpu.cpp
file. If it is not defined, this pass tries to change layout to one of the most common format [bfyx, yxfb, byxf] and picks the first supported format. - 21_add_onednn_optimization_attributes: This pass generates onednn attributes for post operation(link). OpenVINO gpu plugin(a.k.a. cldnn) has a set of defined post operations and it requires some transformation to map those into onednn post-operations.
- 22_compile_graph: This pass creates
primitive_impl
through kernel selector. In this pass, the kernel for each node is chosen. For onednn primitives, OpenCL code is compiled in this stage. For cldnn primitives, OpenCL code will be compiled after all passes. - 26_propagate_constants: This pass reorders weights for convolution, deconvolution and FC to a required format. As kernel is chosen in
compile_graph
stage, it is now known that some reordering is required for weights. It is because the weights are stored in a simple planar format in IR, but other format is usually required for optimized convolution(or deconv, FC). In order to reorder weights, this pass creates a simple graph that receives weights and generates reordered weights. We get the reordered weights by executing the network and the reordered weights are inserted back into the original graph. - 31_oooq_memory_dependencies: In GPU, device memory is a limited resource and it is not necessary to keep all the intermediate results when inferencing a network. Therefore, the buffer is reused when the content is not needed anymore. However, it is necessary to take it into consideration that intel_gpu plugin is using out-of-order queue. As we are not sure the exact sequence of execution, there is additional limitation of reusing buffer. For example, in case of multi-branch structure like inception, there is no direct dependencies between the branches except for the common ancestor. However, in OOOQ execution mode, as we are not sure the sequence of execution in inception module, it is necessary not to reuse the buffer from one branch by another branch. Such implicit dependency information is processed in this pass.