const weight packing support #146

ZhennanQin · 2024-07-03T05:25:56Z

During model inference, model weight is frozen and won't change between iterations. CPU prefers special weight layout to accelerate the execution, then we need to prepack the model weight before model execution. This issue covers below items:

Analyze how weight pre-packing is done in openvino.
Provide the RFC about how to do weight pre-packing in MLIR to meet openvino requirement.
Implement the weight pre-packing pass with current CPU pipeline to support BF16 MLP inference.

ciyongch · 2024-07-03T07:49:04Z

It's beneficial for loading high-precision weight and packing into low precision and cache, while it's not applicable to the INT4 weight in LLM (for example, W4A16 scenario).
We might need to clarify that the original weight's datatype?

ZhennanQin · 2024-07-03T07:55:04Z

It's beneficial for loading high-precision weight and packing into low precision and cache, while it's not applicable to the INT4 weight in LLM (for example, W4A16 scenario). We might need to clarify that the original weight's datatype?

The first datatype that we want to support is BF16. And I don't think weight packing does not apply to INT4. For W4A16 scenario, we can still use a similar block format like NK8k16n2k for weight packing to improve cache locality when converting INT4 to BF16 before BRGEMM. Any special reason making it not applicable?

ciyongch · 2024-07-03T08:39:25Z

My bad, I was referring to it's not beneficial for the datatype conversion (from INT4 to either FP16 or INT8 higher-precision) step during the entire tensor transformation pipeline.
For re-layout, it's always beneficial from cache locality perspective.

niuxiaog · 2024-07-18T06:53:40Z

In openVINO, the IR of a model consists of its topology and constant values, like weights, in memory.

For each Graph, there is a GraphContext attr. A GraphContext holds a WeightsSharing, which is basically a std::unordered_map<std::string, MemoryInfo::Ptr> that stores the memory of cached tensors.

Take FullyConnectedOp (FC for short) with DNNL primitive as an example. Each FC has a DnnlFCExecutor, which has an attr of type ExecutorContext. The ExecutorContext holds an unordered_map<string, MemoryPtr> to store the memory of its private cached weights.

In compile stage, the operations (for example, type casting ops) that follow the ConstantOp (weights, bias or others) will be executed and the results are cached in the unordered_map of the GraphContext.

When the FC has dynamic shape input, which is the case for llama2, these is nothing to do with the weights in compile stage. Actually, there is no explicit ReorderOp in the graph after the ConstantOp which holds the weight of a FullyConnectedOp. In the first execution, all the input shapes are defined and the DnnlFCExecutor is constructed. During the construction, the weight is packed to the blocking format and stored in the unordered_map of the ExecutorContext. In later executions, the packed weight can be used directly.

When the FC has static shape, the above packing and caching process is done in compile stage. All the executions directly use the cached weight.

I think we can still use most of the design of constant tensor cache for oneDNN Graph. The model will be split into two parts, fold and compute. Since the actual values of constant tensors are available in openVINO, the fold part can be compiled and executed in compile stage, and only the compute part is executed in execution stage.

A big difference between the openVINO's design and this design is that the compile and execution are operation-wise in openVINO, while with our GC as backend, it will be graph-wise. This may cause some difficulties to integration.

ZhennanQin · 2024-07-18T07:04:59Z

A big difference between the openVINO's design and this design is that the compile and execution are operation-wise in openVINO, while with our GC as backend, it will be graph-wise. This may cause some difficulties to integration.

Based on the current MLIR integration here: openvinotoolkit/openvino@4b524ca#diff-b40ca25e9ca41e663971ae2274f78b5a444a0f3ba5d014d2323e20f199b690b0, a MLIR subgraph will be represented as a single op in OV graph, sounds like it perfectly match.

ZhennanQin assigned Devjiu Jul 3, 2024

ZhennanQin added this to the CPU milestone Jul 3, 2024

ZhennanQin assigned niuxiaog and unassigned Devjiu Jul 5, 2024

lmontigny added the enhancement New feature or request label Jul 16, 2024

niuxiaog mentioned this issue Jul 24, 2024

[RFC] Constant tensors caching pass #183

Open

lmontigny added the CPU label Aug 9, 2024

lmontigny modified the milestones: 0.1 CPU - General, 0.1 CPU - OpenVino Integration Sep 6, 2024

This was referenced Sep 14, 2024

Constant cache manager and runtime pipeline #341

Open

[Transforms] Add constant_tensors_folding pass #74

Open

niuxiaog linked a pull request Sep 24, 2024 that will close this issue

[Transforms] Add constant_tensors_folding pass #74

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

const weight packing support #146

const weight packing support #146

ZhennanQin commented Jul 3, 2024

ciyongch commented Jul 3, 2024

ZhennanQin commented Jul 3, 2024

ciyongch commented Jul 3, 2024

niuxiaog commented Jul 18, 2024

ZhennanQin commented Jul 18, 2024

const weight packing support #146

const weight packing support #146

Comments

ZhennanQin commented Jul 3, 2024

ciyongch commented Jul 3, 2024

ZhennanQin commented Jul 3, 2024

ciyongch commented Jul 3, 2024

niuxiaog commented Jul 18, 2024

ZhennanQin commented Jul 18, 2024